ReaderLM v2 : Efficient HTML processing language model
ReaderLM v2 is a small language model launched by Jina AI with 1.5 billion parameters. It focuses on HTML to Markdown conversion and HTML to JSON data extraction with high accuracy.
Main functions
HTML to Markdown: Convert HTML content to Markdown format, retaining complete information and effectively using Markdown syntax, especially good at processing complex elements and long texts.
HTML to JSON: Extract specific information directly from HTML and generate JSON format data without the need for intermediate Markdown conversion steps. Users need to provide JSON schema.
Long text processing: Supports input and output of up to 512K tokens, effectively avoiding performance degradation in long text processing.
Multi-language support: Supports 29 languages, including English, Chinese, Japanese, etc.
High Performance: Outperforms many larger models in benchmark tests.
target users
Developers, content creators, data analysts, and businesses and researchers who need to extract structured data from web pages.
Application scenarios
Developer: Convert web news to Markdown format for use in technology blogs.
Data Analyst: Extract product information from web pages for market analysis.
Researchers: Extract paper information from academic websites and store it in JSON format.
Product features
Efficient HTML to Markdown conversion, retaining complete information and using appropriate Markdown syntax.
Powerful long text processing capabilities, supporting input and output of 512K tokens.
Direct HTML to JSON data extraction function to improve data processing efficiency.
Extensive multi-language support.
Small and efficient, it outperforms many larger models.
User Guide
ReaderLM v2 can be used in a variety of ways:
1. Reader API: Use x-engine: readerlm-v2 request header and Accept: text/event-stream to enable response streaming.
2. Google Colab: Test through Colab notebook.
3. Cloud platform deployment: Can be deployed on AWS SageMaker, Azure and GCP marketplace.
4. HTML to Markdown: Use the create_prompt function to create a prompt and then call the model.
5. HTML to JSON: Define JSON Schema first, then create prompts and call the model.