ViTLP is a visually guided pre-trained model for generated text layout, aiming to improve the efficiency and accuracy of intelligent document processing. The model combines OCR text positioning and recognition functions to enable fast and accurate text detection and recognition on document images. The pre-trained version of ViTLP model, ViTLP -medium (380M parameter), provides a balanced solution under the limitations of computing resources and pre-trained dataset size, which not only ensures the performance of the model, but also optimizes the inference speed and memory usage. ViTLP 's inference speeds process a one-page document image on the Nvidia 4090 usually takes 5 to 10 seconds, and is competitive compared to most OCR engines.
Demand population:
"The target audience is for enterprises and research institutions that need document image processing, especially those that require automated document processing and archive digitization. ViTLP 's fast inference speed and high accuracy make it ideal for these scenarios."
Example of usage scenarios:
Case 1: Use ViTLP to digitize historical documents and automatically extract text information from documents.
Case 2: In the legal field, ViTLP is used to automatically process and extract information from a large number of case documents.
Case 3: In the financial industry, contract documents are intelligently analyzed through ViTLP and key terms are extracted.
Product Features:
• Native OCR text positioning and recognition: ViTLP can directly locate and recognize text on document images.
• Pre-trained model ViTLP -medium: provides a pre-trained model with 380M parameters and can provide better performance under limited computing resources.
• Fast inference speed: On Nvidia 4090, ViTLP can quickly process document images, and the inference speed completes the processing of one page of document images within 5 to 10 seconds.
• Huggingface platform support: The pre-training weights of the ViTLP model can be found on the Huggingface platform, which is convenient for users to download and use.
• Easy to integrate and use: With the provided code and instructions, users can easily integrate ViTLP into their projects.
• Support batch decoding: Through the provided decode.sh script, users can decode batch document images.
• Suitable for intelligent document processing: ViTLP is particularly suitable for scenarios that require document image text detection and recognition, such as automated document processing, archive digitization, etc.
Tutorials for use:
1. Visit ViTLP 's GitHub page and clone the project locally.
2. Install the required dependencies and run `pip install -r requirements.txt`.
3. Clone the pre-trained ViTLP model weights to the specified directory and use `git clone https://huggingface.co/veason/ViTLP-medium ckpts/ ViTLP -medium`.
4. Run the demo, use `python ocr.py` and upload the document image for testing.
5. Check `decode.py` for detailed inference code and can run batch decoding through `bash decode.sh`.
6. If you need to fine-tune ViTLP , you can refer to the guide in the `./finenetung` directory.