What is pdf-extract-api?
pdf-extract-api is an API that converts any document or image into structured JSON or Markdown text using modern OCR technology and Ollama-supported models. Built with FastAPI, it uses Celery for asynchronous task handling and Redis for caching OCR results. The API processes data locally, ensuring data privacy and security without relying on cloud services.
Who would benefit from using pdf-extract-api?
This API is ideal for developers and enterprises requiring high-precision document conversion, especially those concerned about data privacy. It is particularly useful for converting large volumes of documents into structured data, such as legal documents, medical reports, and financial invoices.
What are some use cases for pdf-extract-api?
Convert MRI reports into Markdown and JSON.
Convert invoices into JSON and remove PII.
Use different OCR strategies for PDF to Markdown conversion.
What features does pdf-extract-api offer?
High-precision PDF to Markdown and JSON conversion.
Local processing using PyTorch-based OCR and Ollama models.
LLM improvements for OCR text results.
Removal of personal identity information (PII) from PDFs.
Distributed queue processing with Celery.
OCR result caching with Redis.
Command-line tool for sending tasks and handling results.
How do you use pdf-extract-api?
1. Clone the repository to your local machine.
2. Set environment variables and create a .env file.
3. Build and run Docker containers using Docker Compose.
4. Use the CLI tool to upload files for OCR conversion.
5. Retrieve OCR results.
6. Clear OCR cache.