What is pdf-craft ?
pdf-craft is a conversion tool focused on scanning book PDF files, supporting converting PDFs to Markdown and EPUB formats. It uses DocLayout-YOLO algorithm to perform page layout analysis, and combines OCR technology to extract text, automatically remove non-text elements such as headers, footers, and footnotes to ensure that the output text content is coherent and the structure is clear.
PDF to Markdown: Extract the content of the text, preserve the text structure, automatically insert screenshots of pictures, tables and formulas, and generate high-quality Markdown files.
PDF to EPUB: Combine OCR and LLM to build book catalogs and chapters, correct OCR errors, optimize reading order, and output EPUB files that are suitable for e-book readers.
Page layout analysis: Use DocLayout-YOLO to identify text blocks, pictures, tables and other elements to accurately extract the content of the text.
OCR text recognition: Based on PaddleOCR technology, improves the recognition accuracy of scanned text.
Spread page processing: Optimize the logical connection of text blocks to ensure smooth semantics of span content.
Reading order optimization: Use layoutreader to adjust the order of text blocks, which is in line with human reading habits.
Academic Research: Convert scanned papers to Markdown or EPUB.
E-book production: Convert book PDF to EPUB, generate catalogs and chapters.
Document Archive: Archive paper files or PDFs to Markdown or EPUB format.
Educational materials sorting: convert textbooks or handouts to improve teaching and learning efficiency.
Personal study: Organize and scan materials to facilitate notes and review.
GitHub repository: pdf-craft