Recently, Firecrawl launched a new feature - the LLMs.txt generator interface (Alpha version), designed to help users convert the content of any website into clear text files suitable for large language model (LLM) training. Users only need to provide the URL of a website, and Firecrawl will crawl the website and its linked pages, generating text files in two formats: llms.txt and llms-full.txt, for easier subsequent analysis and training.
The workflow of this generator is relatively simple. Users only need to provide a URL, and the system will automatically crawl the content of the website and extract clean and meaningful text information. The generated files are divided into two types: llms.txt is a concise summary of the website content and contains key information; llms-full.txt is a more detailed and complete text content, suitable for users who need in-depth analysis.
During use, users can set some key parameters. The first is "url", which is the URL where you want to generate the LLMs.txt file. Users can also select the "maxUrls" parameter to control the maximum number of pages crawled, with a range of between 1 and 100, and the default value is 10. In addition, the user can also choose whether to generate llms-full.txt, which is set to not generate by default.
It is worth noting that the work of the LLMs.txt generator is carried out asynchronously, and users can initiate requests and monitor the generation status in real time. The system will provide status updates, such as "In Progress" or "Completed", so that users can keep track of progress at any time.
However, as it is currently in the Alpha stage, there are some known limitations to this feature. First, only publicly accessible pages are supported, login protection or paywall content cannot be processed. Secondly, in the Alpha phase, the maximum number of websites processed is 5,000 URLs. Additionally, as an Alpha feature, the output format and processing flow may be adjusted based on user feedback.
In terms of billing, the cost of using the LLMs.txt generator is based on the number of URLs processed, and the basic cost is 1 point consumed for each URL processed. Users can control fees by setting the maxUrls parameter.
Entrance: https://docs.firecrawl.dev/features/alpha/llmstxt