Clean, high-quality text data is crucial in AI training and data analysis. The NPX package generate-llmstxt provided by Firecrawl allows users to extract structured text directly from the website and generate llms.txt and llms-full.txt files for LLM. This article will introduce its installation, usage and optimization techniques to help you extract LLM training data efficiently.
generate-llmstxt is an NPX package that uses the Firecrawl API to convert web pages into structured text files for LLM training or data analysis.
Output file
llms.txt : Extract key information on the web page and form summary text
llms-full.txt : Fully crawl web text, suitable for deeper AI training
Default storage location
public/llms.txt
public/llms-full.txt
Method 1: Provide API Key directly using the command line
npx generate-llmstxt --api-key YOUR_FIRECRAWL_API_KEY
Way 2 Use .env file storage API Key
Create a .env file in the project root directory and add the following
FIRECRAWL_API_KEY=your_api_key_here
Then run
npx generate-llmsstxt
parameter | effect | default value |
---|---|---|
-k, --api-key <key> | Firecrawl API Key (if using .env, you can omit it) | Required |
-u, --url <url> | The target website URL to be crawled | https://example.com |
-m, --max-urls <number> | Maximum number of crawled pages (1-100) | 50 |
-o, --output-dir <path> | Specify the output directory | public |
npx generate-llmstxt -k your_api_key -u https://your-website.com -m 20
npx generate-llmsstxt -u https://your-website.com -m 20
npx generate-llmsstxt -k your_api_key -u https://your-website.com -o custom/output/path
npx generate-llmsstxt -u https://your-website.com -o content/llms
# LLMs.txt - AI Training Summary Data - Website Name: Your Website - Topic: Artificial Intelligence Data Processing - Key Points: 1. Provide data crawling API 2. Suitable for LLM training 3. Support text analysis
# LLMs-Full.txt - Full Text Data## Website Title: Your Website - AI Data Extraction Website provides an automated way to convert web page content into LLM training data. Its API allows users to crawl text and generate structured summary and full-text data...
Node.js 14+ required
A valid Firecrawl API Key (command line or .env file) must be provided
Using generate-llmstxt, you can easily crawl web content and generate structured text data for LLM training. Whether it is a summary (llms.txt) or a complete text (llms-full.txt), it can meet different AI needs.
Try npx generate-llmstxt now to improve AI training efficiency!