llms.txt Generator

Comprehensive documentation for the website content extraction tool

🎯 Overview

The llms.txt Generator is a sophisticated web crawling tool designed to create structured summaries of websites in Markdown format. It intelligently categorizes content and generates clean, organized documentation suitable for AI training data, website analysis, or content audits.

🔍 Smart Crawling

Automatically discovers sitemaps, follows navigation patterns, and intelligently categorizes content across your entire website.

📱 Responsive Design

Professional interface that works seamlessly across all devices with modern styling and intuitive controls.

🎨 Professional Styling

Orange-themed design (#ff9500) matching modern web standards with clean typography and smooth interactions.

📥 Multiple Export Options

Download generated content as .txt or .md files, or copy directly to clipboard for immediate use.

✨ Features & Capabilities

🔧 Crawling Options

🎯 Link Source Selection

🗂️ Intelligent Categorization

Main Pages

About, Contact, Privacy, Terms, Legal

Services

Service offerings, Consulting, Solutions

Products

Product catalog, Shop, Pricing, Store

Tools

Applications, Calculators, Generators

Blog & Resources

Articles, News, Guides, Learning materials

Documentation

API docs, References, Manuals, Wiki

Case Studies

Portfolio, Work examples, Project showcases

Company Info

Team, Careers, Press, Company history

📖 Usage Guide

🚀 Quick Start

  1. Enter a website URL (e.g., https://example.com)
  2. Choose your crawl type: Shallow or Deep
  3. Select link source: All, Header, Navigation, or Footer
  4. Click "Generate llms.txt"
  5. Preview the generated markdown content
  6. Copy to clipboard or download as .txt/.md file

🔍 Deep Crawl Process

When using deep crawl, the system follows this comprehensive discovery process:

  1. Sitemap Discovery: Checks /sitemap.xml first
  2. Fallback Options: If not found, tries /sitemap_index.xml
  3. Robots.txt Analysis: Extracts sitemap references from robots.txt
  4. Homepage Fallback: Uses homepage navigation if no sitemaps found
  5. Recursive Exploration: Follows internal links up to 100 pages
  6. Smart Filtering: Excludes feeds, media files, and non-content pages
Note: Deep crawl may take longer for large websites as it comprehensively explores the entire site structure and content.

⚙️ Technical Specifications

🛠️ Technology Stack

🔧 Sitemap Discovery Algorithm

1. Try /sitemap.xml 2. If not found → try /sitemap_index.xml 3. If not found → parse /robots.txt for Sitemap: entries 4. If no sitemaps → fallback to homepage crawling For sitemap indexes: - Recursively parse all child sitemaps - Extract all <loc> entries (page URLs) - Filter and deduplicate results

🚫 Filtering Rules

The system automatically excludes:

📊 Performance Specifications

⚙️ Backend Configuration

To adjust the number of pages crawled, modify these variables in crawler.php:

// Configuration: Adjust these values to control crawling limits private $shallowCrawlLimit = 25; // Default limit for shallow crawl private $deepCrawlLimit = 1200; // Maximum limit for deep crawl // Examples: // For faster crawling: set to 10 and 500 // For more comprehensive: set to 50 and 2000 // Maximum recommended: 100 and 3000 (with proper server resources)
Performance Note: Higher page limits will increase crawling time and server load. Deep crawls with 1200+ pages may take 5-15 minutes. See Admin Features Guide for detailed configuration instructions.

🔧 Link Source Behavior (Shallow Crawl)