🔧 Admin Features: Crawl Limit Configuration

Complete guide to modifying shallow and deep crawl limits

📊 Current Configuration Overview

The llms.txt Generator currently supports two crawling modes with configurable limits:

Crawl Type Current Limit Recommended Range Maximum Recommended
Shallow Crawl 25 pages 10-50 pages 100 pages
Deep Crawl 1200 pages 100-1500 pages 2000 pages
📍 File Location: All crawl limits are configured in the crawler.php file, specifically at lines 19-20.

🚀 Step-by-Step: Increasing Crawl Limits

📂 Step 1: Locate the Configuration File

  1. Open your project directory
  2. Find the file named crawler.php in the root folder
  3. Open crawler.php in your text editor or IDE
  4. Navigate to approximately lines 19-20 (near the top of the class definition)

🔍 Step 2: Find the Current Configuration

Look for these specific lines in the WebsiteCrawler class:

// Configuration: Adjust these values to control crawling limits private $shallowCrawlLimit = 25; // Default limit for shallow crawl private $deepCrawlLimit = 1200; // Maximum limit for deep crawl

✏️ Step 3: Modify the Limits

📈 To Increase Shallow Crawl Limit:

// Original private $shallowCrawlLimit = 25; // Examples of increases: private $shallowCrawlLimit = 50; // For medium sites private $shallowCrawlLimit = 100; // For large navigation structures private $shallowCrawlLimit = 200; // For comprehensive shallow crawl

📈 To Increase Deep Crawl Limit:

// Original private $deepCrawlLimit = 1200; // Examples of increases: private $deepCrawlLimit = 1500; // For large corporate sites private $deepCrawlLimit = 2000; // For comprehensive analysis private $deepCrawlLimit = 3000; // For maximum coverage (use with caution)
⚠️ Important: Increasing limits above 2000 pages may cause performance issues and timeouts. Monitor server resources carefully.

💾 Step 4: Save and Test

  1. Save the crawler.php file
  2. If using a development server, restart it to apply changes
  3. Test with a small website first to verify the changes work
  4. Monitor performance and adjust if needed

📉 Step-by-Step: Decreasing Crawl Limits

🎯 When to Decrease Limits

✏️ Decrease Configuration Examples

📉 For Faster Shallow Crawls:

// Original private $shallowCrawlLimit = 25; // Decreased options: private $shallowCrawlLimit = 10; // Quick preview only private $shallowCrawlLimit = 15; // Basic navigation structure private $shallowCrawlLimit = 5; // Minimal crawl (homepage + key pages)

📉 For Faster Deep Crawls:

// Original private $deepCrawlLimit = 1200; // Decreased options: private $deepCrawlLimit = 500; // Medium-depth analysis private $deepCrawlLimit = 200; // Limited deep crawl private $deepCrawlLimit = 100; // Shallow-deep hybrid private $deepCrawlLimit = 50; // Minimal deep crawl
✅ Benefits of Lower Limits: Faster processing, reduced server load, quicker results, better for testing and development.

⚙️ Advanced Configuration Options

🕐 Execution Time Limits

If you increase crawl limits significantly, you may also need to adjust execution time:

// Find this section in the crawl() method (around line 31-33): if ($crawlType === 'deep') { set_time_limit(300); // 5 minutes for deep crawls ini_set('memory_limit', '512M'); // Increase memory for large sites } // Increase time limit for larger crawls: set_time_limit(600); // 10 minutes set_time_limit(900); // 15 minutes set_time_limit(1200); // 20 minutes // Increase memory for very large sites: ini_set('memory_limit', '1G'); // 1 GB ini_set('memory_limit', '2G'); // 2 GB

🌐 Request Timeout Settings

Adjust individual page request timeouts (around line 106):

// Original curl_setopt($ch, CURLOPT_TIMEOUT, 15); // Adjust for slower sites: curl_setopt($ch, CURLOPT_TIMEOUT, 30); // 30 seconds per page curl_setopt($ch, CURLOPT_TIMEOUT, 45); // 45 seconds per page curl_setopt($ch, CURLOPT_TIMEOUT, 60); // 1 minute per page (for very slow sites)

📊 Recommended Combinations

Use Case Shallow Limit Deep Limit Time Limit Memory
Quick Testing 10 50 120s 256M
Standard Use 25 1200 300s 512M
Large Sites 50 2000 600s 1G
Enterprise 100 3000 1200s 2G

🚨 Performance Considerations & Warnings

⚠️ Important Warnings

Server Resources: Higher limits consume more CPU, memory, and bandwidth. Monitor your server resources carefully.
Timeout Risks: Very high limits may cause PHP or web server timeouts. Always test changes incrementally.
Target Site Impact: Aggressive crawling may overload target websites. Be respectful of rate limits.

📊 Performance Impact Guide

🔧 Optimization Tips

  1. Test Incrementally: Start with small increases and monitor performance
  2. Monitor Memory: Watch server memory usage during large crawls
  3. Consider Caching: Implement result caching for frequently crawled sites
  4. Use Appropriate Mode: Use shallow crawl for quick analysis, deep crawl only when needed
  5. Monitor Logs: Check server logs for errors or timeouts

🛠️ Troubleshooting Common Issues

❌ Problem: "Maximum execution time exceeded"

Solution: Increase the set_time_limit() value or decrease crawl limits.

❌ Problem: "Out of memory" errors

Solution: Increase memory_limit or reduce crawl limits for large sites.

❌ Problem: Slow crawling performance

Solution: Reduce CURLOPT_TIMEOUT or implement parallel processing.

❌ Problem: Incomplete results

Solution: Check if limits are too low or if target site has restrictions.

✅ Testing Your Changes

  1. Start with a small, known website
  2. Test shallow crawl first, then deep crawl
  3. Monitor browser developer console for errors
  4. Check server logs for any issues
  5. Gradually test with larger, more complex sites