Methodologies and Best Practices
Writing the code is the easy part; keeping your scraper undetected and efficient is the challenge. If your scraper is getting blocked, returning incomplete data, or crashing servers, you are likely neglecting one of these core pillars of scraping methodology.
The "Stealth Mode" Protocol (Avoiding Detection)
Websites use "fingerprinting" to identify non-human traffic. To stay invisible, you must mimic organic user behavior:
- IP Rotation is Non-Negotiable: Never scrape a large target with a single IP address. Use proxy services to rotate your IP. For high-security sites, "residential proxies" (IPs from real home devices) are less likely to be blocked than "datacenter proxies".
- Proxy Pool Health Monitoring: Continuously health-check proxies. Remove any with >20% failure rate or latency >3s to avoid detection patterns.
- Sticky Sessions: Use the same IP for an entire user journey (home → product → checkout). Switching IPs mid-session is a fraud flag - keep IP sticky per domain for 5-10 minutes.
- Geographic Pinning: For geo-blocked targets, use region-specific proxies (e.g., us-nyc, de-berlin) and validate with IP geolocation lookups.
- Authentication Pattern Rotation: Rotate proxy credentials (username:password) along with IPs; some WAFs (Web Application Firewalls) fingerprint the auth format itself.
- Header Hygiene: The User-Agent string tells the server what browser you are using. Do not use the default Python-requests user agent (e.g., "Python-urllib/3.9"), which is an instant red flag. Rotate through strings of popular browsers (Chrome, Firefox) and include legitimate headers like “Referer” and “Accept-Language”.
- Header Ordering Matters: requests orders headers lexicographically. Real browsers have a fixed order. Use httpx[http2] or curl_cffi to control the header sequence.
- Proxy-Header Coupling: Rotating IPs without rotating associated headers is a major detection flag. A Madrid proxy with “Accept-Language: en-US” is geographically inconsistent and triggers behavioral scoring.
- Humanize Your Interactions: Real humans don't click links every 0.5 seconds with mathematical precision. Introduce randomized delays between requests. If using browser automation, randomize mouse movements and typing speeds to defeat behavioral profiling.
- Implement Rate Limiting: Scraping too quickly can overwhelm a server, leading to IP bans or service disruptions. By introducing delays between requests (e.g., 1-5 seconds), you mimic human browsing behavior and reduce the risk of detection as a bot, ensuring your scraper remains operational in the long term.
Technical Efficiency & Architecture
- Look for Hidden APIs: Before parsing HTML, inspect the "Network" tab in your browser’s developer tools. Many modern sites load data via JSON APIs in the background. Sniffing these XHR requests is often faster, more reliable, and less bandwidth-intensive than scraping the full visual page.
- Cache Everything: During development, never hit the live server repeatedly for the same data. Store responses locally (caching). This speeds up your code testing and reduces the risk of getting banned for spamming requests.
- Watch Out for Honeypots: Sophisticated sites hide "traps"—links that are invisible to humans (via CSS like “display: none”) but visible to bots. If your scraper follows a honeypot link, your IP will be instantly blacklisted.
- Store Data Efficiently: Structured formats like JSON, CSV, or databases (SQL/NoSQL) optimize storage and retrieval. This choice reduces processing time later and ensures scalability as your dataset grows.
- Build Robust Error Handling: Network hiccups, server errors (5xx), or unexpected HTML changes can cause a scraper to crash. Implement retries with exponential backoff and logging to recover gracefully, maintaining uptime even during partial failures.
Ethical Scraper Governance
- Respect robots.txt/robots-ai.txt: These files serve as the "house rules" for a website, specifying which areas are off-limits to bots. While not always legally binding, ignoring it is the fastest way to trigger anti-bot defenses. Consider llms.txt, ai.txt, and robots-ai.txt, where available, to guide AI access.
- Don't DDoS the Target: Sending concurrent requests is powerful, but doing it too aggressively can crash a smaller site. Use throttling to limit your request rate, and try to scrape during "off-peak" hours (like late at night) to minimize impact.
- Ethical Exit Strategy: If a small site (<10k visits/day) blocks you, STOP. Buying more proxies to force your way in is not just unethical.
- Not all data is meant for public consumption: Scraping personal information without consent or using it commercially in violation of the terms can lead to legal penalties. Always verify permissions and anonymize sensitive details when sharing scraped content.
- Comply with Legal Frameworks: Laws such as GDPR and CCPA restrict data collection from EU/US users without consent. Review terms of service, obtain necessary permissions, and avoid scraping protected content (e.g., paywalled articles).
- Respect the site’s purpose and user experience: avoid actions that could disrupt services, degrade performance, or harm real users.
- Be transparent and minimize data collection: collect only what is necessary for the stated objective; avoid sensitive or personally identifiable information (PII) without explicit consent.
- Responsible data use: prevent the dissemination of sensitive information, avoid deceptive practices, and protect privacy when sharing or storing scraped data.
More detail: robots.txt, llms.txt, robots-ai.txt, and ai.txt
As LLMs and AI agents become primary content consumers, traditional crawler control mechanisms are proving insufficient. New standards have emerged to manage what content can be consumed, indexed, or used for training.
| Standard robots.txt |
Purpose
Traditional crawler control
Location
/robots.txt
Format
Plain text (REP syntax)
Adoption
Universal
|
| Standard llms.txt |
Purpose
Structured content for LLMs
Location
/llms.txt
Format
Markdown
Adoption
Growing (Mintlify, LangChain, Anthropic)
|
| Standard robots-ai.txt |
Purpose
AI crawler-specific directives
Location
/ai.robots.txt
Format
Plain text (REP extension)
Adoption
Emerging
|
| Standard ai.txt |
Purpose
Granular AI consumption permissions
Location
/ai.txt
Format
JSON/YAML
Adoption
Experimental
|
robots.txt — The Classic Standard
The robots.txt file, originally created by Martijn Koster in 1994 and formally standardized as RFC 9309 by the IETF in September 2022, remains the foundational mechanism for controlling crawler access to websites. This plain text file, placed at the root of a domain (e.g., example.com/robots.txt), communicates to automated agents which sections of the site they may or may not access. While technically a voluntary protocol rather than an enforceable restriction, robots.txt has become the de facto standard that responsible crawlers—both search engines and AI companies—respect before accessing web content.
The protocol defines a simple syntax where User-agent lines identify specific crawlers, and Allow/Disallow directives specify accessible or restricted URL paths. Modern AI crawlers such as GPTBot (OpenAI), CCBot (Common Crawl), and Google-Extended (Google) now explicitly identify themselves in user-agent headers, allowing site owners to create targeted rules for AI-specific crawling alongside traditional search engine bots.
Basic Structure:
- User-agent: Identifies target crawler (use * for all)
- Allow/Disallow: Defines accessible or restricted paths
- Sitemap: Indicates XML sitemap location
- Crawl-delay: Requests delay between requests (seconds)
llms.txt — The AI-Optimized Standard
Proposed by Jeremy Howard (Co-Founder of Answer.AI) in September 2024, llms.txt represents a paradigm shift from restriction to provision. Unlike robots.txt, which tells crawlers what they cannot access, llms.txt actively provides structured, LLM-friendly content that AI systems can consume efficiently.
The fundamental problem this solves is that large language models struggle with HTML pages containing navigation menus, advertisements, JavaScript, and complex layouts—converting these to plain text often loses critical context or includes irrelevant information. LLMs also face limitations in their context windows, making it impractical to process entire websites. The llms.txt specification addresses this by having website owners curate a Markdown file containing a brief project summary, usage guidance, and curated links to the most relevant content in a format optimized for LLM consumption.
This approach empowers content creators to guide AI systems toward their most valuable and accurate information, rather than leaving AI models to parse and infer structure from raw HTML. The format has gained rapid adoption among developer tooling companies, with Mintlify, LangChain, Anthropic, and Cursor implementing support, creating momentum toward industry-wide standardization.
Key Points:
- H1 project name is the only required field
- Blockquote summary provides immediate context
- File lists use markdown hyperlinks with optional descriptions
- "Optional" section allows AI to skip secondary content when needed
- Files are served at /llms.txt (or subpath)
robots-ai.txt — AI-Specific Directives
The robots-ai.txt specification (currently at version 1.1.0, published January 2026) extends the traditional robots.txt protocol with AI-specific capabilities. This emerging standard addresses the growing complexity of managing AI crawlers, which now include distinct agents for training data collection (e.g., GPTBot, CCBot) and real-time inference retrieval (e.g., ChatGPT-User, Claude-User).
The specification allows site owners to differentiate between these crawler types and apply different access rules based on whether content will be used for model training or for generating immediate answers in AI systems. Beyond simple allow/disallow decisions, robots-ai.txt introduces optional directives for crawl rate preferences (Request-rate, Crawl-delay) and preferred time windows for crawling (Visit-time). The format follows standard robots.txt syntax to ensure compatibility, but lives at a separate path (/robots-ai.txt) and operates as a supplementary layer that AI crawlers should check after respecting the primary robots.txt file. This approach acknowledges that not all AI crawlers currently implement extended protocols, so site owners are encouraged to maintain critical restrictions in their main robots.txt for reliable enforcement while using ai.robots.txt for granular AI-specific policies.
Key Directives:
- AI-Training-Disallow/Allow: Rules for training crawlers
- AI-Inference-Disallow/Allow: Rules for real-time retrieval crawlers
- Crawl-delay: Request delay in seconds
- Request-rate: Crawl rate (pages per second)
- Visit-time: Preferred crawling window (UTC)
ai.txt — Granular AI Permissions
The ai.txt format represents an experimental approach to AI crawler control, moving beyond simple access rules to provide comprehensive usage policies in machine-readable formats (JSON or YAML). While still in early stages of adoption, this specification allows site owners to define explicit permissions for different AI use cases, including training, inference, data enrichment, and redistribution.
The format enables structured declarations of which content categories are available for each purpose, rate-limiting parameters, and even retention policies for scraped data. Unlike text-based protocols, ai.txt's JSON/YAML structure supports complex nested policies that can programmatically express nuanced intentions—such as allowing training on blog content while restricting commercial use of scraped product data. This approach aligns with evolving legal frameworks such as GDPR and CCPA, where explicit consent and usage boundaries are becoming increasingly important for AI data governance.
Key Features:
- Explicit differentiation between training, inference, and enrichment use cases
- Commercial use flags and attribution requirements
- Structured rate-limiting parameters
- Preferred crawling time windows
- Contact information for policy inquiries
Common Pitfalls: Silent Killers of Production Scrapers
Even experienced developers fall into these production-specific traps that cause silent data corruption, shadow bans, or catastrophic failures.
Selector Fragility & DOM Coupling
The Mistake: Using brittle XPath, such as/html/body/div[3]/div[2]/span, or auto-generated selectors that break when a site runs A/B tests or redesigns. Most scrapers fail within weeks due to this alone.
Why It Fails: Modern sites deploy daily UI changes. If your selector relies on positional indices or random class names (e.g., .css-1a2b3c), your scraper returns null values without raising errors, silently corrupting datasets.
The Fix:
- Prioritize semantic attributes: [data-testid], [data-cy], aria-label, or stable id attributes.
- Implement fallback selector chains: try CSS → XPath → text match before giving up.
- Use visual regression testing (Playwright screenshots) to proactively detect layout changes.
Pro Tip: Store a checksum of each target's DOM structure; alert when it changes >15%.
Pagination & Cursor Drift
The Mistake: Assuming “?page=1”, “?page=2” works forever. Many sites use cursor-based pagination (?after=ey...) with expiring tokens or rate-limit page turns per session.
Why It Fails: You hit page 50 successfully but get empty results or honeypot redirects. Your logs show 200 OK, so you don't notice the data loss until analytics fail.
The Fix:
- Always extract the "next" URL from the current page; never construct it manually.
- Validate pagination integrity: check for anchor elements unique to each page (e.g., last product ID).
- Persist cursor/timestamp state to disk; resume gracefully after crashes.
Pro Tip: Scrape pagination backwards (latest → oldest) to detect when old data archival breaks your loop.
Session & Cookie Mismanagement
The Mistake: Creating a new requests.get() connection per request instead of reusing a persistent session. Modern WAFs (Web Application Firewall) flag this as bot behavior instantly.
Why It Fails: Without persistent cookies and TLS session reuse, you get 403s after 2-3 requests, especially behind Cloudflare or Akamai. Each request appears to be a cold connection from a different client.
The Fix:
- Use requests.Session() or aiohttp.ClientSession() and reuse it across your entire crawl.
- Warm up sessions: browse homepage → category → target page before scraping, mimicking human pathing.
- Export/import cookies in Netscape format for distributed crawling; don't regenerate sessions per worker.
The Inverted Honeypot (Human-Only Traps)
The Mistake: Believing stealth means "not clicking invisible links". Advanced sites detect bots by what they don't interact with: invisible checkboxes, “onmousemove” listeners, or scroll-depth trackers.
Why It Fails: Your scraper extracts data perfectly, but IPs land on a shadow-ban list receiving poisoned data (incorrect prices, fake stock levels). Silent failure is the worst failure.
The Fix:
- Use Playwright/Selenium with stealth plugins (playwright-stealth, undetected-chromedriver).
- Simulate micro-interactions: page.hover(), realistic mouse trails, and keyboard scroll (PageDown).
- Override navigator.webdriver and patch Chrome object properties to remove automation flags.
Legal Creep in Data Enrichment
The Mistake: Scraping public data (legal) but then enriching it with PII from other sources (e.g., LinkedIn emails) without re-validating consent. This crosses legal boundaries invisibly.
Why It Fails: GDPR forbids automated profiling without explicit consent. You might anonymize one dataset but de-anonymize it through cross-referencing, creating liability.
The Fix:
- Document data lineage: tag each field with source URL and consent status.
- Apply k-anonymity (k>=5) before merging datasets to prevent re-identification.
- Consult legal counsel before enriching scraped data with third-party PII (Personally Identifiable Information), even if "public."
The Python Ecosystem: The Standard Toolset
In DKL, our main language is Python, so we will add a special mention to it. Python remains the undisputed champion of scraping due to its versatility. Choosing the right library depends on the target website's complexity.
Beautiful Soup (Best for Beginners & Static Data)
Beautiful Soup is a parsing library. It is excellent for navigating HTML trees and extracting data from messy code. It helps locate and separate specific data points from large HTML content using a simple syntax.
- Pros: Easy to learn and great for simple HTML parsing, for simple projects where the data is in plain HTML.
- Cons: It cannot fetch data (you need the requests library) and cannot handle JavaScript.
Selenium & Playwright (Best for Dynamic Content)
When a site uses JavaScript to load data (e.g., infinite scrolling or clicking "Load More"), simple parsers fail. Tools like Selenium and Playwright open a real browser (headless) to render the page.
- Pros: Handles complex, dynamic websites and automated testing (interacting with forms, clicking buttons, and scraping Single Page Applications -SPAs-).
- Cons: They are resource-heavy and slower than simple request-based scrapers.
Scrapy (Best for Scale)
Scrapy is a full-featured framework, not just a library. It handles requests, concurrency, and data export (CSV/JSON) out of the box. For large-scale data extraction, Scrapy is the most powerful open-source framework available. It allows you to send concurrent requests, making it incredibly fast.
- Pros: Best for large-scale projects where you need to scrape thousands of pages quickly using asynchronous processing. Exports data easily to CSV/JSON/databases, and has built-in mechanisms to bypass anti-bot measures.
- Cons: Steeper learning curve than Beautiful Soup.
Online and Enterprise Tools
For those who prefer low-code solutions or need enterprise-grade reliability, several online platforms stand out.
- ScraperAPI: This tool handles proxies, CAPTCHAs, and browsers for you. You simply send a request to their API, and they return the HTML. It is ideal for scaling up without managing infrastructure.
- OxyLabs: A market leader offering a massive IP pool and tools like "Oxy Copilot," which uses AI to generate scraping code from prompts.
- Octoparse: A visual, point-and-click tool that requires no programming knowledge. It is excellent for users who want to extract data into Excel or JSON via a visual interface.
- Browse AI: A no-code platform that lets you train "robots" to scrape data and monitor websites for changes simply by clicking on elements in a browser.
AI Tools
AI web scraping tools use artificial intelligence to adapt to layout changes automatically, reducing the need for constant code maintenance. These tools are revolutionizing the field by converting raw web data into LLM-ready formats.
- Bright Data: Rated as a top AI scraping tool, it offers an "LLM-ready" Search API and huge proxy networks. It features autonomous AI agents that can interact with websites in real-time,,.
- Crawl4AI: An open-source Python library optimized for AI agents. It uses heuristics to speed up extraction for LLMs and supports aggressive crawling strategies,.
- ScrapeGraphAI: A library that uses Large Language Models (LLMs) and graph logic to create scraping pipelines. You can simply provide a prompt and a URL to get structured data.
- Firecrawl: Designed specifically for AI applications, this tool scrapes a URL and returns clean Markdown or structured data, which is the preferred format for LLMs.
Reflections on AI crawling, hosting viability, and the future of open source
In recent years, we have witnessed an unprecedented transformation in the technological landscape. Generative artificial intelligence has evolved from an academic curiosity into an omnipresent force that shapes entire industries. Yet, behind every sophisticated language model and every clever chatbot response lies a silent infrastructure paying an increasingly steep price: the servers, repositories, and public resources that feed this ecosystem.
The Dilemma of Mass Crawling
AI crawlers have scaled to volumes that system administrators never anticipated. While traditional search engines like Google or Bing operate with a certain courtesy, respecting rate-limiting thresholds and robots.txt files, new scraping agents for AI model training frequently ignore these conventions. The result is a constant barrage of requests that can multiply a website's typical traffic by a hundredfold.
This situation poses a cruel paradox: the content that developers and creators have generously shared for decades to enrich the web is becoming raw material for systems that, in many cases, threaten to displace the very creators who made them possible. A programming blog that once received modest visits may now face astronomical hosting bills simply because its tutorials are valuable for training a code model.
The Economic Viability of Hosting
The economic model of public web hosting was designed under assumptions that are no longer valid. Free or low-cost hosting providers assumed human traffic patterns: peaks during high-activity hours, navigation through individual pages, and reasonable.
dwell times. AI bots break all these patterns: they operate 24/7, download content massively, and rarely respect the navigation structure designed for humans.
For personal projects, technical blogs, or small communities, this represents an existential threat. Many administrators are forced to implement drastic measures: blocking entire IP ranges, implementing verification systems that frustrate legitimate users, or simply removing their content from the public web. The tragedy is that these measures erode precisely the spirit of openness and sharing that made the internet great.
The Open Source Business Model in Question
Open source software has historically been a public good maintained through a combination of idealism, peer recognition, and indirect business models. Companies sponsored open-source projects because shared infrastructure benefited them. Developers contributed because they built a reputation and skills. This equation is cracking.
When large corporations use open-source projects to train their AI models without contributing to the ecosystem, they effectively privatize collectively generated value. A language model trained on millions of GitHub repositories can generate functionally similar code, but the company behind that model has not invested a single dollar in maintaining the tools that made its existence possible. Open source project maintainers, many of them volunteers, see no economic return while their creations are commoditized.
Traditional software licenses were not designed for this scenario. The GPL, MIT, and Apache all assume a world where software runs on computers, not where it is ingested to train statistical models. The legal and technological community is still searching for answers to fundamental questions about where legitimate use ends and exploitation begins.
The Tailwind Case: A Warning Sign
The case of Tailwind CSS and the recent layoffs at its parent company, Tailwind Labs, dramatically illustrates these tensions. In early 2025, the company announced a significant workforce reduction, including several key members of its development team. The official reason was the need to restructure toward a more AI-focused approach. Adam Wathan (creator of Tailwind CSS) disclosed that declining documentation traffic -attributed to AI summarization tools- was forcing workforce reductions at his company (more info in the #2388 PR of the tailwindcss project or his X’s account). This case is instructive for the entire open source ecosystem:
- Tailwind's primary revenue model: Premium UI components (Tailwind UI) purchased by developers who learn the framework through free documentation
- AI tools now answer "how do I center a div in Tailwind?" without users visiting the docs
- Result: 30-40% traffic drop, direct revenue impact, layoffs
The irony is palpable: Tailwind CSS is one of the most successful open-source projects of recent years, used by millions of developers and companies that generate trillions in economic value. Yet the business model based on premium documentation, courses, and complementary tools is being undermined by the very technology that feeds on the ecosystem Tailwind helped build. When a developer can ask an AI assistant to "generate a component with Tailwind" without consulting the official documentation, an entire potential revenue stream evaporates.
The layoffs at Tailwind are not simply business news: they are a symptom of a systemic disease. They represent the moment when a company that had found a sustainable balance between open source and economic viability discovers that this balance has been altered by forces beyond its control. If a project as popular and well-managed as Tailwind can be forced to cut staff, what hope do the thousands of smaller projects that keep digital infrastructure running have?
Where We Are Heading
We urgently need a new governance framework for the AI era. This includes technical mechanisms for creators to express their preferences about how their content is used, legal frameworks that recognize creators' rights in the context of model training, and above all, an honest conversation about how to fairly distribute the benefits of AI.
The promise of artificial intelligence is genuinely transformative: democratizing access to knowledge, accelerating innovation, and solving problems that seemed intractable. But if we achieve these benefits at the cost of destroying the ecosystems that made them possible, we will have made a tragic mistake. The future of the open web depends on finding a balance that honors both the potential of AI and the value of the human contribution that makes it possible.
The next time you interact with an AI assistant, take a moment to consider the vast tapestry of human knowledge that makes that interaction possible. And ask yourself: are we building a future where that tapestry can continue to grow, or one where we are wearing it down until it breaks?