Security & Data

Web Scraping & AI: How to Extract Public Data Reliably

📅 2026-03-14 ⏱️ 5 min read

Traditional scrapers break at the slightest layout change. Discover how AI enables flexible, semantic data extraction.

The web is the world's largest database. Whether for competitor monitoring, real estate analysis, or recruiting, gathering targeted data (web scraping) is a major asset. But developers know the problem too well: a scraper coded with rigid CSS selectors (XPath, div classes) breaks as soon as the target site changes its design by a single pixel. Introducing AI solves this historical vulnerability.

The Limits of Traditional Selector-Based Scraping

Traditionally, to retrieve a product price on an e-commerce site, you write a script targeting the exact CSS class of the price (e.g., div.price-tag-large). The day the site switches to TailwindCSS and renames the class to text-red-500 font-bold, the script finds nothing and returns an error or, worse, a null value without alerting you. Maintaining these bots is a financial drain for data teams.

Semantic Scraping via LLMs

The modern approach combines a simple scraper that retrieves the page's raw HTML with a Large Language Model (LLM) tasked with extracting meaning. Instead of telling the machine "retrieve the text in the third div", you tell it: "identify the product price in this raw HTML code and return it as a number."

Because the model understands the semantic structure of the page (context, table tags, currency symbols), it can spot the information even if the CSS classes change or if the layout is completely reorganized.

Optimizing Extraction Costs

Passing the entire HTML code of a page through GPT-4 or Claude for every request is expensive in tokens. A technical trick is to use an HTML cleaning utility (like Markdown-it or Readability.js) to strip out noise (styles, scripts, navigation headers) before sending the clean text skeleton to the model.

Respecting Ethics and Detecting Blocks

Scraping robustly also means respecting target platforms. Use random delays between requests, respect public robots.txt files, and do not query protected or private personal data. If a firewall like Cloudflare blocks your requests, residential rotating proxies remain essential to simulate genuine human connections.

Conclusion: Semantic Extraction Is the Future of Data

Thanks to LLMs, web scraping is shifting from a fragile craft to a reliable engineering discipline. Your data pipelines will no longer break at your competitors' slightest button redesign.


Read also

Jour de Chance

The Jour de Chance Team

Digital acquisition and media strategy experts.

Is this relevant to you?

Discuss with an expert