LLM Schema Extraction: From Messy HTML to Clean JSON
We use Groq's LLaMA models to parse raw scraped HTML into typed JSON objects. This post explains the prompt engineering and schema design behind it.
Deep dives, tutorials, and engineering notes from the team building Scrapify.
The classic serverless scraping problem: your function times out before the headless browser finishes. Here's the architecture that fixed it completely.
We use Groq's LLaMA models to parse raw scraped HTML into typed JSON objects. This post explains the prompt engineering and schema design behind it.
React, Vue, and Angular apps make scraping harder. We walk through every technique we use — waitForSelector, networkidle, and lazy-load scrolling.
A step-by-step walkthrough of setting up a weekly price extraction job, storing results in the dashboard, and querying the data with Chat.
Rate limiting, robots.txt compliance, and data minimisation aren't optional extras — they're core to how Scrapify works by default.
How we chunk, embed, and store scraped content so users can ask plain English questions and get grounded, accurate answers.
Want to stay updated?