AI Tools

I Built 15 Scrapers in One Session

May 25, 2026 | 10 min read | Watch on YouTube

Some links in this log are affiliate links. I earn a small commission at no extra cost to you.

The Build

Government websites do not have APIs. That is the problem at the center of this build, and it is a problem that does not get talked about enough outside of policy circles and compliance departments. Legal documents, regulations, compliance rules, enforcement guidance — all of it is buried in PDFs sitting inside portals that were last updated somewhere around 2003. The people who need to understand these documents are not lawyers with four years of training and a research assistant. They are families navigating disability services systems for a child who was just diagnosed. They are advocates working for organizations that cannot afford in-house legal teams. They are professionals in education, healthcare, and social services who need to understand the rules of the systems they operate inside every single day. The gap between what exists — thousands of pages of dense legal text in locked-down portals — and what people actually need — plain English answers to specific questions — is not a gap that search engines close. You cannot Google your way to a clear answer when the source document is a 400-page federal regulation and the relevant clause is on page 287. The gap required a translation layer. That is what this build is.

The decision to build the translation layer came out of a simple observation about what the tools can now do that they could not do two years ago. Playwright is a browser automation library built by Microsoft. It navigates portals the way a human would — it opens a browser, fills out forms, clicks through pagination, waits for JavaScript to render, and downloads the output. It does not care whether the portal has an API because it does not use the API. It uses the browser. That distinction is everything when you are working with government portals built on legacy infrastructure by vendors who have not updated their contracts since before mobile internet was a consideration. What Playwright does is turn an inaccessible portal into an accessible data source. Once the portal is accessible, pdfplumber handles the next step. PDF text extraction sounds simple until you sit with a government PDF that contains scanned tables, multi-column layouts, headers that span rows, footnotes embedded in the body, and text that was produced by optical character recognition from a physical document printed in 1998. pdfplumber parses all of it — the tables, the structure, the footnotes. It does not always get it perfect. Nothing does. But it gets enough of it right often enough to produce text that the next layer can work with. That next layer is Claude. The extracted text goes to the Claude API. The prompt instructs Claude to read the document, identify the subject matter, produce a plain English summary of the key requirements, flag any enforcement provisions, and note the effective date. What comes back is a structured output — jurisdiction, source, document type, plain English summary, key clauses, effective date. PostgreSQL stores everything. The neon.tech hosted database gives the application its persistence layer. pgvector gives it hybrid search — semantic vector search and keyword search running together so that a user asking "what are the transition planning requirements for students with disabilities" finds the right section even if the document uses completely different terminology. The combination of these four tools works where nothing else did because each one handles the step it is actually suited for. No single tool tries to do everything. The system is the combination.

The first scraper took the longest. That is always true and it does not stop being true no matter how many times you go through it. The first scraper was for a federal education regulation portal. The path issues came first. Playwright was configured to look for download directories that did not exist at the path specified. The error message was unhelpful in the specific way that path errors are always unhelpful — it told me what failed but not why. Claude Code handled the diagnosis. Claude Code reads the error, identifies the cause, rewrites the relevant section of the script, and explains the change in plain English. That exchange took four minutes from error to fix. The encoding problems came next. The PDFs contained characters that were outside the standard UTF-8 encoding range. pdfplumber was extracting them correctly but the database write was failing because the connection was not configured for the extended character set. Claude Code fixed that too — connection string, encoding specification, retry logic. The third problem was the one that took the most time: one of the portals used JavaScript rendering instead of static HTML. Playwright can handle JavaScript rendering but it requires a different configuration than static HTML scraping. The default configuration assumes static. The JavaScript portal requires a wait condition — Playwright needs to wait until the JavaScript has finished executing before it reads the page content. Claude Code identified the pattern from the error log, wrote the wait condition, and the portal started returning data. The first scraper, start to finish including all three failure modes and their fixes, took approximately four hours. The fifteenth scraper, building on all the patterns established by the first fourteen, took forty minutes. Not because the fifteenth portal was simpler. Because the pattern was locked and the tooling was reusable. The helper functions written for scraper three got used in scrapers five, nine, and twelve. The encoding fix written for scraper one was already in the shared utility before scraper two started. Every scraper after the first is faster because the first one did the hardest work.

The insight that changed how I think about this build is not a coding insight. It is a pattern recognition insight. Every government portal, regardless of the agency or the jurisdiction, has the same underlying architecture once you learn to read it. There is a search interface. There is a results list. There is a detail page. There is a download link or a render-in-browser PDF. The specific implementation varies. The pattern does not. Every PDF produced by a government agency, regardless of the subject matter, has the same structural characteristics. There is a title. There is an effective date. There is a table of contents. The body is organized in numbered sections with subsections. Footnotes reference other documents. Tables contain the compliance requirements. The specific content varies. The structure does not. Every Claude prompt for document translation follows the same template. Extract the subject. Identify the jurisdiction. Produce a plain English summary of requirements. Flag enforcement provisions. Note the effective date. Format as structured JSON. The specific output varies. The template does not. Once the pattern is locked across all three layers — the portal navigation, the PDF extraction, the Claude translation — building a new scraper for a new data source is a matter of identifying which variation of the known pattern you are dealing with and applying the appropriate configuration. The fifteenth scraper was faster than the first not because the problem was smaller but because the pattern recognition was sharper. The reusable portal helper cut the per-scraper code volume in half by the time the third scraper was complete. The extraction utility reduced the pdfplumber configuration from thirty lines to six. The Claude prompt template was stable by scraper four and untouched through fifteen. Pattern recognition is the skill. The code is the output of the pattern recognition.

The signal that the build was working came from the API cost dashboard. When API costs start to climb without any deliberate action on your part, it means the system is processing at volume. 568 documents processed across 15 data sources. 557 processed successfully by Claude — 98 percent. The remaining 11 were corrupted source files, documents with encoding errors that survived the extraction step but failed the translation step, and one document that was entirely in a language other than English that was not caught by the triage filter. The advocates who use the free tier are active. The search logs show what people actually look for — transition planning requirements, evaluations timelines, parental rights under IDEA, least restrictive environment standards. These are not abstract research queries. These are the specific questions that families and advocates face in IEP meetings and due process hearings and compliance reviews. The gated version — the version with deeper search, full document access, and the ability to export plain English summaries — has real users who found it without any marketing beyond the channel. Usage data is the most honest signal available. Nobody uses a tool they do not find valuable. The usage data says the tool is valuable. That is the signal.

The hardest problem this build surfaces is not a technical problem. The tech works. The hardest problem is the monetization question, and it is harder here than in most products because the base that needs this tool most — advocates, families navigating disability systems, professionals in under-resourced organizations — is also the base that is least able to pay enterprise pricing. The freemium model is the answer in practice because it is the only model that serves both the mission and the revenue goal simultaneously. The free tier covers the most common use cases: search, plain English summaries of key sections, access to the most recently updated documents across all 15 data sources. The free tier does not require a credit card. It does not expire. It does not downgrade to read-only after a trial period. What stays free stays free. The gated tier — priced for organizations, not individuals — includes full document export, bulk search across the entire 557-document library, API access for organizations building their own compliance tools on top of the database, and priority processing as new documents are added. The business model here is not selling to the families. It is selling to the organizations that serve families — advocacy groups, law firms with disability practice areas, school districts that need compliance support at scale. Those organizations have budget. Those organizations also have the mission alignment to pay for a tool that makes the regulatory landscape navigable for the people they serve. This one build connects directly to the $9,650 per month gap and May 31 2027 because it is the kind of recurring subscription product that compounds over time. One organizational subscriber renews every month without a new sales conversation. The asset, once built, keeps earning. That is the model. The build is documented so you can see exactly how it was assembled. Follow along. This is the build log.

The Tools

Tool	What I Used It For	Link
Playwright (Microsoft)	Browser automation for portal navigation	playwright.dev
pdfplumber	PDF text and table extraction	github.com/jsvine/pdfplumber
Claude API (Anthropic)	Plain English translation of legal documents	anthropic.com
PostgreSQL + pgvector	Document storage and hybrid search	postgresql.org
Claude Code	98% of debugging handled automatically	anthropic.com
ElevenLabs	AI voice synthesis (affiliate)	Try free 30 days

The Math

Item	Cost	Notes
Documents scraped	568	Across multiple govt + non-govt entities
Documents processed by Claude	557	98% processed successfully
Scrapers built	15	One per data source/portal
Cost per document (Claude)	~$0.012	Haiku for triage + Sonnet for processing
Total processing cost	~$6.70	568 docs x $0.012
Free tier users	Active	Advocates and families
Revenue from this build	$0	Monetization question still open

The Next Log

Next: The First Product That Actually Sold