Programmatic SEO · The mechanics
Where the data comes from — sourcing rows worth publishing.
A programmatic page is only ever as good as the row behind it, so the row is where the work is. Here’s the order to source data in — your own first — what counts as data that adds value versus filler that just justifies a URL, the legality and ethics of scraping, and the hygiene step everyone skips.
The template is the easy part. The data is the job.
People who get programmatic SEO wrong spend their time on the template — a clever layout, a slick build pipeline — and treat the data as an afterthought, something to scrape in an hour to “fill the cells.” That’s backwards. The template is a frame; what hangs in it is the data. A great template fed empty rows produces empty pages, and empty pages don’t rank — they sit un-indexed or, worse, drag the rest of the site down. So the real question of programmatic SEO isn’t “what template?” — it’s “do I actually have something true and specific to put in every row?” If you don’t, you don’t have a programmatic project yet; you have a spreadsheet of URLs you can’t fill.
Your own data, first
The best data source is the one nobody else has: yours. A service business is sitting on more publishable, genuinely useful data than it realises.
- Jobs done. What you fixed, where, what it cost roughly, what was unusual about it. Anonymised, this is real, specific content — the stuff a prospect actually wants to read on a city page.
- Services. Every service line, with the detail a buyer needs: what’s involved, what it solves, what it doesn’t, how it’s priced.
- Locations. The areas you serve — and the real facts about each: response time there, the housing or building stock, the recurring problems, the landmarks. This is what makes a service-area page substantial rather than a swapped noun. (More on the local side in service-area pages.)
- FAQs. The questions customers actually ask, with your actual answers. Each one is a potential answer page.
- Pricing tiers. What things cost, in ranges if not exact. Buyers search for this constantly and almost nobody publishes it; if you can, it’s a strong, specific page.
- Response times, coverage, capacity. “Same-day in these suburbs, next-day in those” — concrete, useful, true only of you.
Your own data is also the data with the least legal and ethical complexity, the most credibility, and the strongest claim to being “unique.” Start here. Most of a good local programmatic build is your own records, organised and templated — which is exactly the same engine behind topical authority done at scale, and what the programmatic SEO service builds from.
Bayshore HVAC’s 184-page build ran almost entirely on Bayshore’s own data — the service lines, the neighbourhoods they actually cover, the response times they actually offer, the jobs they’d actually done. No scraping required. The matrix was service × neighbourhood × intent, and every cell was filled from records the business already had. That’s why it ranked — 3 → 67 keywords in 60 days — instead of reading like a template someone forgot to finish.
Public datasets, APIs, and scraped-and-cleaned
When your own data runs out, there are three external sources, in roughly descending order of how clean they are to use:
Public datasets. Government data, census figures, open registries, official statistics — much of it is genuinely public-domain or openly licensed and built to be reused. Population, demographics, permit volumes, climate data: legitimate raw material for a page, if it’s relevant to what the reader came for. Check the licence; most are permissive, some require attribution.
APIs. A lot of useful data is available through official APIs — maps, business directories, weather, reviews where the terms allow it. APIs come with terms of service: read them. Some permit caching and display; some don’t. Staying inside the terms is non-negotiable, partly because it’s the right thing and partly because building a content set on data you’ll lose access to is building on sand.
Scraped-and-cleaned — with real caveats. Sometimes the data you want is only on a web page. Scraping it is a grey area, and the line matters: respect robots.txt; respect the site’s terms of service; don’t republish someone else’s database wholesale. Taking a few facts, verifying them, and presenting them in your own structure with attribution is one thing. Lifting a competitor’s entire directory and putting your logo on it is copyright infringement and you’ll deserve what happens. The honest test: are you adding value — combining, contextualising, making it useful in a new way — or are you republishing someone’s work to skip doing your own? If it’s the latter, don’t.
If the only reason a row exists is to fill a cell, the cell shouldn’t exist. Data that adds value is real, specific, and useful to the reader. Everything else is padding wearing a column header.
Data that adds value vs. data that’s filler
Not all data earns a page. The difference between value and filler is whether a reader would actually want it.
Value: the response time you offer in that suburb. The real price range for that service. The actual difference between product A and B. The recurring problem in that area’s building stock. A definition someone genuinely needs. Data a person came looking for, that helps them decide.
Filler: the population of a town padded onto a service page where it changes nothing for the buyer. A weather average no one searching for a plumber cares about. “[City] is a vibrant community located in [county]” — a sentence that exists to be a sentence. Data dropped in solely so the URL has body text. Search engines are good at spotting pages that are mostly filler, and the “scaled content abuse” policy is aimed squarely at sets built that way — pages mass-produced primarily to manipulate rankings rather than to help anyone. The guardrail every cell has to clear is in the thin-content line; the short version is that filler doesn’t save a thin row — it just makes the thinness longer.
Data hygiene — the step everyone skips
Raw data is messy, and messy data makes broken pages. Before a single page is generated, the dataset gets cleaned:
- Deduplicate. “Tampa” and “Tampa, FL” and “tampa” are one row, not three. Near-duplicate rows produce near-duplicate pages, which is the fastest way to look like a content farm.
- Normalise. One format for names, one for dates, one for prices, one for everything. Inconsistent data shows up as inconsistent pages.
- Fill gaps — or drop the row. If a row is missing the data that would make its page substantial, you do one of two things: find the missing data, or delete the row. You do not ship the page with a hole in it where the substance should be. A row you can’t fill honestly is a page that shouldn’t exist — and skipping it makes the whole set stronger. (“We eat our own cooking” — our own geo matrix only builds the {vertical} × {city} cells we can fill with a real local angle; the rest don’t get made.)
- Sanity-check the demand. Cross every row against whether anyone actually searches for it. Rows with no demand become pages that sit un-indexed; better to know before you build than after. More on that in why aren’t my programmatic pages ranking.
If you’ve combed your own records, the public datasets, and the APIs and you still can’t fill the rows with anything a reader would care about — that’s the answer. It means there isn’t a real programmatic play here, at least not at the scale you were picturing. Don’t manufacture filler to hit a page count; the right move is fewer pages, each one full, or a hand-written approach where each page gets real judgment instead of a thin data row. The boundary’s in programmatic vs. writing by hand.
Common questions
On sourcing data, specifically.
Is scraping data for programmatic pages legal?
It depends what you scrape and what you do with it. Respect robots.txt, respect the site’s terms of service, and don’t republish someone else’s database wholesale — that last one is copyright infringement. Taking a handful of facts, verifying them, and presenting them in your own structure with attribution is generally fine; lifting a competitor’s directory and rebranding it is not. When in doubt, use your own data and public datasets — they’re cleaner on every axis. Related: is programmatic SEO black hat.
I don’t have much data. Can I still do programmatic SEO?
You have more than you think — service lines, locations, jobs done, FAQs, pricing, response times. Combed honestly, that’s usually enough for a tight, full set. What you can’t do is manufacture filler to inflate the page count; if the rows aren’t there, the move is fewer pages, each one substantial, or hand-written depth instead. How much you can build follows directly from how much you can fill: how many programmatic pages can I make.
What’s the difference between data that helps a page and data that’s just padding?
Whether a reader came for it. Your response time in that suburb, the real price range, the actual A-vs-B difference — value. The population of a town on a plumbing page, a weather average no one asked for, “[city] is a vibrant community” — padding. Search engines are good at telling the difference, and a page that’s mostly padding is exactly what the “scaled content abuse” policy targets. The bar every page has to clear: the thin-content line.

Q2 capacity · 4 builds · 2 slots remaining
Got the data? Then you’ve got the pages.
Send us your URL and a sketch of what you’d want to scale. We’ll send back a free 5-minute Loom — what data you’re already sitting on, where the gaps are, and whether there’s a real programmatic build in it. No call required.