All writing

May 20, 2026

How I Found 4 Million Rows in a Billion-Row Estimate

Written with Claude

The request came in from a property management company with roughly 2,000 properties across the US. They needed a custom compliance report — a specific view of their data that our platform did not produce out of the box. Standard procedure.

When the ticket reached engineering for scoping, it came back with a single verdict: not feasible. The estimated data volume was 1 billion rows. The query would time out. The report would take weeks to generate. The client should be told this was not possible.

That estimate was wrong. Not slightly off — wrong by a factor of 250.


The Problem With Accepting the First Number

When engineering estimates a query as "1 billion rows," they usually have a reason. In this case, the reason was a reasonable but incorrect assumption about how the data was structured.

The Onesite platform stores property data in nested relational tables. Lease data links to unit data, which links to property data, which links to the client's portfolio. If you write a query that joins these tables naively — without understanding which relationships are one-to-many versus one-to-one — you get a combinatorial explosion. Every lease joins to every unit, every unit to every property, and suddenly a dataset that is actually modest becomes, on paper, enormous.

The billion-row estimate came from this kind of accidental cross-join. The analyst who scoped it had looked at the top-level table counts and done the math. The math was correct for that model. The model was wrong.


Tracing the Actual Data Model

I pulled the schema documentation and started mapping the actual relationships.

The key question was: what is the real unit of data this client needs in their report? Once I identified that — it was individual lease compliance records, not some aggregate — I could trace the actual path through the data model.

The path looked like this:

client_portfolio → properties (2,000)
  → units per property (avg ~12)
  → active leases per unit (avg ~1.7)
  → compliance fields per lease (6)

That gives you: 2,000 × 12 × 1.7 × 6 = roughly 244,000 rows of actual compliance data. Add in historical records and some data variation across the portfolio, and you get to somewhere around 4 million rows.

Not 1 billion. 4 million.

The difference between those two numbers is not a rounding error. It is the difference between "impossible" and "done by Thursday."


The Build

With the data model confirmed, the extraction was not complex. I built a Python script using the platform's REST API, paginating through the endpoint that exposed the correct data at the correct grain. The script took about a day to write, test, and tune for rate limits.

The output was a flat CSV the client could import directly into their compliance system.


Delivery

Three working days after the original rejection, the client had their report.

A portfolio of 2,000 properties. Full compliance data. Correctly scoped, correctly extracted, correctly formatted.

The engineering team had not been wrong to be cautious — large queries do cause real problems on shared infrastructure. But the caution was applied to an incorrect model of the data, and that model was never questioned.


The Lesson

A data estimate is only as good as the data model it is based on.

When someone tells you a dataset is too big, the first question should not be "how do we make the query faster?" It should be: "is the dataset actually that big?" Those are different problems with completely different solutions.

Nested relational data is particularly prone to this kind of over-estimation. The actual size of what you need is almost always smaller than a naive table count suggests. Before accepting any estimate that closes a door, trace the model.

The extra hour spent verifying the structure saved weeks of back-and-forth with the client and preserved a relationship that might otherwise have ended with a refusal.