3 · Where it gets tricky — Collecting Datasets

A frequency table looks so official that it's easy to trust it without asking where it came from. But a dataset is only as honest as the collecting behind it — and there are five classic ways that collecting goes wrong. Learn to smell these and you'll never be fooled by a confident-looking table again.

The Five Traps

1 — A biased sample. You only asked your mates, or only the kids at basketball, or only people awake at 3am posting online. The crowd you sampled isn't the crowd you're claiming to describe. The table is real; the claim built on it is a lie.

2 — A leading question. "Don't you agree our amazing canteen should stay open later?" The question shoves people toward an answer before they've thought. The data that pours out is the question's opinion, not theirs.

3 — Missing data. Some kids didn't answer, some results got lost, some readings were skipped. If you quietly drop those and tally only what's left, your frequency table is missing a chunk of the truth — and often the missing chunk isn't random.

4 — Double-counting. The same person fills the survey twice; the same observation gets tallied in two rows. One real thing becomes two in the table, and every count downstream is wrong.

5 — Mismatched units. Someone recorded heights in centimetres, someone else in metres, and you tallied them together. 1.6 and 160 land in different rows as if they were different heights. The same measurement, counted as two — your table is quietly nonsense.

Say it plainly: five poisons — a loaded sample, a loaded question, missing results, double-counted results, and mismatched units. Each one happens during collecting, long before the chart. A neat frequency table can't fix a poisoned dataset; it just makes the poison look respectable.

Why These Are So Sneaky

None of these break the maths. The tallying is flawless; the percentages add to a hundred; the table is tidy and convincing. That's exactly what makes them dangerous — the rot is upstream, in how the data was gathered, where nobody downstream can see it. So the real question to ask of any dataset is never "is the table neat?" but "who got asked, what were they asked, and what got left out?" Get curious about the collecting, and the traps light up.

A Worked One, Slowly

Question: a post claims "94% of teenagers love energy drinks!" — what should you check before believing it?

Start at the collecting, not the number. Who got asked? If the poll lived on an energy-drink brand's own page, the sample is biased — only fans were ever there. What were they asked? "Do you love our refreshing energy drink?" is a leading question. What got left out? Maybe thousands clicked "no" and the brand quietly dropped them — missing data. The 94% might be flawless arithmetic on a thoroughly poisoned dataset. Once you interrogate the collecting, a scary-confident statistic turns into "asked the wrong people the wrong question and hid the rest." Checking the source before the sum is the whole point of this concept.

Where a Dataset Gets Poisoned

The Five Traps

Why These Are So Sneaky

A Worked One, Slowly

Us, Thinking Out Loud