1 · Where it comes from — Collecting Datasets

Here's the whole idea in one breath: data starts as a messy pile of raw observations, and the work is turning that pile into counts you can actually read. A tally and a frequency table do exactly that. And because you usually can't measure everyone, you take a sample — which only tells the truth if it's fair.

The Messy Pile

Imagine you stood at the school gate and wrote down the eye colour of the first forty kids through it: brown, blue, brown, brown, green, hazel, blue, brown… Forty scribbles. Is brown winning? You genuinely can't tell — your eye slides off a list of forty words. The raw observations are all there, but a pile isn't an answer. You have to organise it before it means anything.

The fix is ancient and brilliant: a tally. Each time a value shows up you draw one stroke next to it, and every fifth stroke goes across the previous four, so you can count in fives at a glance. Add up the strokes for each value and you've got a frequency — the number of times that value appeared. Lay every value beside its frequency in a little two-column table and that's a frequency table: the pile, finally turned into counts.

Say it plainly: raw observations are just a pile. A tally counts them as they come in; a frequency table is the tidy result — each value with the number of times it showed up. Same data, but now you can read it.

You Can't Measure Everyone

There's a second problem hiding behind the first. You wanted to know about the whole school's eye colour, but you only counted forty kids — you took a sample. That's normal; nobody measures the entire population every time. The catch is that a sample only tells the truth about the crowd if it's representative — a fair, unbiased slice of it.

Picture asking "what's everyone's favourite sport?" but only sampling kids walking out of basketball training. Your frequency table will scream basketball, and it'll be a lie about the school — because the sample was loaded before you started. A fair sample is picked so every kind of person in the crowd has an equal chance of turning up; a biased sample quietly leaves some of them out, then speaks confidently for people it never asked.

Flip the toy to its sampling mode. The same coloured crowd sits there the whole time — you're not changing it. A fair random handful lands close to the real split; a biased handful, grabbed only from one corner, gives a confidently wrong answer. Same population, two samples, two different "truths." That gap is the whole reason careful collecting matters.

Where a Dataset Comes From

The Messy Pile

You Can't Measure Everyone

Us, Thinking Out Loud