Probability Sampling

In probability sampling, every member of the population has a known, non-zero chance of being selected. This is what we want. It's what makes our statistical claims actually transfer from the sample to the population.

Simple Random Sampling

Every member of the population has an equal chance of being selected. The conceptually cleanest method and the gold standard when feasible. Use this as your default when the population is well-defined and accessible.

Stratified Sampling

Divide the population into subgroups (strata) and randomly sample from each. Use this when you have known subgroups whose distributions matter — sampling equal numbers from each demographic group, or sampling from each class in a classification problem so your test set isn't all majority class.

In ML: sklearn.model_selection.train_test_split(stratify=y) and StratifiedKFold both implement this. Pass the label column and your class distribution will be preserved across splits.

Cluster Sampling

Divide the population into clusters and randomly select entire clusters. Sometimes the only feasible option — think geographic sampling, where you select cities, then survey everyone in those cities. Less statistically efficient than simple random sampling for a given total sample size, but often much cheaper to execute.

Systematic Sampling

Order the population and select members at regular intervals (every k-th element). Easy to execute. It assumes the ordering isn't itself correlated with what you're studying — if it is, systematic sampling will mislead you. (Classic failure: sampling every 7th day in a weekly-seasonal dataset would always land on the same day of the week.)

Sampling Methods Explorer

Every person has an equal chance. 20 people drawn at random.

20
selected
70%
Group A in sample
30%
Group B in sample
Group A
70%+10pp
Group B
30%-10pp

Clean but may under-represent small subgroups by chance.

Group A selected (60% of pop.)Group B selected (40% of pop.)Not selected

An 80-person population split into Group A (60%) and Group B (40%). Switch between sampling methods to see which people are selected and how well each method preserves the true group proportions. Hit Resample to draw again.

Checkpoint

You're building a churn prediction model and need a test set. Your dataset is 90% non-churned users and 10% churned. You want to ensure your test set reflects this distribution rather than getting, by chance, a test set with only 2% churners. Which sampling method should you use?