Probability Sampling
In probability sampling, every member of the population has a known, non-zero chance of being selected. This is what we want. It's what makes our statistical claims actually transfer from the sample to the population.
Simple Random Sampling
Every member of the population has an equal chance of being selected. The conceptually cleanest method and the gold standard when feasible. Use this as your default when the population is well-defined and accessible.
Stratified Sampling
Divide the population into subgroups (strata) and randomly sample from each. Use this when you have known subgroups whose distributions matter — sampling equal numbers from each demographic group, or sampling from each class in a classification problem so your test set isn't all majority class.
In ML: sklearn.model_selection.train_test_split(stratify=y) and StratifiedKFold both implement this. Pass the label column and your class distribution will be preserved across splits.
Cluster Sampling
Divide the population into clusters and randomly select entire clusters. Sometimes the only feasible option — think geographic sampling, where you select cities, then survey everyone in those cities. Less statistically efficient than simple random sampling for a given total sample size, but often much cheaper to execute.
Systematic Sampling
Order the population and select members at regular intervals (every k-th element). Easy to execute. It assumes the ordering isn't itself correlated with what you're studying — if it is, systematic sampling will mislead you. (Classic failure: sampling every 7th day in a weekly-seasonal dataset would always land on the same day of the week.)
Every person has an equal chance. 20 people drawn at random.
Clean but may under-represent small subgroups by chance.
An 80-person population split into Group A (60%) and Group B (40%). Switch between sampling methods to see which people are selected and how well each method preserves the true group proportions. Hit Resample to draw again.
You're building a churn prediction model and need a test set. Your dataset is 90% non-churned users and 10% churned. You want to ensure your test set reflects this distribution rather than getting, by chance, a test set with only 2% churners. Which sampling method should you use?