You Will Always Be Sampling
Virtually every statistical claim a data scientist makes is a claim about a population, made from a sample. You will essentially never get the entire population. Even when your dataset feels exhaustive — every transaction, every user, every log line — it's still a sample from the conceptual population of "all transactions, users, and log lines, including the ones that will happen tomorrow."
So sampling isn't just about surveys and academic studies. It's about train/test splits, cross-validation, mini-batch training, bootstrap confidence intervals, active learning, and anomaly detection. The way you sample shapes everything that comes after.
✦
Sampling Is Everywhere in ML
- Dataset creation: Collecting representative training data. Balancing classes. The decisions you make here echo through your entire model.
- Train/test split: Randomly partitioning data is sampling. So is k-fold cross-validation.
- Bootstrapping: Sampling with replacement to estimate variability or train ensemble methods like random forests.
- Mini-batch sampling: In stochastic gradient descent, you sample subsets of data each iteration. How you batch matters!
- Active learning: Selectively sampling the most informative instances for labeling — useful when labels are expensive.