Feature Selection
Engineering creates features. Selection narrows them down to the ones that actually earn their place in the model. More features is not always better — dimensionality, overfitting, and interpretability all suffer when you include noise alongside signal. Three families of approaches cover most cases.
Correlation-based selection. Compute pairwise correlations; remove one of any pair above a threshold (e.g., 0.8). Use Pearson for linear relationships, Spearman for monotonic, Kendall's tau for ordinal. Simple, fast, interpretable — but misses nonlinear relationships and may remove complementary features.
Recursive Feature Elimination (RFE). Wrap a model. Rank features by importance (coefficients for linear models, Gini/entropy for trees). Remove the least important. Rebuild. Repeat. Captures interactions and nonlinearities; accounts for how features work together. Computationally expensive; model-specific; risks overfitting if not combined with cross-validation.
Univariate selection. Score each feature individually based on its relationship with the target — Pearson or F-test or mutual information for regression; chi-square, ANOVA F-test, or mutual information for classification. Fast, simple, model-agnostic. But ignores feature interactions — a feature that's individually weak may be jointly strong with another. Use as a first filter, not a final answer.
Genomics: When Selection Is Everything
In genomics, you might have tens of thousands of features (gene expression levels) and a few hundred samples. Feature selection is necessary hin this case. Without it, any model will overfit. The right selection method depends on whether you care about interpretability (correlation), predictive performance (RFE), or speed (univariate). In high-dimensional sensor data — wearables, industrial monitoring — the same calculus applies.
Scan pairwise correlations, remove redundant duplicates
Press step 1 to begin.
Toggle between correlation, RFE, and univariate selection to see which features each method retains from the same set — and why they disagree.
You have 500 features and want to select the best 20 for a random forest. You run univariate chi-square selection and keep the top 20. A colleague says you might be missing the best features. Why might they be right?