Feature Engineering for Tabular Data

Tabular feature engineering is the most general category and the one you'll do most often. The techniques here apply across domains — from customer churn to credit risk to healthcare outcomes.

Core techniques:

  • Descriptive statistics as features. Compute mean, std, min, max on numerical columns; frequency counts on categorical columns. These summary statistics can become inputs to downstream models — especially useful when aggregating over groups.
  • Encoding categorical variables. One-hot for unordered; ordinal for ordered; target encoding for high-cardinality categoricals; frequency encoding when count of occurrences is itself predictive.
  • Scaling and standardization. Required for distance-based models (KNN, clustering) where comparing age and household income raw is meaningless. Also speeds convergence for logistic regression and neural networks.

Interaction Features

Combine features through multiplication, addition, division, or any function that captures their joint behavior. Examples:

  • BMI = weight / height² — captures something neither weight nor height captures alone.
  • Debt-to-income ratio — captures affordability in a way that raw income and raw debt don't.
  • Petal area = petal_length × petal_width — which we saw in the Iris EDA gave cleaner species separation than either dimension alone.

You can also create polynomial features by raising existing features to higher powers — these capture nonlinear relationships in linear models.

Interactions and Polynomials: Use Deliberately

Interaction and polynomial features increase dimensionality, raise overfitting risk, and reduce interpretability. Use them when EDA shows a clear pair relationship, when model performance plateaus and you need more signal, and always pair them with regularization. Don't blanket-apply them — that's a recipe for an overfit, opaque model.

Credit Risk Modeling

In credit risk modeling, classic interaction features have been used for decades: debt-to-income ratio, payment-to-income ratio, account-age × utilization. The features are simple, but they encode decades of credit-industry domain expertise. The newest ML models often beat older rule-based ones not because the models are smarter, but because they use more engineered features that encode more domain knowledge.

Checkpoint

You have features 'loan_amount' and 'annual_income'. A domain expert says the ratio between them is more predictive than either value alone. What should you do?