Tabular Data

There are two types of tabular data: Quantitative data is already numeric: ages, prices, temperatures. Categorical data is not: colors, countries, diagnoses. Encoding categorical data correctly is an important decision, because the wrong choice may introduce false structure that biases every model downstream.

Four Encodings You Should Know

  • Nominal encoding — assigns each category a unique integer (Red=0, Blue=1, Green=2). Simple, but implies an ordering between categories that doesn't exist. Use cautiously.
  • One-hot encoding — creates a separate binary column for each category. Each row gets a 1 in exactly one column and 0s elsewhere. Right for unordered categories; expensive if there are hundreds of categories.
  • Ordinal encoding — for categories that do have a natural order (Small=0, Medium=1, Large=2). The numbers genuinely carry meaning here, so models can use the magnitude.
  • Label encoding — assigns integers in alphabetical order of the category labels. Useful for tracking, not for conveying magnitude.
Encoding Comparison

Categorical variable

Values: XS, S, M, L, XL(XS < S < M < L < XL — real order exists)

Encoding method

Problematic for this variable— one-hot discards the meaningful order between categories
valueis_xsis_sis_mis_lis_xl
XS10000
S01000
M00100
L00010
XL00001

One-hot creates a separate binary column for each category. No column is numerically "higher" than another — but you're throwing away the real ordering between values.

Select a categorical variable and an encoding method. Watch how the choice reshapes the data — and what assumptions get smuggled in.

The Core Rule

If the categories have no inherent order (colors, countries, diagnoses) → use one-hot encoding. If they do have order (Small/Medium/Large, rating scales) → use ordinal encoding. Using nominal encoding on unordered categories silently introduces a false ordering that can bias your model.

Checkpoint

You have a 'satisfaction level' variable: Very Unsatisfied → Unsatisfied → Neutral → Satisfied → Very Satisfied. Which encoding is most appropriate?

The Single Biggest Mistake

Using an encoding that introduces an ordering the data doesn't have. If you nominal-encode color (Red=0, Blue=1, Green=2) and your model is even a little bit sensitive to ordinal relationships, you've smuggled in a bias that's hard to detect later. When in doubt, use one-hot for unordered categories.

Survey Data in the Wild

The U.S. Census, NHANES, Pew Research, the European Social Survey — every one of these is a tabular dataset full of categorical variables that someone decided how to encode. Whether you're building a model on top of survey data or reading a published analysis, the encoding choices upstream are shaping the conclusions. When a paper reports "race" as a numerical variable, take a moment to think about how it got encoded. The choice is never neutral.

Checkpoint

A dataset has a 'Education Level' column with values: 'High School', 'Bachelor's', 'Master's', 'PhD'. Which encoding is most appropriate?