Tabular Data
There are two types of tabular data: Quantitative data is already numeric: ages, prices, temperatures. Categorical data is not: colors, countries, diagnoses. Encoding categorical data correctly is an important decision, because the wrong choice may introduce false structure that biases every model downstream.
Four Encodings You Should Know
- Nominal encoding — assigns each category a unique integer (Red=0, Blue=1, Green=2). Simple, but implies an ordering between categories that doesn't exist. Use cautiously.
- One-hot encoding — creates a separate binary column for each category. Each row gets a 1 in exactly one column and 0s elsewhere. Right for unordered categories; expensive if there are hundreds of categories.
- Ordinal encoding — for categories that do have a natural order (Small=0, Medium=1, Large=2). The numbers genuinely carry meaning here, so models can use the magnitude.
- Label encoding — assigns integers in alphabetical order of the category labels. Useful for tracking, not for conveying magnitude.
Categorical variable
Values: XS, S, M, L, XL(XS < S < M < L < XL — real order exists)
Encoding method
| value | is_xs | is_s | is_m | is_l | is_xl |
|---|---|---|---|---|---|
| XS | 1 | 0 | 0 | 0 | 0 |
| S | 0 | 1 | 0 | 0 | 0 |
| M | 0 | 0 | 1 | 0 | 0 |
| L | 0 | 0 | 0 | 1 | 0 |
| XL | 0 | 0 | 0 | 0 | 1 |
One-hot creates a separate binary column for each category. No column is numerically "higher" than another — but you're throwing away the real ordering between values.
Select a categorical variable and an encoding method. Watch how the choice reshapes the data — and what assumptions get smuggled in.
The Core Rule
If the categories have no inherent order (colors, countries, diagnoses) → use one-hot encoding. If they do have order (Small/Medium/Large, rating scales) → use ordinal encoding. Using nominal encoding on unordered categories silently introduces a false ordering that can bias your model.
You have a 'satisfaction level' variable: Very Unsatisfied → Unsatisfied → Neutral → Satisfied → Very Satisfied. Which encoding is most appropriate?
The Single Biggest Mistake
Using an encoding that introduces an ordering the data doesn't have. If you nominal-encode color (Red=0, Blue=1, Green=2) and your model is even a little bit sensitive to ordinal relationships, you've smuggled in a bias that's hard to detect later. When in doubt, use one-hot for unordered categories.
Survey Data in the Wild
The U.S. Census, NHANES, Pew Research, the European Social Survey — every one of these is a tabular dataset full of categorical variables that someone decided how to encode. Whether you're building a model on top of survey data or reading a published analysis, the encoding choices upstream are shaping the conclusions. When a paper reports "race" as a numerical variable, take a moment to think about how it got encoded. The choice is never neutral.
A dataset has a 'Education Level' column with values: 'High School', 'Bachelor's', 'Master's', 'PhD'. Which encoding is most appropriate?