Machine Learning FAQ
How can I apply an SVM to categorical data?
I assume you are asking about categorical features, not the target variable, which is already assumed to be categorical (binary) in SVM classifiers.
First, there are two sub-types of categorical features: Ordinal and nominal features.
Ordinal means that an âorderâ is implied. For example, a customer satisfaction metric {âsatisfiedâ, âneutralâ, âdissatisfiedâ} is a ordinal variable since we can order it: âsatisfiedâ > âneutralâ > âdissatisfiedâ. Here, we can simply map the âstringâ notation into an integer notation, for example âsatisfiedâ=1, âneutralâ =0, and âdissatisfiedâ= -1.
If our variable is nominal, an âorderâ does not make sense. For example, think of âcolorâ; there are some cases in image processing where ordering color values makes sense, but for simplicity, we canât say âred > blue > yellowâ or so. To deal with such variables in SVM classification, we typically do a âone-hotâ encoding. Here, we create so-called dummy variables that can binary values â we create one dummy variable for each possible value of that nominal feature variable. Say that our color variable can have one of the three values: âred,â âblue,â âyellow.â And Letâs say we have the following dataset consisting of 4 training samples:
- sample 1: âblueâ
- sample 2: âyellowâ
- sample 3: âredâ
- sample 4: âyellowâ
Then our one-hot encoding would look like this:

Note that thereâs only one âtrueâ value (the integer 1) in each row, which denotes the column for that sample in the training set. Sample 1 is blue; sample 2 is yellow, and so forth.