Decoding the Art of Data Encoding: Unraveling Techniques for Categorical and Numerical Features

ยท

4 min read

Decoding the Art of Data Encoding: Unraveling Techniques for Categorical and Numerical Features

Introduction:

In the vast landscape of data, the need for encoding arises as we encounter two distinct types: numerical and categorical. While machine learning algorithms favor numerical data, the challenge lies in handling categorical features. This blog explores the intricate world of encoding, the types of categories, and a comprehensive array of encoding techniques to seamlessly integrate both types into the realm of machine learning.

Understanding Categories:

Categories can be broadly classified into two types: Nominal Categorical Features and Ordinal Categorical Features. Nominal features lack a relational order, while ordinal features exhibit a clear order or hierarchy.

Encoding Techniques for Ordinal Categories

Ordinal Encoding:

In this type of encoding, we list our categories in increasing order of rank and then our categories will be transformed into numbers such that the highest number has the highest order.

Label Encoding:

This type of encoding is only for the target column and the algorithm assigns the rank to an output category. This encoding technique should not be used for other features.

Target Guided Encoding:

In this technique, we find out the no of occurrences of a category and then compare it with the target labels. Let's suppose we are dealing with a classification problem and we want to categorize A, B, and C into numbers then we find all occurrences of A having 1 as output and then we divide all occurrences of A (with target as 1 and 0 both.) We store these values in the new feature called mean. As we are interested in ranks, we assign higher numbers to higher means and take that final feature as our encoded value.

Mean Encoding:

It is the same as target-guided encoding, only difference is that we stop at the mean step and do assign ranks.

Encoding Techniques for Nominal Categories

One Hot Encoding:

Creates dummy variables/features equal to the number of nominal features, introducing a potential curse of dimensionality. For example, consider we have 2 categories heads and tails, then we add 2 features by heads and tails. And assign 1 to that feature when we encounter it in the original feature.

One Hot Encoding for Multiple Columns:

Mitigates the curse of dimensionality by adding n-1 features. For the above example, we can only consider adding either heads or tails as a new feature instead of both because if we add 1 of the feature and its value is 0 then we know that it automatically belongs to another category. Another method to remove the curse of dimensionality is to use a common category "Others" for the less frequent categories

Count/Frequency Encoding:

Replaces category occurrences with their frequency in the feature.

Encoding Numerical Features

Despite numerical features being preferable for machine learning algorithms, variations in their ranges may pose challenges. Two methods address this:

Discretization or Binning:

Description*:* Converts numerical features into categorical features by creating ranges.

Types of Binning:

i) Unsupervised Binning:

a) Equal Width/Uniform Binning:

    • Description: Decide the number of bins and calculate the width of each interval.

      • Formula: Number of intervals = (Max value in feature - Min value in feature) / Number of bins.

b) Equal Frequency/Quantile Binning:

    • Description: Determine the number of bins, each representing a quantile range.

      • Process: Divide data into percentiles, assigning points to respective bins based on quantiles.

      • Advantage: Ensures an equal distribution of data points within each bin.

c) K-means Binning:

    • Description: Choose the number of bins, applying the k-means clustering algorithm.

      • Process: Apply k-means clustering using the number of bins equal to no of centroids. After clustering each cluster would be seen as an interval.

ii) Custom/Domain-based Binning:

    • Description: Leverage domain knowledge to define custom intervals.

      • Implementation: While Sci-kit provides support for Unsupervised Binning, using Pandas facilitates the creation of custom bins based on domain-specific insights.

iii) Supervised Binning:

    • Description: Tailor binning based on the target variable, incorporating labels for improved model performance.

      • Advantage: Customizing bin ranges based on the target variable's influence can enhance the predictive power of the model.

2) Binarization:

In Binarization, numerical features are encoded into categorical features by converting continuous numbers into binary (0 or 1). This technique simplifies the interpretation of numerical data, particularly in scenarios where a binary classification is desired.

Conclusion:

Data encoding is a pivotal step in preparing diverse datasets for machine learning models. By understanding the nuances of encoding techniques for both categorical and numerical features, we can optimize the performance and reliability of our models. Explore the art of decoding data through meticulous encoding strategies for a seamless machine-learning journey.

ย