Demystifying Feature Scaling and Transformation in Machine Learning

ยท

4 min read

Demystifying Feature Scaling and Transformation in Machine Learning

Introduction:

In the world of machine learning, feature scaling and transformation are essential techniques that often remain hidden behind the scenes. This blog aims to shine a light on these critical processes and their significance in preparing data for machine learning models.

Why Feature Scaling is Essential:

Many machine learning algorithms rely on distance calculations as a fundamental part of their decision-making process. If the features in your dataset have significantly different scales, it can lead to distorted distances and affect the accuracy and efficiency of your models. In essence, when your features are not on the same scale, some features may dominate the learning process. To mitigate this, we employ feature scaling.

Types of Feature Scaling

There are two common methods for feature scaling:

  1. Standardization

    Standardization is the process of converting a distribution into a standard normal distribution, where the mean is 0 and the standard deviation is 1. This technique is particularly useful for algorithms that heavily rely on distance calculations. The formula for standardization is:

    \(standardized-value = (value - mean) / std-deviation\)where value is data point value and std-deviation and mean are standard deviation and mean for the respective features.

  2. Normalization

    Normalization, on the other hand, is about scaling numerical features to a specific range, typically between 0 and 1. It aims to bring all the features to a common scale, making them easier to work with for machine learning algorithms.

Types of Normalization

There are several normalization methods:

Min-Max Scaler

  1. The min-max scaler is a versatile way to transform data into a normal distribution. You can achieve this transformation using the formula:

    \(x-transformed = (x - xmin) / (xmax - xmin)\)

    where x min is the min value in the feature and x max is the max value in the feature

  2. Max Absolute Scaler

    When your dataset contains many zeroes, the max absolute scaler is a suitable choice. The formula for this scaler is:

    \(x-transformed = x / abs(x-max)\)

    where max is the max value present in the feature column.

  3. Robust Scaler

    The robust scaler comes into play when your dataset contains outliers. Its formula is:

    \(x-transformed = (x - xmedian) / IQR\)

    where IQR is the interquartile range, which is the difference between the 75th and 25th percentile values in the distribution and x-median is the median of that feature column.

  1. Mathematical Transformation:

  1. Mathematical transformations involve applying statistical concepts and formulas to convert data into a normal distribution. There are two types of mathematical transformations: Function Transformer and Power Transformer.

    a) Function Transformer

    • Log Transformation: Converts data values into their logarithmic form, mainly useful for right-skewed data.

    • Reciprocal Transformation: Inverts data values, turning higher values into lower ones and vice versa using the formula 1/x.

    • Square Transformation: Replaces data points with the square of the data point, suitable for left-skewed data.

    • Square Root Transformation: Replaces data points with their square root, a transformation rarely used.

b) Power Transformer

  • Box-Cox Transformation

  • Yeo-Johnson Transformation

Exploring Box-Cox and Yeo-Johnson Transformations

Two common power transformations, the Box-Cox and Yeo-Johnson transformations, play crucial roles in achieving optimal feature scaling. They are designed to normalize data and address issues associated with skewed distributions.

Box-Cox Transformation Formula:

The Box-Cox transformation is defined as:

\((y = \begin{cases} \frac{x^\lambda - 1}{\lambda}, & \text{if } \lambda \neq 0 \\ \log(x), & \text{if } \lambda = 0 \end{cases}\) This expression uses the `cases` environment to define two different equations for \(y\) based on the value of \(\lambda\).\)

Yeo-Johnson Transformation Formula:

The Yeo-Johnson transformation is similar to the Box-Cox transformation but is more flexible, allowing for the transformation of data with zero or negative values. It is given by:

y = { (x^lambda - 1) / lambda, if lambda is not equal to 0 and x >= 0
      log(x + 1), if lambda is equal to 0 and x >= 0
      -[(|x + 1|^lambda - 1) / lambda], if lambda is not equal to 0 and x < 0
      -log(|x|), if lambda is equal to 0 and x < 0 }

Conclusion:

Feature scaling and transformation are vital steps in preparing data for machine learning models. By understanding the various scaling techniques and the power transformations like Box-Cox and Yeo-Johnson, you can make informed decisions about which method is best suited to your specific dataset and machine learning algorithm, ultimately improving the performance and accuracy of your models.

ย