Handling Missing Data: A Comprehensive Guide for Data Scientists

ยท

4 min read

Handling Missing Data: A Comprehensive Guide for Data Scientists

Introduction:

As data scientists, the responsibility of handling missing values is paramount before unleashing the power of machine learning algorithms. In this comprehensive guide, we'll explore the intricacies of dealing with missing data, comparing and contrasting various methods to ensure optimal model performance.

Methods to Handle Missing Values:

1) Complete Case Analysis(Remove data):

  • Best suited when missing data is completely random.

  • Considered when missing values comprise only a small fraction, typically around 5%.

  • This method suggests we remove a data point that has missing data present in it.

2) Impute the Missing Values(Replace missing values):

  • Essential when the missing data shows a specific pattern. This means when the data is not missing at random

  • Two main approaches: Univariant Imputation and Multivariant Imputation.

Univariant Imputation for Numerical Data:

a) Replacing with Mean/Median/Mode:

Quick and simple, but may alter the distribution. This method suggests we replace the missing value with the mean of the feature when the distribution of the feature is normally distributed. And if the feature is skewed in distribution it is often a good practice to replace it with Median. This method should only be used when data is missing completely at random and the missing data is up to only 5%.

b) Replacing with Arbitrary Value:

Suitable when data is not missing at random. This method suggests we replace missing data by replacing it with a total arbitrary value. This technique may lead to distortion of data distribution for that feature so this technique is not that used.

c) Replacing with End-of-Distribution Value:

Better than arbitrary value, maintains distribution integrity. And method suggested is to replace the missing data with the end of distribution value. For a skew distributed feature, we should replace the missing value with either Q1-1.5IQR or Q3+1.5IQR where Q1 and Q3 are quartile 1 and quartile 3 and IQR is the difference between them. For a normal distribution, we replace the missing value with either mean+3rd std deviation or mean-3rd std deviation value.

d) Replacing with Random Value:

Preserves the distribution, but increases covariance. This method suggests we replace the missing value with the random value present in that particular feature. This method leads to more covariance with other features than before.

e) Missing Indicator:

Introduces new features denoting missing values. For each feature having missing values, we would create a new feature that holds either a true or false value for those evey features, that have missing values.

Univariant Imputation for Categorical Data:

a) Most Frequent Value Imputation:

Easy to implement but changes the feature distribution. This method is used when the data is missing completely at random. This method suggests we replace the missing value with the mode for that feature. This is quite easy to implement but the distribution will be distorted after this transformation

b) Missing Category Imputation:

Introduces a "Missing" category for robust handling. We would simply replace missing data with this category called 'missing'. This method is generally used when the missing value is more than 10% of total data.

c) Replacing with Random Values:

Preserves distribution without distortion. We would replace the missing value with the random value present in the feature.

d) Missing Indicator Imputation:

Adds features indicating missing values. This is the same method we used for numerical features.

Multivariant Imputation(for categorical and numerical data):

a) KNN Imputation:

Utilizes the KNN algorithm to replace missing values. This method can be summed up in 3 steps: i) Finalize the value of K. ii) Consider each data point represents a point and then we calculate the distance of our point(which has the missing value) with other points using nan_euclidean distance. iii) From the k nearest points we take the average of that particular feature of those points and then replace it with this average.

b) Iterative Imputer (MICE Algorithm):

Slow but effective, suitable for non-random missing data. - Implements chained equations to iteratively improve imputations.MICE stands for Multivariant Imputation by Chained Equations. So let's suppose we have 3 columns each having missing values in them. So In the 0th iteration, we will replace the missing values of each feature/column with the mean of that feature/column. In the 1st Iteration we will we will go column by column, staring with column 1 we will again replace our mean value of that column with the missing value. Then apart from that record, we will select all records for training an ML model for prediction of that particular feature. Then we will replace that prediction with the missing value of that column/feature. We will perform the same thing with all the columns and then we will get the result of 1st iteration. Finally, we calculate the difference between our current iteration and with previous iteration and our goal is to minimize this difference. We will again perform the same steps to calculate new iteration results and we will do it tell the difference in 2 consecutive iterations is 0.

Conclusion:

Effective handling of missing data is pivotal for robust machine learning models. From complete case analysis to advanced imputation techniques, choosing the right method depends on the nature of the missing data. Remember, a well-handled dataset is the foundation of accurate predictions in the realm of data science.

ย