7 Steps of Data Exploration & Preparation – Part 2

sunil Last Updated : 06 Aug, 2015

5 min read

Introduction

In Part-1 of this series, we looked at the first three steps of Data Exploration & Preparation, namely Variable identification, Univariate and Bivariate analysis. In this article, we will proceed from the fourth step.

We will look at the methods of Missing values treatment. More importantly, we will also look at why missing values occur in our data and why treating them is necessary?

Why missing values treatment is required?

Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analysed the behavior and relationship with other variables correctly. It can lead to wrong prediction or classification.

Notice the missing values in the image shown above: In the left scenario, we have not treated missing values. The inference from this data set is that the chances of playing cricket by males is higher than females. On the other hand, if you look at the second table, which shows data after treatment of missing values (based on gender), we can see that females have higher chances of playing cricket compared to males.

Why my data has missing values?

We looked at the importance of treatment of missing values in a dataset. Now, let’s identify the reasons for occurrence of these missing values. They may occur at two stages:

Data Extraction: It is possible that there are problems with extraction process. In such cases, we should double-check for correct data with data guardians. Some hashing procedures can also be used to make sure data extraction is correct. Errors at data extraction stage are typically easy to find and can be corrected easily as well.
Data collection: These errors occur at time of data collection and are harder to correct. They can be categorized in four types:
- Missing completely at random: This is a case when the probability of missing variable is same for all observations. For example: respondents of data collection process decide that they will declare their earning after tossing a fair coin. If an head occurs, respondent declares his / her earnings & vice versa. Here each observation has equal chance of missing value.
- Missing at random: This is a case when variable is missing at random and missing ratio varies for different values / level of other input variables. For example: We are collecting data for age and female has higher missing value compare to male.
- Missing that depends on unobserved predictors: This is a case when the missing values are not random and are related to the unobserved input variable. For example: In a medical study, if a particular diagnostic causes discomfort, then there is higher chance of drop out from the study. This missing value is not at random unless we have included “discomfort” as an input variable for all patients.
- Missing that depends on the missing value itself: This is a case when the probability of missing value is directly correlated with missing value itself. For example: People with higher or lower income are likely to provide non-response to their earning.

Methods to treat Missing Values

Deletion: It is of two types: List Wise Deletion and Pair Wise Deletion.
- In list wise deletion, we delete observations where any of the variable is missing. Simplicity is one of the major advantage of this method, but this method reduces the power of model because it reduces the sample size.
- In pair wise deletion, we perform analysis with all cases in which the variables of interest are present. Advantage of this method is, it keeps as many cases available for analysis. One of the disadvantage of this method, it uses different sample size for different variables.
- Deletion methods are used when the nature of missing data is “Missing completely at random” else non random missing values can bias the model output.
Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with estimated ones. The objective is to employ known relationships that can be identified in the valid values of the data set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most frequently used methods. It consists of replacing the missing data for a given attribute by the mean or median (quantitative attribute) or mode (qualitative attribute) of all known values of that variable. It can be of two types:-
- Generalized Imputation: In this case, we calculate the mean or median for all non missing values of that variable then replace missing value with mean or median. Like in above table, variable “Manpower” is missing so we take average of all non missing values of “Manpower” (28.33) and then replace missing value with it.
- Similar case Imputation: In this case, we calculate average for gender “Male” (29.75) and “Female” (25) individually of non missing values then replace the missing value based on gender. For “Male“, we will replace missing values of manpower with 29.75 and for “Female” with 25.
Prediction Model: Prediction model is one of the sophisticated method for handling missing data. Here, we create a predictive model to estimate values that will substitute the missing data. In this case, we divide our data set into two sets: One set with no missing values for the variable and another one with missing values. First data set become training data set of the model while second data set with missing values is test data set and variable with missing values is treated as target variable. Next, we create a model to predict target variable based on other attributes of the training data set and populate missing values of test data set.We can use regression, ANOVA, Logistic regression and various modeling technique to perform this. There are 2 drawbacks for this approach:
1. The model estimated values are usually more well-behaved than the true values
2. If there are no relationships with attributes in the data set and the attribute with missing values, then the model will not be precise for estimating missing values.
KNN Imputation: In this method of imputation, the missing values of an attribute are imputed using the given number of attributes that are most similar to the attribute whose values are missing. The similarity of two attributes is determined using a distance function. It is also known to have certain advantage & disadvantages.
- Advantages:
  - k-nearest neighbour can predict both qualitative & quantitative attributes
  - Creation of predictive model for each attribute with missing data is not required
  - Attributes with multiple missing values can be easily treated
  - Correlation structure of the data is taken into consideration
- Disadvantage:
  - KNN algorithm is very time-consuming in analyzing large database. It searches through all the dataset looking for the most similar instances.
  - Choice of k-value is very critical. Higher value of k would include attributes which are significantly different from what we need whereas lower value of k implies missing out of significant attributes.

End note

In this article, we have discussed the importance of treating missing values in a dataset, followed by the reason which causes the missing values to prevail in a dataset. We have also looked at the basics of treating missing values. The treatment method can be hybrid approach based on above methods or more complex. In next series of data exploration, we will look at the outlier treatment.

Have you ever come across dealing with similar situation? We are excited to hear your experience & learning talks. Talk with us in the comments section below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

sunil

Free Courses

4.7

Introduction to CrewAI: Building a Researcher Assistant Agent

Build smart AI agents with CrewAI to automate tasks and solve problems.

4.7

Understanding the working of Neural Networks

Learn the neural network basics, concepts, layers, and activation functions.

4.5

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

4.6

GenAI Landscape: Foundations & Hands On

Learn Generative AI basics: prompting, RAG, fine-tuning & agents.

4.5

Getting Started with Tableau

Free Tableau certification course covering data visualization essentials.

Hassine Saidane

Thanks for the insights. Below are some suggestions to think about: 1. How about building a predicitve model using observations that have no missing values and all the variables , then estimating the missing values? 2. Another alternative to KNN, cluster all the observations (on all variables without missing values) and estimate the missing values using the observations in the missing value's cluster. KNN can be more effifient here, or just use the corresponding cluster mean/median, etc, as estimate. A CART model can also be efficient 3. What if a case (row) or a variable (column) has a lot of missing values.? It would be wise to set a threshod (e.g. 20% missing) to wheed out "bad" rows/columns? 4. How about outliers? How should they be identified and treated? Dropped or adjusted? Thanks

Vinayak

Thanks a lot for the useful information. I have read about the multiple imputation will handle the missing values better than the other procedures. Multiple Imputation : The problem with all of the other techniques is they do not account for imputation variability - that is the uncertainty in our estimate due to the imputations used - this variability must be reflected in the standard errors of our estimates. Multiple imputation does this by running each statistical procedure multiple times with different imputed values each time. The results from each run are then pooled to produce final estimates which account for imputation variability. Advantages : a. Variability more accurate with multiple imputations for each missing value. b. Considers variability due to sampling AND variability due to imputation c. SAS : Proc mi Multiple imputation Procedure: (a) Impute observations for all missing values in a variable V : use random samples from normal distribution with mean and SD of V . (Or use regression to predict mean and SD, then sample from this normal distribution.) (b) Do the imputation M times, creating M complete data sets. (c) Analyze each of the M complete data sets. (d) Combine the results of the M analyses to draw conclusions. Notes : Listwise deletion often works the best. However, under certain conditions, Maximum Likelihood Methods and Multiple Imputation Methods can work better. “MI [multiple imputation]” is a procedure by which missing data are imputed several times (e.g. using regression imputation) to produce several different complete-data estimates of the parameters. The parameter estimates from each imputation are then combined to give an overall estimate of the complete-data parameters as well as reasonable estimates of the standard errors. “Maximum Likelihood (ML) approaches” operate by estimating a set of parameters that maximize the probability of getting the data that was observed Thanks, Vinayak

Reading list

7 Steps of Data Exploration & Preparation – Part 2

Introduction

Why missing values treatment is required?

Why my data has missing values?

Methods to treat Missing Values

End note

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Login to continue reading and enjoy expert-curated content.

Free Courses

Introduction to CrewAI: Building a Researcher Assistant Agent

Understanding the working of Neural Networks

No Code Predictive Analytics with Orange

GenAI Landscape: Foundations & Hands On

Getting Started with Tableau

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

7 Steps of Data Exploration & Preparation – Part 2

Introduction

Why missing values treatment is required?

Why my data has missing values?

Methods to treat Missing Values

End note

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Login to continue reading and enjoy expert-curated content.

Free Courses

Introduction to CrewAI: Building a Researcher Assistant Agent

Understanding the working of Neural Networks

No Code Predictive Analytics with Orange

GenAI Landscape: Foundations & Hands On

Getting Started with Tableau

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques