How to Deal with Missing Data in Python ?


Data scientist works on the large dataset for doing better analysis. It can lead to wrong predictions if you have a dataset and have missing values in the rows and columns. How to deal with missing data is a major task for every data scientist for the correct prediction. It is one of the top steps for data preprocessing steps. If you want to know about it then follow our post on it.

Top 4 Data Pre Processing Steps.

In this tutorial of “How to”, you will learn the following things.

  1. How to find the missing values in the Dataset?
  2. The methods for filling the missing values.
  3. How to find the total number of the missing values in the dataset?
  4. Filtering the missing values?

Follow the step by step methods for getting more knowledge on this lesson.

Step 1: Import the necessary libraries.

import numpy as np
import pandas as pd
from pandas import Series,DataFrame

Step 2: Read the dataset using the Pandas.

For this example, I am reading the sales dataset. Though the data is complete but for the demonstration purpose I am defining some missing values for the Sales and Price columns in the dataset using the numpy nan method. If you have already missing values in the dataset then move to step 3.

address= "C:\Users\skrsuDesktop\Jypter\data\sales_data.csv"
sales_data = pd.read_csv(address)
sales_data.columns = ["orderNumber","Quantity","Price","Sales","Date"]

pandas read the dataset

Let’s put the 2nd, 6th rows of the Price and 1st, 4th and 7th row of the Sales column to be NaN.
Use the following code for the traversing the specific rows and change their values to NaN.

sales_data.iloc[[2,6],2] = missing

For putting the missing value at the 2nd and 6th position of the Price Column.

sales_data.iloc[[1,4,7],3]=missing, will put the missing value at the 1st,4th and 7th position of the Sales column.

missing data pandas

Step 3: Find there are missing data in the dataset or not.

Use the following method to find the missing value.


It will tell you at the total number of missing values in the corresponding columns.

check missing values pandas

Step 4: Filling the missing values.

To do this you have to use the Pandas Dataframe fillna() method. You can fill the values in the three ways.
Lets I have to fill the missing values with 0, then I will use the method fillna(0) with 0 as an argument.


pandas fill the missing values with 0

You can also fill the missing values with the mean of the data of the corresponding column. Like, in this case, I will fill the missing value with the mean of the Price and Sales using the fillna() method but instead of passing 0 as an argument, I will pass the dictionary.

#mean of the price and Sales Column
mean_price= sales_data["Price"].mean()
mean_sales = sales_data["Sales"].mean()


filled_sales_data= sales_data.fillna({"Price":mean_price,"Sales":mean_sales})


fill the missing values with the mean

Now it fills all the missing values of Price and Sales column with the mean of the corresponding column.

You can also fill the missing values with the last non-value in the same column using the fillna(method=”ffill”)


filling the missing value with the last non-null value

Step 5: Filtering out the Null Data in the large dataset.

Suppose you have a large dataset or columns or rows in the dataset that has maximum null values. Then instead of filling their values using the method fillna(), you should remove or delete the rows and columns using the method dropna().

If you want to want to delete the rows then you can simply use the dropna() method without any axis. Like in this case


delete rows with null values

But for deleting the columns you have to pass the axis =1 as the argument of dropna(axis=1). You should always delete the columns only when most of the rows of that particular column is null. That’s why most of the data scientist use dropna().

delete columns with null values

In some case all the values inside the row are null, then, in that case, you should use dropna(how=”all”). In our dataset, there are no rows that all the values are null. Therefore the output will be just the simply original sales data containing the null values.

deleting rows and columns with all the dropna all


Data cleaning is a major process before modeling machine learning for better predictions. Pandas library is a popular library for optimization and cleaning the raw data and making it structured data.

We think that reading this tutorial given a basic understanding of “How to Deal with Missing Data in Python? If you want to suggest us to write the tutorial then contact us. You can also subscribe or Like us for faster learning.


Data Science Learner Team.

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.


Source link

What is a Student T Test in Statistics ? An Overview for the Data Scientist


In order to find the z score in the Normal DIstribution, you have to know the population standard deviation. Without the value of it, you are unable to find the z score and find the relationship with between the sample deviation and population deviation. But in the real world, most of the problems you are unable to know the population standard deviation. To solve the problems like this Student T  Test distribution came to exists.

In the Student’s T distribution you select various small samples when you don’t know the population standard deviation. You use T – Table to determine the significant difference between the two sets of the data. In Z test you consider only the mean for finding the score but in the t-test, you consider sample variance.

Types of the Student T-test

One-Sample T Test

In this, you test the null hypothesis that the population mean is equal to a particular value that is based on the sample mean. For example, the sample student’s test score mean is equal to the mean of the students in the population. The formula for the one Sample T-Test is the below

one sample t test formula

How to calculate student t test for the One Sample ?

Just like in the Z score you consider the table for the lookup of the Z score, in the same way, You lookup for the T Table for finding the score. But these scored depend upon the sample size (n) and the level of significance, $latex alpha &s=2$ ( Default is 0.05).
After that, you compare it with T – Table Score to determine the options for fail to reject or reject Hypothesis.

compare t value one sample

Independent two Sample t-test

In this t-test you test the null hypothesis that the two exclusive sample means $latex overline{x_{1}}&s=2$ and $latex overline{x_{2}}&s=2$ bar are equal.

For example, you want to check the mean test scores of the two samples of the students. It can be statistically equal or not. Note that this test is named as Independent two-sample t-test as each sample are independent of each other.
The calculation of this Sample Test is different from the One sample test and depends upon the following cases.
Equal Sample Size(n) and Equal Variance ($latex S&s=2$)
Unequal Sample Size(n) and Equal Variance($latex S&s=2$)
Equal or Unequal Sample Size(n) and Unequal Variance ($latex S_{1}&s=2$) and ($latex S_{2}&s=2$)
The third case is the most common situation
Note that when you say to compare one thing to another thing. Then it means you will find the ratio between them. In this two sample t -test you are finding the T ratio between the two samples and the formula is the below.

two independent t test in unequal mean and variance

The General formula for the degree for the freedom is

Dependent Paired Sample T-test.
It is used when the samples are dependent on each other. There can be two cases of it. One is that the one sample has been tested twice and the other is two samples are being matched or paired.
For example, you want to check the test scores that this course has improved the test scores of the same group of students before and after taking the course.

How to calculate student t test for the two independent samples ?

Suppose you have two samples Sample A and Sample B. To do T-test on the two samples.

Step 1: Compare the means of the two sample. ($latex x_{1}&s=2$) – ($latex x_{2}&s=2$). If there is a large difference then you can easily decide the best sample. But if the difference is less then you have to calculate the T-Test.

Step 2 : State the Null and Alternate Hypothesis and select the type of test and the degree of the Freedom.

Step 3: Calculate the variance of the Each of the Sample A and Sample B that is ($latex S_{a}&s=2$) and ($latex S_{b}&s=2$)

Step 4 : Using All the calculated values find the T value using the following formulae.

two independent t test in unequal mean and variance

Step 5: Look for the Critical Value from the T table.

Inside the T table, you look for the type of the test, one tail or two tails, the degree of the Freedom, critical level of the significance.

Step 6: Compare the T value with the critical value.

t test for the independent two samples

Step 7: After comparing you are now statistically able to determine whether you to reject or fail to reject the Null Hypothesis.


Student T Test  is very useful when you have two large samples and their difference between mean is very small.  As it uses sample variance, you can compare them to find the best sample. You can think it as an Upgrade of the Z Score. Other things like Test type, level of significance are the same as the Z score.

Hope, you understood the Student  T distribution and How to calculate it. If you want to add something in this tutorial and want to ask something, then please contact us. You can also subscribe us and like Data Science Learner Page

Join our list

Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

Thank you for signup. A Confirmation Email has been sent to your Email Address.

Something went wrong.


Source link