Baby steps in Python – Exploratory analysis in Python (using Pandas)

kunal Last Updated : 17 Apr, 2015
7 min read

In the last 2 posts of this series, we looked at how to install Python with iPython interface and several useful libraries and data structures, which are available in Python. If you have not gone through these posts and are new to Python, I would recommend that you go through the previous posts before going ahead.

In order to explore our data further, let me introduce you to another animal (as if Python was not enough!) – Pandas

pandas

Image Source: Wikipedia

Pandas are one of the most useful data analysis library in Python (I know these names sounds weird, but hang on!). They have been instrumental in increasing the use of Python in data science community. In this tutorial, we will use Pandas to read a data set from a Kaggle competition, perform exploratory analysis and build our first basic categorization algorithm for solving this problem.

 

Introduction to Series and Dataframes

Series can be understood as a 1 dimensional labelled / indexed array. You can access individual elements of this series through these labels.

A dataframe is similar to Excel workbook – you have column names referring to columns and you have rows, which can be accessed with use of row numbers. The essential difference being that column names and row numbers are known as column and row index, in case of dataframes.

Series and dataframes form the core data model for Pandas in Python. The data sets are first read into these dataframes and then various operations (e.g. group by, aggregation etc.) can be applied very easily to its columns.

 

Kaggle dataset – Titanic: Machine Learning from Disaster

You can download the dataset from Kaggle. Here is the description of variables as provided by Kaggle:

[stextbox id = “grey”]
VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
[/stextbox]

Let the exploration begin

To begin, start iPython interface in Inline Pylab mode by typing following on your terminal / windows command prompt:

[stextbox id = "grey"]ipython notebook --pylab=inline[/stextbox]

This opens up iPython notebook in pylab environment, which has a few useful libraries already imported. Also, you will be able to plot your data inline, which makes this a really good environment for interactive data analysis. You can check whether the environment has loaded correctly, by typing the following command (and getting the output as seen in the figure below):

[stextbox id = "grey"]plot(arange(5))[/stextbox]

ipython_pylab_check

I am currently working in Linux, and have stored the dataset in the following location:

 /home/kunal/Downloads/kaggle/train.csv

 

Importing libraries and the data set:

Following are the libraries we will use during this tutorial:

  • numpy
  • matplotlib
  • pandas

You can read a brief description about each of these libraries here. Please note that you do not need to import matplotlib and numpy because of Pylab environment. I have still kept them in the code, in case you use the code in a different environment.

After importing the library, you read the dataset using function read_csv(). This is how the code looks like till this stage:

[stextbox id = "grey"]
import pandas as pd
import numpy as np
import matplotlib as plt

df = pd.read_csv("/home/kunal/Downloads/kaggle/train.csv") #Reading the dataset in a dataframe using Pandas
[/stextbox]

 

Quick data exploration:

Once you have read the dataset, you can have a look at few top rows by using the function head()

[stextbox id = "grey"]df.head(10)[/stextbox]

 

data_head

This should print 10 rows. Alternately, you can also look at more rows by printing the dataset.

Next, you can look at summary of numerical fields by using describe() function

[stextbox id = "grey"]df.describe()[/stextbox]

data_describe

describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output (Read this article to refresh basic statistics to understand population distribution)

Here are a few inferences, you can draw by looking at the output of describe() function:

  1. Age has (891 – 714) 277 missing values.
  2. We can also look that about 38% passangers survived the tragedy. How? The mean of survival field is 0.38 (Remember, survival has value 1 for those who survived and 0 otherwise)
  3. By looking at percentiles of Pclass, you can see that more than 50% of passengers belong to class 3,
  4. The age distribution seems to be in line with expectation. Same with SibSp and Parch
  5. The fare seems to have values with 0 indicating possibility of some free tickets or data errors. On the other extreme, 512 looks like a possible outlier / error

In addition to these statistics, you can also look at the median of these variables and compare them with mean to see possible skew in the dataset. Median can be found out by:

[stextbox id = "grey"]df['Age'].median()[/stextbox]

 

For the non-numerical values (e.g. Sex, Embarked etc.), we can look at unique values to understand whether they make sense or not. Since Name would be a free flowing field, we will exclude it from this analysis. Unique value can be printed by following command:

[stextbox id = "grey"]df['Sex'].unique()[/stextbox]

Similarly, we can look at unique values of port of embarkment.

 

Distribution analysis:

Now that we are familiar with basic data characteristics, let us study distribution of various variables. Let us start with numeric variables – namely Age and Fare

We plot their histograms using the following commands:

[stextbox id = "grey"]fig = plt.pyplot.figure()
ax = fig.add_subplot(111)
ax.hist(df['Age'], bins = 10, range = (df['Age'].min(),df['Age'].max()))
plt.pyplot.title('Age distribution')
plt.pyplot.xlabel('Age')
plt.pyplot.ylabel('Count of Passengers')
plt.pyplot.show()
[/stextbox]

and

[stextbox id = "grey"]fig = plt.pyplot.figure()
ax = fig.add_subplot(111)
ax.hist(df['Fare'], bins = 10, range = (df['Fare'].min(),df['Fare'].max()))
plt.pyplot.title('Fare distribution')
plt.pyplot.xlabel('Fare')
plt.pyplot.ylabel('Count of Passengers')
plt.pyplot.show()
[/stextbox]

histogram_age

Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:

[stextbox id = "grey"]df.boxplot(column='Fare')[/stextbox]

bloxplot_fare1

This shows a lot of Outliers. Part of this can be driven by the fact that we are looking at fare across the 3 passenger classes. Let us segregate them by Passenger class:

[stextbox id = "grey"]df.boxplot(column='Fare', by = 'Pclass')[/stextbox]

bloxplot_fare2

Clearly, both Age and Fare require some amount of data munging. Age has about 31% missing values, while Fare has a few Outliers, which demand deeper understanding. We will take this up later (in the next tutorial).

 

Categorical variable analysis:

Now that we understand distributions for Age and Fare, let us understand categorical variables in more details. Following code plots the distribution of population by PClass and their probability of survival:

[stextbox id = "grey"]temp1 = df.groupby('Pclass').Survived.count()
temp2 = df.groupby('Pclass').Survived.sum()/df.groupby('Pclass').Survived.count()
fig = plt.pyplot.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('Pclass')
ax1.set_ylabel('Count of Passengers')
ax1.set_title("Passengers by Pclass")
temp1.plot(kind='bar')

ax2 = fig.add_subplot(122)
temp2.plot(kind = 'bar')
ax2.set_xlabel('Pclass')
ax2.set_ylabel('Probability of Survival')
ax2.set_title("Probability of survival by class")
[/stextbox]

categorical_pclass

You can plot similar graphs by Sex and port of embarkment.

Alternately, these two plots can also be visualized by combining them in a stacked chart:

[stextbox id = "grey"]temp3 = pd.crosstab([df.Pclass, df.Sex], df.Survived.astype(bool))
temp3.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)[/stextbox]

crosstab_class_sex

You can also add port of embankment into the mix:

crosstab_class_sex_port

If you have not realized already, we have just created two basic classification algorithms here, one based on Pclass and Sex, while other on 3 categorical variables (including port of embankment). You can quickly code this to create your first submission on Kaggle.

End Notes:

In this post, we saw how we can do exploratory analysis in Python using Pandas. I hope your love for pandas (the animal) would have increased by now – given the amount of help, the library can provide you in analyzing datasets.

We will start the next tutorial from this stage, where we will explore Age and Fare variables further, perform data munging and create a dataset for applying various modeling techniques. If you are following this series, we have covered a lot of ground in this tutorial. I would strongly urge that you take another dataset and problem and go through an independent example before we publish the next post in this series.

If you come across any difficulty while doing so, or you have any thoughts / suggestions / feedback on the post, please feel free to post them through comments below.

If you like what you just read & want to continue your analytics learning, subscribe to our emailsfollow us on twitter or like our facebook page.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear

Gaurav
Gaurav

Thanks Kunal, Its a wonderful article to get started with Data Analysis in Python. I am getting empty plot for Age distribution. Is this because df['Age'].min() = 0.41999999999999998

Phil Renaud
Phil Renaud

Really excellent overview - thank you! Any recommendations for which data set to look at next?

Stas
Stas

Good article! Thanks! It's better to avoid using of "--pylab=inline" - see details here: http://carreau.github.io/posts/10-No-PyLab-Thanks.ipynb.html