In the last 2 posts of this series, we looked at how to install Python with iPython interface and several useful libraries and data structures, which are available in Python. If you have not gone through these posts and are new to Python, I would recommend that you go through the previous posts before going ahead.
In order to explore our data further, let me introduce you to another animal (as if Python was not enough!) – Pandas
Pandas are one of the most useful data analysis library in Python (I know these names sounds weird, but hang on!). They have been instrumental in increasing the use of Python in data science community. In this tutorial, we will use Pandas to read a data set from a Kaggle competition, perform exploratory analysis and build our first basic categorization algorithm for solving this problem.
Series can be understood as a 1 dimensional labelled / indexed array. You can access individual elements of this series through these labels.
A dataframe is similar to Excel workbook – you have column names referring to columns and you have rows, which can be accessed with use of row numbers. The essential difference being that column names and row numbers are known as column and row index, in case of dataframes.
Series and dataframes form the core data model for Pandas in Python. The data sets are first read into these dataframes and then various operations (e.g. group by, aggregation etc.) can be applied very easily to its columns.
You can download the dataset from Kaggle. Here is the description of variables as provided by Kaggle:
[stextbox id = “grey”]VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1)
If the Age is Estimated, it is in the form xx.5
With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored. The following are the definitions used
for sibsp and parch.
Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent: Mother or Father of Passenger Aboard Titanic
Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws. Some children travelled
only with a nanny, therefore parch=0 for them. As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.
[/stextbox]
To begin, start iPython interface in Inline Pylab mode by typing following on your terminal / windows command prompt:
[stextbox id = "grey"]ipython notebook --pylab=inline[/stextbox]
This opens up iPython notebook in pylab environment, which has a few useful libraries already imported. Also, you will be able to plot your data inline, which makes this a really good environment for interactive data analysis. You can check whether the environment has loaded correctly, by typing the following command (and getting the output as seen in the figure below):
[stextbox id = "grey"]plot(arange(5))[/stextbox]
I am currently working in Linux, and have stored the dataset in the following location:
/home/kunal/Downloads/kaggle/train.csv
Following are the libraries we will use during this tutorial:
You can read a brief description about each of these libraries here. Please note that you do not need to import matplotlib and numpy because of Pylab environment. I have still kept them in the code, in case you use the code in a different environment.
After importing the library, you read the dataset using function read_csv(). This is how the code looks like till this stage:
[stextbox id = "grey"]
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv("/home/kunal/Downloads/kaggle/train.csv") #Reading the dataset in a dataframe using Pandas
[/stextbox]
Once you have read the dataset, you can have a look at few top rows by using the function head()
[stextbox id = "grey"]df.head(10)[/stextbox]
This should print 10 rows. Alternately, you can also look at more rows by printing the dataset.
Next, you can look at summary of numerical fields by using describe() function
[stextbox id = "grey"]df.describe()[/stextbox]
describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output (Read this article to refresh basic statistics to understand population distribution)
Here are a few inferences, you can draw by looking at the output of describe() function:
In addition to these statistics, you can also look at the median of these variables and compare them with mean to see possible skew in the dataset. Median can be found out by:
[stextbox id = "grey"]df['Age'].median()[/stextbox]
For the non-numerical values (e.g. Sex, Embarked etc.), we can look at unique values to understand whether they make sense or not. Since Name would be a free flowing field, we will exclude it from this analysis. Unique value can be printed by following command:
[stextbox id = "grey"]df['Sex'].unique()[/stextbox]
Similarly, we can look at unique values of port of embarkment.
Now that we are familiar with basic data characteristics, let us study distribution of various variables. Let us start with numeric variables – namely Age and Fare
We plot their histograms using the following commands:
[stextbox id = "grey"]fig = plt.pyplot.figure()
ax = fig.add_subplot(111)
ax.hist(df['Age'], bins = 10, range = (df['Age'].min(),df['Age'].max()))
plt.pyplot.title('Age distribution')
plt.pyplot.xlabel('Age')
plt.pyplot.ylabel('Count of Passengers')
plt.pyplot.show()
[/stextbox]
and
[stextbox id = "grey"]fig = plt.pyplot.figure()
ax = fig.add_subplot(111)
ax.hist(df['Fare'], bins = 10, range = (df['Fare'].min(),df['Fare'].max()))
plt.pyplot.title('Fare distribution')
plt.pyplot.xlabel('Fare')
plt.pyplot.ylabel('Count of Passengers')
plt.pyplot.show()
[/stextbox]
Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:
[stextbox id = "grey"]df.boxplot(column='Fare')[/stextbox]
This shows a lot of Outliers. Part of this can be driven by the fact that we are looking at fare across the 3 passenger classes. Let us segregate them by Passenger class:
[stextbox id = "grey"]df.boxplot(column='Fare', by = 'Pclass')[/stextbox]
Clearly, both Age and Fare require some amount of data munging. Age has about 31% missing values, while Fare has a few Outliers, which demand deeper understanding. We will take this up later (in the next tutorial).
Now that we understand distributions for Age and Fare, let us understand categorical variables in more details. Following code plots the distribution of population by PClass and their probability of survival:
[stextbox id = "grey"]temp1 = df.groupby('Pclass').Survived.count()
temp2 = df.groupby('Pclass').Survived.sum()/df.groupby('Pclass').Survived.count()
fig = plt.pyplot.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('Pclass')
ax1.set_ylabel('Count of Passengers')
ax1.set_title("Passengers by Pclass")
temp1.plot(kind='bar')
ax2 = fig.add_subplot(122)
temp2.plot(kind = 'bar')
ax2.set_xlabel('Pclass')
ax2.set_ylabel('Probability of Survival')
ax2.set_title("Probability of survival by class")
[/stextbox]
You can plot similar graphs by Sex and port of embarkment.
Alternately, these two plots can also be visualized by combining them in a stacked chart:
[stextbox id = "grey"]temp3 = pd.crosstab([df.Pclass, df.Sex], df.Survived.astype(bool)) temp3.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)[/stextbox]
You can also add port of embankment into the mix:
If you have not realized already, we have just created two basic classification algorithms here, one based on Pclass and Sex, while other on 3 categorical variables (including port of embankment). You can quickly code this to create your first submission on Kaggle.
In this post, we saw how we can do exploratory analysis in Python using Pandas. I hope your love for pandas (the animal) would have increased by now – given the amount of help, the library can provide you in analyzing datasets.
We will start the next tutorial from this stage, where we will explore Age and Fare variables further, perform data munging and create a dataset for applying various modeling techniques. If you are following this series, we have covered a lot of ground in this tutorial. I would strongly urge that you take another dataset and problem and go through an independent example before we publish the next post in this series.
If you come across any difficulty while doing so, or you have any thoughts / suggestions / feedback on the post, please feel free to post them through comments below.
Thanks Kunal, Its a wonderful article to get started with Data Analysis in Python. I am getting empty plot for Age distribution. Is this because df['Age'].min() = 0.41999999999999998
Gaurav, Can you check if you are getting a histogram output with following command: df['Age'].hist() If you still get it empty, can you send the output of the df.describe() to me via email? Which version of Python, matplotlib and pandas are you using through which interface? Regards, Kunal
Really excellent overview - thank you! Any recommendations for which data set to look at next?
Thanks Phil! It entirely depends on your past experience. If you already know stats and predictive modeling technique - but are learning Python as a new tool - I would say that you should take up a bigger dataset and a more complex problem (e.g. Tree classification or movie review mining on Kaggle). If you are learning both the tools and the techniques for the first time, I would say to take a few well documented problems - example Iris dataset, so that you can learn about various techniques and their results from internet. Hope this helps. Regards, Kunal
Good article! Thanks! It's better to avoid using of "--pylab=inline" - see details here: http://carreau.github.io/posts/10-No-PyLab-Thanks.ipynb.html