Baby steps in Python – Exploratory analysis in Python (using Pandas)

kunal Last Updated : 17 Apr, 2015

7 min read

In the last 2 posts of this series, we looked at how to install Python with iPython interface and several useful libraries and data structures, which are available in Python. If you have not gone through these posts and are new to Python, I would recommend that you go through the previous posts before going ahead.

In order to explore our data further, let me introduce you to another animal (as if Python was not enough!) – Pandas

Image Source: Wikipedia

Pandas are one of the most useful data analysis library in Python (I know these names sounds weird, but hang on!). They have been instrumental in increasing the use of Python in data science community. In this tutorial, we will use Pandas to read a data set from a Kaggle competition, perform exploratory analysis and build our first basic categorization algorithm for solving this problem.

Introduction to Series and Dataframes

Series can be understood as a 1 dimensional labelled / indexed array. You can access individual elements of this series through these labels.

A dataframe is similar to Excel workbook – you have column names referring to columns and you have rows, which can be accessed with use of row numbers. The essential difference being that column names and row numbers are known as column and row index, in case of dataframes.

Series and dataframes form the core data model for Pandas in Python. The data sets are first read into these dataframes and then various operations (e.g. group by, aggregation etc.) can be applied very easily to its columns.

Kaggle dataset – Titanic: Machine Learning from Disaster

You can download the dataset from Kaggle. Here is the description of variables as provided by Kaggle:

[stextbox id = “grey”]

VARIABLE DESCRIPTIONS:
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch)
some relations were ignored.  The following are the definitions used
for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins,
nephews/nieces, aunts/uncles, and in-laws.  Some children travelled
only with a nanny, therefore parch=0 for them.  As well, some
travelled with very close friends or neighbors in a village, however,
the definitions do not support such relations.

[/stextbox]

Let the exploration begin

To begin, start iPython interface in Inline Pylab mode by typing following on your terminal / windows command prompt:

[stextbox id = "grey"]ipython notebook --pylab=inline[/stextbox]

This opens up iPython notebook in pylab environment, which has a few useful libraries already imported. Also, you will be able to plot your data inline, which makes this a really good environment for interactive data analysis. You can check whether the environment has loaded correctly, by typing the following command (and getting the output as seen in the figure below):

[stextbox id = "grey"]plot(arange(5))[/stextbox]

I am currently working in Linux, and have stored the dataset in the following location:

/home/kunal/Downloads/kaggle/train.csv

Importing libraries and the data set:

Following are the libraries we will use during this tutorial:

numpy
matplotlib
pandas

You can read a brief description about each of these libraries here. Please note that you do not need to import matplotlib and numpy because of Pylab environment. I have still kept them in the code, in case you use the code in a different environment.

After importing the library, you read the dataset using function read_csv(). This is how the code looks like till this stage:

[stextbox id = "grey"]
import pandas as pd
import numpy as np
import matplotlib as plt

df = pd.read_csv("/home/kunal/Downloads/kaggle/train.csv") #Reading the dataset in a dataframe using Pandas
[/stextbox]

Quick data exploration:

Once you have read the dataset, you can have a look at few top rows by using the function head()

[stextbox id = "grey"]df.head(10)[/stextbox]

This should print 10 rows. Alternately, you can also look at more rows by printing the dataset.

Next, you can look at summary of numerical fields by using describe() function

[stextbox id = "grey"]df.describe()[/stextbox]

describe() function would provide count, mean, standard deviation (std), min, quartiles and max in its output (Read this article to refresh basic statistics to understand population distribution)

Here are a few inferences, you can draw by looking at the output of describe() function:

Age has (891 – 714) 277 missing values.
We can also look that about 38% passangers survived the tragedy. How? The mean of survival field is 0.38 (Remember, survival has value 1 for those who survived and 0 otherwise)
By looking at percentiles of Pclass, you can see that more than 50% of passengers belong to class 3,
The age distribution seems to be in line with expectation. Same with SibSp and Parch
The fare seems to have values with 0 indicating possibility of some free tickets or data errors. On the other extreme, 512 looks like a possible outlier / error

In addition to these statistics, you can also look at the median of these variables and compare them with mean to see possible skew in the dataset. Median can be found out by:

[stextbox id = "grey"]df['Age'].median()[/stextbox]

For the non-numerical values (e.g. Sex, Embarked etc.), we can look at unique values to understand whether they make sense or not. Since Name would be a free flowing field, we will exclude it from this analysis. Unique value can be printed by following command:

[stextbox id = "grey"]df['Sex'].unique()[/stextbox]

Similarly, we can look at unique values of port of embarkment.

Distribution analysis:

Now that we are familiar with basic data characteristics, let us study distribution of various variables. Let us start with numeric variables – namely Age and Fare

We plot their histograms using the following commands:

[stextbox id = "grey"]fig = plt.pyplot.figure()
ax = fig.add_subplot(111)
ax.hist(df['Age'], bins = 10, range = (df['Age'].min(),df['Age'].max()))
plt.pyplot.title('Age distribution')
plt.pyplot.xlabel('Age')
plt.pyplot.ylabel('Count of Passengers')
plt.pyplot.show()
[/stextbox]

and

[stextbox id = "grey"]fig = plt.pyplot.figure()
ax = fig.add_subplot(111)
ax.hist(df['Fare'], bins = 10, range = (df['Fare'].min(),df['Fare'].max()))
plt.pyplot.title('Fare distribution')
plt.pyplot.xlabel('Fare')
plt.pyplot.ylabel('Count of Passengers')
plt.pyplot.show()
[/stextbox]

Next, we look at box plots to understand the distributions. Box plot for fare can be plotted by:

[stextbox id = "grey"]df.boxplot(column='Fare')[/stextbox]

This shows a lot of Outliers. Part of this can be driven by the fact that we are looking at fare across the 3 passenger classes. Let us segregate them by Passenger class:

[stextbox id = "grey"]df.boxplot(column='Fare', by = 'Pclass')[/stextbox]

Clearly, both Age and Fare require some amount of data munging. Age has about 31% missing values, while Fare has a few Outliers, which demand deeper understanding. We will take this up later (in the next tutorial).

Categorical variable analysis:

Now that we understand distributions for Age and Fare, let us understand categorical variables in more details. Following code plots the distribution of population by PClass and their probability of survival:

[stextbox id = "grey"]temp1 = df.groupby('Pclass').Survived.count()
temp2 = df.groupby('Pclass').Survived.sum()/df.groupby('Pclass').Survived.count()
fig = plt.pyplot.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('Pclass')
ax1.set_ylabel('Count of Passengers')
ax1.set_title("Passengers by Pclass")
temp1.plot(kind='bar')

ax2 = fig.add_subplot(122)
temp2.plot(kind = 'bar')
ax2.set_xlabel('Pclass')
ax2.set_ylabel('Probability of Survival')
ax2.set_title("Probability of survival by class")
[/stextbox]

You can plot similar graphs by Sex and port of embarkment.

Alternately, these two plots can also be visualized by combining them in a stacked chart:

[stextbox id = "grey"]temp3 = pd.crosstab([df.Pclass, df.Sex], df.Survived.astype(bool))
temp3.plot(kind='bar', stacked=True, color=['red','blue'], grid=False)[/stextbox]

You can also add port of embankment into the mix:

If you have not realized already, we have just created two basic classification algorithms here, one based on Pclass and Sex, while other on 3 categorical variables (including port of embankment). You can quickly code this to create your first submission on Kaggle.

End Notes:

In this post, we saw how we can do exploratory analysis in Python using Pandas. I hope your love for pandas (the animal) would have increased by now – given the amount of help, the library can provide you in analyzing datasets.

We will start the next tutorial from this stage, where we will explore Age and Fare variables further, perform data munging and create a dataset for applying various modeling techniques. If you are following this series, we have covered a lot of ground in this tutorial. I would strongly urge that you take another dataset and problem and go through an independent example before we publish the next post in this series.

If you come across any difficulty while doing so, or you have any thoughts / suggestions / feedback on the post, please feel free to post them through comments below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

kunal

Free Courses

4.7

Introduction to CrewAI: Building a Researcher Assistant Agent

Build smart AI agents with CrewAI to automate tasks and solve problems.

4.7

Understanding the working of Neural Networks

Learn the neural network basics, concepts, layers, and activation functions.

4.5

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

4.6

GenAI Landscape: Foundations & Hands On

Learn Generative AI basics: prompting, RAG, fine-tuning & agents.

4.5

Getting Started with Tableau

Free Tableau certification course covering data visualization essentials.

Gaurav

Thanks Kunal, Its a wonderful article to get started with Data Analysis in Python. I am getting empty plot for Age distribution. Is this because df['Age'].min() = 0.41999999999999998

Show 1 reply

Kunal Jain

Gaurav, Can you check if you are getting a histogram output with following command: df['Age'].hist() If you still get it empty, can you send the output of the df.describe() to me via email? Which version of Python, matplotlib and pandas are you using through which interface? Regards, Kunal

Phil Renaud

Really excellent overview - thank you! Any recommendations for which data set to look at next?

Thanks Phil! It entirely depends on your past experience. If you already know stats and predictive modeling technique - but are learning Python as a new tool - I would say that you should take up a bigger dataset and a more complex problem (e.g. Tree classification or movie review mining on Kaggle). If you are learning both the tools and the techniques for the first time, I would say to take a few well documented problems - example Iris dataset, so that you can learn about various techniques and their results from internet. Hope this helps. Regards, Kunal

Stas

Good article! Thanks! It's better to avoid using of "--pylab=inline" - see details here: http://carreau.github.io/posts/10-No-PyLab-Thanks.ipynb.html

Reading list

Baby steps in Python – Exploratory analysis in Python (using Pandas)

Introduction to Series and Dataframes

Kaggle dataset – Titanic: Machine Learning from Disaster

Let the exploration begin

Importing libraries and the data set:

Quick data exploration:

Distribution analysis:

Categorical variable analysis:

End Notes:

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Login to continue reading and enjoy expert-curated content.

Free Courses

Introduction to CrewAI: Building a Researcher Assistant Agent

Understanding the working of Neural Networks

No Code Predictive Analytics with Orange

GenAI Landscape: Foundations & Hands On

Getting Started with Tableau

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Baby steps in Python – Exploratory analysis in Python (using Pandas)

Introduction to Series and Dataframes

Kaggle dataset – Titanic: Machine Learning from Disaster

Let the exploration begin

Importing libraries and the data set:

Quick data exploration:

Distribution analysis:

Categorical variable analysis:

End Notes:

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Login to continue reading and enjoy expert-curated content.

Free Courses

Introduction to CrewAI: Building a Researcher Assistant Agent

Understanding the working of Neural Networks

No Code Predictive Analytics with Orange

GenAI Landscape: Foundations & Hands On

Getting Started with Tableau

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques