Note: This article was originally published on Sep 13th, 2015 and updated on Sept 11th, 2017
Here’s a situation you’ve got into in your data science project:
You are working on a classification problem and have generated your set of hypothesis, created features and discussed the importance of variables. Within an hour, stakeholders want to see the first cut of the model.
What will you do? You have hundreds of thousands of data points and quite a few variables in your training data set. In such a situation, if I were in your place, I would have used ‘Naive Bayes‘, which can be extremely fast relative to other classification algorithms. It works on Bayes theorem of probability to predict the class of unknown data sets.
In this article, I’ll explain the basics of this algorithm, so that next time when you come across large data sets, you can bring this algorithm to action. In addition, if you are a newbie in Python or R, you should not be overwhelmed by the presence of available codes in this article.
If you prefer to learn Naive Bayes theorem from the basics concepts to the implementation in a structured manner, you can enrol in this free course: Naive Bayes from Scratch
Project to apply Naive BayesProblem StatementHR analytics is revolutionizing the way human resources departments operate, leading to higher efficiency and better results overall. Human resources have been using analytics for years. However, the collection, processing, and analysis of data have been largely manual, and given the nature of human resources dynamics and HR KPIs, the approach has been constraining HR. Therefore, it is surprising that HR departments woke up to the utility of machine learning so late in the game. Here is an opportunity to try predictive analytics in identifying the employees most likely to get promoted. |
It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:
Let’s understand it using an example. Below I have a training data set of weather and corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to classify whether players will play or not based on weather condition. Let’s follow the below steps to perform it.
Step 1: Convert the data set into a frequency table
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction.
Problem: Players will play if weather is sunny. Is this statement is correct?
We can solve it using above discussed method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.
Pros:
Cons:
Again, scikit learn (python library) will help here to build a Naive Bayes model in Python. There are three types of Naive Bayes model under the scikit-learn library:
Gaussian: It is used in classification and it assumes that features follow a normal distribution.
Multinomial: It is used for discrete counts. For example, let’s say, we have a text classification problem. Here we can consider Bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.
Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.
Try out the below code in the coding window and check your results on the fly!
require(e1071) #Holds the Naive Bayes Classifier Train <- read.csv(file.choose()) Test <- read.csv(file.choose()) #Make sure the target variable is of a two-class classification problem only levels(Train$Item_Fat_Content) model <- naiveBayes(Item_Fat_Content~., data = Train) class(model) pred <- predict(model,Test) table(pred)
Above, we looked at the basic Naive Bayes model, you can improve the power of this basic model by tuning parameters and handle assumption intelligently. Let’s look at the methods to improve the performance of Naive Bayes Model. I’d recommend you to go through this document for more details on Text classification using Naive Bayes.
Here are some tips for improving power of Naive Bayes Model:
In this article, we looked at one of the supervised machine learning algorithm “Naive Bayes” mainly used for classification. Congrats, if you’ve thoroughly & understood this article, you’ve already taken you first step to master this algorithm. From here, all you need is practice.
Further, I would suggest you to focus more on data pre-processing and feature selection prior to applying Naive Bayes algorithm.0 In future post, I will discuss about text and document classification using naive bayes in more detail.
Did you find this article helpful? Please share your opinions / thoughts in the comments section below.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
10 Nov 23 • 08:00pm
Hi Sunil , From the weather and play table which is table [1] we know that frequency of sunny is 5 and play when sunny is 3 no play when suny is 2 so probability(play/sunny) is 3/5 = 0.6 Why do we need conditional probabilty to solve this? Is there problems that can be solved only using conditional probability. can you suggest such examples. Thanks, Arun
Arun, An example of a problem that requires the use of conditional probability is the Monty Hall problem. https://en.wikipedia.org/wiki/Monty_Hall_problem Conditional probability is used to solve this particular problem because the solution depends on Bayes' Theorem. Which was described earlier in the article.
It's a trivial example for illustration. The "Likelihood table" (a confusing misnomer, I think) is in fact a probability table that has the JOINT weather and play outcome probabilities in the center, and the MARGINAL probabilities of one variable (from integrating out the other variable from the joint) on the side and bottom. Say, weather type = w and play outcome = p. P(w,p) is the joint probabilities and P(p) and P(w) are the marginals. Bayes rule described above by Sunil stems from: P(w,p) = P(w|p) * P(p) = P(p|w) * P(w). From the center cells we have P(w,p) and from the side/bottom we get P(p) and P(w). Depending on what you need to calculate, it follows that: (1): P(w|p) = P(w,p) / P(p) and (2:)P(p|w) = P(w,p) / P(w), which is what you did with P(sunny,yes) = 3/14 and P(w) = 5/14, yielding (3/14 ) ( 14/5), with the 14's cancelling out. The main Bayes take away is that often, one of the two quantities above, P(w|p) or P(p|w) is much harder to get at than the other. So if you're a practitioner you'll come to see this as one of two mathematical miracles regarding this topic, the other being the applicability of Markov Chain Monte Carlo in circumventing some nasty integrals Bayes might throw at you. But I digress. Best, Erdem.
I had the same question. Have you found an answer? Thanks!
Great article and provides nice information.
Amazing content and useful information
I'm new to machine learning and Python.Could you please help to read data from CSV and to separate the same data set to training and test data
import pandas as pd person = pd.read_csv('example.csv') mask = np.random.rand(len(sales)) < 0.8 train = sales[mask] test = sales[~mask]
very useful article.
Very nice....but...if u dont mind...can you please give me that code in JAVA ...
is it possible to classify new tuple in orange data mining tool??
good.
I am really impressed together with your writing skills as wwell as with the format to your weblog. Is that this a paid theme or did you customiz it yourself? Anyway stay up the excellent quality writing, it's uncommon tto see a great weblog like this one these days..
You should be a part of a contest for one of the most useful websites online. I will highly recommend this blog!
Thanks for tips to improve the performance of models, that 's really precious experience.
Nice piece - Just to add my thoughts , people require a CA OCF-1 , We used a sample document here http://goo.gl/ibPgs2
Hi, I have a question regarding this statement: 'If continuous features do not have normal distribution, we should use transformation or different methods to convert it in normal distribution.' Can you provide an example or a link to the techniques? Thank you, MB
Can ӏ simply juѕt say what а comfoгt too find someone who actսaly knows what they're talking aƄout over the internet. Υou certainly know how to bring an issue to light and make it impߋrtant. A lot more peoⲣⅼe ought to read this and undᥱrstand this side of your story. I can't believe you aren't more popular because you certainly have the gift.
Great article! Thanks. Are there any similar articles for other classification algorithms specially target towards textual features and mix of textual/numeric features?
great article with basic clarity.....nice one
This article is extremely clear and well laid-out. Thank you!
ty
Explanation given in simple word. Well explained! Loved this article.
The 'y' should be capitalized in your code - great article though.
This is the best explanation of NB so far simple and short :)
Great article! Really enjoyed it. Just wanted to point out a small error in the Python code. Should be a capital "Y" in the predict like so : model.fit(x, Y) Thanks!
Is this dataset related to weather? I am confused as a newbie. Can you please guide?
best artical that help me to understand this concept
Am new to machine learning and this article was handy to me in understanding naive bayes especially the data on weather and play in the table. Thanks for sharing keep up
Thanks to you I can totally understand NB classifier.
Really nice article, very use-full for concept building.
I didn't understand the 3rd step. Highest probability out of which probability values? >> Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability. Higher than what?
Concept explained well... nice Article
thanks nice artical that help me to understand this concept
Good article and I am waiting for text and documents classification using naive base algorithm.
Superb information in just one blog.Enjoyed the simplicity.Thanks for the effort.
Good start point for beginners
Weldone sanil I have a question regarding naive bayes,currently i am working on a project that is detect depression through naive bayes algorithm so plz suggest few links regarding my projects.i shall be gratefull to you. Thanku so much
Hi Abdul, Refer this link.
I am not understanding the x and the y variables. Can someone help me
Hi Tongesai, X here represents the dependent variable and y is used for target variable (which is to be predicted).