8 Powerful Hacks to Ace Data Science Hackathons

Lakshay Arora 17 Mar, 2020 • [rt_reading_time] min read

Introduction

Like any discipline, data science also has a lot of “folk wisdom”. This folk wisdom is hard to teach formally or in a structured manner but it’s still crucial for success, both in the industry as well as in data science hackathons.

Newcomers in data science often form the impression that knowing all machine learning algorithms would be a panacea to all machine learning problems. They tend to believe that once they know the most common algorithms (Gradient Boosting, Xtreme Gradient Boosting, Deep Learning architectures), they would be able to perform well in their roles/organizations or top these leaderboards in competitions.

Sadly, that does not happen!

ace data science hackathons

If you’re reading this, there’s a high chance you’ve participated in a data science hackathon (or several of them). I’ve personally struggled to improve my model’s performance in my initial hackathon days and it was quite a frustrating experience. I know a lot of newcomers who’ve faced the same obstacle.

So I decided to put together 8 powerful hacks that have helped me climb to the top echelons of hackathon leaderboards. Some of these hacks are straightforward and a few you’ll need to practice to master.

And keeping the theme of hackathons going, make sure you register on the ‘Women-in-the-loop hackathon’ by Bain and Company!

 

The 8 Hacks to Ace Data Science Hackathons

  1. Understand the Problem Statement
  2. Build your Hypothesis Set
  3. Team Up
  4. Create a Generic Codebase
  5. Feature Engineering is the Key
  6. Ensemble (Almost) Always Wins
  7. Discuss! Collaborate!
  8. Trust Local Validation

 

Hack #1: Understand the Problem Statement

Seems too simple to be true? And yet, understanding the problem statement is the very first step to acing any data science hackathon:

  • Without understanding the problem statement, the data, and the evaluation metric, most of your work is fruitless. Spend time reading as much as possible about them and gain some functional domain knowledge if possible
  • Re-read all the available information. It will help you in figuring out an approach/direction before writing a single line of code. Only once you are very clear about the objective, you can proceed with the data exploration stage

Let me show you an example of a problem statement from a data science hackathon we conducted. Here’s the Problem Statement of the BigMart Sales Prediction problem:

The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and find out the sales of each product at a particular store.

Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.

The idea is to find the properties of a product and store which impact the sales of a product. Here, you can think of some of the factors based on your understanding that can make an impact on the sales and come up with some hypotheses without looking at the data.

 

Hack #2: Build your Hypothesis Set

  • Next, you should build a comprehensive list of hypotheses. Please note that I am actually asking you to build a set of the hypothesis before looking at the data. This ensures you are not biased by what you see in the data
  • It also gives you time to plan your workflow better. If you are able to think of hundreds of features, you can prioritize which ones you would create first
  • Read more about hypothesis generation here

I encourage you to go through the hypotheses generation stage for the BigMart Sales problem in this article: Approach and Solution to break in Top 20 of Big Mart Sales prediction We have divided them on the basis of store level and product level. Let me illustrate a few examples here.

 

Store-Level Hypotheses:

  1. City type: Stores located in urban or Tier 1 cities should have higher sales because of the higher income levels of people there
  2. Population Density: Stores located in densely populated areas should have higher sales because of more demand
  3. Store Capacity: Stores that are very big in size should have higher sales as they act like one-stop-shops and people would prefer getting everything from one place
  4. Ambiance: Stores that are well-maintained and managed by polite and humble people are expected to have higher footfall and thus higher sales

 

Product-Level Hypotheses:

  1. Brand: Branded products should have higher sales because of higher trust in the customer
  2. Packaging: Products with good packaging can attract customers and sell more
  3. Utility: Daily products should have a higher tendency to sell as compared to the specific products
  4. Advertising: Better advertising of products in the store should have higher sales in most cases
  5. Promotional Offers: Products accompanied by attractive offers and discounts will sell more

 

Hack #3: Team Up!

  • Build a team and brainstorm together. Try and find a person with a complementary skill set in your team. If you have been a coder all your life, go and team up with a person who has been on the business side of things
  • This would help you get a more diverse set of hypotheses and would increase your chances of winning the hackathon. The only exception to this rule can be that both of you should prefer the same tool/language stack
  • It will save you a lot of time and you will be able to parallelly experiment with several ideas and climb to the top of the leaderboard
  • Get a good score early in the competition which helps in teaming up with higher-ranked people

Here are some of the instances where hackathons were won by a team:

 

Hack #4: Create a Generic Codebase

  • Save valuable time when you participate in your next hackathon by creating a reusable generic code base & functions for your favorite models which can be used in all your hackathons, like:
    • Create a variety of time-based features if the dataset has a time feature
    • You can write a function that will return different types of encoding schemes
    • You can write functions that will return your results on a variety of different models so that you can choose your baseline model wisely and choose your strategy accordingly

Here is a code snippet that I generally use to encode all my train, test and validation set of the data. I just need to pass a dictionary on which column and what kind of encoding scheme I want. I will not recommend you to use exactly the same code but will suggest you keep some of the function handy so that you can spend more time on brainstorming and experimenting.

Here is a sample of how I use the above function. I just need to provide a dictionary where the keys are the type of encoding I want and the values are the columns name that I want to encode:

 

 

  • You can also use libraries like pandas profiling to get an idea about the dataset by reading the data:

 

Hack #5: Feature Engineering is Key

More data beats clever algorithms, but better data beats more data.

– Peter Norwig

Feature engineering! This is one of my favorite parts of a data science hackathon. I get to tap into my creative juices when it comes to feature engineering – and which data scientist doesn’t like that?

 

Hack #6: Ensemble (Almost) Always Wins

  • 95% of winners have used ensemble models in their final submission on DataHack hackathons
  • Ensemble modeling is a powerful way to improve the performance of your model. It is an art of combining diverse results of individual models together to improvise on the stability and predictive power of the model
  • You will not find any data science hackathon that has top finishing solutions without ensemble models
  • You can learn more about the different ensemble techniques from the following articles:
    1. Basics of Ensemble Learning
    2. A Comprehensive Guide to Ensemble Learning

Here is an example of an advance ensemble technique: 3-Level Stacking used by Marios Michailidis:

 

Hack #7: Discuss! Collaborate!

  • Stay up to date with forum discussions to make sure that you are not missing out on any obvious detail regarding the problem
  • Do not hesitate to ask people in forums/messages:

 

Hack #8: Trust Local Validation

  • Do not jump into building models by dumping data into the algorithms. While it is useful to get a sense of basic benchmarks, you need to take a step back and build a robust validation framework
  • Without validation, you are just shooting in the dark. You will be at the mercy of overfitting, leakage and other possible evaluation issues
  • By replicating the evaluation mechanism, you can make faster and better improvements by measuring your validation results along with making sure your model is robust enough to perform well on various subsets of the train/test data
  • Have a robust local validation set and avoid relying too much on the public leaderboard as this might lead to overfitting and can drop your private rank by a lot
  • In the Restaurant Revenue Prediction contest, a team that was ranked first on the public leaderboard slipped down to rank 1962 on the private leaderboard

“The first we used to determine which rows are part of the public leaderboard score, while the second is used to determine the correct predictions. Along the way, we encountered much interesting mathematics, computer science, and statistics challenges.”

Source: Kaggle:  BAYZ Team

 

Final Thoughts

These 8 hacks have held me in good stead regardless of the hackathon I’m participating in. Sure, a few tweaks here and there are necessary but having a solid framework and structure in place will take you a long way towards achieving success in data science hackathons.

I would love to hear your frameworks, hacks, and approaches to hackathons. Share your thoughts in the comments section below.

Use these hacks and participate in Women-in-the-loop: Data Science Hackathon by Bain & Company.

Lakshay Arora 17 Mar 2020

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Related Courses