Dealing with large scale data has always been a challenging task for data scientists. With limited resources and computational power, it often becomes a daunting experience.
Pandas is one of the most commonly used python libraries but using it on a single core to deal with large datasets becomes insufficient. Most users do not want to optimise their entire workflow just to meet the existing hardware requirements; they do, however, want Pandas to run faster regardless of the size of the data.
So researchers at Berkeley have come up with Pandas on Ray, a library that wraps Pandas and transparently distributes the data and computation. It’s targeted towards existing Pandas users who want their programs to run quicker and and scale better without making huge changes to the code.
Ray is basically a flexible and high performance distributed execution framework.
According to the researchers, “The user does not need to know how many cores their system or cluster has, nor do they need to specify how to distribute the data”. Even on a single machine, users can continue using their usual Pandas notebooks but will experience a significant upgrade in processing speed.
All the user needs to do is modify the old Pandas import statement in the below format:
import ray.dataframe as pd
And you’re good to go! Ray is initialized automatically with the number of cores available to you.
You can read the official research paper, which includes a dataset and a demo on how to use the library, on Berkeley’s blog here.
We have also covered another product from Ray, a reinforcement library called RLlib, which you can read about here.
While still very much in it’s nascent stages, this is shaping up to be a very promising library. Heavy datasets always tend to be problematic with limited computational resources, so Pandas on Ray should provide a workaround for that.
This is a good alternative to Dask, but not at the same level yet. You can read about the different between Ray and Dask here.
It is not available for Windows yet and there is no word on when that might happen. Currently, it can be used on both Mac and Linux machines.
Are you planning to use this library? Let us know in the comments section below.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,