Cleaning Data Frames Efficiently with Pandas

Jerry Hudspeth
4 min readJun 2, 2021

If you have ever heard anything about the fields of data science or data analytics, you’ve likely heard the saying, “data science is 80% cleaning data and 20% actually analyzing the data.” Now regardless of if you enjoy the feeling of cleaning up a disorganized, chaotic data frame or not, we can agree speeding this process up makes the job go by rapidly.

There are multiple different ways to go about altering data frames through Pandas, an analytics library in python written by Wes McKinney. But data frames can be thousands of lines of code long and what initially was a minor waste of time can add up. Below is a brief overview of a data frame that we can use to demonstrate different techniques for data cleaning on.

Overview of sample data set

If we narrow our attention to the column titled “Name”, we can see most of the values are cluttered, it would be easier to read just a last name than the entire name. Now let’s assume we’ve ignored all instructions about not making a for loop iterate through a data frame and proceeded to attempt to solve our problem that way.

An example of a for loop going through a data frame

I wrote an example for loop above that would check each of the 891 rows in the data frame individually to create a new column of all the last names of the passengers. The first line, %%timeit, is a command that works in Jupyter notebooks and records how long it took the code to complete. The %% indicates that the command will hold true for the entire cell (%timeit would only work for the line of code it was on). So we can tell that for this specific for loop, it will take 137 ms ± 2.11 ms per loop. Now for the approximately 900 loops we can calculate that it would take this code about two entire minutes. Enough to make 2 cups of minute rice. That is too slow and like my role model, Sonic the Hedgehog, I too, gotta go fast…

A faster method to edit data frames is with .apply() or .map(), .map() is two characters shorter than .apply() and as they are essentially identical we will use .map().

Immediately we can see that not only does it take significantly less code for this command to work but also it takes a fraction of the time at 735 μs per loop instead of the previous 137 ms per loop. That is an increase of approximately 19,000% increase in speed.

But can we go beyond what it is to be even faster than a for loop?

The answer is yes but we must first go into vectorization and I certainly could not figure out how to make vectorization work in a code in the time crunch I was in. However, an explanation of the concept of vectorization is as follows.

The principle behind vectorization is in linear algebra which works with matrices to transform them, at times using vector multiplication. The essence of vectorization is that a data frame can be visualized as an array and then commands can be used to alter the entire array at once. As one would expect, this saves a tremendous amount of time when it comes to processing code as the computer no longer has to run each command multiple times for each row it is given. An example of vectorization using numba that can be found on the pandas.py database is shown below.

Example of vectorization for a data frame

--

--