Data analysis is a key part of how organizations make decisions. However, all data is not created equal, and data cleansing is a crucial first step before generating any insights. You want to confirm that all data present is as valid and stable as possible before performing any analysis. To make good decisions, you need good data.
Removing duplicate values from your data set plays an important role in the cleansing process. Duplicate data takes up unnecessary storage space and slows down calculations at a minimum. At worst, duplicate data can skew analysis results and threaten the integrity of the data set.
pandas is an open-source Python library that optimizes storage and manipulation of structured data. The framework also has built-in support for data cleansing operations, including removing duplicate rows and columns.
This post will focus on the .duplicated and .drop_duplicates methods of the pandas library. Topics we'll cover include:
How to Count the Number of Duplicated Rows in Pandas DataFrames
Best for: inspecting your data sets for duplicates without having to manually comb through rows and columns of data
Before we start removing duplicate rows, it's wise to get a better idea of where the duplicates are in our data set. Then we can decide how best to deal with them. To gather this intelligence, we'll turn to the aptly named .duplicated method.
Below is a product inventory of kitchen cutlery and utensils. This will be our base DataFrame over the following tutorials:
With the .duplicated method, we can identify which rows are duplicates:
Here, we are calling .duplicated() on our DataFrame kitch_prod_df. Once executed, .duplicated() returns booleans organized into a pandas Series (also known as a column), which we can inspect by printing to the terminal:
The easiest way to interpret these results is to think of .duplicated as asking the question "Is this row duplicated?" The method looks through each row to determine if all the data present completely matches the values of any other row in the DataFrame. If it finds an exact match to another row, it returns True. In other words, it answers the question "Is this row duplicated?" with "Yes."
This means that any rows that are marked False are unique. To return to the printout, this means we have four unique rows and two duplicate rows ("fork" and "spoon" at index positions 3 and 4).
The default behavior of .duplicated is to mark the second instance of a row as the duplicate (and any other instances that follow it). It considers the first row to contain the matching data as unique. However, you can modify this behavior by changing the arguments.
Let's review the arguments available for the .duplicated method and how you can adjust them to fit your needs.
Pandas .duplicated Arguments
1. The Keep Argument
Best for: fine tuning how .duplicated treats the origin row for any duplicates it finds
Let's say we want to flag the origin row and any subsequent rows as duplicates. To do this, we turn to the keep argument:
kitch_prod_df.duplicated(keep = False)
Here, we call .duplicated() with the argument keep set to the boolean False.
In effect, this tells .duplicated we don't want to "keep" any duplicates. Now, any two or more rows that share the same data will all be marked True in the output. We can see the original DataFrame and the boolean Series printouts below.
As expected, the first and second rows that were marked False in our last example are now marked True since their data matches the duplicate rows in the fourth and fifth positions. This leaves us with two unique rows.
The default value for keep is "first." This tells the .duplicated method that we want to mark the first and subsequent occurrences of any duplicated row as False and mark the last instance as True:
kitch_prod_df.duplicated(keep = 'first')
Since "first" is the default keep argument, we don't need to include it when calling .duplicated(), but it's still important to understand how it works. We can confirm the method performed as expected by viewing the printout:
We can see the fourth and fifth rows are marked as duplicates, which matches the result when we executed .duplicated with no keep argument specified in the original example.
If you want to keep the last instance of a duplicate row, you can invert the default behavior of .duplicated by changing the argument to "last":
Instead of marking the fourth and fifth rows, we have now identified the initial occurrences of the duplicate data as True in the first and second rows.
In some scenarios, it makes sense to limit our search for duplicates to specific columns instead of the entire DataFrame. To achieve this, we use the subset argument.
2. The Subset Argument
Best for: narrowing the scope of the rows that .duplicated inspects to suspect columns to produce more focused results
Scanning the entire DataFrame isn't always practical, especially if it's dozens of columns or hundreds of rows. If we identify columns where duplicates are likely to occur, we can pass the column names to .duplicated with the subset argument.
The original DataFrame for reference:
In this code, we are checking the DataFrame for duplicates in the "department" column:
kitch_prod_df.duplicated(subset = 'department')
Here, we set the subset argument equal to "department" so that .duplicated() only examines the column matching that label. The output is below.
If we went off these results, we could conclude that our entire DataFrame is full of duplicate records. However, with context, we would realize that all the products are under the "kitchen" department label. Fortunately, the subset argument allows us to expand our scope beyond a single column by using a Python list:
This time, we are searching for duplicates in both the "sku" and "department" columns. Only if the data is the exact same in both columns will a row get marked as a duplicate. Even though every record shares the same department name in our DataFrame, only the fourth and fifth rows are flagged as duplicates in our results:
The keep and subset arguments are also available to us in the .drop_duplicates method, which we'll review in the following sections.
How to Drop Duplicate Rows in Pandas DataFrames
Best for: removing rows you have determined are duplicates of other rows and will skew analysis results or otherwise waste storage space
Now that we know where the duplicates are in our DataFrame, we can use the .drop_duplicates method to remove them.
The original DataFrame for reference:
By default, .drop_duplicates will remove the second and additional occurrences of any duplicate rows when called:
kitch_prod_df.drop_duplicates(inplace = True)
In the above code, we call .drop_duplicates() on the kitch_prod_df DataFrame with the inplace argument set to True. This allows us to modify the existing DataFrame instead of returning a new copy of the DataFrame with the duplicates removed.
We can confirm that the duplicated rows were dropped by printing to the terminal. We can compare the original and new DataFrames and see that the rows at index positions 3 and 4 have been removed. Note that the index has not reordered in the modified DataFrame.
The inplace argument is not the only argument available with .drop_duplicates. We can use the keep argument to modify which rows are removed, similar to how we modified which rows were tagged True with the .duplicated method.
In this example, we have instructed .drop_duplicates() to remove the first instance of any duplicate row:
Here, .drop_duplicates has identified duplicates in the values of the "sku" column and dropped the rows belonging to those repeat values — the fourth and fifth rows.
This video from DataDaft provides a live demonstration of the .drop_duplicates method:
We now understand how the .drop_duplicates method removes identical rows and how its arguments allow us to refine its behavior. Now, let's look at how we can modify this approach to remove duplicate columns.
How to Drop Duplicate Columns in Pandas DataFrames
Best for: removing columns you have determined are duplicates of other columns with only a slight adjustment to the syntax for dropping identical rows
You may encounter columns that hold identical values that need to be removed. However, .drop_duplicates only works for rows.
Fortunately, Python offers a trick for this use case. In short, we will temporarily make our columns into rows.
Our DataFrame for reference, now including a second column labeled "products" and containing the same values as the original "product" column.
The .T property allows us to invert the axis of our DataFrame. You can see the output of printing kitch_prod_df.T to the terminal below.
Now, .drop_duplicates can look through our DataFrame and remove duplicate rows, because the columns are the rows.
Here, we are inverting the kitch_prod_df DataFrame with the .T property so that we can call .drop_duplicates() to examine the rows. By default, .drop_duplicates() will keep the first instance of a duplicate row and remove any others. We are then removing the inversion with the second .T property so the DataFrame will display normally and saving it to a new variable no_dup_columns.
The output of printing the new DataFrame is below.
Here, we can see the duplicate column "products" is now removed from our DataFrame. Even though the column names between "product" and "products" were different, their values were identical.
We can modify how .drop_duplicates removes duplicate columns through the keep and subset arguments just like with dropping duplicate rows.
Improve data analysis and discovery by dropping duplicates in Pandas.
Good decisions rely on good data, making data cleansing an essential process for data-driven businesses. A major part of improving data quality is removing duplicate data, which can skew the results of data analysis and take up unnecessary storage space. The pandas library supports this critical need with built-in methods to find and remove duplicate rows and columns. Armed with these tools, you are ready to improve your business outcomes.
Originally published Mar 7, 2022 7:00:00 AM, updated March 21 2022