Removing Duplicated Data in Pandas: A Step-by-Step Guide

Stephen Roddewig
Stephen Roddewig

Updated:

Published:

Data analysis is a key part of how organizations make decisions. However, all data is not created equal, and data cleansing is a crucial first step before generating any insights. You want to confirm that all data present is as valid and stable as possible before performing any analysis. To make good decisions, you need good data.

Data scientist removing duplicated rows with pandas

Download Now: An Introduction to Python [Free Guide]

Removing duplicate values from your data set plays an important role in the cleansing process. Duplicate data takes up unnecessary storage space and slows down calculations at a minimum. At worst, duplicate data can skew analysis results and threaten the integrity of the data set.

pandas is an open-source Python library that optimizes storage and manipulation of structured data. The framework also has built-in support for data cleansing operations, including removing duplicate rows and columns.

This post will focus on the .duplicated and .drop_duplicates methods of the pandas library. Topics we'll cover include:

How to Count the Number of Duplicated Rows in Pandas DataFrames

Best for: inspecting your data sets for duplicates without having to manually comb through rows and columns of data

Before we start removing duplicate rows, it's wise to get a better idea of where the duplicates are in our data set. Then we can decide how best to deal with them. To gather this intelligence, we'll turn to the aptly named .duplicated method.

Below is a product inventory of kitchen cutlery and utensils. This will be our base DataFrame over the following tutorials:

DataFrame with columns of kitchen utensil names, prices, product skus, and department name printed to the terminal

With the .duplicated method, we can identify which rows are duplicates:

 

kitch_prod_df.duplicated()

Here, we are calling .duplicated() on our DataFrame kitch_prod_df. Once executed, .duplicated() returns booleans organized into a pandas Series (also known as a column), which we can inspect by printing to the terminal:

Boolean list showing True values for fourth and fifth rows of source DataFrame printed to the terminal

The easiest way to interpret these results is to think of .duplicated as asking the question "Is this row duplicated?" The method looks through each row to determine if all the data present completely matches the values of any other row in the DataFrame. If it finds an exact match to another row, it returns True. In other words, it answers the question "Is this row duplicated?" with "Yes."

This means that any rows that are marked False are unique. To return to the printout, this means we have four unique rows and two duplicate rows ("fork" and "spoon" at index positions 3 and 4).

The default behavior of .duplicated is to mark the second instance of a row as the duplicate (and any other instances that follow it). It considers the first row to contain the matching data as unique. However, you can modify this behavior by changing the arguments.

Let's review the arguments available for the .duplicated method and how you can adjust them to fit your needs.

Pandas .duplicated Arguments

1. The Keep Argument

Best for: fine tuning how .duplicated treats the origin row for any duplicates it finds

Let's say we want to flag the origin row and any subsequent rows as duplicates. To do this, we turn to the keep argument:

 

kitch_prod_df.duplicated(keep = False)

Here, we call .duplicated() with the argument keep set to the boolean False.

In effect, this tells .duplicated we don't want to "keep" any duplicates. Now, any two or more rows that share the same data will all be marked True in the output. We can see the original DataFrame and the boolean Series printouts below.

DataFrame with original rows and Series showing first, second, fourth, and fifth rows marked True printed to the terminal

As expected, the first and second rows that were marked False in our last example are now marked True since their data matches the duplicate rows in the fourth and fifth positions. This leaves us with two unique rows.

The default value for keep is "first." This tells the .duplicated method that we want to mark the first and subsequent occurrences of any duplicated row as False and mark the last instance as True:

 

kitch_prod_df.duplicated(keep = 'first')

Since "first" is the default keep argument, we don't need to include it when calling .duplicated(), but it's still important to understand how it works. We can confirm the method performed as expected by viewing the printout:

Boolean Series with fourth and fifth rows showing True printed to the terminal

We can see the fourth and fifth rows are marked as duplicates, which matches the result when we executed .duplicated with no keep argument specified in the original example.

If you want to keep the last instance of a duplicate row, you can invert the default behavior of .duplicated by changing the argument to "last":

Boolean Series with first and second rows marked True printed to the terminal

Instead of marking the fourth and fifth rows, we have now identified the initial occurrences of the duplicate data as True in the first and second rows.

In some scenarios, it makes sense to limit our search for duplicates to specific columns instead of the entire DataFrame. To achieve this, we use the subset argument.

2. The Subset Argument

Best for: narrowing the scope of the rows that .duplicated inspects to suspect columns to produce more focused results

Scanning the entire DataFrame isn't always practical, especially if it's dozens of columns or hundreds of rows. If we identify columns where duplicates are likely to occur, we can pass the column names to .duplicated with the subset argument.

The original DataFrame for reference:

DataFrame with columns of kitchen utensil names, prices, product skus, and department name printed to the terminal

In this code, we are checking the DataFrame for duplicates in the "department" column:

 

kitch_prod_df.duplicated(subset = 'department')

Here, we set the subset argument equal to "department" so that .duplicated() only examines the column matching that label. The output is below.

Boolean Series showing True in rows two through six printed to the terminal

If we went off these results, we could conclude that our entire DataFrame is full of duplicate records. However, with context, we would realize that all the products are under the "kitchen" department label. Fortunately, the subset argument allows us to expand our scope beyond a single column by using a Python list:

 

kitch_prod_df.duplicated(subset = ['sku', 'department'])

This time, we are searching for duplicates in both the "sku" and "department" columns. Only if the data is the exact same in both columns will a row get marked as a duplicate. Even though every record shares the same department name in our DataFrame, only the fourth and fifth rows are flagged as duplicates in our results:

Boolean Series with fourth and fifth rows showing True printed to the terminal

The keep and subset arguments are also available to us in the .drop_duplicates method, which we'll review in the following sections.

How to Drop Duplicate Rows in Pandas DataFrames

Best for: removing rows you have determined are duplicates of other rows and will skew analysis results or otherwise waste storage space

Now that we know where the duplicates are in our DataFrame, we can use the .drop_duplicates method to remove them.

The original DataFrame for reference:

DataFrame with columns of kitchen utensil names, prices, product skus, and department name printed to the terminal

By default, .drop_duplicates will remove the second and additional occurrences of any duplicate rows when called:

 

kitch_prod_df.drop_duplicates(inplace = True)

In the above code, we call .drop_duplicates() on the kitch_prod_df DataFrame with the inplace argument set to True. This allows us to modify the existing DataFrame instead of returning a new copy of the DataFrame with the duplicates removed.

We can confirm that the duplicated rows were dropped by printing to the terminal. We can compare the original and new DataFrames and see that the rows at index positions 3 and 4 have been removed. Note that the index has not reordered in the modified DataFrame.

Original and modified DataFrames with fourth and fifth rows removed from modified DataFrame printed to the terminal

The inplace argument is not the only argument available with .drop_duplicates. We can use the keep argument to modify which rows are removed, similar to how we modified which rows were tagged True with the .duplicated method.

In this example, we have instructed .drop_duplicates() to remove the first instance of any duplicate row:

 

kitch_prod_df.drop_duplicates(keep = 'last', inplace = True)

The output is below.

DataFrame with dropped first and second rows printed to the terminal

Here we have removed the first two rows and retained the others. If we wanted to remove all duplicate rows regardless of their order, we can set the keep argument to the boolean False.

Subset is also available to us to narrow the columns which .drop_duplicates uses to locate and drop duplicate rows. Below, we are identifying the column named "sku" through the subset argument:

 

kitch_prod_df.drop_duplicates(subset = 'sku', inplace = True)

The results are below.

DataFrame with dropped fourth and fifth rows printed to the terminal

Here, .drop_duplicates has identified duplicates in the values of the "sku" column and dropped the rows belonging to those repeat values — the fourth and fifth rows.

This video from DataDaft provides a live demonstration of the .drop_duplicates method:

We now understand how the .drop_duplicates method removes identical rows and how its arguments allow us to refine its behavior. Now, let's look at how we can modify this approach to remove duplicate columns.

How to Drop Duplicate Columns in Pandas DataFrames

Best for: removing columns you have determined are duplicates of other columns with only a slight adjustment to the syntax for dropping identical rows

You may encounter columns that hold identical values that need to be removed. However, .drop_duplicates only works for rows.

Fortunately, Python offers a trick for this use case. In short, we will temporarily make our columns into rows.

Our DataFrame for reference, now including a second column labeled "products" and containing the same values as the original "product" column.

DataFrame with columns of kitchen utensil names, prices, product skus, department name, and a duplicate product column printed to the terminal

The .T property allows us to invert the axis of our DataFrame. You can see the output of printing kitch_prod_df.T to the terminal below.

DataFrame inverted so that column names are now row labels and each column is a row printed to the terminal

Now, .drop_duplicates can look through our DataFrame and remove duplicate rows, because the columns are the rows.

 

no_dup_columns = kitch_prod_df.T.drop_duplicates().T

Here, we are inverting the kitch_prod_df DataFrame with the .T property so that we can call .drop_duplicates() to examine the rows. By default, .drop_duplicates() will keep the first instance of a duplicate row and remove any others. We are then removing the inversion with the second .T property so the DataFrame will display normally and saving it to a new variable no_dup_columns.

The output of printing the new DataFrame is below.

DataFrame with "products" column now removed printed to the terminal

Here, we can see the duplicate column "products" is now removed from our DataFrame. Even though the column names between "product" and "products" were different, their values were identical.

We can modify how .drop_duplicates removes duplicate columns through the keep and subset arguments just like with dropping duplicate rows.

Improve data analysis and discovery by dropping duplicates in Pandas.

Good decisions rely on good data, making data cleansing an essential process for data-driven businesses. A major part of improving data quality is removing duplicate data, which can skew the results of data analysis and take up unnecessary storage space. The pandas library supports this critical need with built-in methods to find and remove duplicate rows and columns. Armed with these tools, you are ready to improve your business outcomes.

python

Topics: What Is Python?

Related Articles

We're committed to your privacy. HubSpot uses the information you provide to us to contact you about our relevant content, products, and services. You may unsubscribe from these communications at any time. For more information, check out our Privacy Policy.

A guide for marketers, developers, and data analysts.