pandas is an open-source Python library that simplifies and speeds data analysis. One of pandas' primary offerings is the DataFrame, which is a two-dimensional data structure that stores information in rows and columns — similar to a table in a database.
Below is an example DataFrame containing information about different car models printed to the terminal. We will be using this DataFrame for our tutorials.
When working with large data sets, you may need to remove one or more columns from a DataFrame. You may want to see how dropping a column affects the outcome of your analysis — or maybe you have discovered the values in a column are incorrect or outdated. Fortunately, pandas provides multiple approaches to remove columns you no longer need.
In this post, we will examine three different approaches for how to drop multiple columns in pandas DataFrames:
The most common approach for dropping multiple columns in pandas is the aptly named .drop method. Just like it sounds, this method was created to allow us to drop one or multiple rows or columns with ease. We will focus on columns for this tutorial.
1. Drop a single column.
What we like: You have the flexibility to remove a single column of data for more methodical testing of the modified DataFrame.
Let's review the use case of dropping a single column to familiarize ourselves with the syntax before moving on to multiple columns.
In this example, the code removes the named column "top_speed" by calling .drop() on the existing DataFrame:
Let's break down each of the arguments inside the parentheses:
'top_speed': The name of the column to drop. The argument in the first position will always be the column(s) you want .drop to remove.
axis = 1: Because the .drop method can remove columns or rows, you have to specify which axis the first argument belongs in. If axis is set to 0, then .drop would look for a row named 'top_speed' to drop.
inplace = True: The default behavior of .drop is to return a new DataFrame instead of modifying the existing car_df DataFrame. Setting inplace to the boolean True reverses that behavior so that the "top_speed" column is dropped from the original car_df.
Once this statement has executed, you can see the results by printing the modified DataFrame to the terminal with the print() function:
The output is below.
You've now dropped your first column. The next sections will focus on different ways to remove multiple columns with the .drop method.
2. Drop multiple columns by name.
What we like: You can be specific about the columns being dropped based on the context (e.g. your analysis doesn't require top speed values).
To build on the previous example where we dropped one named column, we'll now provide multiple column names to remove in a list:
By wrapping the column names in square brackets ([ ]), you've created a Python list. The .drop() method will look through the DataFrame for exact matches to the strings provided ("safety_rating" and "passenger_capacity") and drop those columns. You can provide as many names as columns you want to remove.
The output of executing this expression is printed below.
Note that naming columns relies on exact matches. This means that the arguments are both case-sensitive and intolerant of typos. In other words, passing "Safety_rating" or "safety_rtaing" into .drop will result in a KeyError.
This video from CodeWithData provides a live walkthrough of the previous two methods for removing columns in pandas:
3. Drop multiple columns by index.
What we like: You don't need to know the column names in advance and aren't at the mercy of typos.
The syntax for identifying columns by their index is a bit more complex than naming multiple columns. The key distinction is that you are now providing the integer label of a column. Because DataFrames are zero-indexed, this means that index 0 corresponds to the first column, index 1 corresponds to the second column, and so on.
In this scenario, the index list tells .drop to remove the columns at the third and sixth positions:
Because .drop() expects column names instead of index integers, you use the .columns property of the car_df DataFrame to retrieve the column names corresponding to index values 2 and 5. You can see the list pulled by running car_df.column[[2, 5]] in the screenshot:
.drop() is then able to remove these columns now that they are named, as confirmed in the printout:
So, why use index values if they require the extra step with the .column property?
You may not always know the column names of your DataFrame in advance, and this method removes the need to explicitly name them. You also remove the risk of errors from typos or mismatched capitalizations when using integers.
4. Drop multiple columns in a named range with .loc.
What we like: You can drop multiple columns without naming each of them.
Instead of having to specify each column name, you can provide a range to the .drop method with another built-in method for pandas called .loc. Short for location, .loc returns a cross-section of a DataFrame based on the row and column labels provided:
Here you are specifying all rows with the colon (:) as the first argument of .loc. The second argument selects all columns between the "top_speed" and "passenger_capacity" columns. Together, these arguments return a subset of the DataFrame consisting of three columns and all the rows within them for .drop() to remove.
The output is below.
Ranges, also known as slices, save you the trouble of naming every column to remove. This may seem trivial when removing a few columns, but in a DataFrame with dozens of columns, using .loc can save a lot of time.
5. Drop multiple columns in an index range with .iloc.
What we like: You can drop multiple columns without needing to know the column names, and this method avoids the pitfall of string mismatches.
The syntax for using an index range is almost the same as a named range. The key distinctions are that you are using integers instead of names to specify columns and that you are using the .iloc method. Short for integer location, .iloc is a counterpart to the .loc method and functions the same as .loc except that it accepts integers instead of strings.
In this example, you are removing the first four columns of the DataFrame:
Like the previous example, you use the colon (:) to specify all rows and add a second argument of two values joined by a colon to specify a range of columns after the comma (,). Since integer ranges are exclusive, the range concludes at the fifth (index 4) column, which means the fifth column is still included in the new DataFrame:
Over the following sections, we will examine two more approaches to dropping columns beyond the .drop method.
Method 2: The Difference Method
What we like: You can name only the columns you want to keep quickly and easily.
We now return to the .columns property to examine a new method: .difference. This use case also relies on the .drop method but in an entirely new way: instead of naming the columns we want to drop, we name the columns we want to keep.
In this example, the expression modifies the DataFrame to only retain the columns labeled "make," "model," and "avg_speed":
The easiest way to understand what is going on is to think back to the drop multiple columns by index example from the last section. In that use case, you used the .columns property to retrieve the names of columns based on their index values.
Here, the .difference() method takes a list of column names and returns any column names in the DataFrame not included in the list you provided. .drop() then removes the remaining columns as usual, resulting in a DataFrame with the three columns you explicitly named:
In a DataFrame with dozens of columns, the .difference method provides a simple way to retrieve a few rows of importance without using ranges.
Method 3: The Iterative Approach
What we like: You can use advanced search methods not available when specifying indexes or column names.
The iterative approach is a more advanced approach to dropping columns that leaves specifying columns behind for logical operators. While it requires a bit more setup, this process is a powerful and flexible way to dive deeper into your data.
In this example, you are removing any columns if their names contain the phrase "speed":
for col in car_df.columns:
if 'speed' in col:
Let's look at each level of this expression:
for col in car_df.columns: This is a for loop that will pass over every value in a provided list. In this case, the list is the column names pulled from the .columns property of car_df. col is a placeholder for the items being iterated over. You could also use "item," "column_name," "col_name," or anything else.
if 'speed' in col: Now that the Python loop is established, the expression will evaluate whether each column name meets the specified criteria. In this case, the if statement checks whether the phrase "speed" is in each column name (col). If the statement evaluates to true, the expression then executes the final line. If not, then the loop continues to the next column name until all have been evaluated.
del car_df[col]: If "speed" is in the column's name, the last statement is executed. The column name (col) is passed to the DataFrame (car_df) through the indexing operator ([ ]). Now that the column is accessed, it is deleted from the DataFrame with the del function.
You can confirm the logical expression is performing as expected by printing the modified DataFrame:
You can easily invert this statement by adding not to the if statement:
for col in car_df.columns:
if 'speed' not in col:
Now the DataFrame contains columns whose names only have "speed" in them:
Searching for a phrase in part of a larger string is known as partial string matching. The iterative approach means you no longer need to identify the exact column names to drop them. It can also be a quick way to discover related columns across a large DataFrame (e.g. columns containing speed values). This method can also be used to search for shared numbers in column names.
Drop columns from Pandas DataFrames to improve your analysis.
There are multiple ways to remove columns from DataFrames. You can name columns to drop, provide the index values, use ranges, provide the name of columns to keep, and define logic to loop through your DataFrame and filter out column names that don't match your criteria. No matter which method you use, you will be closer to discovering new insights and finding answers to your business's biggest questions.
Originally published Feb 23, 2022 7:00:00 AM, updated March 21 2022