Learning how to drop multiple columns in Pandas using Python library can simplify and speed up data analysis. One of Pandas' primary offerings is the DataFrame, which is a two-dimensional data structure that stores information in rows and columns — similar to a table in a database.
When working with large datasets, you may need to remove columns from a DataFrame. This could be to analyze the impact of dropping a column or to remove incorrect or outdated values. Pandas offers multiple approaches to remove unnecessary columns.
Below is an example DataFrame containing information about different car models printed to the terminal. We will be using this DataFrame for our tutorials.
In this post, we will examine three different approaches for how to drop multiple columns in pandas DataFrames:
How to Drop Multiple Columns in Pandas
Method 1: The Drop Method
The most common approach for dropping multiple columns in pandas is the aptly named .drop method. Just like it sounds, this method was created to allow us to drop one or multiple rows or columns with ease. We will focus on columns for this tutorial.
1. Drop a single column.
What we like: You have the flexibility to remove a single column of data for more methodical testing of the modified DataFrame.
Let's review the use case of dropping a single column to familiarize ourselves with the syntax before moving on to multiple columns.
In this example, the code removes the named column "top_speed" by calling .drop() on the existing DataFrame:
car_df.drop('top_speed', axis = 1, inplace = True)
Let's break down each of the arguments inside the parentheses:
- 'top_speed': The name of the column to drop. The argument in the first position will always be the column(s) you want .drop to remove.
- axis = 1: Because the .drop method can remove columns or rows, you have to specify which axis the first argument belongs in. If axis is set to 0, then .drop would look for a row named 'top_speed' to drop.
- inplace = True: The default behavior of .drop is to return a new DataFrame instead of modifying the existing car_df DataFrame. Setting inplace to the boolean True reverses that behavior so that the "top_speed" column is dropped from the original car_df.
Once this statement has executed, you can see the results by printing the modified DataFrame to the terminal with the print() function:
The output is below.
You've now dropped your first column. The next sections will focus on different ways to remove multiple columns with the .drop method.
2. Drop multiple columns by name.
What we like: You can be specific about the columns being dropped based on the context (e.g. your analysis doesn't require top speed values).
To build on the previous example where we dropped one named column, we'll now provide multiple column names to remove in a list:
car_df.drop(['safety_rating', 'passenger_capacity'], axis = 1, inplace = True)
By wrapping the column names in square brackets, you create a Python list. The .drop() method will look for exact matches to the provided strings and drop those columns. You can provide as many names as columns you want to remove.
The output of executing this expression is printed below.
Note that column names are case-sensitive and typos will result in a KeyError when using the .drop method.
This video from CodeWithData provides a live walkthrough of the previous two methods for removing columns in pandas:
3. Drop multiple columns by index.
What we like: You don't need to know the column names in advance and aren't at the mercy of typos.
When identifying columns by index, provide the integer label of the column corresponding to its position in the DataFrame, starting from index 0 for the first column.
In this scenario, the index list tells .drop to remove the columns at the third and sixth positions:
car_df.drop(car_df.columns[[2, 5]], axis = 1, inplace = True)
Because .drop() expects column names instead of index integers, you use the .columns property of the car_df DataFrame to retrieve the column names corresponding to index values 2 and 5. You can see the list pulled by running car_df.column[[2, 5]] in the screenshot:
.drop() is then able to remove these columns now that they are named, as confirmed in the printout:
So, why use index values if they require the extra step with the .column property?
You may not always know the column names of your DataFrame in advance, and this method removes the need to explicitly name them. You also remove the risk of errors from typos or mismatched capitalizations when using integers.
4. Drop multiple columns in a named range with .loc.
What we like: You can drop multiple columns without naming each of them.
Instead of having to specify each column name, you can provide a range to the .drop method with another built-in method for pandas called .loc. Short for location, .loc returns a cross-section of a DataFrame based on the row and column labels provided:
car_df.drop(car_df.loc[:, 'top_speed':'passenger_capacity'], axis = 1, inplace = True)
Here you are specifying all rows with the colon (:) as the first argument of .loc. The second argument selects all columns between the "top_speed" and "passenger_capacity" columns. Together, these arguments return a subset of the DataFrame consisting of three columns and all the rows within them for .drop() to remove.
The output is below.
Ranges, also known as slices, save you the trouble of naming every column to remove. This may seem trivial when removing a few columns, but in a DataFrame with dozens of columns, using .loc can save a lot of time.
5. Drop multiple columns in an index range with .iloc.
What we like: You can drop multiple columns without needing to know the column names, and this method avoids the pitfall of string mismatches.
The syntax for using an index range is almost the same as a named range. The key distinctions are that you are using integers instead of names to specify columns and that you are using the .iloc method. Short for integer location, .iloc is a counterpart to the .loc method and functions the same as .loc except that it accepts integers instead of strings.
In this example, you are removing the first four columns of the DataFrame:
car_df.drop(car_df.iloc[:, 0:4], axis = 1, inplace = True)
Like the previous example, you use the colon (:) to specify all rows and add a second argument of two values joined by a colon to specify a range of columns after the comma (,). Since integer ranges are exclusive, the range concludes at the fifth (index 4) column, which means the fifth column is still included in the new DataFrame:
Over the following sections, we will examine two more approaches to dropping columns beyond the .drop method.
Method 2: The Difference Method
What we like: You can name only the columns you want to keep quickly and easily.
We now return to the .columns property to examine a new method: .difference. This use case also relies on the .drop method but in an entirely new way: instead of naming the columns we want to drop, we name the columns we want to keep.
In this example, the expression modifies the DataFrame to only retain the columns labeled "make," "model," and "avg_speed":
car_df.drop(car_df.columns.difference(['make', 'model', 'avg_speed']), axis = 1, inplace = True)
Here, the .difference() method returns column names in the DataFrame that are not included in the provided list, similar to the drop multiple columns by index example.
.drop() then removes the remaining columns as usual, resulting in a DataFrame with the three columns you explicitly named:
In a DataFrame with dozens of columns, the .difference method provides a simple way to retrieve a few rows of importance without using ranges.
Method 3: The Iterative Approach
What we like: You can use advanced search methods not available when specifying indexes or column names.
The iterative approach is a more advanced approach to dropping columns that leaves specifying columns behind for logical operators. While it requires a bit more setup, this process is a powerful and flexible way to dive deeper into your data.
In this example, you are removing any columns if their names contain the phrase "speed":
for col in car_df.columns:
if 'speed' in col:
Let's look at each level of this expression:
- for col in car_df.columns: This for loop iterates over each column name in the car_df DataFrame. The variable "col" represents each individual column name. You can use any variable name you prefer, such as "item," "column_name," or "col_name."
- if 'speed' in col: The if statement checks if the phrase "speed" is in each column name. If true, the final line is executed; otherwise, the loop continues to the next column name.
- del car_df[col]: If "speed" is in the column's name, the column is accessed using the indexing operator ([ ]) and deleted from the DataFrame with the del function.
You can confirm the logical expression is performing as expected by printing the modified DataFrame:
You can easily invert this statement by adding not to the if statement:
for col in car_df.columns:
if 'speed' not in col:
Now the DataFrame contains columns whose names only have "speed" in them:
The iterative approach allows you to search for a specific phrase within column names, eliminating the need to identify exact names. This method can quickly reveal related columns in a large DataFrame, such as those containing speed values or shared numbers in column names.
Drop columns from Pandas DataFrames to improve your analysis.
There are multiple ways to remove columns from DataFrames. You can name columns to drop, provide the index values, use ranges, provide the name of columns to keep, and define logic to loop through your DataFrame and filter out column names that don't match your criteria. No matter which method you use, you will be closer to discovering new insights and finding answers to your business's biggest questions.