pandas is a Python library built to streamline the process for working with relational data. It has two primary structures for capturing and manipulating data: Series and DataFrames.
When working with these data structures, you’ll often need to filter out rows, whether to inspect a subset of data or to cleanse the data set, such as removing duplicates.
Fortunately, pandas and Python offer a number of ways to filter rows in Series and DataFrames so you can get the answers you need to guide your business strategy. This post will cover the following approaches:
- How to Filter Rows by Column Value
- How to Filter Rows by Logical Conditions
- How to Filter Rows by Slice
- How to Filter Rows by Query
- How to Filter Rows by Missing Values
- How to Filter Rows by String Methods
How to Filter Rows in Pandas
1. How to Filter Rows by Column Value
Often, you want to find instances of a specific value in your DataFrame. You can easily filter rows based on whether they contain a value or not using the .loc indexing method.
For this example, you have a simple DataFrame of random integers arrayed across two columns and 10 rows:
Say you only want to view rows that have the value 2 under the "a" column. .loc allows you to easily define this parameter:
num_df.loc[num_df['a'] == 2]
Here, .loc takes the logical expression as an argument, meaning that any time the value in column "a" of num_df equals 2 — the expression returns the boolean True — the function returns the corresponding row.
The output of executing this code and printing the result is below.
As expected, the .loc method has looked through each of the values under column "a" and filtered out all rows that don't contain the integer 2, leaving you with the two rows that matched your parameter.
For a deeper dive on the .loc method, you can check out our guide on indexing in Pandas. This guide also covers the indexing operator used in Example 2 and the .iloc method used in Example 3.
2. How to Filter Rows by Logical Conditions
In some cases, you will not want to find rows with one sole value but instead find groupings based on patterns. You can define patterns with logical expressions.
In this scenario, you have a DataFrame of 10 student test scores for a class. You can inspect the data it contains below.
If you only want to inspect the test scores of upperclassmen, you can define the logic as an argument for the indexing operator ():
tests_df[(tests_df['grade'] > 10)]
Similar to the previous example, you are filtering the tests_df DataFrame to only show the rows where the values in the "grade" column are greater than (>) 10. You can confirm the expression performed as intended by printing to the terminal:
You now have a subset of five rows for each of the upperclassmen students. You can confirm this by inspecting the "grade" column. Only the values 11 and 12 are present.
You can add additional conditions using the boolean operator & (representing "and"). Now you are segmenting the data further to only show the top performers among the upperclassmen:
tests_df[(tests_df['grade'] > 10) & (tests_df['test_score'] > 80)]
Based on the defined conditions, a student must be at a grade level higher than 10 and have scored greater than 80 on the test. If either or both of these conditions are false, their row is filtered out.
The output is below.
The data subset is now further segmented to show the three rows that meet both of our conditions.
You can add flexibility to your conditions with the boolean operator | (representing "or"). In this example, the code would display the rows that either have a grade level greater than 10 or a test score greater than 80. Only one condition needs to be true to satisfy the expression:
tests_df[(tests_df['grade'] > 10) | (tests_df['test_score'] > 80)]
The next example will inspect another way to filter rows with indexing: the .iloc method.
3. How to Filter Rows by Slice
Sometimes you don't want to filter based on values at all but instead based on position. The .iloc method allows you to easily define a slice of the DataFrame to retrieve.
This example uses the Major League Baseball player salaries data set available on Kaggle. Feel free to download it and follow along.
You can examine a preview of the data below.
For this scenario, you are less interested in the year the data was collected or the team name of each player. You just want a quick sample of the first 10 rows of data that include the player name, their salary, and their player ID.
.iloc allows you to quickly define this slice:
Here, you are defining the ranges as arguments for .iloc that then pulls the row and column values at the specified locations. The first argument identifies the rows starting at index 0 and before index 10, returning 10 rows of data.
The second argument designates the columns starting at index 2 and before index 5, returning three columns of data. The output is below.
If you decide you want to see a subset of 10 rows and all columns, you can replace the second argument in .iloc with a colon:
Pandas will interpret the colon to mean all columns, as seen in the output:
You can also use a colon to select all rows.
Let's return to condition-based filtering with the .query method.
4. How to Filter Rows by Query
The .query method of pandas allows you to define one or more conditions as a string. It also removes the need to use any of the indexing operators (, .loc, .iloc) to access the DataFrame rows.
In this scenario, you once again have a DataFrame consisting of two columns of randomly generated integers:
You can quickly define a range of numbers as a string for the .query() function to pull from the DataFrame:
num_df.query('a < 8 and a > 3')
Here, .query() will search for every row where the value under the "a" column is less than 8 and greater than 3. You can confirm the function performed as expected by printing the result:
You have filtered the DataFrame from 10 rows of data down to four where the values under column "a" are between 4 and 7. Note that you did not need to use the indexing operating when defining the columns to apply each condition to — like in Example 2.
This video from Sean MacKenzie walks through a live demonstration of the .query method:
5. How to Filter Rows by Missing Values
Not every data set is complete. Pandas provides an easy way to filter out rows with missing values using the .notnull method.
For this example, you have a DataFrame of random integers across three columns:
However, you may have noticed that three values are missing in column "c" as denoted by NaN (not a number). You can filter these incomplete records from the DataFrame using .notnull() and the indexing operator:
Here, you are calling .notnull() on each value contained under column "c." True to its name, .notnull() evaluates whether the data in each row is null or not. If the data isn't null, .notnull() returns True. You use a second indexing operator to then apply the boolean Series generated by .notnull() as a key to only display rows that evaluate to True.
The output of this expression is below.
You have removed all three rows with null values from the DataFrame, ensuring your analysis only incorporates records with complete data.
The final method moves away from numerical conditions to examine how you can filter rows containing string data.
6. How to Filter Rows by String Methods
The majority of the examples in this post have focused on filtering numerical values. However, you can apply these methods to string data as well. In fact, strings have their own subset of methods to allow you to filter and segment data with even greater precision.
In this example, you have a DataFrame of data around user signups:
You want to display users who signed up this year (2022). Since the signup dates are stored as strings, you can use the .str property and .contains method to search the column for that value:
Once again, you are using the indexing operator to search the "sign_up_date" column. You use the .str property to access the .contains() method to evaluate whether each string under the specified column contains "2022." Whichever rows evaluate to true are then displayed by the second indexing operator.
The result is below.
You can reuse this syntax to search for users who are based in the same city. Since you know city will always be the first value listed under the "city_state" column, you can use the .startswith method to evaluate the strings:
While .contains would also work here, .startswith() is more efficient because it is only concerned with the beginning of the string. As soon as it finds a character that doesn't match the string "Boston" (e.g. the "C" in Cambridge instead of a "B") the function will move to the next value.
Note that the .contains and .startswith methods are both case sensitive, so searching with the string "boston" would return no results.
This video by sage81564 shows another string method that uses .contains and .loc:
Filter rows in Pandas to get answers faster.
Not all data is created equal. Filtering rows in pandas removes extraneous or incorrect data so you are left with the cleanest data set available. You can filter by values, conditions, slices, queries, and string methods. You can even quickly remove rows with missing data to ensure you are only working with complete records. All these approaches help you find valuable insights to guide your business operations and determine strategy easier and faster.