pandas is a Python library created to expedite the importation and manipulation of relational data. To support this work, it has many built-in properties and methods for its two main data structures: Series and DataFrames.
One popular method during the data cleansing stage is the .notnull method. Data cleansing is a crucial step before performing advanced tests and analysis on the data set. You want to know that the data you have is as accurate and stable as possible to minimize the risk of skewed results and conclusions. Removing duplicates and dropping unneeded columns also helps speed up calculations by minimizing the amount of data functions have to parse through.
For example, you notice an unusual spike in user signups in a given month. However, .notnull would show that these user records are actually blank, meaning they were likely created as the result of a glitch or internal system error. Without the crosscheck .notnull and other data cleansing methods provide, you might draw false conclusions about the efficacy of the promotions and campaigns you ran in that time frame.
This post will cover everything you need to know about .notnull and provide three different examples of how you can use .notnull in your DataFrames.
What is the Pandas .notnull method?
.notnull is a general function of the pandas library in Python that detects if values are not missing for either a single value (scalar) or array-like objects. The function returns booleans to reflect whether the values evaluated are null (False) or not null (True). .notnull is an alias of the pandas .notna method.
How to Check if Something Is Not Null in Pandas
.notnull is a pandas function that will examine one or multiple values to validate that they are not null. In Python, null values are reflected as NaN (not a number) or None to signify no data present. .notnull will return False if either NaN or None is detected. If these values are not present, it will return True.
To better understand the .notnull method, let's examine how it functions in an example.
You have a simple DataFrame of a few numbers arranged in two columns. You can inspect the values below.
You can see the first column is not missing any values, but the second column has a NaN value in the second row. .notnull() will indicate the same in its output:
To see the results of calling .notnull(), use the print statement:
The output is below.
The .notnull method has confirmed your observation: the value in the second row of the second column is a null value, as indicated by .notnull returning False. It has also confirmed all the other values in the DataFrame are not null values since they are marked True.
If these results still aren't clear, think of .notnull as asking the question "Is this value not null?" Wherever it returns True, .notnull answers the question by saying "Yes." Wherever it returns False, the function is answering "No."
Now that you understand the concept and basic syntax of .notnull, let's review its inverse: .isnull.
What's the difference between the Pandas .notnull and .isnull methods?
.notnull and .isnull are inverse functions. They perform the same operations but deliver opposite results. .notnull will return True if a value is not a null value, and .isnull will return False if the value is not a null value. They are used to identify where data is present and where it's missing in a data set.
Let's expand on the distinctions between .notnull and .isnull by looking at .isnull in action.
Below, you have the same DataFrame of three integers and a NaN value as the previous example.
You call .isnull() the same way as .notnull:
The result of executing the print statement is below.
As you likely expected, the results from .notnull are now reversed in the .isnull call. The cell with NaN is marked True, and all other cells are marked False.
In other words, the question is now "Is this value null?" .isnull uses True to answer "Yes" and False to answer "No."
Now that you better understand the differences between these two methods, let's examine some more complex use cases for .notnull in pandas DataFrames.
Examples of .notnull Method Used in Pandas DataFrames
The .notnull method can be used in multiple ways to ensure the data you have is accurate and stable, including totaling the number of missing values, filtering out rows with missing values, and replacing missing values. Let's start with counting missing values.
1. Count missing values with .notnull.
So far, the DataFrame has been simple, so you could get away with simply printing the DataFrame and inspecting it to find the missing values. However, the .notnull method becomes essential when the size of the DataFrame is larger than the four cells in the previous tutorials.
Each of the following examples will be using a data set of property sales available on Kaggle. Feel free to download it and follow along. You can view a quick snapshot of the DataFrame with the .head() method:
The result of calling .head() on the DataFrame lots_df is below.
You can already get a sense of the size of your data set by noting the ellipses (...), which means there are more columns than will fit in the terminal window. To put hard numbers to your DataFrame dimensions, you can call the .size and .shape properties:
print('DataFrame shape:', lots_df.shape)
print('DataFrame size:', lots_df.size)
The result of printing these DataFrame properties is below.
You can now see the DataFrame is made up of 1,460 rows and 81 columns, which equates to 118,260 total data cells. Clearly, this is too large to simply eyeball any missing values.
Before doing anything else, it's a good idea to establish if there are any missing values. If not, then you don't need to worry about any of the other use cases examined in this post.
You can get a quick count of the number of cells with non-null data using the .notnull and .sum methods:
Here, you are calling .notnull() as before. However, a 1460x81 DataFrame of boolean values would still be too large to tell us much on its own. Instead, you are now chaining the Python .sum() function to total up every row under each column that holds a True value:
You now have a Series with an index of column names corresponding to integers representing the total number of rows with data present in each column.
Once again, given the size of the DataFrame, you are not able to inspect all columns in one view. However, you can at least note that the "LotFrontage" column is holding 1,201 rows of data compared to the other columns displaying 1,460 rows of data. This confirms that there are missing values. The next example will expand on this approach as you move into remediation strategies.
For now, you can calculate the total number of cells with data present by chaining another .sum() call onto the expression:
.sum() adds the totals held in each row of the Series. The result is below.
You can see you have 111,295 cells that are not missing values in the DataFrame. Recall that the .size property pulls the total number of all cells regardless of the data they hold. You can now find the total of missing cells through a simple subtraction operation:
total_cells = lots_df.size
cells_with_data = lots_df.notnull().sum().sum()
print('Total cells missing data:', total_cells - cells_with_data)
The result of the print statement and subtraction operation is below.
You now know you have 6,965 cells with null values. The next examples will move into how to address these missing values, starting with filtering them out.
2. Filter out rows with missing data with .notnull.
Let's return to the previous example and total up all available data in each column again. To do this, you call .sum() to count all instances of True in each column of the boolean DataFrame created by .notnull():
column_totals = lots_df.notnull().sum()
The printout of column_totals is below.
However, you're not interested in all columns, only the ones with missing data. Instead of having to manually look through the Series of 81 column names, you can have pandas filter the list for you:
values_missing = column_totals.loc[column_totals < lots_df.shape]
In this statement, you are comparing the values held in each row of column_totals against the total number of rows in the DataFrame, which is provided by taking the first (index 0) value of the .shape property of the original lots_df DataFrame.
If the value in any row of column_totals is less than 1,460, then .loc fetches the value at that index and stores the modified Series in the new variable values_missing.
To learn more about how .loc and the indexing operator work, check out our guide on indexing in pandas.
You can confirm the logical statement filtered the column names as expected by printing to the terminal:
Of the original 81 columns represented in the Series, you now have a reduced list of the 19 that .notnull did not find a full 1,460 rows of data in to now inspect.
Let's take a look at the first value in the values_missing Series. You can see that there are 1,201 rows of data under the "LotFrontage" column, telling you there are more than 200 missing cells. To create a DataFrame that excludes the records that are missing data on lot frontage, turn once again to the .loc method:
lotFrontage_missing_removed = lots_df.loc[lots_df['LotFrontage'].notnull()]
Here, .loc is locating every row in lots_df where .notnull() evaluates the data contained in the "LotFrontage" column as True. Each time the value under that column returns True, .loc retrieves the entire record associated with that value and saves it to the new DataFrame lotFrontage_missing_removed.
You can confirm .loc performed as expected by printing to the terminal with the .head() method:
Here, you can see that no missing values appear under the "LotFrontage" column in the snapshot the .head method provides. You can also examine the .shape property of the new DataFrame:
As expected, the .loc method removed the 259 rows that did not have data under the "LotFrontage" column, and you are left with a DataFrame of 1,201 rows for each record that did have data under this column.
If you were only interested in properties that included the lot frontage data, then you have now filtered out more than 250 unneeded records. Calculations will perform faster because your functions have that much less data to parse through.
In this example, you learned how to filter out rows with missing data. In the next scenario, you'll learn how to fill in missing data.
3. Replace missing values with .notnull.
In some situations, it's better to fill in missing values rather than filter them out. In the previous example, you were only interested in records with lot frontage values and created a copy of the DataFrame with any records that did not include this information removed.
However, if you wanted to update the existing DataFrame and add a placeholder value for any missing values, you can do so by combining the pandas .notnull and .fillna methods.
Let's take another look at the values_missing Series from the last example:
This time, you want to update any missing values under the "Alley" column. First, you can get a better sense of what the column describes in relation to the alley on any property:
Here, you are using .loc to look through the "Alley" column and retrieve any rows with non-null data. Since you know that only 91 rows have data, chances are that printing the "Alley" column as is would only show us null values instead of the actual data this column is meant to hold.
The output of executing this code is below.
You can see that the "Alley" column contains strings describing the alley type (paved, gravel, etc.). Now you are ready to add a placeholder value for any missing data that matches this context using the .fillna method:
lots_df['Alley'].fillna('no', inplace = True)
Here, you are calling .fillna() to look through the column labeled "Alley" of lots_df and overwrite any null values with the string "no" to indicate there is no alley on this property. The second argument inplace specifies whether you want to modify the existing DataFrame or create a copy. By setting it to True, you are running .fillna() on the original lots_df.
Since the operation was specific to the "Alley" column, you can access that column using the indexing operator () to print a quick snapshot with .head():
The output is below.
It appears that the .fillna method performed as expected. You can confirm every row under "Alley" now holds a value with .notnull() and .sum():
Because narrowing the scope to the "Alley" column returns a Series instead of a full DataFrame, you only need to call .sum() once after .notnull() to generate a count of each time True occurs in the column:
You can see that the .notnull and .sum methods returned a total of 1,460 rows containing data, which matches the base shape of the DataFrame.
Why does this matter? Why not just leave it blank if all you're doing is adding a value to say that there is no alley? The advantage of backfilling missing data is that it's now clear that these cells aren't blank by accident. You are explicitly stating there is no alley on each of these properties and replacing ambiguity with intentionality.
The disadvantage of adding placeholder data, of course, is that the calculations that you performed before around which columns were missing data will no longer work. .notnull does not care what the actual values are in a cell, only if they are null or not. Of course, you can use other pandas and Python methods to filter the data in the "Alley" column based on whether it matches the string "no" or not, but that will be a more complex process than simply calling .notnull on the DataFrame.
To achieve the best of both worlds, you can rework the previous workflow so that you create a copy of the DataFrame to run .fillna on, preserving the original DataFrame's values for reference while creating a more explicit DataFrame to perform your analysis:
alleys_backfilled = lots_df
alleys_backfilled['Alley'].fillna('no', inplace = True)
Whether it's better to filter out rows of missing data or backfill them depends on your goals and the scope of the analysis you are performing.
Use the pandas .notnull method to power data cleansing and improve data quality.
You need good data to make good decisions. However, data often requires examination and cleansing before you can draw accurate conclusions from it. The pandas library supports this process with the .notnull method to give you an exact count of how many cells within your DataFrame contain non-null data. You can then develop strategies to handle any missing data to improve data quality before you use it to inform your business strategies and make crucial decisions.