While data scientists can and do utilize SQL, it can quite frankly be easier to manipulate your pandas dataframe with Python operations instead (or, in addition to). I, personally, like to have a mix of both languages to structure my data. At a certain point, it can be more efficient to work with operations once you have an already queried dataframe from SQL. For example, you might query all your necessary columns, and then read in your dataframe, then apply the respective operations to organize your data before it will ultimately be ingested into your data science model. With that being said, let’s dive a little deeper into some simple operations that might make your everyday work a little easier.
Lesser/Greater
For all these use cases, I will have a pretend pandas dataframe.
This following operation is “lesser than”, so you can write your dataframe alias, which in this case, is just df. You can insert the column name where I have placed ‘column_1’. I have assigned a new dataframe, named df_less_than_20, so that I only have records/rows that are the column value that is less than 20.
df_less_than_20 = df[df['column_1'] < 20]
The same concept can be applied to greater than:
df_more_than_20 = df[df['column_1'] > 20]
Although these operations are simple, they are still useful, and, when put together, can be even more beneficial — as we will see below.
Another way to look at this feature is like the WHERE clause in SQL.
And/Or
Now that we have the above statement, we can apply a further filter to our data.
We can use both, or either the & or | operation.
To clarify:
- AND = &
- OR = |
I was aware of the AND operation, but the OR was actually a recent operation that I found that has been incredibly useful, especially when filtering out data for accuracy and error analysis after your model is run. Of course, you can use this operation before that step of the process as well.
Now, we can use either or both of these in the following way:
df[(df['column_1'] >= -100) & (df['column_1'] <= 1000)]
The above is saying, give me the data where the value is between negative 100 and positive 100.
A next step, is to use the OR operation, to find all rows that are negative as well:
df[(df['column_1'] < 0) | (df['column_1'] >= -100) & (df['column_1'] <= 100)]
We can also strip away the middle clause to create the following snippet:
df[(df['column_1'] < 0) | (df['column_1'] <= 100)]
However, we could replace one of the clauses with something that is filtering on another column with another value as well.
df[(df['column_1'] < 0) | (df['column_2'] <= 50)]
Does/Does not Equal
Lastly, we have another way to filter our data by selecting rows where there is a certain value or there is not a certain value.
These two operations look like the following
- DOES EQUAL: ==
- DOES NOT EQUAL: !=
Here are a few examples of both:
df[df['column_1'] == 100] df[df['column_2'] == 50] df[df['column_3'] == 'blue'] df[df['column_3'] != 'blue'] df[(df['column_3'] != 'red' ) | (df['column_200'] <= 8.60)]
Summary
To summarize, we saw that we could combine a few of the operations that we discussed above to create a filtered dataset or pandas dataframe. Ultimately, this type of coding might be easier for some data scientists, who prefer to work in Python rather than in SQL.