6th November 2022
Extending fluent design to filtering.
If you’re like me, breaking a nice method chain to do filtering is annoying.
You might have a nice long list of operations, and then interupt it like so:
df = (
df_raw.set_index("A")
.sort_index()
.unstack()
.groupby("B")
.mean()
)
df = df[df["C"] == "some_value"]
df = df.join(df2).reset_index()
I’m a simple man, and I just want everything in a single chain!
Yes, in the trivial example above, you could just move the indexor into the chain, but what if you need to filter based a value in the dataframe at that instant, not in the raw dataframe? Then you’ll have issues.
So, how can we better do simple filtering with method chaining? I looked at a few options, so I thought it would be smart to document them.
MONTH | DAY | WEEKDAY | ORG_AIR | DEST_AIR | CANCELLED | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 4 | LAX | SLC | 0 |
1 | 1 | 1 | 4 | DEN | IAD | 0 |
2 | 1 | 1 | 4 | DFW | VPS | 0 |
3 | 1 | 1 | 4 | DFW | DCA | 0 |
4 | 1 | 1 | 4 | LAX | MCI | 0 |
Our goal: get some fluent API design going to filter to flights into Los Angeles (and the absolutely awful airport that is LAX).
# Normal method. Boo!
df[df["DEST_AIR"] == "LAX"].head(2)
MONTH | ORG_AIR | DEST_AIR | CANCELLED | |
---|---|---|---|---|
262 | 1 | SFO | LAX | 0 |
703 | 1 | SFO | LAX | 0 |
So, stock standard answer is “use query”!
df.query("DEST_AIR == 'LAX'").head(2)
MONTH | ORG_AIR | DEST_AIR | CANCELLED | |
---|---|---|---|---|
262 | 1 | SFO | LAX | 0 |
703 | 1 | SFO | LAX | 0 |
I do not like query
. In a world of mypy, type hints, and similar, hiding away data manipulation logic inside a string feels fundamentally wrong to me.
A better solution, is to pass a callable into loc
:
df.loc[lambda x: x["DEST_AIR"] == "LAX"].head(2)
MONTH | DAY | WEEKDAY | ORG_AIR | DEST_AIR | CANCELLED | |
---|---|---|---|---|---|---|
19 | 1 | 1 | 4 | PHX | LAX | 0 |
22 | 1 | 1 | 4 | LAS | LAX | 0 |
So this is a little nicer. The lambda function means we can filter on the dataframe at that point in time, which is the critical thing.
Of course, if we didn’t like the lambda functions, we could slap our own method onto the dataframe:
select = lambda df, col, val: df[df[col] == val]
pd.DataFrame.select = select
df.select("DEST_AIR", "LAX").head(2)
MONTH | DAY | WEEKDAY | ORG_AIR | DEST_AIR | CANCELLED | |
---|---|---|---|---|---|---|
19 | 1 | 1 | 4 | PHX | LAX | 0 |
22 | 1 | 1 | 4 | LAS | LAX | 0 |
Provided your filtering always takes this simple form, and provided you don’t feel incredibly dirty patching your own method onto the DataFrame
class, this is very chainable.
df.select("DEST_AIR", "LAX").select("CANCELLED", 1).head(2)
MONTH | DAY | WEEKDAY | ORG_AIR | DEST_AIR | CANCELLED | |
---|---|---|---|---|---|---|
3069 | 1 | 20 | 2 | SFO | LAX | 1 |
3164 | 1 | 20 | 2 | SFO | LAX | 1 |
Now, if you wanted the same function, but didn’t want to patch the DataFrame
class, consider:
df.pipe(select, "DEST_AIR", "LAX").head(2)
MONTH | DAY | WEEKDAY | ORG_AIR | DEST_AIR | CANCELLED | |
---|---|---|---|---|---|---|
19 | 1 | 1 | 4 | PHX | LAX | 0 |
22 | 1 | 1 | 4 | LAS | LAX | 0 |
Really I think that something like this method should be added into the base pandas library. Maybe I’ll make a PR one day.
But for everything now in one place:
# Method chaining to filter your data!
import pandas as pd
# Download some data
repo = "https://github.com/PacktPublishing"
project = "Pandas-Cookbook/raw/master/data/flights.csv"
df_raw = pd.read_csv(f"{repo}/{project}")
df_raw = df_raw[["MONTH", "ORG_AIR", "DEST_AIR", "CANCELLED"]]
# Patch a custom method in
def select(df, col, val):
return df[df[col] == val]
pd.DataFrame.select = select
# And then pick your favourite method!
df = (
# You can use query
df_raw.query("DEST_AIR == 'LAX'")
# A custom patched method
.select("MONTH", 1)
# The same method via pipe
.pipe(select, "ORG_AIR", "SFO")
# Or pass a function to loc
.loc[lambda x: x["CANCELLED"] == 0]
)
If you know of other useful ways of filtering, please let me know!
For your convenience, here’s the code in one block:
df = (
df_raw.set_index("A")
.sort_index()
.unstack()
.groupby("B")
.mean()
)
df = df[df["C"] == "some_value"]
df = df.join(df2).reset_index()
import pandas as pd
df = pd.read_csv("https://github.com/PacktPublishing/Pandas-Cookbook/raw/master/data/flights.csv")
df = df[["MONTH", "DAY", "WEEKDAY", "ORG_AIR", "DEST_AIR", "CANCELLED"]]
df.head(5)
# Normal method. Boo!
df[df["DEST_AIR"] == "LAX"].head(2)
df.query("DEST_AIR == 'LAX'").head(2)
df.loc[lambda x: x["DEST_AIR"] == "LAX"].head(2)
select = lambda df, col, val: df[df[col] == val]
pd.DataFrame.select = select
df.select("DEST_AIR", "LAX").head(2)
df.select("DEST_AIR", "LAX").select("CANCELLED", 1).head(2)
df.pipe(select, "DEST_AIR", "LAX").head(2)
# Method chaining to filter your data!
import pandas as pd
# Download some data
repo = "https://github.com/PacktPublishing"
project = "Pandas-Cookbook/raw/master/data/flights.csv"
df_raw = pd.read_csv(f"{repo}/{project}")
df_raw = df_raw[["MONTH", "ORG_AIR", "DEST_AIR", "CANCELLED"]]
# Patch a custom method in
def select(df, col, val):
return df[df[col] == val]
pd.DataFrame.select = select
# And then pick your favourite method!
df = (
# You can use query
df_raw.query("DEST_AIR == 'LAX'")
# A custom patched method
.select("MONTH", 1)
# The same method via pipe
.pipe(select, "ORG_AIR", "SFO")
# Or pass a function to loc
.loc[lambda x: x["CANCELLED"] == 0]
)