DataFrame filtering with chaining

6th November 2022

Extending fluent design to filtering.

If you’re like me, breaking a nice method chain to do filtering is annoying.

You might have a nice long list of operations, and then interupt it like so:

df = (
    df_raw.set_index("A")
    .sort_index()
    .unstack()
    .groupby("B")
    .mean()
)
df = df[df["C"] == "some_value"]
df = df.join(df2).reset_index()

I’m a simple man, and I just want everything in a single chain!

Yes, in the trivial example above, you could just move the indexor into the chain, but what if you need to filter based a value in the dataframe at that instant, not in the raw dataframe? Then you’ll have issues.

So, how can we better do simple filtering with method chaining? I looked at a few options, so I thought it would be smart to document them.

import pandas as pd

df = pd.read_csv("https://github.com/PacktPublishing/Pandas-Cookbook/raw/master/data/flights.csv")
df = df[["MONTH", "DAY", "WEEKDAY", "ORG_AIR", "DEST_AIR", "CANCELLED"]]
df.head(5)
MONTHDAYWEEKDAYORG_AIRDEST_AIRCANCELLED
0114LAXSLC0
1114DENIAD0
2114DFWVPS0
3114DFWDCA0
4114LAXMCI0

Our goal: get some fluent API design going to filter to flights into Los Angeles (and the absolutely awful airport that is LAX).

# Normal method. Boo!
df[df["DEST_AIR"] == "LAX"].head(2)
MONTHORG_AIRDEST_AIRCANCELLED
2621SFOLAX0
7031SFOLAX0

So, stock standard answer is “use query”!

df.query("DEST_AIR == 'LAX'").head(2)
MONTHORG_AIRDEST_AIRCANCELLED
2621SFOLAX0
7031SFOLAX0

I do not like query. In a world of mypy, type hints, and similar, hiding away data manipulation logic inside a string feels fundamentally wrong to me.

A better solution, is to pass a callable into loc:

df.loc[lambda x: x["DEST_AIR"] == "LAX"].head(2)
MONTHDAYWEEKDAYORG_AIRDEST_AIRCANCELLED
19114PHXLAX0
22114LASLAX0

So this is a little nicer. The lambda function means we can filter on the dataframe at that point in time, which is the critical thing.

Of course, if we didn’t like the lambda functions, we could slap our own method onto the dataframe:

select = lambda df, col, val: df[df[col] == val]
pd.DataFrame.select = select

df.select("DEST_AIR", "LAX").head(2)
MONTHDAYWEEKDAYORG_AIRDEST_AIRCANCELLED
19114PHXLAX0
22114LASLAX0

Provided your filtering always takes this simple form, and provided you don’t feel incredibly dirty patching your own method onto the DataFrame class, this is very chainable.

df.select("DEST_AIR", "LAX").select("CANCELLED", 1).head(2)
MONTHDAYWEEKDAYORG_AIRDEST_AIRCANCELLED
30691202SFOLAX1
31641202SFOLAX1

Now, if you wanted the same function, but didn’t want to patch the DataFrame class, consider:

df.pipe(select, "DEST_AIR", "LAX").head(2)
MONTHDAYWEEKDAYORG_AIRDEST_AIRCANCELLED
19114PHXLAX0
22114LASLAX0

Really I think that something like this method should be added into the base pandas library. Maybe I’ll make a PR one day.

But for everything now in one place:

# Method chaining to filter your data!
import pandas as pd

# Download some data
repo = "https://github.com/PacktPublishing"
project = "Pandas-Cookbook/raw/master/data/flights.csv"
df_raw = pd.read_csv(f"{repo}/{project}")
df_raw = df_raw[["MONTH", "ORG_AIR", "DEST_AIR", "CANCELLED"]]

# Patch a custom method in
def select(df, col, val):
    return df[df[col] == val]
    
pd.DataFrame.select = select

# And then pick your favourite method!
df = (
    # You can use query
    df_raw.query("DEST_AIR == 'LAX'")
    # A custom patched method
    .select("MONTH", 1)
    # The same method via pipe
    .pipe(select, "ORG_AIR", "SFO")
    # Or pass a function to loc
    .loc[lambda x: x["CANCELLED"] == 0]
)

If you know of other useful ways of filtering, please let me know!


For your convenience, here’s the code in one block:

df = (
    df_raw.set_index("A")
    .sort_index()
    .unstack()
    .groupby("B")
    .mean()
)
df = df[df["C"] == "some_value"]
df = df.join(df2).reset_index()
import pandas as pd

df = pd.read_csv("https://github.com/PacktPublishing/Pandas-Cookbook/raw/master/data/flights.csv")
df = df[["MONTH", "DAY", "WEEKDAY", "ORG_AIR", "DEST_AIR", "CANCELLED"]]
df.head(5)
# Normal method. Boo!
df[df["DEST_AIR"] == "LAX"].head(2)
df.query("DEST_AIR == 'LAX'").head(2)
df.loc[lambda x: x["DEST_AIR"] == "LAX"].head(2)
select = lambda df, col, val: df[df[col] == val]
pd.DataFrame.select = select

df.select("DEST_AIR", "LAX").head(2)
df.select("DEST_AIR", "LAX").select("CANCELLED", 1).head(2)
df.pipe(select, "DEST_AIR", "LAX").head(2)
# Method chaining to filter your data!
import pandas as pd

# Download some data
repo = "https://github.com/PacktPublishing"
project = "Pandas-Cookbook/raw/master/data/flights.csv"
df_raw = pd.read_csv(f"{repo}/{project}")
df_raw = df_raw[["MONTH", "ORG_AIR", "DEST_AIR", "CANCELLED"]]

# Patch a custom method in
def select(df, col, val):
    return df[df[col] == val]
    
pd.DataFrame.select = select

# And then pick your favourite method!
df = (
    # You can use query
    df_raw.query("DEST_AIR == 'LAX'")
    # A custom patched method
    .select("MONTH", 1)
    # The same method via pipe
    .pipe(select, "ORG_AIR", "SFO")
    # Or pass a function to loc
    .loc[lambda x: x["CANCELLED"] == 0]
)