DataFrame filtering with chaining

2022-11-08

Extending fluent design to filtering.

If you’re like me, breaking a nice method chain to do filtering is annoying.

You might have a nice long list of operations, and then interupt it like so:

df = (
    df_raw.set_index("A")
    .sort_index()
    .unstack()
    .groupby("B")
    .mean()
)
df = df[df["C"] == "some_value"]
df = df.join(df2).reset_index()

I’m a simple man, and I just want everything in a single chain!

Yes, in the trivial example above, you could just move the indexor into the chain, but what if you need to filter based a value in the dataframe at that instant, not in the raw dataframe? Then you’ll have issues.

So, how can we better do simple filtering with method chaining? I looked at a few options, so I thought it would be smart to document them.

import pandas as pd

df = pd.read_csv("https://github.com/PacktPublishing/Pandas-Cookbook/raw/master/data/flights.csv")
df = df[["MONTH", "DAY", "WEEKDAY", "ORG_AIR", "DEST_AIR", "CANCELLED"]]
df.head(5)
MONTH DAY WEEKDAY ORG_AIR DEST_AIR CANCELLED
0 1 1 4 LAX SLC 0
1 1 1 4 DEN IAD 0
2 1 1 4 DFW VPS 0
3 1 1 4 DFW DCA 0
4 1 1 4 LAX MCI 0

Our goal: get some fluent API design going to filter to flights into Los Angeles (and the absolutely awful airport that is LAX).

# Normal method. Boo!
df[df["DEST_AIR"] == "LAX"].head(2)
MONTH ORG_AIR DEST_AIR CANCELLED
262 1 SFO LAX 0
703 1 SFO LAX 0

So, stock standard answer is “use query”!

df.query("DEST_AIR == 'LAX'").head(2)
MONTH ORG_AIR DEST_AIR CANCELLED
262 1 SFO LAX 0
703 1 SFO LAX 0

I do not like query. In a world of mypy, type hints, and similar, hiding away data manipulation logic inside a string feels fundamentally wrong to me.

A better solution, is to pass a callable into loc:

df.loc[lambda x: x["DEST_AIR"] == "LAX"].head(2)
MONTH DAY WEEKDAY ORG_AIR DEST_AIR CANCELLED
19 1 1 4 PHX LAX 0
22 1 1 4 LAS LAX 0

So this is a little nicer. The lambda function means we can filter on the dataframe at that point in time, which is the critical thing.

Of course, if we didn’t like the lambda functions, we could slap our own method onto the dataframe:

select = lambda df, col, val: df[df[col] == val]
pd.DataFrame.select = select

df.select("DEST_AIR", "LAX").head(2)
MONTH DAY WEEKDAY ORG_AIR DEST_AIR CANCELLED
19 1 1 4 PHX LAX 0
22 1 1 4 LAS LAX 0

Provided your filtering always takes this simple form, and provided you don’t feel incredibly dirty patching your own method onto the DataFrame class, this is very chainable.

df.select("DEST_AIR", "LAX").select("CANCELLED", 1).head(2)
MONTH DAY WEEKDAY ORG_AIR DEST_AIR CANCELLED
3069 1 20 2 SFO LAX 1
3164 1 20 2 SFO LAX 1

Now, if you wanted the same function, but didn’t want to patch the DataFrame class, consider:

df.pipe(select, "DEST_AIR", "LAX").head(2)
MONTH DAY WEEKDAY ORG_AIR DEST_AIR CANCELLED
19 1 1 4 PHX LAX 0
22 1 1 4 LAS LAX 0

Really I think that something like this method should be added into the base pandas library. Maybe I’ll make a PR one day.

But for everything now in one place:

# Method chaining to filter your data!
import pandas as pd

# Download some data
repo = "https://github.com/PacktPublishing"
project = "Pandas-Cookbook/raw/master/data/flights.csv"
df_raw = pd.read_csv(f"{repo}/{project}")
df_raw = df_raw[["MONTH", "ORG_AIR", "DEST_AIR", "CANCELLED"]]

# Patch a custom method in
def select(df, col, val):
    return df[df[col] == val]

pd.DataFrame.select = select

# And then pick your favourite method!
df = (
    # You can use query
    df_raw.query("DEST_AIR == 'LAX'")
    # A custom patched method
    .select("MONTH", 1)
    # The same method via pipe
    .pipe(select, "ORG_AIR", "SFO")
    # Or pass a function to loc
    .loc[lambda x: x["CANCELLED"] == 0]
)

If you know of other useful ways of filtering, please let me know!


For your convenience, here’s the code in one block:

df = (
    df_raw.set_index("A")
    .sort_index()
    .unstack()
    .groupby("B")
    .mean()
)
df = df[df["C"] == "some_value"]
df = df.join(df2).reset_index()
import pandas as pd

df = pd.read_csv("https://github.com/PacktPublishing/Pandas-Cookbook/raw/master/data/flights.csv")
df = df[["MONTH", "DAY", "WEEKDAY", "ORG_AIR", "DEST_AIR", "CANCELLED"]]
df.head(5)
# Normal method. Boo!
df[df["DEST_AIR"] == "LAX"].head(2)
df.query("DEST_AIR == 'LAX'").head(2)
df.loc[lambda x: x["DEST_AIR"] == "LAX"].head(2)
select = lambda df, col, val: df[df[col] == val]
pd.DataFrame.select = select

df.select("DEST_AIR", "LAX").head(2)
df.select("DEST_AIR", "LAX").select("CANCELLED", 1).head(2)
df.pipe(select, "DEST_AIR", "LAX").head(2)
# Method chaining to filter your data!
import pandas as pd

# Download some data
repo = "https://github.com/PacktPublishing"
project = "Pandas-Cookbook/raw/master/data/flights.csv"
df_raw = pd.read_csv(f"{repo}/{project}")
df_raw = df_raw[["MONTH", "ORG_AIR", "DEST_AIR", "CANCELLED"]]

# Patch a custom method in
def select(df, col, val):
    return df[df[col] == val]

pd.DataFrame.select = select

# And then pick your favourite method!
df = (
    # You can use query
    df_raw.query("DEST_AIR == 'LAX'")
    # A custom patched method
    .select("MONTH", 1)
    # The same method via pipe
    .pipe(select, "ORG_AIR", "SFO")
    # Or pass a function to loc
    .loc[lambda x: x["CANCELLED"] == 0]
)