DataFrame filtering with chaining

2022-11-22

Extending fluent design to filtering.

Show me everything!
Oh yeah, coding time.
Just the plots
Code is nasty.

If you’re like me, breaking a nice method chain to do filtering is annoying.

You might have a nice long list of operations, and then interupt it like so:

df = (
    df_raw.set_index("A")
    .sort_index()
    .unstack()
    .groupby("B")
    .mean()
)
df = df[df["C"] == "some_value"]
df = df.join(df2).reset_index()

I’m a simple man, and I just want everything in a single chain!

Yes, in the trivial example above, you could just move the indexor into the chain, but what if you need to filter based a value in the dataframe at that instant, not in the raw dataframe? Then you’ll have issues.

So, how can we better do simple filtering with method chaining? I looked at a few options, so I thought it would be smart to document them.

import pandas as pd

df = pd.read_csv("https://github.com/PacktPublishing/Pandas-Cookbook/raw/master/data/flights.csv")
df = df[["MONTH", "DAY", "WEEKDAY", "ORG_AIR", "DEST_AIR", "CANCELLED"]]
df.head(5)

	MONTH	DAY	WEEKDAY	ORG_AIR	DEST_AIR
0	1	1	4	LAX	SLC
1	1	1	4	DEN	IAD
2	1	1	4	DFW	VPS
3	1	1	4	DFW	DCA
4	1	1	4	LAX	MCI

Our goal: get some fluent API design going to filter to flights into Los Angeles (and the absolutely awful airport that is LAX).

# Normal method. Boo!
df[df["DEST_AIR"] == "LAX"].head(2)

	MONTH	ORG_AIR	DEST_AIR	CANCELLED
262	1	SFO	LAX	0
703	1	SFO	LAX	0

So, stock standard answer is “use query”!

df.query("DEST_AIR == 'LAX'").head(2)

	MONTH	ORG_AIR	DEST_AIR	CANCELLED
262	1	SFO	LAX	0
703	1	SFO	LAX	0

I do not like query. In a world of mypy, type hints, and similar, hiding away data manipulation logic inside a string feels fundamentally wrong to me.

A better solution, is to pass a callable into loc:

df.loc[lambda x: x["DEST_AIR"] == "LAX"].head(2)

	MONTH	DAY	WEEKDAY	ORG_AIR	DEST_AIR	CANCELLED
19	1	1	4	PHX	LAX	0
22	1	1	4	LAS	LAX	0

So this is a little nicer. The lambda function means we can filter on the dataframe at that point in time, which is the critical thing.

Of course, if we didn’t like the lambda functions, we could slap our own method onto the dataframe:

select = lambda df, col, val: df[df[col] == val]
pd.DataFrame.select = select

df.select("DEST_AIR", "LAX").head(2)

	MONTH	DAY	WEEKDAY	ORG_AIR	DEST_AIR	CANCELLED
19	1	1	4	PHX	LAX	0
22	1	1	4	LAS	LAX	0

Provided your filtering always takes this simple form, and provided you don’t feel incredibly dirty patching your own method onto the DataFrame class, this is very chainable.

df.select("DEST_AIR", "LAX").select("CANCELLED", 1).head(2)

	MONTH	DAY	WEEKDAY	ORG_AIR	DEST_AIR	CANCELLED
3069	1	20	2	SFO	LAX	1
3164	1	20	2	SFO	LAX	1

Now, if you wanted the same function, but didn’t want to patch the DataFrame class, consider:

df.pipe(select, "DEST_AIR", "LAX").head(2)

	MONTH	DAY	WEEKDAY	ORG_AIR	DEST_AIR	CANCELLED
19	1	1	4	PHX	LAX	0
22	1	1	4	LAS	LAX	0

Really I think that something like this method should be added into the base pandas library. Maybe I’ll make a PR one day.

But for everything now in one place:

# Method chaining to filter your data!
import pandas as pd

# Download some data
repo = "https://github.com/PacktPublishing"
project = "Pandas-Cookbook/raw/master/data/flights.csv"
df_raw = pd.read_csv(f"{repo}/{project}")
df_raw = df_raw[["MONTH", "ORG_AIR", "DEST_AIR", "CANCELLED"]]

# Patch a custom method in
def select(df, col, val):
    return df[df[col] == val]
    
pd.DataFrame.select = select

# And then pick your favourite method!
df = (
    # You can use query
    df_raw.query("DEST_AIR == 'LAX'")
    # A custom patched method
    .select("MONTH", 1)
    # The same method via pipe
    .pipe(select, "ORG_AIR", "SFO")
    # Or pass a function to loc
    .loc[lambda x: x["CANCELLED"] == 0]
)

If you know of other useful ways of filtering, please let me know!

For your convenience, here’s the code in one block:

df = (
    df_raw.set_index("A")
    .sort_index()
    .unstack()
    .groupby("B")
    .mean()
)
df = df[df["C"] == "some_value"]
df = df.join(df2).reset_index()
import pandas as pd

df = pd.read_csv("https://github.com/PacktPublishing/Pandas-Cookbook/raw/master/data/flights.csv")
df = df[["MONTH", "DAY", "WEEKDAY", "ORG_AIR", "DEST_AIR", "CANCELLED"]]
df.head(5)
# Normal method. Boo!
df[df["DEST_AIR"] == "LAX"].head(2)
df.query("DEST_AIR == 'LAX'").head(2)
df.loc[lambda x: x["DEST_AIR"] == "LAX"].head(2)
select = lambda df, col, val: df[df[col] == val]
pd.DataFrame.select = select

df.select("DEST_AIR", "LAX").head(2)
df.select("DEST_AIR", "LAX").select("CANCELLED", 1).head(2)
df.pipe(select, "DEST_AIR", "LAX").head(2)
# Method chaining to filter your data!
import pandas as pd

# Download some data
repo = "https://github.com/PacktPublishing"
project = "Pandas-Cookbook/raw/master/data/flights.csv"
df_raw = pd.read_csv(f"{repo}/{project}")
df_raw = df_raw[["MONTH", "ORG_AIR", "DEST_AIR", "CANCELLED"]]

# Patch a custom method in
def select(df, col, val):
    return df[df[col] == val]
    
pd.DataFrame.select = select

# And then pick your favourite method!
df = (
    # You can use query
    df_raw.query("DEST_AIR == 'LAX'")
    # A custom patched method
    .select("MONTH", 1)
    # The same method via pipe
    .pipe(select, "ORG_AIR", "SFO")
    # Or pass a function to loc
    .loc[lambda x: x["CANCELLED"] == 0]
)