Which Big City has the best Weather?

6th July 2020

Python Plotting Preferred Weather

Like many people, I might one day be moving to the United States for work. However, I would like to know where I should move too. Where is too hot? Where is it too wet? Where is my little slice of paradise??

This is what we’ll be making:

png

So using the NOAA Global Summary of the Day and the top 25 most populated cities in the US, we’ve got enough to make a plot.

But first, we have to smack this data format into something useable.

Data preparation

I downloaded the data from 2020-2013, and extract the list of a billion csv files into a directory per year. Ouch. We’re going to have to pre-process this. Let’s load in the list of cities first from cities.csv. You can download that file here.

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import cmasher as cmr
from pathlib import Path


root = Path("D:/data/weather/")

cities = pd.read_csv(root / "cities.csv")
cities.head()
namelatlong
0New York40.6635-73.938700
1Los Angeles34.0194-118.410800
2Chicago41.8376-87.681800
3Washington38.9041-77.017200
4San Francisco37.7775-122.416389

Then, lets find all the csv files for the US states and load them in.

Great. Now lets open one of the weather files as an example:

pd.read_csv(root / "2020/71076099999.csv").describe().T[["mean", "std", "min", "max"]]
meanstdminmax
STATION7.107610e+100.000000e+007.107610e+107.107610e+10
LATITUDE5.956667e+011.282585e-135.956667e+015.956667e+01
LONGITUDE-1.084833e+024.275283e-14-1.084833e+02-1.084833e+02
ELEVATION3.180000e+020.000000e+003.180000e+023.180000e+02
TEMP1.806011e+012.669257e+01-3.280000e+016.630000e+01
TEMP_ATTRIBUTES2.366854e+011.440817e+001.000000e+012.400000e+01
DEWP7.848315e+002.382630e+01-4.060000e+014.990000e+01
DEWP_ATTRIBUTES2.361236e+011.461856e+001.000000e+012.400000e+01
SLP1.017151e+031.084217e+019.926000e+021.048300e+03
SLP_ATTRIBUTES2.366854e+011.440817e+001.000000e+012.400000e+01
STP9.654730e+021.031018e+024.500000e+009.986000e+02
STP_ATTRIBUTES2.366854e+011.440817e+001.000000e+012.400000e+01
VISIB9.999000e+023.192211e-129.999000e+029.999000e+02
VISIB_ATTRIBUTES0.000000e+000.000000e+000.000000e+000.000000e+00
WDSP4.801685e+002.247471e+006.000000e-011.490000e+01
WDSP_ATTRIBUTES2.366854e+011.440817e+001.000000e+012.400000e+01
MXSPD8.594944e+003.352649e+001.900000e+001.810000e+01
GUST5.808478e+024.868429e+021.500000e+019.999000e+02
MAX3.051124e+012.678324e+01-2.310000e+018.200000e+01
MIN7.076404e+002.663468e+01-4.320000e+015.450000e+01
PRCP2.904494e-027.611089e-020.000000e+005.600000e-01
SNDP4.333691e+024.848170e+028.000000e-019.999000e+02
FRSHTT0.000000e+000.000000e+000.000000e+000.000000e+00

Alright, so we have one csv file per year, per station. And what we’d want to do is match stations within a certain lat/long distance to the city coordinate. Oh boy, this is going to take a while to run.

Let’s write up something that goes through and just extracts station, lat, long, and then filters down to the stations we want. Im only going to use stations from 2020 (but they’ll have data from other years).

station_locations = root / "station_locations.csv"

# If Ive done this before, load in the previous results
if os.path.exists(station_locations):
    station_df = pd.read_csv(station_locations)
    
# Or else figure it out from scratch using 2020 year
else:
    stations = {}
    y = "2020"
    for f in os.listdir(root / y):
        path = root / y / f
        cols = ["STATION", "LATITUDE", "LONGITUDE"]
        res = pd.read_csv(path, nrows=1, usecols=cols)
        stations[f.replace(".csv", "")] = res
    station_df = pd.concat(list(stations.values()))
    station_df.columns = ["station", "lat", "long"]
    station_df.to_csv(station_locations, index=False)

Wow that took a long time to run. Lets save it out so we don’t have to do that again. And then Im going to edit the code above to check that the station_location.csv doesn’t exist, so this doesn’t run whenever I start this notebook up.

# Checking everything looks alright
station_df.head()
stationlatlong
0100109999970.933333-8.666667
1100149999959.7919255.340850
2100209999980.05000016.250000
3100309999977.00000015.500000
4100609999978.25000022.816667

Okay, first issue down. Now what we can do is filter that list to determine which stations are in the cities we care about. So we want a method to map a station to a city, if it is close enough

def check_station(row, threshold=0.1):
    station, lat, long, *_ = row
    found = False
    distance = (cities.lat - lat)**2 + (cities.long - long)**2
    matches = distance < threshold
    if matches.sum():
        return cities[matches].name.iloc[0]
    else:
        return np.NaN
    
# Lets test this works, we should get New York out
check_station(("New York", 40.6, -73.9))
'New York'

Fantastic. Now lets run it over all stations:

station_df["city"] = station_df.apply(check_station, axis=1)
final_stations = station_df.set_index("station").dropna()
final_stations.head()
latlongcity
station
7153809999942.275556-82.955556Detroit
7201135482942.543060-83.178060Detroit
7202915397032.746940-96.530560Dallas
7203046475240.100000-75.266670Philadelphia
7203349376439.166670-77.166670Washington

Great, so we’ve got matching weather stations now. In fact, we have 91 of them! Now for the super slow part. For each of these stations, I want to go through and load in the csv files for all available years, keeping the city information preserved.

years = [y for y in os.listdir(root) if os.path.isdir(root / y)]
cols = ["STATION", "DATE", "MAX", "WDSP", "PRCP"]

df_name = root / "weather.csv"

if os.path.exists(df_name):
    df_weather = pd.read_csv(df_name)
else:
    dfs = []
    for y in years:
        for f in os.listdir(root / y):
            station = int(f.replace(".csv", ""))
            if station in final_stations.index:
                df = pd.read_csv(root / y / f, usecols=cols, parse_dates=["DATE"])
                df["city"] = final_stations.loc[station, "city"]
                dfs.append(df)
    df_weather = pd.concat(dfs).reset_index(drop=True)
    df_weather.to_csv(df_name, index=False)
    
df_weather    
STATIONDATEWDSPMAXPRCPcity
0715380999992013-01-017.334.20.0Detroit
1715380999992013-01-025.830.20.0Detroit
2715380999992013-01-0310.028.40.0Detroit
3715380999992013-01-0414.830.20.0Detroit
4715380999992013-01-059.530.20.0Detroit
.....................
426336998499999992020-06-263.378.60.0Chicago
426337998499999992020-06-274.884.70.0Chicago
426338998499999992020-06-283.184.90.0Chicago
426339998499999992020-06-292.979.90.0Chicago
426340998499999992020-06-303.180.10.0Chicago

426341 rows × 6 columns

Alright, now we’re getting somewhere, 225k rows! Except, oh boy, I can see that 999 is being used as a NaN value. Lets fix that up.

# Yes it has all three variants -_-
df_weather = df_weather.replace(999.9, np.NaN).replace(99.99, np.NaN).replace(9999.9, np.NaN)
df_weather
STATIONDATEWDSPMAXPRCPcity
0715380999992013-01-017.334.20.0Detroit
1715380999992013-01-025.830.20.0Detroit
2715380999992013-01-0310.028.40.0Detroit
3715380999992013-01-0414.830.20.0Detroit
4715380999992013-01-059.530.20.0Detroit
.....................
426336998499999992020-06-263.378.60.0Chicago
426337998499999992020-06-274.884.70.0Chicago
426338998499999992020-06-283.184.90.0Chicago
426339998499999992020-06-292.979.90.0Chicago
426340998499999992020-06-303.180.10.0Chicago

426341 rows × 6 columns

Data engineering

The distinction here is that - at this point - we have a viable dataset which we can now manipulate as we see fit. We can create features and play around, but we have right here as a checkpoint. What I want to do is create a few features to determine what I would consider a good or bad day is. I’ll be simple, and break this into temperature and weather features. And note that - as I don’t really care what temperature it is during the night, I’ll use the max temperature for everything.

I want to define two sliding scales, between -1 and 1, where -1 is a horrible cold day, and 1 is a horrible hot day. Aka hot and bad weather, or cold and bad weather. This is obviously entirely subjective, but what I’ll do is this:

  • “Cold” becomes normalised between -10C to 15C (14F to 59F)
  • “Hot” becomes normalised between 28C to 38C (83F to 100F)
  • Precipitation (in inches) is normalised between 0 and 1.5
  • Wind (in knots) is normalised between 0 and 20
  • Weather is the maximum between precipiation and wind

We can see the distributions here:

df_weather[["MAX", "PRCP", "WDSP"]].hist(bins=100, log=True, layout=(1, 3), figsize=(10, 4));

png

So let’s add in that feature:

from sklearn.preprocessing import minmax_scale

df_weather["cold"] = minmax_scale(df_weather.MAX.clip(14, 59), (-1, 0))
df_weather["hot"] = minmax_scale(df_weather.MAX.clip(83, 100), (0, 1))
df_weather["rating_precip"] = minmax_scale(df_weather.PRCP.clip(0, 1.5), (0, 1))
df_weather["rating_wind"] = minmax_scale(df_weather.PRCP.clip(0, 20), (0, 1))

# Combine the weather and temp ratings
df_weather["rating_weather"] = df_weather[["rating_precip", "rating_wind"]].max(axis=1)
df_weather["rating_temp"] = df_weather.cold + df_weather.hot + 0.001

# Make some little feature to represent our total rating
df_weather["rating"] = np.sign(df_weather.rating_temp) * df_weather.rating_weather + df_weather.rating_temp
df_weather["rating"] = minmax_scale(df_weather["rating"].clip(-1, 1))
D:\anaconda3\lib\site-packages\pandas\core\series.py:679: RuntimeWarning: invalid value encountered in sign
  result = getattr(ufunc, method)(*inputs, **kwargs)
ax = df_weather["rating"].hist(bins=100, log=True, figsize=(8, 4))
ax.set_xlabel("Rating"), ax.set_ylabel("Frequency");

png

Obiously this is all completely dependent on arbitrary choices on weather. Let’s not get caught up on that. We can come back to this feature later if we need. For now, we want to take all the datapoints we have and take the average for each day in the year.

df_weather["day"] = df_weather.DATE.dt.dayofyear
df = df_weather.groupby(["city", "day"]).rating.mean().reset_index()
df
citydayrating
0Atlanta10.411146
1Atlanta20.349296
2Atlanta30.442816
3Atlanta40.491880
4Atlanta50.398219
............
10975Washington3620.423438
10976Washington3630.404648
10977Washington3640.357907
10978Washington3650.335265
10979Washington3660.369574

10980 rows × 3 columns

Make the plot

Again we’re going to run into the issue that we’ve compressed two axes (temp and weather) down into one. Ah well, I’m not going to publish this anyway!

First thing is I’ll just sort cities based on their average rating, and pick a colorset. I’m going to use a colormap from cmasher because I think it look great.

city_order = df.groupby("city").rating.mean().sort_values(ascending=False)
cmap = plt.get_cmap('cmr.fusion_r')

And then away we go plotting!

# Get a nice dark figure
bg = "#111111"
plt.style.use("dark_background")
fig, axes = plt.subplots(ncols=5, nrows=6, figsize=(10, 12))
fig.patch.set_facecolor(bg)

# Make donut plots
for (city, avg), ax in zip(city_order.iteritems(), axes.flatten()):
    df_tmp = df[df.city == city]
    xs = np.ones(df_tmp.shape[0])
    colors = [cmap(0.95 * r) for r in df_tmp.rating]
    ws, ts = ax.pie(xs, colors=colors, counterclock=False, startangle=90)
    # Set the line for each wedge to stop artifacts
    for w, c in zip(ws, colors):
        w.set_linewidth(0.5)
        w.set_edgecolor(c)
        
    ax.set_title(city, pad=-2)
    ax.set_aspect('equal', 'box')
    ax.add_artist(plt.Circle((0, 0), .6, color=bg))
    
    # Add average rating color circle
    ax.add_artist(plt.Circle((0, 0), .2, color=cmap(avg)))


# Add the custom colorbar
fig.subplots_adjust(top=0.83, bottom=0.04)
cbar_ax = fig.add_axes([0.2, 0.9, 0.6, 0.015])
cbar_ax.set_title("Yearly weather trends for major US cities", pad=25, fontsize=18)
cb = mpl.colorbar.ColorbarBase(cbar_ax, cmap=cmap, orientation='horizontal', 
                               ticks=[0, 0.25, 0.5, 0.75, 1])
cb.outline.set_visible(False)
cb.ax.set_xticklabels(["Cold and/or \nBad Weather", "Chilly", "Feeling good", "Warm", "Hot and/or\nBad Weather"]);

# And add our annotations
plt.annotate('Using completely arbitrary temperature preferences', 
                    (0.5,1), (0, 15), xycoords='axes fraction', color="#a19a92",
                    textcoords='offset points', size=10, va='top', ha="center")
plt.annotate('Source: NOAA (https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ncdc:C00516)', 
                    (0.5,0.024), (0, 0), xycoords='figure fraction', color="#a19a92",
                    textcoords='offset points', size=10, va='bottom', ha="center")

png

There we go! Its not perfect, but I think it’s pretty, and certainly San Jose is looking like a real nice place to go! And lets not go live in Phoenix. I don’t know if its just stupidly hot or theres bad weather as well (a failing of the plot I admit), either way - nope.

png


For your convenience, here’s the code in one block:

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import cmasher as cmr
from pathlib import Path


root = Path("D:/data/weather/")

cities = pd.read_csv(root / "cities.csv")
cities.head()
pd.read_csv(root / "2020/71076099999.csv").describe().T[["mean", "std", "min", "max"]]
station_locations = root / "station_locations.csv"

# If Ive done this before, load in the previous results
if os.path.exists(station_locations):
    station_df = pd.read_csv(station_locations)
    
# Or else figure it out from scratch using 2020 year
else:
    stations = {}
    y = "2020"
    for f in os.listdir(root / y):
        path = root / y / f
        cols = ["STATION", "LATITUDE", "LONGITUDE"]
        res = pd.read_csv(path, nrows=1, usecols=cols)
        stations[f.replace(".csv", "")] = res
    station_df = pd.concat(list(stations.values()))
    station_df.columns = ["station", "lat", "long"]
    station_df.to_csv(station_locations, index=False)
# Checking everything looks alright
station_df.head()
def check_station(row, threshold=0.1):
    station, lat, long, *_ = row
    found = False
    distance = (cities.lat - lat)**2 + (cities.long - long)**2
    matches = distance < threshold
    if matches.sum():
        return cities[matches].name.iloc[0]
    else:
        return np.NaN
    
# Lets test this works, we should get New York out
check_station(("New York", 40.6, -73.9))
station_df["city"] = station_df.apply(check_station, axis=1)
final_stations = station_df.set_index("station").dropna()
final_stations.head()
years = [y for y in os.listdir(root) if os.path.isdir(root / y)]
cols = ["STATION", "DATE", "MAX", "WDSP", "PRCP"]

df_name = root / "weather.csv"

if os.path.exists(df_name):
    df_weather = pd.read_csv(df_name)
else:
    dfs = []
    for y in years:
        for f in os.listdir(root / y):
            station = int(f.replace(".csv", ""))
            if station in final_stations.index:
                df = pd.read_csv(root / y / f, usecols=cols, parse_dates=["DATE"])
                df["city"] = final_stations.loc[station, "city"]
                dfs.append(df)
    df_weather = pd.concat(dfs).reset_index(drop=True)
    df_weather.to_csv(df_name, index=False)
    
df_weather    
# Yes it has all three variants -_-
df_weather = df_weather.replace(999.9, np.NaN).replace(99.99, np.NaN).replace(9999.9, np.NaN)
df_weather
df_weather[["MAX", "PRCP", "WDSP"]].hist(bins=100, log=True, layout=(1, 3), figsize=(10, 4));
from sklearn.preprocessing import minmax_scale

df_weather["cold"] = minmax_scale(df_weather.MAX.clip(14, 59), (-1, 0))
df_weather["hot"] = minmax_scale(df_weather.MAX.clip(83, 100), (0, 1))
df_weather["rating_precip"] = minmax_scale(df_weather.PRCP.clip(0, 1.5), (0, 1))
df_weather["rating_wind"] = minmax_scale(df_weather.PRCP.clip(0, 20), (0, 1))

# Combine the weather and temp ratings
df_weather["rating_weather"] = df_weather[["rating_precip", "rating_wind"]].max(axis=1)
df_weather["rating_temp"] = df_weather.cold + df_weather.hot + 0.001

# Make some little feature to represent our total rating
df_weather["rating"] = np.sign(df_weather.rating_temp) * df_weather.rating_weather + df_weather.rating_temp
df_weather["rating"] = minmax_scale(df_weather["rating"].clip(-1, 1))
ax = df_weather["rating"].hist(bins=100, log=True, figsize=(8, 4))
ax.set_xlabel("Rating"), ax.set_ylabel("Frequency");
df_weather["day"] = df_weather.DATE.dt.dayofyear
df = df_weather.groupby(["city", "day"]).rating.mean().reset_index()
df
city_order = df.groupby("city").rating.mean().sort_values(ascending=False)
cmap = plt.get_cmap('cmr.fusion_r')
# Get a nice dark figure
bg = "#111111"
plt.style.use("dark_background")
fig, axes = plt.subplots(ncols=5, nrows=6, figsize=(10, 12))
fig.patch.set_facecolor(bg)

# Make donut plots
for (city, avg), ax in zip(city_order.iteritems(), axes.flatten()):
    df_tmp = df[df.city == city]
    xs = np.ones(df_tmp.shape[0])
    colors = [cmap(0.95 * r) for r in df_tmp.rating]
    ws, ts = ax.pie(xs, colors=colors, counterclock=False, startangle=90)
    # Set the line for each wedge to stop artifacts
    for w, c in zip(ws, colors):
        w.set_linewidth(0.5)
        w.set_edgecolor(c)
        
    ax.set_title(city, pad=-2)
    ax.set_aspect('equal', 'box')
    ax.add_artist(plt.Circle((0, 0), .6, color=bg))
    
    # Add average rating color circle
    ax.add_artist(plt.Circle((0, 0), .2, color=cmap(avg)))


# Add the custom colorbar
fig.subplots_adjust(top=0.83, bottom=0.04)
cbar_ax = fig.add_axes([0.2, 0.9, 0.6, 0.015])
cbar_ax.set_title("Yearly weather trends for major US cities", pad=25, fontsize=18)
cb = mpl.colorbar.ColorbarBase(cbar_ax, cmap=cmap, orientation='horizontal', 
                               ticks=[0, 0.25, 0.5, 0.75, 1])
cb.outline.set_visible(False)
cb.ax.set_xticklabels(["Cold and/or \nBad Weather", "Chilly", "Feeling good", "Warm", "Hot and/or\nBad Weather"]);

# And add our annotations
plt.annotate('Using completely arbitrary temperature preferences', 
                    (0.5,1), (0, 15), xycoords='axes fraction', color="#a19a92",
                    textcoords='offset points', size=10, va='top', ha="center")
plt.annotate('Source: NOAA (https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.ncdc:C00516)', 
                    (0.5,0.024), (0, 0), xycoords='figure fraction', color="#a19a92",
                    textcoords='offset points', size=10, va='bottom', ha="center")