6th August 2020
Removing the hassle of misspelling from a dataset
In this tweet, Steven Rich pointed out that Philadelphia is spelled at least 57 different ways in the PPP load data, and that this represents both a challenge to fix on the back-end, and a perfect example of why you should do as much work on the front-end to get better input.
In this write up, we’ll figure out an easy way of fixing up these spelling issues to produce a far better dataset to work on using the python library FuzzyWuzzy.
We’ll source our copy of the PPP dataset from this Kaggle challenge. I’ve downloaded this into a local folder called fuzzy_ppp
.
import pandas as pd
import numpy as np
df = pd.read_csv("fuzzy_ppp/PPP_data_150k_plus.csv")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 661218 entries, 0 to 661217
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LoanRange 661218 non-null object
1 BusinessName 661210 non-null object
2 Address 661201 non-null object
3 City 661203 non-null object
4 State 661218 non-null object
5 Zip 661202 non-null float64
6 NAICSCode 654435 non-null float64
7 BusinessType 659789 non-null object
8 RaceEthnicity 661218 non-null object
9 Gender 661218 non-null object
10 Veteran 661218 non-null object
11 NonProfit 42462 non-null object
12 JobsRetained 620712 non-null float64
13 DateApproved 661218 non-null object
14 Lender 661218 non-null object
15 CD 661218 non-null object
dtypes: float64(3), object(13)
memory usage: 80.7+ MB
Alright, so we’ve got a lot of data here. 600k rows, but really the one we care about is the City. Note: for a non-introductory tutorial, for sure utilise the ZIP and State to get better determination of the actual city.
For now, lets just see how many unique cities we have inour dataset.
15652 ['02155' '1026-02-006' '10550' ... 'ZUNI' 'ZWOLLE' 'nan']
Almost sixteen thousand cities. That is a big number. And some interesting cities that appear to be entirely numeric. Perhaps they are postcodes? Here my lack of familiarity with the US system (as an Australian") means I lack a certain domain knowledge here, but let’s push on! To simplify things and make output… readable… lets just look at cities starting with "PH"
:
cities_p = [c for c in cities if c.startswith("PH")]
cities_p
['PHARR',
'PHEBA',
'PHELAN',
'PHELPS',
'PHENIX',
'PHENIX CITY',
'PHEONIX',
'PHIADELPHIA',
'PHIALDELPHIA',
'PHIL CAMPBELL',
'PHILA',
'PHILA.',
'PHILADELKPHIA',
'PHILADELPHIA',
'PHILADELPHILA',
'PHILADELPIA',
'PHILADEPHIA',
'PHILDADELPHIA',
'PHILDELPHIA',
'PHILIP',
'PHILIPPI',
'PHILIPSBURG',
'PHILIPSTOWN',
'PHILLIPS',
'PHILLIPSBURG',
'PHILLIPSTON',
'PHILMONT',
'PHILO',
'PHILOMATH',
'PHILPOT',
'PHIPPSBURG',
'PHOENICIA',
'PHOENIX',
'PHOENIX AZ 85017',
'PHOENIXA',
'PHOENIXVILE',
'PHOENIXVILLE',
'PHOENX',
'PHONEIX']
It’s not 57 different spellings in our dataset, but oh boy do we have a lot of different ways of writing down Philadelphia.
Next up, lets get a proper list of cities from this github repository. Unfortunately, this dataset is not comprehensive, but it will do for us!
actual_cities = pd.read_csv("fuzzy_ppp/us_cities.csv")
actual_cities
ID | STATE_CODE | STATE_NAME | CITY | COUNTY | LATITUDE | LONGITUDE | |
---|---|---|---|---|---|---|---|
0 | 1 | AK | Alaska | Adak | Aleutians West | 55.999722 | -161.207778 |
1 | 2 | AK | Alaska | Akiachak | Bethel | 60.891854 | -161.392330 |
2 | 3 | AK | Alaska | Akiak | Bethel | 60.890632 | -161.199325 |
3 | 4 | AK | Alaska | Akutan | Aleutians East | 54.143012 | -165.785368 |
4 | 5 | AK | Alaska | Alakanuk | Wade Hampton | 62.746967 | -164.602280 |
... | ... | ... | ... | ... | ... | ... | ... |
29875 | 29876 | WY | Wyoming | Worland | Washakie | 44.013796 | -107.956260 |
29876 | 29877 | WY | Wyoming | Wright | Campbell | 43.829349 | -105.532327 |
29877 | 29878 | WY | Wyoming | Wyarno | Sheridan | 44.813333 | -106.773333 |
29878 | 29879 | WY | Wyoming | Yellowstone National Park | Park | 44.853913 | -110.674366 |
29879 | 29880 | WY | Wyoming | Yoder | Goshen | 41.912018 | -104.353507 |
29880 rows × 7 columns
For simplicity lets again look at just cities starting with "PH"
.
array(['PHARR', 'PHEBA', 'PHELAN', 'PHELPS', 'PHENIX', 'PHENIX CITY',
'PHIL CAMPBELL', 'PHILADELPHIA', 'PHILIP', 'PHILIPP', 'PHILIPPI',
'PHILIPSBURG', 'PHILLIPS', 'PHILLIPSBURG', 'PHILLIPSPORT',
'PHILLIPSVILLE', 'PHILMONT', 'PHILO', 'PHILOMATH', 'PHILOMONT',
'PHILPOT', 'PHIPPSBURG', 'PHLOX', 'PHOENICIA', 'PHOENIX',
'PHOENIXVILLE', 'PHYLLIS'], dtype='<U13')
Great! So we have a list of input cities, and actual cities. We don’t care about those that have the name correct, so lets pull out only things which don’t match the name of a known city in our dataset.
unknown = [c for c in cities_p if c not in proper_names]
unknown
['PHEONIX',
'PHIADELPHIA',
'PHIALDELPHIA',
'PHILA',
'PHILA.',
'PHILADELKPHIA',
'PHILADELPHILA',
'PHILADELPIA',
'PHILADEPHIA',
'PHILDADELPHIA',
'PHILDELPHIA',
'PHILIPSTOWN',
'PHILLIPSTON',
'PHOENIX AZ 85017',
'PHOENIXA',
'PHOENIXVILE',
'PHOENX',
'PHONEIX']
Now time for the good stuff! Lets delve into the fantastic package fuzzywuzzy
. It uses some smart Levenshtein distance computation to determine how similar two words are.
So for an example unknown city in the list above, we’ll see how similar it is to all the potential cities we could match them to.
from fuzzywuzzy import fuzz
example = unknown[5]
for potential in proper_names:
print(f"{example} -> {potential} = {fuzz.ratio(example, potential)}")
PHILADELKPHIA -> PHARR = 33
PHILADELKPHIA -> PHEBA = 33
PHILADELKPHIA -> PHELAN = 42
PHILADELKPHIA -> PHELPS = 53
PHILADELKPHIA -> PHENIX = 32
PHILADELKPHIA -> PHENIX CITY = 33
PHILADELKPHIA -> PHIL CAMPBELL = 54
PHILADELKPHIA -> PHILADELPHIA = 96
PHILADELKPHIA -> PHILIP = 53
PHILADELKPHIA -> PHILIPP = 50
PHILADELKPHIA -> PHILIPPI = 57
PHILADELKPHIA -> PHILIPSBURG = 42
PHILADELKPHIA -> PHILLIPS = 57
PHILADELKPHIA -> PHILLIPSBURG = 48
PHILADELKPHIA -> PHILLIPSPORT = 48
PHILADELKPHIA -> PHILLIPSVILLE = 38
PHILADELKPHIA -> PHILMONT = 38
PHILADELKPHIA -> PHILO = 44
PHILADELKPHIA -> PHILOMATH = 55
PHILADELKPHIA -> PHILOMONT = 36
PHILADELKPHIA -> PHILPOT = 50
PHILADELKPHIA -> PHIPPSBURG = 35
PHILADELKPHIA -> PHLOX = 33
PHILADELKPHIA -> PHOENICIA = 45
PHILADELKPHIA -> PHOENIX = 30
PHILADELKPHIA -> PHOENIXVILLE = 40
PHILADELKPHIA -> PHYLLIS = 30
We can see that the match with the highest similarity is, of course, PHILADELPHIA
. So what if we just want to find the best match, instead of printing them all out of it, that’s very easy to do!
from fuzzywuzzy import process
best_match = process.extractOne(example, proper_names)
print(best_match)
('PHILADELPHIA', 96)
If we want more than one, also easy to do, let’s pick a different work and get the top three matches:
example2 = unknown[0]
matches = process.extract(example2, proper_names, limit=3)
print(matches)
[('PHENIX', 92), ('PHOENIX', 86), ('PHENIX CITY', 79)]
How could we improve this? Firstly, we are ignoring the super useful information of the state and the PIN, which we should also be using to inform our decision.
Secondly, at the moment all cities are weighted equally. But larger cities should have more entries, so their prior probability should be higher, and we should introduce a weighting based on the city population. For example, Phoenix has a population around fifty times larger than Phenix, so if the two weightings were around the same, it would be statistically far more likely that the intended entry was Phoenix.
So lets go through now and extract the best match for every city in our inexact list.
for c in unknown:
print(f"{c} -> {process.extractOne(c, proper_names)[0]}")
PHEONIX -> PHENIX
PHIADELPHIA -> PHILADELPHIA
PHIALDELPHIA -> PHILADELPHIA
PHILA -> PHILADELPHIA
PHILA. -> PHILADELPHIA
PHILADELKPHIA -> PHILADELPHIA
PHILADELPHILA -> PHILADELPHIA
PHILADELPIA -> PHILADELPHIA
PHILADEPHIA -> PHILADELPHIA
PHILDADELPHIA -> PHILADELPHIA
PHILDELPHIA -> PHILADELPHIA
PHILIPSTOWN -> PHILIP
PHILLIPSTON -> PHILLIPS
PHOENIX AZ 85017 -> PHOENIX
PHOENIXA -> PHOENIX
PHOENIXVILE -> PHOENIXVILLE
PHOENX -> PHOENIX
PHONEIX -> PHOENIX
An obvious issue is that our dataset of US cities is not complete. It turns out, there is a tiny town called Philipstown in Putnam County, New York, but its not in our list, so it gets turns into other cities. Apart from that, this is pretty good! If we wanted to do this in a dataframe, we could use df.map(lambda x: process.extractOne(x, proper_names)[0])
on anything that doesn’t match to extract a new series.
Now, this is all great, but theres plenty more we can do with the fuzzywuzzy
library.
Let’s have a different example: acronyms.
a = "National Aeronautics and Space Administration (NASA)"
b = "NASA"
We know these refer to the same things, but if we tried a simple ratio similarity test we wouldn’t look too good:
print(fuzz.ratio(a, b))
14
Of course, this compares the entire strings. We can ask for the partial ratio which will help us out a lot:
print(fuzz.partial_ratio(a, b))
100
But we can go more complicated:
a = "Machine Learning vs Artifical Intelligence"
b = "Artifical Intelligence vs Machine Learning"
print(fuzz.ratio(a, b), fuzz.partial_ratio(a, b))
52 52
Obviously swapping the order has confused things. So lets tokenise!
print(fuzz.token_sort_ratio(a, b))
100
Oh yeah, thats what we want. Its broken the string down into tokens, sorted them, then taken the ratio so that we don’t care about order. But wait, there’s more!
59 96
The sheer size difference between the two strings means that our token_sort_ratio
is now not giving good results. Instead, we can use token_set_ratio
which breaks both strings into tokens, and then takes into account the intersection of those two sets better.
As a final note, computing the distance (and thus ratios) between strings is a slow process. Running this for millions of records at a time will take a very long time, so this is a great process suited for cleaning data upon ingestion, such that the task is distributed over time and not run in bulk.
I should point out that process.extract
(and similar methods) allow you to specify the scoring algorithm used. By default, its fuzz.WRatio
, which is a smart algorithm which runs ratio
, partial_ratio
and the tokenised versions depending on the size difference and matching between the strings.
fuzzywuzzy
is a great tool for matching strings based on their Levenshtein distance. The ones we touched on are:
fuzz.ratio
: Distance between two stringsfuzz.partial_ratio
: Takes partial matching into accountfuzz.token_sort_ratio
: Tokenises your stringfuzz.token_set_ratio
: Takes the intersection of tokens into account betterfuzz.WRatio
: Smart weighted ratio of many methodsprocess.extract
: Uses (by default) WRatio
to get match again a list of input stringprocess.extractOne
: Returns only the best match from a list of input strings.But there are plenty more, so go check out the library!
For your convenience, here’s the code in one block:
import pandas as pd
import numpy as np
df = pd.read_csv("fuzzy_ppp/PPP_data_150k_plus.csv")
df.info()
cities = np.sort(df.City.str.replace(",", "").str.strip().unique().astype(str))
print(cities.size, cities)
cities_p = [c for c in cities if c.startswith("PH")]
cities_p
actual_cities = pd.read_csv("fuzzy_ppp/us_cities.csv")
actual_cities
proper_names = np.unique([c.upper() for c in actual_cities.CITY if c.upper().startswith("PH")])
proper_names
unknown = [c for c in cities_p if c not in proper_names]
unknown
from fuzzywuzzy import fuzz
example = unknown[5]
for potential in proper_names:
print(f"{example} -> {potential} = {fuzz.ratio(example, potential)}")
from fuzzywuzzy import process
best_match = process.extractOne(example, proper_names)
print(best_match)
example2 = unknown[0]
matches = process.extract(example2, proper_names, limit=3)
print(matches)
for c in unknown:
print(f"{c} -> {process.extractOne(c, proper_names)[0]}")
a = "National Aeronautics and Space Administration (NASA)"
b = "NASA"
print(fuzz.ratio(a, b))
print(fuzz.partial_ratio(a, b))
a = "Machine Learning vs Artifical Intelligence"
b = "Artifical Intelligence vs Machine Learning"
print(fuzz.ratio(a, b), fuzz.partial_ratio(a, b))
print(fuzz.token_sort_ratio(a, b))
a = "Machine Learning vs Artifical Intelligence"
b = "The Incredible and Often Semantic difference between Artifical Intelligence and Machine Learning"
print(fuzz.token_sort_ratio(a, b), fuzz.token_set_ratio(a, b))