Inspired by http://creatingdata.us/etc/streets/ we try to compute word vectors for each streetname and then visualize the relationship between streetnames.
import pandas as pd
from gensim.models import Word2Vec
import numpy as np
import multiprocessing
from umap import UMAP
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
Reading the csv file of all streets into memory, this can be generated by the filter script. This assumes you have the file saved in the following folder
streets = pd.read_csv('../data/streets.csv')
# Group streets by postcode
groups = streets.groupby('postcode')
To train the Word2Vec model we normally need a list of sentences, usually mined from some text source. One way to emulate this with streetnames we consider each postcode a different sentence and list all the streetnames for a particular postcode.
# Create a list of lists containing the streetnames of each city
cleaned = []
for m_id, values in groups:
city = []
for nl, fr in values[['streetname_nl', 'streetname_fr']].values:
# try to add nl and fr names, handling the cases where they are null
try:
city.append(nl.lower())
except:
pass
try:
city.append(fr.lower())
except:
pass
cleaned.append(city)
Now we can calculate the word vectors for each streetname. First we create our model. Notable parameters
are the min_count
stating that only when a street occurs at least twice it can be included in
the word vectors and the window
which specifies how far words can be from eachother to still
be associated with eachother (10 is the maximum value).
cores = multiprocessing.cpu_count() # Count the number of cores in a computer
postcode_grouped = Word2Vec(min_count=2,
window=10,
size=300,
sample=6e-5,
alpha=0.03,
min_alpha=0.0007,
negative=20,
workers=cores-1)
# Build vocabulary, dropping streets that only occur once
postcode_grouped.build_vocab(cleaned, progress_per=10000)
# Train the word vectors
postcode_grouped.train(cleaned, total_examples=postcode_grouped.corpus_count, epochs=30, report_delay=1)
# Optimize model
postcode_grouped.init_sims(replace=True)
This model is able to extract some of the associations between the occurrences of streets in different cities. Especially when looking at commonly occuring streetnames we can see some other streets in the same style.
postcode_grouped.wv.most_similar(positive=["dorpsstraat"])
We can group streets using geolocation as well (the geolocation of the addresses belonging to that street). We collect the streets in bins and make sure there is overlap between bins so neighboorhoods should be contained in at least one bin. Each bin is then interpreted as a 'sentence' for the creation of the word vectors.
Read the addresses csv file, this assumes you have the file saved in the following folder
addresses = pd.read_csv('../data/belgium_addresses.csv')
min_x = addresses['EPSG:31370_x'].min()
max_x = addresses['EPSG:31370_x'].max()
min_y = addresses['EPSG:31370_y'].min()
max_y = addresses['EPSG:31370_y'].max()
binsize = 1000
coll = {}
# Get the ids of the necessary columns
x_id = addresses.columns.get_loc('EPSG:31370_x')
y_id = addresses.columns.get_loc('EPSG:31370_y')
nl_id = addresses.columns.get_loc('streetname_nl')
fr_id = addresses.columns.get_loc('streetname_fr')
for row in addresses.values:
# Get the bin offsets
x = (row[0] // binsize) * binsize
y = (row[1] // binsize) * binsize
bins = [
(x, y),
(x + binsize/2, y),
(x, y + binsize/2),
(x + binsize/2, y + binsize/2),
]
for pos in bins:
if pos not in coll:
coll[pos] = set()
try:
coll[pos].add(row[nl_id].lower())
except:
pass
try:
coll[pos].add(row[fr_id].lower())
except:
pass
blocks = [list(el) for el in coll.values() if el]
geo_grouped = Word2Vec(min_count=10,
window=10,
size=300,
sample=6e-5,
alpha=0.03,
min_alpha=0.0007,
negative=20,
workers=cores-1)
# Build vocabulary
geo_grouped.build_vocab(blocks, progress_per=10000)
# Train the word vectors
geo_grouped.train(blocks, total_examples=geo_grouped.corpus_count, epochs=30, report_delay=1)
# Optimize the model
geo_grouped.init_sims(replace=True)
geo_grouped.most_similar(positive=["dorpsstraat"])
reducer = UMAP()
# Extract the vectors from the model
vectors = []
for word in postcode_grouped.wv.vocab:
vectors.append(postcode_grouped.wv[word])
vectors = np.array(vectors)
# Create the low dimensional embedding
embedding = reducer.fit_transform(vectors)
plt.figure(figsize=(14, 9))
sns.scatterplot(embedding[:, 0], embedding[:, 1])
# Extract the vectors from the model
vectors = []
for word in geo_grouped.wv.vocab:
vectors.append(geo_grouped.wv[word])
vectors = np.array(vectors)
# Create the low dimensional embedding
embedding = reducer.fit_transform(vectors)
plt.figure(figsize=(14, 9))
sns.scatterplot(embedding[:, 0], embedding[:, 1])