Author: Leonardo Espin.
Date: 11/09/2018
A common sight in New York City is people walking their dogs. Sometimes the dogs can be seen wearing boots, vests or even what it looks like expensive coats. This gave me the idea that perhaps dog ownership could be used as an indicator of disposable income, by tying geographical income information with New York City dog licensing data.
Why having such an indicator would be useful?. Because it would be a very flexible and dynamic proxy measure of income. For instance the most accurate geographical income information is obtained through the U.S. Census, which is a decennial census (the next one will take place in 2020). On the other hand, in NYC dog owners are required by law to license their dogs, which means that dog licensing data is updated continuously.
To investigate this hypothesis I utilized the following databases:
Since geographical information is best visualized on a map, I use the folium
python package. The geometry data for NYC zip codes I use is a geojson file that can be downloaded here. The file containing the latitude and longitude for the centroids of each zip code can be obtained here.
The map below shows the median income by zip code with color. Darker color represents higher income. The colorbar indicates income in thousands of dollars.
import folium
NYC_coord=[40.738, -73.98]
zip_geo = r'../DIchallenge/nyc-zip-code-tabulation-areas-polygons.geojson'
incomeMap = folium.Map(location=NYC_coord)
#bins are important for visualization
bins = [0, 20, 40, 60, 95, 233]
#the income map layer
incomeMap.choropleth(geo_data = zip_geo,
data = income, #income database
columns = ['postalCode', 'MedianIncome'],
key_on = 'feature.properties.postalCode',
name='Income', #name of the layer
fill_color = 'YlOrRd',
fill_opacity = 0.8,
line_opacity = 0.2,
legend_name = 'Income by zip code (thousands of dollars)',
threshold_scale = bins,
reset=True)
#adjust zoom of the map
incomeMap.fit_bounds(incomeMap.get_bounds())
#show the map
incomeMap
Note that to create this map zip code information has to be extracted from the income database, and these zip codes have to be matched to the ones availabe in the geometry geojson file (hidden code below, to reveal click the button at the left):
#reading and processing income database
income=pd.read_csv('../DIchallenge/ACS_12_5YR_B19013_with_ann.csv',na_values='-')
income['GEO.display-label']=income['GEO.display-label'].apply(lambda x: x.replace('ZCTA5 ',''))
income.rename(columns={'GEO.display-label':'postalCode', 'HD01_VD01': 'MedianIncome'},inplace=True)
income['MedianIncome']=income['MedianIncome'].astype('float64')
income.dropna(subset=['MedianIncome'],inplace=True)
#extracting zip code information from the geometry file
import json
with open(zip_geo, 'r') as f:
get_id = json.load(f)
#create list of available zip codes
geoZipCodes = [x['properties']['postalCode'] for x in get_id['features']]
#select only zip codes that are in NYC
subset=income[income['postalCode'].apply(lambda x: x in geoZipCodes)]
#append especial zip codes of the geometry file that missing in the income file, and set
#income as zero (for example for zipcode '11430' of the JFK airport)
extraZip=pd.DataFrame({'postalCode':geoZipCodes})
extraZip['MedianIncome']=extraZip['postalCode'].apply(lambda x:0)
#merge both dataframes
totalZip=pd.merge(subset, extraZip, how='outer', on=['postalCode'])
On the map below, I have plotted the number of registered Labradoodles in each zip code area. The radius of each circle is proportional to the number of registered dogs (clicking on any of the circles shows the number of registered dogs).
Note how labradoodles concentrate in regions of darker color, while there are only few or none in regions of lower income areas. So it seems that labraddodle owners tend to live in areas of higher income, mostly in Manhattan. An interesting observation is that a couple of these areas of large labradoodle concentration are in Brooklyn, near Williamsburg, area which is know for its recent gentrification trends. A similar observation about gentrification was made on this New York Times article.
#for definitions of the plotting functions see the section on data analysis below
plotGroup('Labradoodle','Labradoodle',incomeMap)
#the map
incomeMap
The map below shows the number of registered American Pit Bull Terriers. These dogs concentrate in Harlem, The Bronx and regions of lower income in Brooklyn, indicating that the owners of these dogs tend to live in lower income areas.
#for definitions of the plotting functions see the section on data analysis below
plotGroup('American Pit Bull Terrier/Pit Bull','Pit Bull',incomeMap,color='#0A8A9F')
#the map
incomeMap
The map below shows the number of registered Rottweilers. These dogs seem to be distributed all over the city, not showing a particular correlation with income:
#for definitions of the plotting functions see the section on data analysis below
plotGroup('Rottweiler','Rottweiler',incomeMap,color='#459f09')
#the map
incomeMap
How did I know which dog breeds to choose for these plots?. Because of the analysis thad I did below:
The dog licensing database has a variety of geographical fields, but for my purposes I only care about the zip code where the owner lives (hidden code below).
dog = pd.read_csv('../DIchallenge/dogsPlusIncome.csv',dtype={'ZipCode':str})
print('Number of rows: {}, number of columns: {}'.format(dog.shape[0],dog.shape[1]))
dog.head()
Note that the table above has the fields MedianIncome
, LAT
and LNG
which I added from the income database and the latitude and longitude database respectively. To clean and process the dog registration database I used the following code (hidden code below):
#drop rows with a zip code not in NYC
dog=dog[dog['ZipCode'].isin(totalZip['postalCode'])]
#cleaning BreedName column to unify spellings e.g. 'Toy Poodle' and 'Poodle, Toy'
dog['BreedName']=(dog['BreedName']
.str.strip() #remove spaces at beggining and end first
.str.split(',')
.apply(lambda x: x[0] if len(x)==1 else x[1]+' '+x[0]))
dog['BreedName']=dog['BreedName'].str.strip().values
#add median income per zip code and zip code centroid latitude and longitude to the
#dog database
totalZip.rename(columns={'postalCode':'ZipCode'},inplace=True)
#merge income and dog databases
dogFull=pd.merge(dog,totalZip,how='inner', on=['ZipCode'])
#read coordinates file
coordinates = pd.read_csv('../DIchallenge/US Zip Codes from 2013 Government Data.csv'
,dtype={'ZIP':str,'LAT':np.float64,'LNG':np.float64})
#merge coordinate and dog databases
coordinates.rename(columns={'ZIP':'ZipCode'},inplace=True)
dogFull=pd.merge(dogFull,coordinates,how='inner', on=['ZipCode'])
I needed a way to measure how each dog breed correlates income with the geographical distribution of the registered dogs. The solution I came up with is to use a simple linear regression between the number of registered dogs per zip code and the median income per zip code.
An important observation is that I am not assuming that a linear regression is going to be a good model, however I assumed (correctly) that the more correlation there is between registrations for a particular breed and the income, the higher the coefficient of determination should be. That is indeed the case.
model=LinearRegression()
def getDogR2(breed,model,df=dogFull,mergecol='ZipCode',incomecol='MedianIncome'):
test=df.groupby('BreedName').get_group(breed)
#describe will run on all the numerical variables, so I just
#keep the information generated from 'MedianIncome'
#the mean column contains the mean income, and std is zero because the
#value is common for all dogs in the same zipcode
tmp=test.groupby(mergecol)[incomecol].describe()
tmp['zip']=tmp.index
tmp=tmp[['zip','count','mean']]
reg_x=tmp[['count','zip']].values
reg_y=tmp['mean'].values
model.fit(reg_x,reg_y)
return model.score(reg_x,reg_y)
Next I get the scores for all breeds (hidden code below):
#note that by calculating the mean from the full dog database the average income per
#dog race is automatically weighted by the number of dogs in each Zipcode!
breeds=(dogFull.groupby('BreedName')[['BreedName','MedianIncome']]
.agg({'BreedName':['count'],'MedianIncome':['mean']}))
#flatten multi-index columns
breeds.columns = ['_'.join(col).strip() for col in breeds.columns.values]
#rename colums to something more palatable
breeds.rename(columns={'BreedName_count':'count','MedianIncome_mean':'income',
},inplace=True)
#add a breed column
breeds['breed']=breeds.index
#add R^2 coefficient
breeds['scores']=breeds['breed'].apply(lambda x:getDogR2(x,model))
pd.set_option('display.max_rows', 20)
#show only scores for groups that have more than a hundred dogs in total
breeds[breeds['count']>=100].sort_values(by=['scores'], ascending=False)
As seen in the table above, the American Pit Bull Terrier has the highest score, and its weighted average income is relatively low, which is why I choose this breed for the second plot. For the third breed plot I choose Rottweiler because as seen above has the lowest score.
I choose the Labradoodle breed (not seen above) for the first breed plot because it has a good score, its weighted average income is high but also there it has a reasonably high count of total registered dogs:
#show only scores for groups that have more than a hundred dogs in total
breeds[breeds['count']>=100].sort_values(by=['income'], ascending=False).head(10)
I have shown that in fact it seems that dog ownership in NYC can be tied to income. In particular owning some breeds of dogs is correlated with living in a high income area, thus presumably having high income. As described above, dog licensing data in NYC is updated on a regular basis, which means that this information can be used as a proxy measure of how the median income in different areas of the city changes.
Below I show the code I used for plotting the dog registrations(code hidden but to reveal click the button at the left)
def getDogGroup(breed,df=dog,mergecol='ZipCode',col1='MedianIncome',col2='LAT',col3='LNG'):
group=df.groupby('BreedName').get_group(breed)
#get dog counts, income and mean latitude and longitude from the secound
#grouping (zipcode). the mean column contains the mean income by zip, and std is
#zero because the value is common for all dogs in the same zipcode
tmp=(group.groupby(mergecol)[[col1,col2,col3]]
.agg({col1:['count','mean'],col2:'mean',col3:'mean'}))
#flatten multi-index columns
tmp.columns = ['_'.join(col).strip() for col in tmp.columns.values]
#rename colums to something more palatable
tmp.rename(columns={'MedianIncome_count':'count','MedianIncome_mean':'income',
'LAT_count':'count','LAT_mean':'lat',
'LNG_mean':'lng'},inplace=True)
#add a zipcode column
tmp['zip']=tmp.index
#tmp=tmp[['zip','count','mean']]
return tmp
def plotDog(dog,factor,paint,dmap):
folium.CircleMarker(location=[dog['lat'], dog['lng']],
radius=factor*dog['count'],
color=paint,
fill=True,
fill_opacity=0.3,
weight=1,
popup = 'count: '+str(dog['count']), #add text popup information to each point
).add_to(dmap)
def plotGroup(breed,layerName,plotMap,scaleMax='default',color="#000000"):
groupCounts=getDogGroup(breed)
plot_group = folium.FeatureGroup(name=layerName)
#getting range of counts for scaling the radius of the markers
if scaleMax=='default':
b=groupCounts['count'].describe()['max']
else: #assuming an integer or float is passed
b=scaleMax
a=1 #groupCounts['count'].describe()['min']
#plot the dog counts
groupCounts.apply(plotDog,axis = 1,args=(10/(b-a),color,plot_group))
plot_group.add_to(plotMap)
#add a layer control icon
folium.LayerControl().add_to(plotMap)