The Doggy "Economic Index"

Author: Leonardo Espin.

Date: 11/09/2018

A common sight in New York City is people walking their dogs. Sometimes the dogs can be seen wearing boots, vests or even what it looks like expensive coats. This gave me the idea that perhaps dog ownership could be used as an indicator of disposable income, by tying geographical income information with New York City dog licensing data.

Why having such an indicator would be useful?. Because it would be a very flexible and dynamic proxy measure of income. For instance the most accurate geographical income information is obtained through the U.S. Census, which is a decennial census (the next one will take place in 2020). On the other hand, in NYC dog owners are required by law to license their dogs, which means that dog licensing data is updated continuously.

To investigate this hypothesis I utilized the following databases:

Dog licensing data can be downloaded here. This dataset contains dog breed and the zip code of the owners, as well as other geographical information.
Geographical income information can be downloaded from the U.S. Census Bureau website, specifically table B19013 "Median Household Income in the Past 12 Months”

Since geographical information is best visualized on a map, I use the folium python package. The geometry data for NYC zip codes I use is a geojson file that can be downloaded here. The file containing the latitude and longitude for the centroids of each zip code can be obtained here.

NYC income map¶

The map below shows the median income by zip code with color. Darker color represents higher income. The colorbar indicates income in thousands of dollars.

import folium
NYC_coord=[40.738, -73.98]
zip_geo = r'../DIchallenge/nyc-zip-code-tabulation-areas-polygons.geojson'
incomeMap = folium.Map(location=NYC_coord)

#bins are important for visualization
bins = [0, 20, 40, 60, 95, 233]

#the income map layer
incomeMap.choropleth(geo_data = zip_geo, 
              data = income, #income database
              columns = ['postalCode', 'MedianIncome'],
              key_on = 'feature.properties.postalCode',
              name='Income', #name of the layer    
              fill_color = 'YlOrRd', 
              fill_opacity = 0.8, 
              line_opacity = 0.2,
              legend_name = 'Income by zip code (thousands of dollars)',
              threshold_scale = bins,
              reset=True)

#adjust zoom of the map
incomeMap.fit_bounds(incomeMap.get_bounds())
#show the map
incomeMap

Note that to create this map zip code information has to be extracted from the income database, and these zip codes have to be matched to the ones availabe in the geometry geojson file (hidden code below, to reveal click the button at the left):

#reading and processing income database
income=pd.read_csv('../DIchallenge/ACS_12_5YR_B19013_with_ann.csv',na_values='-')
income['GEO.display-label']=income['GEO.display-label'].apply(lambda x: x.replace('ZCTA5 ',''))
income.rename(columns={'GEO.display-label':'postalCode', 'HD01_VD01': 'MedianIncome'},inplace=True)
income['MedianIncome']=income['MedianIncome'].astype('float64')
income.dropna(subset=['MedianIncome'],inplace=True)

#extracting zip code information from the geometry file
import json
with open(zip_geo, 'r') as f:
    get_id = json.load(f)

#create list of available zip codes    
geoZipCodes = [x['properties']['postalCode'] for x in get_id['features']]

#select only zip codes that are in NYC
subset=income[income['postalCode'].apply(lambda x: x in geoZipCodes)]

#append especial zip codes of the geometry file that missing in the income file, and set
#income as zero (for example for zipcode '11430' of the JFK airport)
extraZip=pd.DataFrame({'postalCode':geoZipCodes})
extraZip['MedianIncome']=extraZip['postalCode'].apply(lambda x:0)

#merge both dataframes
totalZip=pd.merge(subset, extraZip, how='outer', on=['postalCode'])

Plotting dog registrations on the NYC income map¶

On the map below, I have plotted the number of registered Labradoodles in each zip code area. The radius of each circle is proportional to the number of registered dogs (clicking on any of the circles shows the number of registered dogs).

Note how labradoodles concentrate in regions of darker color, while there are only few or none in regions of lower income areas. So it seems that labraddodle owners tend to live in areas of higher income, mostly in Manhattan. An interesting observation is that a couple of these areas of large labradoodle concentration are in Brooklyn, near Williamsburg, area which is know for its recent gentrification trends. A similar observation about gentrification was made on this New York Times article.

#for definitions of the plotting functions see the section on data analysis below
plotGroup('Labradoodle','Labradoodle',incomeMap)

#the map
incomeMap

The map below shows the number of registered American Pit Bull Terriers. These dogs concentrate in Harlem, The Bronx and regions of lower income in Brooklyn, indicating that the owners of these dogs tend to live in lower income areas.

#for definitions of the plotting functions see the section on data analysis below
plotGroup('American Pit Bull Terrier/Pit Bull','Pit Bull',incomeMap,color='#0A8A9F')

#the map
incomeMap

The map below shows the number of registered Rottweilers. These dogs seem to be distributed all over the city, not showing a particular correlation with income:

#for definitions of the plotting functions see the section on data analysis below
plotGroup('Rottweiler','Rottweiler',incomeMap,color='#459f09')

#the map
incomeMap

How did I know which dog breeds to choose for these plots?. Because of the analysis thad I did below:

Dog breed analysis¶

The dog licensing database has a variety of geographical fields, but for my purposes I only care about the zip code where the owner lives (hidden code below).

dog = pd.read_csv('../DIchallenge/dogsPlusIncome.csv',dtype={'ZipCode':str})
print('Number of rows: {}, number of columns: {}'.format(dog.shape[0],dog.shape[1]))
dog.head()

Number of rows: 121355, number of columns: 18

Note that the table above has the fields MedianIncome, LAT and LNG which I added from the income database and the latitude and longitude database respectively. To clean and process the dog registration database I used the following code (hidden code below):

#drop rows with a zip code not in NYC
dog=dog[dog['ZipCode'].isin(totalZip['postalCode'])]

#cleaning BreedName column to unify spellings e.g. 'Toy Poodle' and 'Poodle, Toy'
dog['BreedName']=(dog['BreedName']
                     .str.strip() #remove spaces at beggining and end first
                     .str.split(',')
                     .apply(lambda x: x[0] if len(x)==1 else x[1]+' '+x[0]))
dog['BreedName']=dog['BreedName'].str.strip().values

#add median income per zip code and zip code centroid latitude and longitude to the
#dog database

totalZip.rename(columns={'postalCode':'ZipCode'},inplace=True)

#merge income and dog databases
dogFull=pd.merge(dog,totalZip,how='inner', on=['ZipCode'])

#read coordinates file
coordinates = pd.read_csv('../DIchallenge/US Zip Codes from 2013 Government Data.csv'
                          ,dtype={'ZIP':str,'LAT':np.float64,'LNG':np.float64})
#merge coordinate and dog databases
coordinates.rename(columns={'ZIP':'ZipCode'},inplace=True)
dogFull=pd.merge(dogFull,coordinates,how='inner', on=['ZipCode'])

Defining a score for the different breeds¶

I needed a way to measure how each dog breed correlates income with the geographical distribution of the registered dogs. The solution I came up with is to use a simple linear regression between the number of registered dogs per zip code and the median income per zip code.

An important observation is that I am not assuming that a linear regression is going to be a good model, however I assumed (correctly) that the more correlation there is between registrations for a particular breed and the income, the higher the coefficient of determination should be. That is indeed the case.

defining a function that performs regressions and returns scores:

model=LinearRegression()
def getDogR2(breed,model,df=dogFull,mergecol='ZipCode',incomecol='MedianIncome'):
  test=df.groupby('BreedName').get_group(breed)
  #describe will run on all the numerical variables, so I just
  #keep the information generated from 'MedianIncome'
  #the mean column contains the mean income, and std is zero because the
  #value is common for all dogs in the same zipcode
  tmp=test.groupby(mergecol)[incomecol].describe()
  tmp['zip']=tmp.index
  tmp=tmp[['zip','count','mean']]
  reg_x=tmp[['count','zip']].values
  reg_y=tmp['mean'].values
  model.fit(reg_x,reg_y)
  return model.score(reg_x,reg_y)

Next I get the scores for all breeds (hidden code below):

#note that by calculating the mean from the full dog database the average income per
#dog race is automatically weighted by the number of dogs in each Zipcode!
breeds=(dogFull.groupby('BreedName')[['BreedName','MedianIncome']]
        .agg({'BreedName':['count'],'MedianIncome':['mean']}))

#flatten multi-index columns
breeds.columns = ['_'.join(col).strip() for col in breeds.columns.values]

#rename colums to something more palatable
breeds.rename(columns={'BreedName_count':'count','MedianIncome_mean':'income',
                        },inplace=True)
#add a breed column
breeds['breed']=breeds.index

#add R^2 coefficient
breeds['scores']=breeds['breed'].apply(lambda x:getDogR2(x,model))

pd.set_option('display.max_rows', 20)

#show only scores for groups that have more than a hundred dogs in total
breeds[breeds['count']>=100].sort_values(by=['scores'], ascending=False)

As seen in the table above, the American Pit Bull Terrier has the highest score, and its weighted average income is relatively low, which is why I choose this breed for the second plot. For the third breed plot I choose Rottweiler because as seen above has the lowest score.

I choose the Labradoodle breed (not seen above) for the first breed plot because it has a good score, its weighted average income is high but also there it has a reasonably high count of total registered dogs:

#show only scores for groups that have more than a hundred dogs in total
breeds[breeds['count']>=100].sort_values(by=['income'], ascending=False).head(10)

Conclusions¶

I have shown that in fact it seems that dog ownership in NYC can be tied to income. In particular owning some breeds of dogs is correlated with living in a high income area, thus presumably having high income. As described above, dog licensing data in NYC is updated on a regular basis, which means that this information can be used as a proxy measure of how the median income in different areas of the city changes.

Addendum¶

Below I show the code I used for plotting the dog registrations(code hidden but to reveal click the button at the left)

defining functions for selecting dog breed subgroups and all the information required for plotting

def getDogGroup(breed,df=dog,mergecol='ZipCode',col1='MedianIncome',col2='LAT',col3='LNG'):
    group=df.groupby('BreedName').get_group(breed)
    #get dog counts, income and mean latitude and longitude from the secound
    #grouping (zipcode). the mean column contains the mean income by zip, and std is 
    #zero because the value is common for all dogs in the same zipcode
    tmp=(group.groupby(mergecol)[[col1,col2,col3]]
         .agg({col1:['count','mean'],col2:'mean',col3:'mean'}))
    #flatten multi-index columns
    tmp.columns = ['_'.join(col).strip() for col in tmp.columns.values]
    #rename colums to something more palatable
    tmp.rename(columns={'MedianIncome_count':'count','MedianIncome_mean':'income',
                        'LAT_count':'count','LAT_mean':'lat',
                        'LNG_mean':'lng'},inplace=True)
    #add a zipcode column
    tmp['zip']=tmp.index
    #tmp=tmp[['zip','count','mean']]
    return tmp

Define plotting functions

def plotDog(dog,factor,paint,dmap):
    folium.CircleMarker(location=[dog['lat'], dog['lng']],
                        radius=factor*dog['count'],
                        color=paint,
                        fill=True,
                        fill_opacity=0.3,
                        weight=1,
                        popup = 'count: '+str(dog['count']), #add text popup information to each point
                        ).add_to(dmap)

def plotGroup(breed,layerName,plotMap,scaleMax='default',color="#000000"):
    groupCounts=getDogGroup(breed)
    plot_group = folium.FeatureGroup(name=layerName)
    #getting range of counts for scaling the radius of the markers
    if scaleMax=='default':
        b=groupCounts['count'].describe()['max']
    else: #assuming an integer or float is passed
        b=scaleMax 
    a=1 #groupCounts['count'].describe()['min']
    #plot the dog counts
    groupCounts.apply(plotDog,axis = 1,args=(10/(b-a),color,plot_group))
    plot_group.add_to(plotMap)
    #add a layer control icon
    folium.LayerControl().add_to(plotMap)

	RowNumber	AnimalName	AnimalGender	AnimalBirthMonth	BreedName	Borough	ZipCode	CommunityDistrict	CensusTract2010	NTA	CityCouncilDistrict	CongressionalDistrict	StateSenatorialDistrict	LicenseIssuedDate	LicenseExpiredDate	MedianIncome	LAT	LNG
0	1753	SHADOW	M	01/01/00 12:00 AM	Beagle	Brooklyn	11236	318.0	1014.0	BK50	46.0	8.0	19.0	12/29/14	01/30/16	60758.0	40.639413	-73.900664
1	91624	JERRY	M	05/01/16 12:00 AM	Maltese	Brooklyn	11236	318.0	982.0	BK50	42.0	8.0	19.0	07/30/16	07/30/17	60758.0	40.639413	-73.900664
2	42	DEDE	M	02/01/13 12:00 AM	Boston Terrier	Brooklyn	11236	318.0	964.0	BK50	46.0	8.0	19.0	09/14/14	09/14/19	60758.0	40.639413	-73.900664
3	71	TY	M	09/01/10 12:00 AM	Siberian Husky	Brooklyn	11236	318.0	964.0	BK50	46.0	8.0	19.0	09/16/14	09/16/17	60758.0	40.639413	-73.900664
4	1191	SCRUFFY	M	12/01/06 12:00 AM	Chihuahua	Brooklyn	11236	318.0	968.0	BK50	46.0	8.0	19.0	12/14/14	12/14/16	60758.0	40.639413	-73.900664

	count	income	breed	scores
15	3329	51409.878943	American Pit Bull Terrier/Pit Bull	0.275358
71	106	69753.424528	Cardigan Welsh Corgi	0.270280
102	125	82716.864000	English Cocker Spaniel	0.257708
164	190	68045.821053	Long Haired Dachshund	0.243313
273	561	75053.983957	West High White Terrier	0.225465
235	168	77535.625000	Scottish Terrier	0.220872
269	235	81733.463830	Vizsla	0.208334
128	386	83921.059585	Goldendoodle	0.204071
233	109	67720.412844	Schnauzer Crossbreed	0.203457
191	393	72545.496183	Papillon	0.198559
...	...	...	...	...
179	380	63200.194737	Morkie	0.061142
167	4291	63062.927290	Maltese	0.060161
120	1495	63448.650167	German Shepherd Crossbreed	0.059933
207	836	62896.627990	Poodle	0.058674
121	1881	62527.585859	German Shepherd Dog	0.046949
81	264	60177.299242	Chow Chow	0.035944
16	810	57810.093827	American Staffordshire Terrier	0.035388
8	180	51438.033333	American Bully	0.034808
70	236	58486.432203	Cane Corso	0.021627
222	676	58242.732249	Rottweiler	0.017759

	count	income	breed	scores
186	115	86287.678261	Norwich Terrier	0.139998
211	173	85651.635838	Portuguese Water Dog	0.134644
156	793	85431.998739	Labradoodle	0.148085
263	241	84168.165975	Tibetan Terrier	0.169705
38	173	84044.884393	Bernese Mountain Dog	0.162090
128	386	83921.059585	Goldendoodle	0.204071
89	189	83314.730159	Cotton De Tulear	0.160571
102	125	82716.864000	English Cocker Spaniel	0.257708
269	235	81733.463830	Vizsla	0.208334
137	2023	81507.566980	Havanese	0.144715