The Doggy "Economic Index"

Author: Leonardo Espin.

Date: 11/09/2018

A common sight in New York City is people walking their dogs. Sometimes the dogs can be seen wearing boots, vests or even what it looks like expensive coats. This gave me the idea that perhaps dog ownership could be used as an indicator of disposable income, by tying geographical income information with New York City dog licensing data.

Why having such an indicator would be useful?. Because it would be a very flexible and dynamic proxy measure of income. For instance the most accurate geographical income information is obtained through the U.S. Census, which is a decennial census (the next one will take place in 2020). On the other hand, in NYC dog owners are required by law to license their dogs, which means that dog licensing data is updated continuously.

To investigate this hypothesis I utilized the following databases:

  • Dog licensing data can be downloaded here. This dataset contains dog breed and the zip code of the owners, as well as other geographical information.
  • Geographical income information can be downloaded from the U.S. Census Bureau website, specifically table B19013 "Median Household Income in the Past 12 Months”

Since geographical information is best visualized on a map, I use the folium python package. The geometry data for NYC zip codes I use is a geojson file that can be downloaded here. The file containing the latitude and longitude for the centroids of each zip code can be obtained here.

NYC income map

The map below shows the median income by zip code with color. Darker color represents higher income. The colorbar indicates income in thousands of dollars.

In [96]:
import folium
NYC_coord=[40.738, -73.98]
zip_geo = r'../DIchallenge/nyc-zip-code-tabulation-areas-polygons.geojson'
incomeMap = folium.Map(location=NYC_coord)

#bins are important for visualization
bins = [0, 20, 40, 60, 95, 233]

#the income map layer
incomeMap.choropleth(geo_data = zip_geo, 
              data = income, #income database
              columns = ['postalCode', 'MedianIncome'],
              key_on = 'feature.properties.postalCode',
              name='Income', #name of the layer    
              fill_color = 'YlOrRd', 
              fill_opacity = 0.8, 
              line_opacity = 0.2,
              legend_name = 'Income by zip code (thousands of dollars)',
              threshold_scale = bins,
              reset=True)

#adjust zoom of the map
incomeMap.fit_bounds(incomeMap.get_bounds())
#show the map
incomeMap
Out[96]:

Note that to create this map zip code information has to be extracted from the income database, and these zip codes have to be matched to the ones availabe in the geometry geojson file (hidden code below, to reveal click the button at the left):

In [ ]:
#reading and processing income database
income=pd.read_csv('../DIchallenge/ACS_12_5YR_B19013_with_ann.csv',na_values='-')
income['GEO.display-label']=income['GEO.display-label'].apply(lambda x: x.replace('ZCTA5 ',''))
income.rename(columns={'GEO.display-label':'postalCode', 'HD01_VD01': 'MedianIncome'},inplace=True)
income['MedianIncome']=income['MedianIncome'].astype('float64')
income.dropna(subset=['MedianIncome'],inplace=True)

#extracting zip code information from the geometry file
import json
with open(zip_geo, 'r') as f:
    get_id = json.load(f)

#create list of available zip codes    
geoZipCodes = [x['properties']['postalCode'] for x in get_id['features']]

#select only zip codes that are in NYC
subset=income[income['postalCode'].apply(lambda x: x in geoZipCodes)]

#append especial zip codes of the geometry file that missing in the income file, and set
#income as zero (for example for zipcode '11430' of the JFK airport)
extraZip=pd.DataFrame({'postalCode':geoZipCodes})
extraZip['MedianIncome']=extraZip['postalCode'].apply(lambda x:0)

#merge both dataframes
totalZip=pd.merge(subset, extraZip, how='outer', on=['postalCode']) 

Plotting dog registrations on the NYC income map

On the map below, I have plotted the number of registered Labradoodles in each zip code area. The radius of each circle is proportional to the number of registered dogs (clicking on any of the circles shows the number of registered dogs).

Note how labradoodles concentrate in regions of darker color, while there are only few or none in regions of lower income areas. So it seems that labraddodle owners tend to live in areas of higher income, mostly in Manhattan. An interesting observation is that a couple of these areas of large labradoodle concentration are in Brooklyn, near Williamsburg, area which is know for its recent gentrification trends. A similar observation about gentrification was made on this New York Times article.

In [95]:
#for definitions of the plotting functions see the section on data analysis below
plotGroup('Labradoodle','Labradoodle',incomeMap)

#the map
incomeMap
Out[95]:

The map below shows the number of registered American Pit Bull Terriers. These dogs concentrate in Harlem, The Bronx and regions of lower income in Brooklyn, indicating that the owners of these dogs tend to live in lower income areas.

In [85]:
#for definitions of the plotting functions see the section on data analysis below
plotGroup('American Pit Bull Terrier/Pit Bull','Pit Bull',incomeMap,color='#0A8A9F')

#the map
incomeMap
Out[85]:

The map below shows the number of registered Rottweilers. These dogs seem to be distributed all over the city, not showing a particular correlation with income:

In [88]:
#for definitions of the plotting functions see the section on data analysis below
plotGroup('Rottweiler','Rottweiler',incomeMap,color='#459f09')

#the map
incomeMap
Out[88]:

How did I know which dog breeds to choose for these plots?. Because of the analysis thad I did below:

Dog breed analysis

The dog licensing database has a variety of geographical fields, but for my purposes I only care about the zip code where the owner lives (hidden code below).

In [55]:
dog = pd.read_csv('../DIchallenge/dogsPlusIncome.csv',dtype={'ZipCode':str})
print('Number of rows: {}, number of columns: {}'.format(dog.shape[0],dog.shape[1]))
dog.head()
Number of rows: 121355, number of columns: 18
Out[55]:
RowNumber AnimalName AnimalGender AnimalBirthMonth BreedName Borough ZipCode CommunityDistrict CensusTract2010 NTA CityCouncilDistrict CongressionalDistrict StateSenatorialDistrict LicenseIssuedDate LicenseExpiredDate MedianIncome LAT LNG
0 1753 SHADOW M 01/01/00 12:00 AM Beagle Brooklyn 11236 318.0 1014.0 BK50 46.0 8.0 19.0 12/29/14 01/30/16 60758.0 40.639413 -73.900664
1 91624 JERRY M 05/01/16 12:00 AM Maltese Brooklyn 11236 318.0 982.0 BK50 42.0 8.0 19.0 07/30/16 07/30/17 60758.0 40.639413 -73.900664
2 42 DEDE M 02/01/13 12:00 AM Boston Terrier Brooklyn 11236 318.0 964.0 BK50 46.0 8.0 19.0 09/14/14 09/14/19 60758.0 40.639413 -73.900664
3 71 TY M 09/01/10 12:00 AM Siberian Husky Brooklyn 11236 318.0 964.0 BK50 46.0 8.0 19.0 09/16/14 09/16/17 60758.0 40.639413 -73.900664
4 1191 SCRUFFY M 12/01/06 12:00 AM Chihuahua Brooklyn 11236 318.0 968.0 BK50 46.0 8.0 19.0 12/14/14 12/14/16 60758.0 40.639413 -73.900664

Note that the table above has the fields MedianIncome, LAT and LNG which I added from the income database and the latitude and longitude database respectively. To clean and process the dog registration database I used the following code (hidden code below):

In [ ]:
#drop rows with a zip code not in NYC
dog=dog[dog['ZipCode'].isin(totalZip['postalCode'])]

#cleaning BreedName column to unify spellings e.g. 'Toy Poodle' and 'Poodle, Toy'
dog['BreedName']=(dog['BreedName']
                     .str.strip() #remove spaces at beggining and end first
                     .str.split(',')
                     .apply(lambda x: x[0] if len(x)==1 else x[1]+' '+x[0]))
dog['BreedName']=dog['BreedName'].str.strip().values

#add median income per zip code and zip code centroid latitude and longitude to the
#dog database

totalZip.rename(columns={'postalCode':'ZipCode'},inplace=True)

#merge income and dog databases
dogFull=pd.merge(dog,totalZip,how='inner', on=['ZipCode'])

#read coordinates file
coordinates = pd.read_csv('../DIchallenge/US Zip Codes from 2013 Government Data.csv'
                          ,dtype={'ZIP':str,'LAT':np.float64,'LNG':np.float64})
#merge coordinate and dog databases
coordinates.rename(columns={'ZIP':'ZipCode'},inplace=True)
dogFull=pd.merge(dogFull,coordinates,how='inner', on=['ZipCode'])

Defining a score for the different breeds

I needed a way to measure how each dog breed correlates income with the geographical distribution of the registered dogs. The solution I came up with is to use a simple linear regression between the number of registered dogs per zip code and the median income per zip code.

An important observation is that I am not assuming that a linear regression is going to be a good model, however I assumed (correctly) that the more correlation there is between registrations for a particular breed and the income, the higher the coefficient of determination should be. That is indeed the case.

  • defining a function that performs regressions and returns scores:
    model=LinearRegression()
    def getDogR2(breed,model,df=dogFull,mergecol='ZipCode',incomecol='MedianIncome'):
      test=df.groupby('BreedName').get_group(breed)
      #describe will run on all the numerical variables, so I just
      #keep the information generated from 'MedianIncome'
      #the mean column contains the mean income, and std is zero because the
      #value is common for all dogs in the same zipcode
      tmp=test.groupby(mergecol)[incomecol].describe()
      tmp['zip']=tmp.index
      tmp=tmp[['zip','count','mean']]
      reg_x=tmp[['count','zip']].values
      reg_y=tmp['mean'].values
      model.fit(reg_x,reg_y)
      return model.score(reg_x,reg_y)
    

Next I get the scores for all breeds (hidden code below):

In [58]:
#note that by calculating the mean from the full dog database the average income per
#dog race is automatically weighted by the number of dogs in each Zipcode!
breeds=(dogFull.groupby('BreedName')[['BreedName','MedianIncome']]
        .agg({'BreedName':['count'],'MedianIncome':['mean']}))

#flatten multi-index columns
breeds.columns = ['_'.join(col).strip() for col in breeds.columns.values]

#rename colums to something more palatable
breeds.rename(columns={'BreedName_count':'count','MedianIncome_mean':'income',
                        },inplace=True)
#add a breed column
breeds['breed']=breeds.index

#add R^2 coefficient
breeds['scores']=breeds['breed'].apply(lambda x:getDogR2(x,model))

pd.set_option('display.max_rows', 20)

#show only scores for groups that have more than a hundred dogs in total
breeds[breeds['count']>=100].sort_values(by=['scores'], ascending=False)
Out[58]:
count income breed scores
15 3329 51409.878943 American Pit Bull Terrier/Pit Bull 0.275358
71 106 69753.424528 Cardigan Welsh Corgi 0.270280
102 125 82716.864000 English Cocker Spaniel 0.257708
164 190 68045.821053 Long Haired Dachshund 0.243313
273 561 75053.983957 West High White Terrier 0.225465
235 168 77535.625000 Scottish Terrier 0.220872
269 235 81733.463830 Vizsla 0.208334
128 386 83921.059585 Goldendoodle 0.204071
233 109 67720.412844 Schnauzer Crossbreed 0.203457
191 393 72545.496183 Papillon 0.198559
... ... ... ... ...
179 380 63200.194737 Morkie 0.061142
167 4291 63062.927290 Maltese 0.060161
120 1495 63448.650167 German Shepherd Crossbreed 0.059933
207 836 62896.627990 Poodle 0.058674
121 1881 62527.585859 German Shepherd Dog 0.046949
81 264 60177.299242 Chow Chow 0.035944
16 810 57810.093827 American Staffordshire Terrier 0.035388
8 180 51438.033333 American Bully 0.034808
70 236 58486.432203 Cane Corso 0.021627
222 676 58242.732249 Rottweiler 0.017759

111 rows × 4 columns

As seen in the table above, the American Pit Bull Terrier has the highest score, and its weighted average income is relatively low, which is why I choose this breed for the second plot. For the third breed plot I choose Rottweiler because as seen above has the lowest score.

I choose the Labradoodle breed (not seen above) for the first breed plot because it has a good score, its weighted average income is high but also there it has a reasonably high count of total registered dogs:

In [56]:
#show only scores for groups that have more than a hundred dogs in total
breeds[breeds['count']>=100].sort_values(by=['income'], ascending=False).head(10)
Out[56]:
count income breed scores
186 115 86287.678261 Norwich Terrier 0.139998
211 173 85651.635838 Portuguese Water Dog 0.134644
156 793 85431.998739 Labradoodle 0.148085
263 241 84168.165975 Tibetan Terrier 0.169705
38 173 84044.884393 Bernese Mountain Dog 0.162090
128 386 83921.059585 Goldendoodle 0.204071
89 189 83314.730159 Cotton De Tulear 0.160571
102 125 82716.864000 English Cocker Spaniel 0.257708
269 235 81733.463830 Vizsla 0.208334
137 2023 81507.566980 Havanese 0.144715

Conclusions

I have shown that in fact it seems that dog ownership in NYC can be tied to income. In particular owning some breeds of dogs is correlated with living in a high income area, thus presumably having high income. As described above, dog licensing data in NYC is updated on a regular basis, which means that this information can be used as a proxy measure of how the median income in different areas of the city changes.

Addendum

Below I show the code I used for plotting the dog registrations(code hidden but to reveal click the button at the left)

  • defining functions for selecting dog breed subgroups and all the information required for plotting
In [ ]:
def getDogGroup(breed,df=dog,mergecol='ZipCode',col1='MedianIncome',col2='LAT',col3='LNG'):
    group=df.groupby('BreedName').get_group(breed)
    #get dog counts, income and mean latitude and longitude from the secound
    #grouping (zipcode). the mean column contains the mean income by zip, and std is 
    #zero because the value is common for all dogs in the same zipcode
    tmp=(group.groupby(mergecol)[[col1,col2,col3]]
         .agg({col1:['count','mean'],col2:'mean',col3:'mean'}))
    #flatten multi-index columns
    tmp.columns = ['_'.join(col).strip() for col in tmp.columns.values]
    #rename colums to something more palatable
    tmp.rename(columns={'MedianIncome_count':'count','MedianIncome_mean':'income',
                        'LAT_count':'count','LAT_mean':'lat',
                        'LNG_mean':'lng'},inplace=True)
    #add a zipcode column
    tmp['zip']=tmp.index
    #tmp=tmp[['zip','count','mean']]
    return tmp
  • Define plotting functions
In [43]:
def plotDog(dog,factor,paint,dmap):
    folium.CircleMarker(location=[dog['lat'], dog['lng']],
                        radius=factor*dog['count'],
                        color=paint,
                        fill=True,
                        fill_opacity=0.3,
                        weight=1,
                        popup = 'count: '+str(dog['count']), #add text popup information to each point
                        ).add_to(dmap)

def plotGroup(breed,layerName,plotMap,scaleMax='default',color="#000000"):
    groupCounts=getDogGroup(breed)
    plot_group = folium.FeatureGroup(name=layerName)
    #getting range of counts for scaling the radius of the markers
    if scaleMax=='default':
        b=groupCounts['count'].describe()['max']
    else: #assuming an integer or float is passed
        b=scaleMax 
    a=1 #groupCounts['count'].describe()['min']
    #plot the dog counts
    groupCounts.apply(plotDog,axis = 1,args=(10/(b-a),color,plot_group))
    plot_group.add_to(plotMap)
    #add a layer control icon
    folium.LayerControl().add_to(plotMap)