Sunday, February 17, 2013

Computing polygon centroids for non-contiguous polygons in R


I don't like doing spatial analysis in a graphical GIS program like Arc or QGis. I find them fine for looking at data, but not for analysis. For that, I'd rather use a scripting language, and I am NOT going to use Arc's python scripting support (my experience with arcgisscripting is not good).

So, I'm quite excited about the growing support for spatial data in Python and R (Yes, I probably need to get out more). I would love to get to the point where I could do all my analysis in Python or R and dump out the results for viewing in a GIS program.

So, in order to learn about the new R packages for spatial data, I set myself the task of calculating the centroids of non-contiguous polygons. So, more concretely, I mean:


  • I have a layer of polygons, and each polygon has a name/identifier (a postcode, zone number, etc)
  • I would like to calculate the centroid of each name/identifier. Note that there may be more than one polygon with that identifier, and in that case I want the centroid of all the polygons with that name/identifier. (Or, if you prefer, the centre of mass of all the polygons with the same identifier).
Note that the 'centroid' I am calculating here for each name/identifier may not be inside a polygon with that name/identifier. 

And here is the R code to do it (NB: In this data-set, the 'id' field for each polygon is 'tz06'):


#import everything you need in 1 package
library(GISTools)

#read in your shapefile
p = readShapePoly("GMA_polygons_wgs84")

#calculate the centroids of each polygon
trueCentroids = gCentroid(p, byid=TRUE)


#I like to whack the centroids in as extra columns
p$lat = coordinates(trueCentroids)[,2]
p$lon = coordinates(trueCentroids)[,1]

#The 'average' lat and lon is what I want. It is the centre 
#of mass of all polygons with that id. 
#Of course, if there is only a single polygon with that id,
#the average lat/lon is just the regular centroid, so I 
#initialize my centroid to that first
p$avglat = p$lat
p$avglon = p$lon


#Now go through and calculate the 'average' centroid
#for all the tricky areas that are made up of more than
#1 polygon
for(i in 1:length(p$tz06)) 
{
    #find all the polygons with the same id
    matches = p$tz06 == p$tz06[i]
        
    #if there is more than one such polygon
    if(sum(matches) > 1)
    {
        #we weight all of the centroids by area,
        #because we want the centre of mass
        areas = matches*p$Area_sqm
        weights = areas / sum(areas)

        p$avglat[i] = sum(p$lat*weights)
        p$avglon[i] = sum(p$lon*weights)
    }
}


I'm sure this is not the nicest way of doing this, so if you know a cleaner way (hell, there may even be a built-in function for this already!), please let me know. Also, I'm hopeless with regards to functional programming (my brain thinks procedurally), so I've dont the above with a loop, but if you can point out to me a nice functional way of achieving the above with map/filter/fold/etc, I'd be keen to hear about that too.



1 comment:

Hugo said...

This is gorgeous!