Wednesday, March 9, 2016

What's Watson Developer Cloud

A few times in the last year or so an IBM sales-person has come around to the company I work at spruiking products branded with the IBM 'Watson' label. To be frank, I came out of these meetings without a clear idea of what, exactly, the Watson suite of products was, or how it could be used in our business. The problem was that there were too many products sitting under the Watson brand, and the sales rep kept changing between different things.

So, I've just spent a couple of days in Melbourne, getting a better handle on what is available, what it can be used for, what it costs, etc. And here is a quick summary:

What is Watson?

I still can't definitively answer this. It seems to just be a brand/label for a range of products/APIs, focussed an analytics of some sort (speech analysis, text analysis, image analysis, sentiment analysis, etc). There are a few things on offer from IBM, and I think it helps to separate those:

  1. Services: IBM has a range of analytical services (language processing, image recognition, and so on), all exposed through APIs, and you pay per API request to use these. So you can submit (say) an image to the Watson image classification service and it tells you the probability that the image contains a cat (say). More on this in a moment.
  2. Application hosting: you can build an application that uses the IBM services, and IBM can also host that application for you. I'm not going to say more about this, as it is of less interest to me right now.

I have to say that the communication and documentation around all this from IBM is not good, and to start with it's all a bit bewildering as you work out how tightly coupled the app hosting is to the services (do you want an app, a virtual server? a container? a service? do you want to 'bind' your hosted app to your service? Too too many things and no coherent presentation of what they are and how they work together.... for example, it's still not clear to me when you would bind a service to an app and when you wouldn't.... it seems that you can use a service from an un-bound application, so maybe it's just a performance thing?). But anyway, I digress... from now on I'm just going to talk about the services, not the apps.

What Services are available? What do they do?

Below, I list the services that are available under the Watson brand, as at time of writing. There is an API that exposes each of the services, and the model seems to be that you are changed per API transaction, for each service. So for instance, there is an image classification service that you can train with some data, and then submit images to that service to get classification results. Or you can submit free text to the tone analysis service and it will return you information about the 'tone' of the language (aggressive, friendly, etc). Each such submitted sample would cost you an API call, which costs between $0.2 and $0.05 (depending on the volume of calls, you get a discount for volume). Pricing varies by service. You can get up to date info on pricing if you dig down from here.

Here are the services on offer, each of which you can call via a simple API (more on this in a second). The description next to each is the IBM-provided one. I'll say a bit more about a few of them underneath.

Concept Expansion: Maps euphemisms or colloquial terms to more commonly understood phrases

Concept Insights: Explore the concepts behind your input, identifying associations beyond traditional text matching.

Dialog: Enable your application to use natural language to converse with users

Document Conversion: Converts a  HTML, PDF, or Microsoft Word™ document into a normalized HTML, plain text, or a set of JSON-formatted Answer units.

Language Translation: Translate text from one language to another for specific domains.

Natural Language Classifier: Natural Language Classifier performs natural language classification on question texts. A user would be able to train their data and the predict the appropriate class for a input question.

Personality Insights: The Watson Personality Insights derives insights from transactional and social media data to identify psychological traits

Relationship Extraction: Intelligently finds relationships between sentences components (nouns, verbs, subjects, objects, etc.)

Retrieve and Rank: Add machine learning enhanced search capabilities to your application

Speech To Text: Low-latency, streaming transcription

Text to Speech: Synthesizes natural-sounding speech from text.

Tone Analyzer: It helps people detect, understand and revise the language tones of emotions, social propensities and writing styles from their writings.

Tradeoff Analytics: Helps make better choices under multiple conflicting goals. Combines smart visualization and recommendations for tradeoff exploration

Visual Recognition: Analyzes the visual content of images and videos to understand their content without requiring a textual description

Cognitive Commerce, Graph, and Insights: These are third-party provided. 

Some comments

Exposing analytics as a service could be great, but I don't think Watson quite gets you to where you want to be. Maybe it will soon, but not right now. For example, suppose you want to classify images into one of (say) 5 categories. Right now, you cannot do this with Watson except in a clumsy way -- the API appears to only support binary classification for images, so multi-class classification is not possible. Sure, you can build multiple binary classifiers but this is silly cludge to have to work through. There are some strange other restrictions too: you are limited to 15,000 training samples when you train the natural language classifier. Why? Training a deep network to do multi-class classification would certainly benefit from having more than 15,000 training samples, so why put this limit in there? Also, it's not clear to me what sort of training is going on under the hood. I know that putting things behind an API is meant to mask all that, but training neural network's can be quite fiddly and task specific... and I am really quite doubtful that the default network architecture, learning rate, batch size, etc are anywhere near optimal. It would be nice if IBM did hyper-parameter tuning for you behind the scenes too, but I'm pretty damn sure this is not happening -- you just get a trained classifier that may or may not have good settings for your particular problem.

In general, the Watson API offerings seem to occupy a space that will appeal to smaller organizations without a lot of analytics capacility, and may wow some managers who want 'insights' but don't care if those insights are accurate or not, or have a clear idea about what, exactly, they want to do with those insights. The prime example of this is the personality analysis API -- you can feed in free text (from politicians speeches, or a twitter handle, or wherever) and get back a 'personality model' of the person who wrote that text. The personality model is a bit like a horoscope though: almost anything it returns sounds half-plausible. So I can see managers saying 'great, this tells us which customers are open, or angry, or supportive, and this will let us target marketing to them!', but are the scored personality traits returned by the Watson API actually useful in targeted marketing? Who knows, but I'm certainly sceptical, and I'm scared that the richness of the outputs will be mis-taken by managers as something that is both accurate and useful, when it's far from clear that it is either.

Thursday, December 17, 2015

Simplest possible Theano examples

This is another write-up of my steps learning Theano. My intention is to end up using an easier-to-use library such as keras, which sits on top of Theano, but before I do that I want to make sure I have a reasonable handle on Theano itself.

I started out with the logistic regression example available here, and then went on to the denoising auto-encoder example available here. However, being completely new to Theano, I thought the examples were not as simple as they could be, so I stripped the examples down to the bare bones (no classes, no functions, no bothering about proper weight initialization). I wanted to focus on understanding Theano itself, and the code in the existing example is a bit of a distraction from that. Others may find themselves in a similar position (wanted the simplest possible examples), and that's why I've written them up and put them here.

Example 1: Logistic Regression

There are a couple of logistic regression examples for Theano, but of those, I think this one is the simplest and best one to start on. It is a great place to start and easy to do in Theano while getting used to the main concepts. I've reproduced the code below, with some additional comments added by me.

import numpy
import theano
import theano.tensor as T
rng = numpy.random

# Create some dummy data. Here we create a dataset of size 400
# each example has 784 independent variables/columns, and a randomly
# assigned 'label' between 0 and 1 (inclusive), so we are doing binary
# logistic regression

N = 400   #400 samples
feats = 784 #each with 784 features
D = (rng.randn(N, feats), rng.randint(size=N, low=0, high=2))
training_steps = 10000

# Declare Theano symbolic variables
x = T.matrix("x")  # input matrix (of size (N, feats))
y = T.vector("y")  # output vector of length N
w = theano.shared(rng.randn(feats), name="w") # weights
b = theano.shared(0., name="b") # a single bias/constant term
print("Initial model:")

# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(, w) - b))   # Probability that target = 1
prediction = p_1 > 0.5                  # The prediction thresholded
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum() # The cost to minimize. Note this includes a L2 penalty on the weights

gw, gb = T.grad(cost, [w, b])  # Compute the gradient of cost wrt w/b 

# Compile. The 0.1 in here is the learning rate (basically how quickly we try to go downhill when doing optimisation
train = theano.function(
          outputs=[prediction, xent],
          updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb)))
predict = theano.function(inputs=[x], outputs=prediction)

# Train
for i in range(training_steps):
    pred, err = train(D[0], D[1])

print("Final model:")
print("target values for D:")
print("prediction on D:")

Example 2: Auto-encoder

I didn't like the structure of the auto-encoder tutorial available here, so here is the stripped down example I worked through to solidify my Theano understanding. The nice thing about the following code is that the whole thing works if you run each code segment in the order presented here.

Step 1: get MNIST data

We will train our auto-encoder on the MNIST data, so we are essentially going to create a single later network which creates some key features of the MNIST digits, and then used those features to reconstruct (as best as possible) the MNIST digits.

For the example below, I will assume that you have downloaded the MNIST data (mnist.pkl.gz) and saved it somewhere.

Step 2: import required libraries and write data import function

Lets first make all the imports that we are going to need, and then there is a function which will load in the MNIST data for us. This function does this for us.

print '... loading data'  
# Load the dataset  
f ="/path/to/where/you/saved/mnist/mnist.pkl.gz", 'rb')  
train_set, valid_set, test_set = cPickle.load(f)  
data_x, data_y = train_set  
train_set_x = theano.shared(numpy.asarray(data_x,  

Nothing exciting there -- we just have our training data set, which is a matrix with 50000 rows (i.e. images), and 784 columns, which are the values in a 28x28 image, flattened out into a single length 784 vector. The values are between 0 (black) and 1 (white).

Step 3: Define our auto-encoder 

Now we need to specify our auto-encoder structure. We have 784 inputs (28x28 images), and so must also have 784 outputs (because we are trying to reconstruct the source image). I've gone for 500 hidden units, for no particularly good reason.

Because we have 784 inputs/outputs and 500 hidden nodes, we need a weights matrix of size (784,500) between our inputs and hidden layer, and one of size (500,784) between our hidden and output layers. Below, these are called W1 and W2. We also have two bias vectors b1 and b2. b1 is a vector of biases for the hidden layer (so length 500), and b2 is a vector of biases for the output layer (so length 784).

You can get into a long debate about the best way to initialize these weights, but I'll just be lazy and initialize them to something random and reasonably close to 0. I'll return to a discussion of the 'correct' way to initialize weights later.

#attempt at a simple auto-encoder in Theano  
#We do this by mapping a 28x28 image (i.e. 784 inputs)   
#into a hidden layer, and then back out again to a visible (output)   
#These are our input images. Each row is of length 784  
#The number of rows will depend on the batch size of course  
x = T.matrix('x')  
#These are the weights between the input and the hidden layer. There must be   
#784*n_hidden of them  
W1 = theano.shared(value=(rng.rand(n_visible, n_hidden)-0.5)*0.1, name="W1")  
#This is the bias into the first hidden layer, so we need n_hidden of those  
b1 = theano.shared(value=rng.rand(n_hidden)*0.02-0.01, name="b1")  
#These are the weights between the hidden layer and the output layer  
W2 = theano.shared(value=(rng.rand(n_hidden, n_visible)-0.5)*0.1, name="W2")  
#These are the biases for the output layer  
b2 = theano.shared(value=rng.rand(n_visible)*0.02-0.01, name="b2")  

Now, in principle, W1 and W2 are completely separate matrices....but it's a common trick to tie the two together, so W1 = transpose(W2). The reasoning for this is intuitively appealing but not mathematically precise: W1 encodes the image, so it is nice to think of the transpose operation as 'unpacking'/decoding the image. This would work exactly if the hidden layer had a linear activation function and W1 was orthogonal, but this is not the case. Still, we cut the number of parameters we have to estimate in half if we tie the weights, so maybe it is not a bad idea to do that.

I pick a non-linear activation function for the hidden units: the rectified linear transformation. I pick a sigmoid function for the output node, so that the response is squashed back to between 0 and 1 (the same range as the input image).

#This is the output from the first layer 
L1output = T.nnet.relu(, W1) + b1)  
#This is our output from the second layer, if we use tied weights  
L2output = T.nnet.sigmoid(, W1.T) + b2)  

Now that we've specified the network, we need to say something about how to train the network. Just being lazy, I choose to minimize squared error, which is often an OK choice. Once I've made that choice, I just need to get the gradients of the network parameters (W1, b1, b2) w.r.t the loss, and that will be enough for me to train my network.

#OK, we've specified our weights/bias structure, now lets specify   
#how to do updates.   
#First, lets work out the loss at the output nodes, for input x  
#Here I just use squared loss  
loss = T.mean((L2output-x)**2)  
#compute gradients w.r.t input parameters  
W1grad, b1grad, b2grad = T.grad(loss, [W1, b1, b2])  

Now, we've specified the loss we want to minimise, and the gradients required to minimise that loss, but we need to put that together in a training function, which we specify like so:

#now specify our update/training step  
trainf = theano.function(  
    outputs=[L2output, loss],  
predict = theano.function(inputs=[x], outputs=L2output)  

This just says that we update our weights and bias in the direction of the negative gradient. The 0.1 in the trainf function is just the learning rate -- higher values will mean faster learning (but maybe we get stuck in a local minima), lower values mean slower training but probably a better final model.

Now, that's pretty much it in terms of specifying the model and how to train it. Now we just need to do the actual training! This is pretty simple -- we break our data up into batched and do stochastic gradient decent on the batches:

#OK, that's it I believe! Now compile it and do the actual training  
#Before we do that, we just need to get Theano to 'evaluate' the data  
trainX = train_set_x.get_value()  
batch_size = 20  
for i in range(10):  
    batches = trainX.shape[0]/batch_size  
    errssofar = []  
    for batch in range(batches):      
        pred, err = trainf(trainX[batch*batch_size:(batch+1)*batch_size])  
        avgerr = numpy.mean(errssofar)  
        print("Iteration "+str(i)+" error on batch "+str(batch)+" is "+str(err)+". Average error across all batches this iteration is "+str(avgerr))  

Now that should get you through an initial 10 epochs of training. So now we'd like to see how well the auto-encoding has worked. Lets define some functions to let us look at the input images, the recovered output images, and the weights in the hidden neurons:

import Image  
#have a look at a particular image (input)  
def showInput(index):  
    Image.fromarray(numpy.asarray([int(item*255.99) for item in trainX[index]], dtype=numpy.uint8).reshape(28,28)).show()  
#have a look at a reconstructed output   
def showOutput(index):  
    Image.fromarray(numpy.asarray([int(item*255.99) for item in predict(trainX[index:index+1])[0]], dtype=numpy.uint8).reshape(28,28)).show()  
#have a look at some weights in the hidden layer  
def showHiddenUnit(index):  
    weights = [item for item in W1.get_value().T[index]]  
    minw, maxw = min(weights), max(weights)  
    weights = [int(255.99*(item-minw)/(maxw-minw)) for item in weights]  
    Image.fromarray(numpy.asarray(weights, dtype=numpy.uint8).reshape(28,28)).show()  

So lets now use these functions to look at one of our inputs, look at its reconstructed output, and then look at the weights:


Probably after only 10 iterations the reconstruction is not that good, so you might want to train a bit more before looking at the reconstruction again:

for i in range(10):  
    batches = trainX.shape[0]/batch_size  
    errssofar = []  
    for batch in range(batches):      
        pred, err = trainf(trainX[batch*batch_size:(batch+1)*batch_size])  
        avgerr = numpy.mean(errssofar)  
        print("Iteration "+str(i)+" error on batch "+str(batch)+" is "+str(err)+". Average error across all batches this iteration is "+str(avgerr)) 

Repeat this as many times as you like and see how the encoding gets a bit better with more training.

Initialisation of weights

Above, I just glossed over the initialisation of weights, but apparently this is quite an important thing, and there has been a lot of research into the effect of weights initialisation, and the optimal setting of initial weights. The current best practice seems to be Xavier initialization (see here for an informal explanation. Use Google or look at the original paper if you want the details)

Choice of loss function

In the example above I just chose to minimise squared loss but this is not the only choice. I don't want to go into further discussion of other options here, but I should be clear that squared loss is not the only (or even the best) choice.

Regularisation, Test/Training set separation, yadda yadda

In the interests of keeping the example simple, I left out a lot of stuff. It doesnt really make sense, for example, just to train on a training data set. In reality we want to use a test set to tell us when we are overfitting our data, and maybe even a validation set to see how well we will do on 'unseen' data. I've deliberately left out all this stuff to keep things simple in the example. Similarly, it sometimes makes sense to include some regularisation in the loss function -- maybe penalising larger weights. This helps prevent overfitting.

Adding noise to input

Another idea to improve the auto-encoder and prevent it overfitting is to add noise to the input, but try to get the network to learn/predict the input without the noise. The justification here is that we want the network to learn the 'essential' features of the input, and by adding noise to the input (but not the target/output) we are encouraging it to do this, rather than just memorize the input.

Sunday, December 13, 2015

Getting started with deep learning in python...installation of keras with tensorflow backend

I've put these instructions down for installing keras to work with tensorflow in case they are helpful for others.

I'm going to assume that you are already on top of how to install python packages, or else have the excellent Anaconda (from continuum analytics) installed, which makes it easy via it's conda package manager (You can get it at

If you have Anaconda installed, you should use its built-in package manager to install additional packages needed by keras:

conda install pyyaml
conda install h5py

You probably also want to make sure you are up to date with scipy, numpy, six:

conda upgrade six
conda upgrade scipy
conda upgrade numpy

Now, keras needs a back-end of either Theano or Tensorflow. I started using Tensorflow (for no particularly good reason, that's just what I picked). I just followed the Tensorflow setup instructions for Mac, which essentially consisted of typing the following at the command prompt:

pip install --upgrade

Now, finally, you want to install keras, which is the python library that sits atop Tensorflow/Theano and makes it easier to specify and train a neural network. You can clone the git repository which lives here. I just downloaded the zip file (from the same location), and unzipped it.

The keras website says to cd to the keras directory and run sudo python install

However, this did not work for me, because keras attempts to install Theano, and I had problems with that part of the install. Theano is listed as a dependency I suspect for historical reasons, even if you intend to use keras with a tensorflow backend. So, I just edited and removed theano as a dependency. By the time you read this, this issue will probably be fixed, and you wont need to edit

Lastly, before you run keras, you need to create a .keras directory in your home folder, and in that directory, create a file called keras.json, and put the following in it: {"epsilon": 1e-07, "floatx": "float32", "backend": "tensorflow"}

Now, you should be able to run keras. At least this seems to have worked for me.

Sunday, March 16, 2014

Public accountability gone in motorway funding

On the weekend the state government announced the "NorthConnex": another tolled motorway for Sydney. This one will cost $3 billion. That's small relative to the $11 billion WestConnex, but hardly small change. Some people think motorways are a bad idea because they induce more traffic, others think it is folly to invest billions of dollars in roadways when car travel is plateauing or falling in developed countries. Personally, I'm not opposed to motorways provided that:

  1. They have a benefit/cost ratio significantly above one
  2. The assumptions underlying the cost/benefit study are reasonable (there are many ways to game these things)

And this brings me to my main issue with these motorway proposals: there is no publicly available cost/benefit study for either of them. In a sane world, before we agreed to investing billions of dollars in infrastructure, the case for doing so would be publicly laid out. But it isn't, and the politicians are doing society two great disservices by their failure to do this: firstly, they may be wasting our money on ill-conceived projects; and secondly, they are entrenching a culture of opaqueness in the handling of public funds. This second point is especially important in NSW given our terrible record on corruption: if politicians can spend billions of dollars without the accountability of a public cost/benefit study, there will always be the potential for politicians to approve dud projects for a kickback, or a position on the board of a road construction or motorway operating company on retirement. Just so I am clear, I am not claiming that the WestConnex and NorthConnex motorways are going ahead due to corruption, I am simply pointing out that there are no publicly available cost/benefit studies, and we are being asked to trust the politicians and motorway companies without proof. This is no way to run major infrastructure investment.

My second peeve about these projects is that both of them are relying, to a significant degree, on tolls to fund them. This is a bad idea because funding a motorway this way lowers the benefit/cost ratio. This is easy to show mathematically, but is intuitively explained as follows:

  • The motorway operator has an incentive to maximize cost-recovery/revenue, and the toll that does this is not the same as the toll that maximizes public benefit.
  • Tolling individual links in a road network causes perverse outcomes: witness the Cross-City tunnel, which is tolled, and under-utilised, while the equivalent untolled surface route is still congested.

Friday, November 29, 2013

Marginal cost of congestion in Sydney is around $0.7/km

Post Synopsis: Every kilometre you drive in Sydney during peak hour, you are costing other motorists $0.66.... so if you drive 20km to work, other people are paying for your trip to the tune of about $13. If that sounds OK to you, remember that when you are stuck in traffic, you are paying (in time, extra fuel, etc) for someone else's decision to drive during peak hour. The best way out of this mess is congestion charging.

Congestion sucks

We all know that congestion is a drag in Sydney, and in most other large developed and developing cities (for fun, check out this congestion in Sao Paulo). A couple of studies have had a stab at estimating the total cost of congestion in Sydney. They found the following:
  • A CIE report estimated congestion costs of ~ $0.29 per vehicle kilometre in 2005. These are total costs (not avoidable/dead-weight)
  • BITRE estimate congestion costs of ~ $0.10 per vehicle kilometre in a 2007 working paper (a really good paper, btw). These are avoidable/dead-weight costs[*] -- total costs are around double that, so not so far from the CIE study.
I've reported these results (and others, below) in costs per vehicle-km, because thats the most common tangible measure, but if you want the total annual cost, you can multiply the above by the number of vehicle-km in Sydney (42.4 billion in 2005, say), and get total annual costs of $8-12 billion/year (depending on whether you use BITRE or CIE total costs).

Why average congestion costs are not so useful

Having just told you about a couple of prior studies that give average congestion costs, I have to break to it you that average costs aren't really that useful. Why? Two reasons:

  • Firstly, there is that old problem with averages: your average temperature can be fine if you have your head in the oven and your feet in the freezer....  If you are driving during a relatively quiet period, when congestion is low or non-existent, then you aren't imposing costs on others. If, on the other hand, you are driving during peak hour, then you are going to be imposing very high costs. 
  • Secondly, congestion is 'caused' by those last few thousand cars that are on the network. People refer to this effect as the 'school holiday effect', because you only need to remove a small amount of traffic for congestion to improve. This is telling us that the marginal cost of each additional vehicle is much higher than the average cost across all vehicles.
So what really matters, from a policy point of view, is the marginal cost of congestion. That is, what is the cost imposed by each additional driver? Since congestion costs are high during the peak, in this blog post I'm going to work out a rough estimate of the marginal congestion cost of each additional vehicle during peak hour in Sydney

So what are the marginal costs?

We are interested in the costs imposed on other motorists by each additional driver at peak time. We can work this out, to a reasonable approximation, as follows:

The average traffic speed during the inter-peak period is 42.9km/h [**]. The average traffic speed during the peak period is 30.0 km/h [***], so this means that traffic is 43% faster in the inter-peak period than during the (congested) peak period.

What causes this? Extra traffic on the roads, of course. Looking at the Figure below, we can see that during peak times, the volume of traffic on Sydney roads is ~ 270,000 vehicles during peak times, but only ~160,000 during inter-peak.

Source: BTS HTS 2011/12 summary report.

So, an extra 110,000 vehicles on the road network increases trip times by 43%. On a per-vehicle basis, this means each additional vehicle slows down all other traffic by 0.000325%. So if the vehicle is on the road for a minute, it slows down all other traffic by 0.000325% for that minute. Since there are 270,000 vehicles on the road at peak time, this means it costs those other motorists 0.88 minutes of delay. Valuing that at $14/hour (a pretty standard choice for peak travel), that's $0.205 of delay costs per minute of peak travel. If we add in other costs (extra running costs, pollution, etc), as in the BITRE report, this comes up to $0.33/minute. Average peak travel speed is 30 km/h, so converting this to a per-km costs, we get $0.66/km.

Note that this figure (of $0.66/km) is an average across Sydney for the peak period, and would be much higher in specific (congested) parts of the road network (and lower elsewhere). I haven't done the numbers, but my guess is that it would be a few times higher on congested links, which implies the marginal costs on congested roads might be in the vicinity of $2/km in the peak period.

[*] In the economics of congestion, there is an 'optimal' level of congestion: each additional traveller derives a benefit from making a trip, but there is a cost to other motorists due to congestion, and the optimal level of traffic is when the benefit to the additional traveller is equal to the costs imposed on existing travellers. Avoidable/dead-weight-loss congestion costs are the net costs taking into account the benefits of additional travel as well as the costs. Total costs are usually just the costs of delay relative to free-flow speeds.

[**] I've calculated average speed by using inter-peak travel times and distances from the NSW BTS's Strategic Travel Model. I've calculated the average speed across all trips in the ABS's journey-to-work data-set, using these travel times/distances.

[***] calculated as in [**], but I use am-peak travel times/distances.

Sunday, February 17, 2013

Computing polygon centroids for non-contiguous polygons in R

I don't like doing spatial analysis in a graphical GIS program like Arc or QGis. I find them fine for looking at data, but not for analysis. For that, I'd rather use a scripting language, and I am NOT going to use Arc's python scripting support (my experience with arcgisscripting is not good).

So, I'm quite excited about the growing support for spatial data in Python and R (Yes, I probably need to get out more). I would love to get to the point where I could do all my analysis in Python or R and dump out the results for viewing in a GIS program.

So, in order to learn about the new R packages for spatial data, I set myself the task of calculating the centroids of non-contiguous polygons. So, more concretely, I mean:

  • I have a layer of polygons, and each polygon has a name/identifier (a postcode, zone number, etc)
  • I would like to calculate the centroid of each name/identifier. Note that there may be more than one polygon with that identifier, and in that case I want the centroid of all the polygons with that name/identifier. (Or, if you prefer, the centre of mass of all the polygons with the same identifier).
Note that the 'centroid' I am calculating here for each name/identifier may not be inside a polygon with that name/identifier. 

And here is the R code to do it (NB: In this data-set, the 'id' field for each polygon is 'tz06'):

#import everything you need in 1 package

#read in your shapefile
p = readShapePoly("GMA_polygons_wgs84")

#calculate the centroids of each polygon
trueCentroids = gCentroid(p, byid=TRUE)

#I like to whack the centroids in as extra columns
p$lat = coordinates(trueCentroids)[,2]
p$lon = coordinates(trueCentroids)[,1]

#The 'average' lat and lon is what I want. It is the centre 
#of mass of all polygons with that id. 
#Of course, if there is only a single polygon with that id,
#the average lat/lon is just the regular centroid, so I 
#initialize my centroid to that first
p$avglat = p$lat
p$avglon = p$lon

#Now go through and calculate the 'average' centroid
#for all the tricky areas that are made up of more than
#1 polygon
for(i in 1:length(p$tz06)) 
    #find all the polygons with the same id
    matches = p$tz06 == p$tz06[i]
    #if there is more than one such polygon
    if(sum(matches) > 1)
        #we weight all of the centroids by area,
        #because we want the centre of mass
        areas = matches*p$Area_sqm
        weights = areas / sum(areas)

        p$avglat[i] = sum(p$lat*weights)
        p$avglon[i] = sum(p$lon*weights)

I'm sure this is not the nicest way of doing this, so if you know a cleaner way (hell, there may even be a built-in function for this already!), please let me know. Also, I'm hopeless with regards to functional programming (my brain thinks procedurally), so I've dont the above with a loop, but if you can point out to me a nice functional way of achieving the above with map/filter/fold/etc, I'd be keen to hear about that too.

Saturday, February 9, 2013

The disaster that is Google Drive

My wife is a smart lady. Not particularly interested in technology, but no luddite either. She recently signed up with Google Drive so that pictures and videos of our kids could be backed up online.

Big mistake.

Files mysteriously go missing from our home computer's LOCAL folders (i.e. the folder of images and videos she wants to sync). In fact many of the local folders were empty! Some have strange duplicates (for example, a "Pictures" and a "Pictures(1)" folder). This was quite a shock when she first discovered it -- "Google Drive is deleting the precious files I want to backup!!!". Looking into it, we discovered that other people have had the same problem.

We were able to find some of our 'missing' files in the "Trash" on our local PC. This caused us even more alarm, because we still lost a lot of really precious stuff from our local computer. So what this meant for us is that while we were able to go to Trash and recover some of the files, we lost many of them. My wife also assured me she wasn't doing anything strange like moving stuff around and deleting stuff while trying to back it up.

This seems really bizarre behaviour for a cloud storage solution. It should never delete stuff from your local computer for you. Or perhaps there are a few cases where it should (if you have manually deleted those items on another computer that syncs with your drive account), but if it does, it should be really careful about it. My wife didn't do anything to suggest to Google Drive that it should delete those items.

So, in panic we shut off Google Drive on our home computer and went to the web interface to see what was there. Alarmingly, all the folders we looked at were empty! Real panic stations now... but looking at our usage, we found that we were using ~20GB of Drive storage. Digging a little further, I discovered that I could click on the (hard to find) "All Items" link on the web interface and we could see everything that was uploaded (actually I think this means that it was moved to the online Google Drive Trash, but online Trash fortunately has to be emptied manually). I manually downloaded these (the interface to do this is terrible, as an added bonus, and you can only download in 2GB chunks). End result is that I think I've recovered our stuff. But the first thing I will do when I've confirmed that I've recovered everything is delete our Google Drive account and sign up with Box or DropBox or someone (anyone!) else.

Note that I am a big user of Google products, and I'm generally pretty disposed to liking Google: my work uses gmail for corporate email, I use it for my personal email, share videos via youtube, and so on. Heck this blog is run by Google. But Drive has been a massive headache for us.

Thinking of using Google Drive? Just dont do it.