rickwoodramblings: December 2015

This is another write-up of my steps learning Theano. My intention is to end up using an easier-to-use library such as keras, which sits on top of Theano, but before I do that I want to make sure I have a reasonable handle on Theano itself.

I started out with the logistic regression example available here, and then went on to the denoising auto-encoder example available here. However, being completely new to Theano, I thought the examples were not as simple as they could be, so I stripped the examples down to the bare bones (no classes, no functions, no bothering about proper weight initialization). I wanted to focus on understanding Theano itself, and the code in the existing example is a bit of a distraction from that. Others may find themselves in a similar position (wanted the simplest possible examples), and that's why I've written them up and put them here.

Example 1: Logistic Regression

There are a couple of logistic regression examples for Theano, but of those, I think this one is the simplest and best one to start on. It is a great place to start and easy to do in Theano while getting used to the main concepts. I've reproduced the code below, with some additional comments added by me.

import numpy
import theano
import theano.tensor as T
rng = numpy.random

# Create some dummy data. Here we create a dataset of size 400
# each example has 784 independent variables/columns, and a randomly
# assigned 'label' between 0 and 1 (inclusive), so we are doing binary
# logistic regression

N = 400   #400 samples
feats = 784 #each with 784 features
D = (rng.randn(N, feats), rng.randint(size=N, low=0, high=2))
training_steps = 10000

# Declare Theano symbolic variables
x = T.matrix("x")  # input matrix (of size (N, feats))
y = T.vector("y")  # output vector of length N
w = theano.shared(rng.randn(feats), name="w") # weights
b = theano.shared(0., name="b") # a single bias/constant term
print("Initial model:")
print(w.get_value())
print(b.get_value())

# Construct Theano expression graph
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))   # Probability that target = 1
prediction = p_1 > 0.5                  # The prediction thresholded
xent = -y * T.log(p_1) - (1-y) * T.log(1-p_1) # Cross-entropy loss function
cost = xent.mean() + 0.01 * (w ** 2).sum() # The cost to minimize. Note this includes a L2 penalty on the weights

gw, gb = T.grad(cost, [w, b])  # Compute the gradient of cost wrt w/b 

# Compile. The 0.1 in here is the learning rate (basically how quickly we try to go downhill when doing optimisation
train = theano.function(
          inputs=[x,y],
          outputs=[prediction, xent],
          updates=((w, w - 0.1 * gw), (b, b - 0.1 * gb)))
predict = theano.function(inputs=[x], outputs=prediction)

# Train
for i in range(training_steps):
    pred, err = train(D[0], D[1])

print("Final model:")
print(w.get_value())
print(b.get_value())
print("target values for D:")
print(D[1])
print("prediction on D:")
print(predict(D[0]))

Example 2: Auto-encoder

I didn't like the structure of the auto-encoder tutorial available here, so here is the stripped down example I worked through to solidify my Theano understanding. The nice thing about the following code is that the whole thing works if you run each code segment in the order presented here.

Step 1: get MNIST data

We will train our auto-encoder on the MNIST data, so we are essentially going to create a single later network which creates some key features of the MNIST digits, and then used those features to reconstruct (as best as possible) the MNIST digits.

For the example below, I will assume that you have downloaded the MNIST data (mnist.pkl.gz) and saved it somewhere.

Step 2: import required libraries and write data import function

Lets first make all the imports that we are going to need, and then there is a function which will load in the MNIST data for us. This function does this for us.

print '... loading data'  
# Load the dataset  
f = gzip.open("/path/to/where/you/saved/mnist/mnist.pkl.gz", 'rb')

train_set, valid_set, test_set = cPickle.load(f)  
f.close()  
data_x, data_y = train_set  
train_set_x = theano.shared(numpy.asarray(data_x,  
                  dtype=theano.config.floatX))

Nothing exciting there -- we just have our training data set, which is a matrix with 50000 rows (i.e. images), and 784 columns, which are the values in a 28x28 image, flattened out into a single length 784 vector. The values are between 0 (black) and 1 (white).

Step 3: Define our auto-encoder

Now we need to specify our auto-encoder structure. We have 784 inputs (28x28 images), and so must also have 784 outputs (because we are trying to reconstruct the source image). I've gone for 500 hidden units, for no particularly good reason.

Because we have 784 inputs/outputs and 500 hidden nodes, we need a weights matrix of size (784,500) between our inputs and hidden layer, and one of size (500,784) between our hidden and output layers. Below, these are called W1 and W2. We also have two bias vectors b1 and b2. b1 is a vector of biases for the hidden layer (so length 500), and b2 is a vector of biases for the output layer (so length 784).

You can get into a long debate about the best way to initialize these weights, but I'll just be lazy and initialize them to something random and reasonably close to 0. I'll return to a discussion of the 'correct' way to initialize weights later.

#attempt at a simple auto-encoder in Theano  
#  
#  
#We do this by mapping a 28x28 image (i.e. 784 inputs)   
#into a hidden layer, and then back out again to a visible (output)   
#layer  
n_visible=784  
n_hidden=500  
#These are our input images. Each row is of length 784  
#The number of rows will depend on the batch size of course  
x = T.matrix('x')  
#These are the weights between the input and the hidden layer. There must be   
#784*n_hidden of them  
W1 = theano.shared(value=(rng.rand(n_visible, n_hidden)-0.5)*0.1, name="W1")  
#This is the bias into the first hidden layer, so we need n_hidden of those  
b1 = theano.shared(value=rng.rand(n_hidden)*0.02-0.01, name="b1")  
#These are the weights between the hidden layer and the output layer  
W2 = theano.shared(value=(rng.rand(n_hidden, n_visible)-0.5)*0.1, name="W2")  
#These are the biases for the output layer  
b2 = theano.shared(value=rng.rand(n_visible)*0.02-0.01, name="b2")

Now, in principle, W1 and W2 are completely separate matrices....but it's a common trick to tie the two together, so W1 = transpose(W2). The reasoning for this is intuitively appealing but not mathematically precise: W1 encodes the image, so it is nice to think of the transpose operation as 'unpacking'/decoding the image. This would work exactly if the hidden layer had a linear activation function and W1 was orthogonal, but this is not the case. Still, we cut the number of parameters we have to estimate in half if we tie the weights, so maybe it is not a bad idea to do that.

I pick a non-linear activation function for the hidden units: the rectified linear transformation. I pick a sigmoid function for the output node, so that the response is squashed back to between 0 and 1 (the same range as the input image).

#This is the output from the first layer 
L1output = T.nnet.relu(T.dot(x, W1) + b1)  
#This is our output from the second layer, if we use tied weights  
L2output = T.nnet.sigmoid(T.dot(L1output, W1.T) + b2)

Now that we've specified the network, we need to say something about how to train the network. Just being lazy, I choose to minimize squared error, which is often an OK choice. Once I've made that choice, I just need to get the gradients of the network parameters (W1, b1, b2) w.r.t the loss, and that will be enough for me to train my network.

#OK, we've specified our weights/bias structure, now lets specify   
#how to do updates.   
#First, lets work out the loss at the output nodes, for input x  
#Here I just use squared loss  
loss = T.mean((L2output-x)**2)  
#compute gradients w.r.t input parameters  
W1grad, b1grad, b2grad = T.grad(loss, [W1, b1, b2])

Now, we've specified the loss we want to minimise, and the gradients required to minimise that loss, but we need to put that together in a training function, which we specify like so:

#now specify our update/training step  
trainf = theano.function(  
    inputs=[x],  
    outputs=[L2output, loss],  
    updates=((W1,W1-0.1*W1grad),(b1,b1-0.1*b1grad),(b2,b2-0.1*b2grad))  
)  
predict = theano.function(inputs=[x], outputs=L2output)

This just says that we update our weights and bias in the direction of the negative gradient. The 0.1 in the trainf function is just the learning rate -- higher values will mean faster learning (but maybe we get stuck in a local minima), lower values mean slower training but probably a better final model.

Now, that's pretty much it in terms of specifying the model and how to train it. Now we just need to do the actual training! This is pretty simple -- we break our data up into batched and do stochastic gradient decent on the batches:

#OK, that's it I believe! Now compile it and do the actual training  
#Before we do that, we just need to get Theano to 'evaluate' the data  
trainX = train_set_x.get_value()  
batch_size = 20  
for i in range(10):  
    batches = trainX.shape[0]/batch_size  
    errssofar = []  
    for batch in range(batches):      
        pred, err = trainf(trainX[batch*batch_size:(batch+1)*batch_size])  
        errssofar.append(err)  
        avgerr = numpy.mean(errssofar)  
        print("Iteration "+str(i)+" error on batch "+str(batch)+" is "+str(err)+". Average error across all batches this iteration is "+str(avgerr))

Now that should get you through an initial 10 epochs of training. So now we'd like to see how well the auto-encoding has worked. Lets define some functions to let us look at the input images, the recovered output images, and the weights in the hidden neurons:

import Image  
#have a look at a particular image (input)  
def showInput(index):  
    Image.fromarray(numpy.asarray([int(item*255.99) for item in trainX[index]], dtype=numpy.uint8).reshape(28,28)).show()  
#have a look at a reconstructed output   
def showOutput(index):  
    Image.fromarray(numpy.asarray([int(item*255.99) for item in predict(trainX[index:index+1])[0]], dtype=numpy.uint8).reshape(28,28)).show()  
#have a look at some weights in the hidden layer  
def showHiddenUnit(index):  
    weights = [item for item in W1.get_value().T[index]]  
    minw, maxw = min(weights), max(weights)  
    weights = [int(255.99*(item-minw)/(maxw-minw)) for item in weights]  
    Image.fromarray(numpy.asarray(weights, dtype=numpy.uint8).reshape(28,28)).show()

So lets now use these functions to look at one of our inputs, look at its reconstructed output, and then look at the weights:

showInput(0)  
showOutput(0)  
showHiddenUnit(0)

Probably after only 10 iterations the reconstruction is not that good, so you might want to train a bit more before looking at the reconstruction again:

for i in range(10):  
    batches = trainX.shape[0]/batch_size  
    errssofar = []  
    for batch in range(batches):      
        pred, err = trainf(trainX[batch*batch_size:(batch+1)*batch_size])  
        errssofar.append(err)  
        avgerr = numpy.mean(errssofar)  
        print("Iteration "+str(i)+" error on batch "+str(batch)+" is "+str(err)+". Average error across all batches this iteration is "+str(avgerr))

Repeat this as many times as you like and see how the encoding gets a bit better with more training.

Initialisation of weights

Above, I just glossed over the initialisation of weights, but apparently this is quite an important thing, and there has been a lot of research into the effect of weights initialisation, and the optimal setting of initial weights. The current best practice seems to be Xavier initialization (see here for an informal explanation. Use Google or look at the original paper if you want the details)

Choice of loss function

In the example above I just chose to minimise squared loss but this is not the only choice. I don't want to go into further discussion of other options here, but I should be clear that squared loss is not the only (or even the best) choice.

Regularisation, Test/Training set separation, yadda yadda

In the interests of keeping the example simple, I left out a lot of stuff. It doesnt really make sense, for example, just to train on a training data set. In reality we want to use a test set to tell us when we are overfitting our data, and maybe even a validation set to see how well we will do on 'unseen' data. I've deliberately left out all this stuff to keep things simple in the example. Similarly, it sometimes makes sense to include some regularisation in the loss function -- maybe penalising larger weights. This helps prevent overfitting.

Adding noise to input

Another idea to improve the auto-encoder and prevent it overfitting is to add noise to the input, but try to get the network to learn/predict the input without the noise. The justification here is that we want the network to learn the 'essential' features of the input, and by adding noise to the input (but not the target/output) we are encouraging it to do this, rather than just memorize the input.

rickwoodramblings

Thursday, December 17, 2015

Simplest possible Theano examples