Learning spring school - Day 2 Lab: Neural networks

  1. Don't forget that after starting R you need to type
    .libPaths(paste(getwd(), "/rlibs",sep=""))
  2. First you'll experiment with code based on Hugh's "Neural Nets: A simple example" presented this morning: html
  3. Next, some experiments with real data
  4. Experiments with neural nets that have more than 1 hidden layer
  5. Self-organizing maps

2. Some experiments with real data

library(MASS)
data(Pima.tr)
library(nnet)  # this line changed 4.18pm Wed May 24
x <- Pima.tr[,-8]
mu <- apply(x,2,mean)
sig <- apply(x,2,sd)
x <- scale(x,mu,sig)
y <- Pima.tr[,8]
y <- (y=='Yes')+0

nn1 <- nnet(x,y,size=5,decay=.5,entropy=T,maxit=1000)
   # the above is an alternate way of specifying a neural network - providing the x
   # matrix and y vector.
result <- predict(nn1)

table(actual=y,pred=round(result))
hist(result)

# now the test set...
x.te <- Pima.te[,-8]
x.te <- scale(x.te,mu,sig)
y.te <- (Pima.te[,8]=='Yes')+0

result.te <- predict(nn1,newdata=x.te)

table(actual=y.te,pred=round(result.te))

QUESTIONS

  1. Fix the number of hidden units at 5. How many parameters does this network have?
  2. For 5 hidden units, choose a value of decay that minimizes cross-validated misclassification rate.
  3. What happens if you don't scale the inputs in this example (scaling is done by the commands mu <- apply(x,2,mean); sig <- apply(x,2,sd); x <- scale(x,mu,sig)
  4. Why is it important to scale the features by the same constants (here the vectors mu and sigma calculated from the training set)?
Alternately you can use the AMORE library to fit a neural net. It currently doesn't have the option of binomial deviance as a criterion, and is instead using least squares. Later, we'll look at AMORE primarily as a means of fitting a network with more than 1 hidden layer.
# the three lines below replace the calls nn1 <- nnet(... and result <- predict(nn1) above.

net <- newff(n.neurons=c(7,1,1), learning.rate.global=1e-2, momentum.global=0.5,
             error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", 
             output.layer="sigmoid", method="ADAPTgdwm")
thenn <- train(net, x, y, error.criterium="LMS", report=TRUE, 
  show.step=100, n.shows=100 )
result <- sim(thenn$net,x)

# the call below generates predictions for the test set, replacing result.te <- predict(nn1,newdata=x.te)
result.te <- sim(thenn$net,x.te)

QUESTIONS

  1. Does AMORE produce similar fits to nnet with the same number of units and no decay?
  2. AMORE has no decay parameter. An alternate way to control overfitting is to employ "early stopping". In this approach, the network training algorithm stops before convergence. Experiment with the number of iterations (equal to n.shows*show.step) and see if you can get a better test set fit by early stopping than with fitting the whole net.

3. Experiments with neural nets that have more than 1 hidden layer

The code below creates a network with 3 hidden units on the first layer, and 2 on the second layer, and trains the network on the simple 2D example with a circular class in the middle. It's primarily intended to demonstrate that the algorithm works.

x <- matrix(rnorm(200),ncol=2)
y <- matrix(((x[,1]^2+x[,2]^2)>1.4) +0,ncol=1)
library(AMORE)

# ----- make a grid of test points so we can plot fitted function
n1 <- 100
n2 <- 110
x1grid <- seq(-3,3,l=n1)
x2grid <- seq(-3,3,l=n2)
xg <- as.matrix(expand.grid(x1grid,x2grid))
# ----- end of grid making

# fit the model.  Note that it's important to get the right number
# of n.neurons in each level, including the number of inputs.
net <- newff(n.neurons=c(2,3,2,1), learning.rate.global=1e-2, momentum.global=0.5,
             error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", 
             output.layer="sigmoid", method="ADAPTgdwm")
thenn <- train(net, x, y, error.criterium="LMS", report=TRUE, 
  show.step=100, n.shows=100 )

# get predictions
result <- sim(thenn$net,x)  # for training data
result2 <- sim(thenn$net,xg)# for grid points

# plot results
par(mfrow=c(1,2))
image(x1grid,x2grid,matrix(result2,n1,n2))
points(x,col=round(result)+1,pch=19)
points(x,pch='o')
plot(x,col=y+1,pch=19)

QUESTIONS

  1. How many hidden units do you need in the first layer to get a good fit?
  2. Is it worth the effort to have more than 1 hidden layer? I'm not sure. To explore this, I make up a dataset that specifically has more than 1 layer in it, and then fit a model with 2 hidden layers (using AMORE).

    The example is a modification of the code used in day 1 to simulate a logistic model. Notice that the predictors given to the neural net are the original data values (z), even though the output is a nonlinear function of a linear combinations of the original data values. If the multiple layers are working well, I figure the first layer should learn the linear combinations and the second layer should learn the nonlinear functions.

    n <- 200        # number of training points
    n.test <- 200   # number of test points
    p <- 5          # dimension of input space
    z <- matrix(rnorm((n+n.test)*p),ncol=p)
    x <- matrix(0,nrow=n+n.test,ncol=p)
    for (i in 1:p)
      x[,i] <- z%*%rnorm(p)
    mu <- x[,1] + x[,2]*x[,3] + sin(x[,4])  
    y <- mu + rnorm(p)
    
    library(AMORE)
    net <- newff(n.neurons=c(5,3,3,1), learning.rate.global=1e-2, momentum.global=0.5,
                 error.criterium="LMS", Stao=NA, hidden.layer="sigmoid", 
                 output.layer="purelin", method="ADAPTgdwm")
    thenn <- train(net, z, y, error.criterium="LMS", report=TRUE, 
      show.step=1, n.shows=100 )
    result <- sim(thenn$net,z)  # for training data
    
    Experiment with the architecture of the network to get a better fit (either for the training set or ideally for the test set)
  3. I haven't had time to actually compare this to a single layer network. Either modify the code above to fit a 1-layer network (to be fair you might want to include more hidden units in layer 1), or use nnet. Do you need more than 1 hidden layer in this case?

Self-organizing maps

Here, we'll try some simple examples with SOMs. Recall that SOMs are an unsupervised network. They have some resemblance to k-means as well.

First, the crabs data. We'll use the log of the physical measurements to cluster the data, and then examine how the species are grouped according to the fitted map. This is a built-in example for the batchSOM function in the "class" library.

library(class)
example(batchSOM)
Here's an example that will be used later in the week. It's a two dimensional problem, but the structure is difficult for conventional unsupervised methods to detect.
# the code below is modified from the example above.
dm <- read.table('http://euler.acadiau.ca/~hchipman/SpringSchool2006/data/double_moon.amat',head=F)
pairs(dm)
plot(dm[,1:2],col=dm[,3]+1,pch=19)
dm.labels <- dm[,3]
dm <- dm[,-3]

# fit the SOM
library(class)
gr <- somgrid(topo='hexagonal')
dm.som <- batchSOM(dm,gr,c(4,4,2,2,1,1,1,0,0))
par(mfrow=c(1,1))
plot(dm.som) # in this case not very useful since you only have 2 x's

# try to visualize what is going on
symb <- c(1:26+64,1:26+96) # codes for letter symbols.

# find the nearest bin centre for each data point
bins <- as.numeric(knn1(dm.som$code, dm, 0:47))

# plot original with labels, and labels in the grid.
par(mfrow=c(2,2))
plot(dm,pch=symb[bins])  # letter labels
plot(dm,pch=19,col=dm.labels+1)  # class labels

# letter labels on grid
plot(dm.som$grid, type = "n")
symbols(dm.som$grid$pts[, 1], dm.som$grid$pts[, 2],
             circles = rep(0.4, 48), inches = FALSE, add = TRUE)
points(dm.som$grid$pts[bins, ],pch= symb[bins])

# observed labels on grid
plot(dm.som$grid, type = "n")
symbols(dm.som$grid$pts[, 1], dm.som$grid$pts[, 2],
             circles = rep(0.4, 48), inches = FALSE, add = TRUE)
points(dm.som$grid$pts[bins, ],pch= as.character(dm.labels))

The last time I remembered to update the "modification date" for this page was May 22, 2006.
Hugh Chipman, Acadia University