The following session is intended to introduce to you some features of the R environment by using them. Many features of the system will be unfamiliar and puzzling at first, but this puzzlement will soon disappear.
In the text below, text in this font is R code. To avoid confusion, the R prompt (a ">") is not displayed. Comments appear on the line after an R code, and are in the font you are reading now. Short comments are prefaced by a "#" since this is the way R indicates a comment. Longer comments appear in paragraphs and are indicated by context.
None of the output from R (either text or graphics) is displayed in this document. You can copy and paste these commands (with or without comments) into an R window to see what happens. If you are trying this session yourself, I encourage you to experiment!
If you are experimenting with this session yourself, you will need a working copy of R. Instructions for installation of R on various platforms are on the preliminaries for R page.
This example is modified from the online Introduction to R manual. Modifications are quite extensive, placing more emphasis on loading datasets, manipulation of objects, and training models in "learning" problems.
# Login, start your windowing system.
$ R# Start R as appropriate for your platform. The above command is the unix/linux way to start R. On Windows/Mac, you'd double click an R icon or choose from the start menu.
# The R program begins, with a banner.
# (Within R, the prompt on the left hand side will not be shown to avoid confusion.)
help.start()Start the HTML interface to on-line help (using a web browser available at your machine). You should briefly explore the features of this facility with the mouse.
# Iconify the help window and move on to the next part.
q()# This stops R
# Now start R up again, and we'll continue with the first session.
x <- rnorm(50) y <- rnorm(x)# Generate two pseudo-random normal vectors of x- and y-coordinates.
plot(x, y)# Plot the points in the plane. A graphics window will appear automatically.
x# Typing the name of an object will display its contents.
ls()# See which R objects are now in the R workspace.
rm(x, y)# Remove objects no longer needed. (Clean up). What happens if you "ls()" after this?
x <- 1:20# Make x = (1, 2, ..., 20).
x = 1:20# "=" and "<-" are the same
dummy <- data.frame(x=x, y= x + rnorm(x))
dummy# Make a data frame of two columns, x and y, and look at it.
fm <- lm(y ~ x, data=dummy)
summary(fm)# Fit a simple linear regression of y on x and look at the analysis.
attach(dummy)# Make the columns in the data frame visible as variables.
plot(x, y)# Standard point plot.
abline(0, 1, lty=3)# The true regression line: (intercept 0, slope 1).
abline(coef(fm))# Add the regression line.
detach()# Remove data frame from the search path.
rm(fm, x, dummy)# Clean up again.
q()Quit. You will be asked if you want to save your workspace, which contains all the objects you have created. For this session, you don't need to do this. Saving a workspace will make objects you create available to you in future sessions.
Now start R again. If you are running this yourself, you will need to download the e1071 package. See the preliminaries for R page if you don't know how to do this.
library(e1071)# load the "e1071" package, which is a collection of machine learning tools and datasets.
data(Glass)# load the "Glass" dataset. It's built into the "e1071" package.
An alternate approach is to read the data from a file or from a URL (below):
glass2 <- read.table('http://www.ics.uci.edu/~mlearn/databases/glass/glass.data',
sep=',',head=F)
We'll use the built-in version of the data, instead of the downloaded one.
dim(Glass)# What are dimensions of the data matrix?
summary(Glass)# summarize each column of
Glass.
? Glass# Ask for help on the Glass dataset. Because the data are part of a library, there is a help page that describes the data. You can ask for help on any R object (either a dataset or a function) using
? or help(Glass).
library(lattice) histogram(~Na|Type,data=Glass)# look at histograms of the "Na" variable for each type of glass separately. Note that the more basic "hist" command is included with R, and "histogram" is part of the more fancy "lattice" package.
plot(Glass[,1:2],col=as.numeric(Glass$Type),pch=19)# get an idea of the separability of the two classes
plot(Glass[,1:5],col=as.numeric(Glass$Type),pch=19)# look at more scatterplots simultaneously
# Lots of things are going on in the above operations.
Glass[,1:2]#Subscript columns 1 and 2
It's also possible to subscript rows or both rows and columns, for example
Glass[1:10,] Glass[1:10,1:2]
Glass$Type# We can also subscript data frames by name
Below we'll fit a neural network to the data.
library(nnet)# load the neural network library
tempdata <- Glass tempdata[,1:9] <- scale(tempdata[,1:9]) nn1 <- nnet(Type~.,data=tempdata,size=10,decay=1)# fit the model
predict(nn1,type='class') table(actual=Glass$Type,predicted=predict(nn1,type='class')) sum(Glass$Type==predict(nn1,type='class'))# look at predictions.
Glass$Type==predict(nn1,type='class') makes a logical vector of length 214, and the sum counts the number of "TRUE" values.
After developing other tutorial material, there is other stuff I hope to cover. In any case, since the above material was distributed to students a week in advance, everyone probably has figured it out. Topics are ordered (roughly) by concept.
1:10 # integers seq(3,10,.1) # a sequence 3 to 10 in steps of .1 seq(3,10,l=20) # instead specify the length rep(0,20) # repeat 0, 20 times c(1:10,3,seq(1,2,l=4)) # paste together in vector All of the above could be assigned, eg x<-1:10 or x=1:10Random:
x <- c(1,1,1,1,2,2,3) sample(x,2) # pick 2 elements of x w/o replacement sample(x) # permute elements of x sample(x,rep=T) # sample length(x) elements with replacement rnorm(10,2,1) # simulate 10 N(2,1) observations rbinom(10,1,.5) # simulate 10 binomials with 1 trial each, and prob=.5 rbinom(10,1,1:10/11) # second or third (here) argument can be vector
x <- matrix(1:6,2,3)
y <- matrix(c(1,1,1),3,1)
x%*%y # matrix multiplication
x*x # elementwise multiplication
mean(x) # mean of elements of x
apply(x,2,mean) # sweep out the mean function, leaving
# dimension 2 (ie leave columns)
sqr <- function(x) {
x*x
} # define a function
sqr(3.3)
y <- rep(0,10)
for (i in 1:10){
y[i] <- sqr(i)
}
y
WARNING: The above is an inefficient way to square the elements of a vector. In general, in R, it's much faster to "vectorize" operations than carry out for loops, for example y <- (1:10)*(1:10). However, situations will arise when loops are necesary.
plot(x) will do different things depending on whether x is a vector, a matrix, a data frame, or an object created by a complex function like lm. We'll encounter object-oriented functions such as plot, summary, and predict.
model formulas identify the predictors (inputs, features) and response (output, target), can specify functions of predictors to use, automatically generate indicator variables for categorical predictors, and can represent more complex modelling structure, such as nested terms, conditional structure, etc. The variables identified in the formula usually correspond to named columns of a data frame.
library(MASS) crabs[1:2,] glm(sp~FL,data=crabs,family=binomial) # use FL only as a predictor glm(sp~.,data=crabs,family=binomial) # use everything except sp as a predictor glm(sp~.-sex-index,data=crabs,family=binomial) # exclude terms sex and index glm(sp~FL+RW,data=crabs,family=binomial) # use FL and RW as predictors.The way in which a predictor actually enters the model will depend on the model being fit. For example, a linear regression (
lm) or generalized linear model (glm) may use them as linear predictors, while tree or a neural network will estimate nonlinear functions and possibly interactions. In these latter cases, the main purpose is to say what variables to use.
str is a good way to "look under the hood" of any object without getting 10,000 lines of printed output...
junk <- kmeans(matrix(rnorm(100),25,4),4) str(junk) # show a 1-line summary of the "str"ucture of x str(junk,2) # as above, but only recurse to depth 2
source("mycommands.txt"). R has a batch mode but it's not likely to be relevant here.
In places these labs may be "over the top", in that they present much more material than you have a hope of finishing. Don't panic. Pick and chose what looks interesting. Ask lot of questions. Have fun.