STA 4953 (Spring 2001) Exercise 3A

Name:

Exercise 3A

Full credit will only be given to correct answers with a clear explanation of how they are obtained. Use additional paper as necessary.

Introduction to S-Plus

S-Plus is a comprehensive statistical software package with a large collection of built in standard statistics programs (e.g. ANOVA, regression, etc.). However, the real strength of this software package lies in the versatile object oriented S-Plus programming language. The language lets us create new statistical functions and procedures very efficiently to suit our own purposes. You can start S-Plus on Helix by typing Splus5 (the "5" is because we are currently using version 5 of S-Plus).

Data Objects: When using S-Plus, think of your data sets as data objects. You can display any S-Plus object by simply typing its name. The simplest data object is a vector. Other data objects include matrix, data frame, and list. We shall focus only on vectors and matrices for now.

If you are working from a terminal with X-windows display environment set up properly, You can get on-line help from S-Plus by typing help.start() to the S-Plus prompt. This will start the help menu in Netscape. If you are not working from an X-windows terminal, you can get help for any command called, say <commandname> by typing help(commandname).

Exercise

A vector is a set of numbers, character values, logical values, etc. You can form a vector by combining several elements of the same type with the "c" function. For example, n<- c(2, 5.6, 8.1, 9.5) will create a vector, named n, consisting of four numbers 2, 5.6, 8.1, 9.5.

Try the following commands, then examine the created object by typing its name. Write down the S-Plus output and an interpretation of what operations are performed at each command.

n<- c(2, 5.6, 8.1, 9.5)

m<-n/sum(n)

Output:

Interpretation:
Creates a vector named, m, with each element of the vector being the quotient of each element of n divided by the sum of the elements of n.
n<-c(-2.3, n, 4,-4.5, 8.5)

len<- length(n)

polyA<- rep ("A", length(n))

mysequence<- c ("C", "T", "T", "A", "G", "C", "A", "G", "G", "T")

Output:
[1] "C" "T" "T" "A" "G" "C" "A" "G" "G" "T"

Interpretation:
Creates a vector, called "mysequence", consisting of the elements in parenthesis in the command statement.
CompStrand<- function (DNAseq){

CompStrand (mysequence)

CompStrand (polyA)

CompStrand (c("G", "A", "A", "T", "T", "C")). What do you notice when you compare the complementary strand with the original sequence?

A matrix is a rectangular array of numbers. In S-Plus, a matrix is constructed using the "matrix" command. Try typing the following

P<-matrix(c(0.2, 0.4, 0.5, 0.1, 0.3, 0.3, 0.1, 0.2, 0.1, 0.1, 0.2, 0.3, 0.4, 0.2, 0.2, 0.4), 4, 4)
Then, type P to the S-Plus prompt and record the results.

Output:
[,1] [,2] [,3] [,4] [1,] 0.2 0.3 0.1 0.4 [2,] 0.4 0.3 0.1 0.2 [3,] 0.5 0.1 0.2 0.2 [4,] 0.1 0.2 0.3 0.4

Verify that P is the transition probability matrix you had in Problem 1 of Exercise 2A. Note that S-Plus fills in the entries of a matrix by columns. Also, the first number, 4, in the command statement after the matrix entries determines the number of rows in the matrix. The second number after the matrix entries determines the number of columns in the matrix.

Find out about the "solve" function in S-Plus by typing help(solve). Use it to find the stationary distribution of the Markov nucleotide sequence with the transition probability matrix P. You might want to refer back to Problem 2 of Exercise 2B, where you have set up the system of linear equations to be solved in order to obtain the stationary distribution.

The stationary distribution is

x	f_o(x)
A
C
G
T

Record below the sequence of S-Plus commands used.

What does the S-Plus command t(P) do?
Output:
[,1] [,2] [,3] [,4] [1,] 0.2 0.4 0.5 0.1 [2,] 0.3 0.3 0.1 0.2 [3,] 0.1 0.1 0.2 0.3 [4,] 0.4 0.2 0.2 0.4

Interpretation:
Creates the transpose of P.
Type "eigen(t(P))" to the S-Plus prompt. From the output, pick out the eigenvector corresponding to the eigenvalue 1. Can you obtain the stationary distribution from the eigen vector? How?
Answer:
The eigenvector is (0.6281143, 0.5413018, 0.4238495, 0.7404600) because if eig1<-matrix(c(0.6281143, 0.5413018, 0.4238495, 0.7404600)),1,4) Then eig1%*%P=eig1 (In Splus, use %*% to multiply matrices) Obtain the stationary distribution by
Why do we have to use "eigen(t(P))" instead of "eigen(P)" in problem 14?
Answer:
Because Splus gives us the right eigenvector, and we want to find the left eigenvector.
Go to the following links to read some introductory S-Plus documents.
SPLUS - TUTORIAL
Introductory Course on SPlus
More helpful links may be found at http://helix.biostat.utsa.edu/~kgarnett/bioinformatics/splus.html