STA 4953 (Spring 2001) Exercise 4A

Name:

E-mail:

Exercise 4A

Full credit will only be given to correct answers with a clear explanation of how they are obtained. Use additional paper as necessary.

Look up the helpfile of the S-Plus command "sample".

Generate a random nucleotide sequence of length 1000 using the independent model (i.e. the tetrahedral die model) with probabilities P(A) = 0.3, P(C) = 0.2, P(G) = 0.25, P(T) = 0.25. Name your sequence "myseq".

Let N_A (respectively N_C, N_G, and N_T) be the random variable representing the number of A's (respectively C's, G's, and T's) in such a random sequence as generated in problem #2.

(a). What kind of probability distribution does N_A (respectively N_C, N_G, and N_T) have? What are the parameters in that distribution?

(b). What is the expected value, of N_A (respectively N_C, N_G, and N_T)?

x	Probability Distribution	Parameters	E(x)
N_A
N_C
N_G
N_T

Figure out the observed counts of the bases in myseq you obtained in problem #2. Fill in the table below.

(Hint: In Exercise3A, you have created an S-Plus object called "mysequence". Find out what happens when you do mysequence[mysequence=="A"]. Do the same for "C", "G", and "T". This might give you some idea.)

Base	Expected Count (from #3)	Observed Count
A
C
G
T

Show the S-Plus command(s) that you used.

Generate a random nucleotide sequence of length 1000 according to the Markov model with the transition probability matrix P given in Problem #1 in Exercise 2A. Assume that the sequence has already attained its stationary distribution at the beginning. Repeat problem 4 for this sequence.

Hint: You'll need to use "for"-loops and some "if-else" statements. The "dimnames" function for matrices may also be useful. Check out some of the introductory S-Plus documents.)

Base Expected Count
Observed Count

A

C

G

T

Show the S-Plus commands that you used.