Name:  
   E-mail:  


Exercise 4A

Full credit will only be given to correct answers with a clear explanation of how they are obtained. Use additional paper as necessary.

  1. Look up the helpfile of the S-Plus command "sample".
  2. Generate a random nucleotide sequence of length 1000 using the independent model (i.e. the tetrahedral die model) with probabilities P(A) = 0.3, P(C) = 0.2, P(G) = 0.25, P(T) = 0.25. Name your sequence "myseq".
    Solution: myseq<-sample(c("A", "C", "G", "T"),size=1000,replace=T,prob=c(0.2,0.3,0.25,0.25))
  3. Let NA (respectively NC, NG, and NT) be the random variable representing the number of A's (respectively C's, G's, and T's) in such a random sequence as generated in problem #2.

    (a). What kind of probability distribution does NA (respectively NC, NG, and NT) have? What are the parameters in that distribution?

    (b). What is the expected value, of NA (respectively NC, NG, and NT)?

    x Probability Distribution Parameters E(x)
    NA
    NC
    NG
    NT
  4. Figure out the observed counts of the bases in myseq you obtained in problem #2. Fill in the table below.

    (Hint: In Exercise3A, you have created an S-Plus object called "mysequence". Find out what happens when you do mysequence[mysequence=="A"]. Do the same for "C", "G", and "T". This might give you some idea.)

    Base Expected Count
    (from #3)
    Observed Count
    A
    C
    G
    T

    Show the S-Plus command(s) that you used.

  5. Generate a random nucleotide sequence of length 1000 according to the Markov model with the transition probability matrix P given in Problem #1 in Exercise 2A. Assume that the sequence has already attained its stationary distribution at the beginning. Repeat problem 4 for this sequence.

    Hint: You'll need to use "for"-loops and some "if-else" statements. The "dimnames" function for matrices may also be useful. Check out some of the introductory S-Plus documents.)

    Base Expected Count
    Observed Count
    A
    C
    G
    T


    The expected values were obtained as follows:
    Let I(j)A = 1 if the jth base is A; 0 if the jth base is not A.
    Then,
    NA = I(1)A + I(2)A + ... + I(1000)A
    Since the expected value of a sum is equal to the sum of the expected values, we have
    E[NA] = sum(j=1 to 1000)[E(I(j)A)] = 0.2691466 + 0.2691466 + ... + 0.2691466 = 269.1466
    The expected counts for C, G, and T are obtained similarly.

    Show the S-Plus commands that you used.