Name:  
   E-mail:  


STA 4953 Test I Solutions
Introduction to Bioinformatics
February 22, 2001, 3:30-4:45 p.m.



Part I. [9 points] Probability models for DNA



The tetrahedral die model
A random nucleotide sequence is generated independently according to the probability distribution
(f(A), f(C), f(G), f(T)) = (0.25, 0.2, 0.2, 0.35)
When a quadruplet made up of 4 nucleotide bases from this sequence is observed, what is the probability that
  1. it contains at least one A?

    Solution:
    P( contains at least one A)
    = 1 - P(contains no A's in four independent rolls of the tetrahedral die)
    = 1 - {[1 - P(A)]4} Since P(no A's in one roll of the die) = 1- P(A), then P(no A's in four rolls of the die) = [1 - P(A)]4
    =1 - {[1 - (0.25)]4} = 0.6836


  2. all four bases are identical?

    Solution:
    P(all four bases are identical)
    = P(all four bases are A's or all four bases are C's or all four bases are G's or all four bases are T's)
    = P(all four bases are A's) + P(all four bases are C's) + P(all four bases are G's) + P(all four bases are T's)
    =(0.25)4 + (0.2)4 + (0.2)4 + (0.35)4
    = 0.0221


The Markov chain model
A random nucleotide sequence is generated by a Markov chain with transition probability matrix
   P  =         A C G T   
A (   0.2 0.3 0.1 0.4   )
C 0.4 0.3 0.1 0.2
G 0.5 0.1 0.2 0.2
T 0.1 0.2 0.3 0.4

If the probability distribution of the first base is
(f(A), f(C), f(G), f(T)) = (0.25, 0.2, 0.2, 0.35),
What is the probability of getting an A in the third base? Describe one way to find the stationary distribution of this Markov chain.

Solution:
The probability of getting an A in the third base is the same as the probability of getting an A on the second transition, so we need to find f2(A).
Since,
(f1(A), f1(C), f1(G), f1(T))* P = (f2(A), f2(C), f2(G), f2(T)),
we must first find the probability distribution of X1.
We know that the probability distribution of X0 is
x f0(x)
A 0.25
C 0.2
G 0.2
T 0.35
and
(f0(A), f0(C), f0(G), f0(T))* P = (f1(A), f1(C), f1(G), f1(T))

To determine the probability distribution of X1, you must find
f1(A), f1(C), f1(G), f1(T).
To find f1(A) solve
P(x1=A) = P(x1=A | xo=A) * P(xo=A) + P(x1=A | xo=C) * P(xo=C) + P(x1=A | xo=G) * P(xo=G) + P(x1=A | xo=T) * P(xo=T)
= (0.2)*(0.25) + (0.4) * (0.2) + (0.5) * (0.2) + (0.1) * (0.35)
= 0.265

f1(C), f1(G), and f1(T) are found similarly to get the probability distribution of X1:



x f1(x)
A 0.265
C 0.225
G 0.19
T 0.32

Now to find f2(A) solve
P(x2=A) = P(x2=A | x1=A) * P(x1=A) + P(x2=A | x1=C) * P(x1=C) + P(x2=A | x1=G) * P(x1=G) + P(x2=A | x1=T) * P(x1=T)
= (0.2)*(0.265) + (0.4) * (0.225) + (0.5) * (0.19) + (0.1) * (0.32)
= 0.27

One way to find the stationary distribution is to use the fact that once the stationary distribution is reached, the following equation is true:
(f(A), f(C), f(G), f(T)) * P = (f(A), f(C), f(G), f(T)).
From this equation and the above matrix, P, we can get the following system of equations:
f(A) = 0.2 * f(A) + 0.4 * f(C) + 0.5 * f(G) + 0.1 f(T)
f(C) = 0.3 * f(A) + 0.3 * f(C) + 0.1 * f(G) + 0.2 f(T)
f(G) = 0.1 * f(A) + 0.1 * f(C) + 0.2 * f(G) + 0.3 f(T)
f(T) = 0.4 * f(A) + 0.2 * f(C) + 0.2 * f(G) + 0.4 f(T)

The system to solve is comprised of any 3 of the above 4 equations along with the equation
1 = f(A) + f(C) + f(G) + f(T)

Another way to find the stationary distribution is to find the eigenvector with eigenvalue 1 and normalize to make it a probability distribution.



Part II [6 points] Statistical Inference

Hypothesis Test

A DNA sequence has the following nucleotide and dinucleotide counts:

A 246 AA 40 AC 74 AG 19 AT 113
C 219 CA 86 CC 76 CG 26 CT 31
G 191 GA 92 GC 12 GG 38 GT 49
T 344 TA 28 TC 56 TG 108 TT 151

Suppose the nucleotide bases are generated independently. Test the hypothesis that the base probability distribution is
(f(A), f(C), f(G), f(T)) = (0.25, 0.2, 0.2, 0.35),
    [Hint: Either the Pearson statistic

    X2 = sum(i={A,C,G,T}) [(Oi - Ei)2/ Ei]

    or the likelihood ratio statistic

    G2 = 2 * sum(i={A,C,G,T}) [Oi * log ( Oi /Ei )]

    may be used. The critical value X20.05 with 3 degrees of freedom is 7.815.]


Solution:

H0: f(A) = 0.25, f(C) = 0.2, f(G) = 0.2, and f(T) = 0.35
H1: Either f(A) does not equal 0.25, or f(C) does not equal 0.2, or f(G) does not equal 0.2, or f(T) does not equal 0.35

Using the Pearson statistic, we have
X2 = sum(i={A,C,G,T}) [(Oi - Ei)2/ Ei]
= [(246 -250)2/250] + [(219-200)2/200] + [(191 - 200)2/200] + [(344 - 250)2/350]
= 2.3769

Since 2.3769 < 7.815, we fail to reject H0, and conclude that there is evidence to suggest that H0 is true.

Using the likelihood ratio, we have
G2 = 2 * sum(i={A,C,G,T}) [Oi * log(Oi /Ei)]
= 2 * {(246) * log(246/250) + (219) * log(219/200) + (191) * log(191/200) + (344) * log(344/350)}
= 2.3294

Since 2.3294 < 7.815, we fail to reject H0, and conclude that there is evidence to suggest that H0 is true.


Estimation
Suppose the Strong/Weak (S/W) classification for this DNA sequence conforms to a Markov chain. Estimate the transition probability matrix

   P  =   ( PSS   PSW )
PWS   PWW
.


Solution:
First we need to find

P(Strong) = (191 + 219) / 1000 = 0.41
and
P(Weak) = (246 + 344) / 1000 = 0.59

Use these probabilities to find pSS, pSW, pWS, and pWW.

pSS
= P(current base is strong given that previous base is strong)
= P(current base is strong | previous base is strong)
= P(current base is strong and previous base is strong) / P(previous base is strong)
= P(CC or CG or GC or GG) / P(Strong)
= [ P(CC) + P(CG) + P(GC) P(GG) ] / P(Strong)
= [(76 / 999) + (26 / 999) + (12 / 999) + (38/999)] / 0.41
= 0.3711

pSW
= P(current base is weak given that previous base is strong)
= P(current base is weak | previous base is strong)
= P(current base is weak and previous base is strong) / P(previous base is strong)
= P(CA or CT or GA or GT) / P(Strong)
= [ P(CA) + P(CT) + P(GA) + P(GT) ] / P(Strong)
= [(76 / 999) + (26 / 999) + (12 / 999) + (38/999)] / 0.41
= 0.6299

pWS
= P(current base is strong given that previous base is weak)
= P(current base is strong | previous base is weak)
= P(current base is strong and previous base is weak) / P(previous base is weak)
= P(AC or AG or TC or TG) / P(weak)
= [ P(AC) + P(AG) + P(TC) + P(TG) ] / P(weak)
= [(74 / 999) + (19 / 999) + (56 / 999) + (108/999)] / 0.59
= 0.4360

pWW
= P(current base is weak given that previous base is weak)
= P(current base is weak | previous base is weak)
= P(current base is weak and previous base is weak) / P(previous base is weak)
= P(AA or AT or TA or TT) / P(weak)
= [ P(AA) + P(AT) + P(TA) + P(TT) ] / P(weak)
= [(40 / 999) + (113 / 999) + (28 / 999) + (151/999)] / 0.059
= 0.5633

Hence, the transition probability matrix is

   P  =   ( 0.3711   0.6299 )
0.4360   0.5633
.


Part III [5 points] Select the best answer and write the letter in the provided space.

  1. The process of making an RNA copy of DNA is called
      A. Transcription
      B. Translation
      C. Moderation
      D. Replication
      E. Gobilization
  2. The process fo reading an amino acid sequence from an RNA molecule is called
      A. Transcription
      B. Translation
      C. Repudiation
      D. Replication
      E. Cross-market capitalization
  3. A protein molecule is made up of
      A. nucleotide bases
      B. A, C, G, and U
      C. chromosomes
      D. cells
      E. amino acids
  4. The base guanine is always paired with
      A. Adenine
      B. Guanine
      C. Cystosine
      D. Thymine
      E. Guanine is never paired with another base in a molecule of DNA
  5. Which of the DNA sequences below is a palindrome?
      A. TAC
      B. TCTCTCT
      C. AAAAAAAAAA
      D. ACGT
      E. MASDP


  6. Part IV Extra Credit Problems [1 point extra credit]

  7. Hershey and Chase differentiated between DNA and protein by:
      A. labeling the DNA with 32Phosphorous, proteins with 35Sulfur
      B. labeling the DNA with 35Sulfur, proteins with 32Phosphorous
      C. labeling the DNA with cesium, proteins with chloride
      D. labeling the DNA with 14Carbon, proteins with 3Hydrogen
  8. In Avery's experiment, the ability of an extract of heat killed, smooth, disease causing bacteria to transform rough, non-disease causing bacteria was blocked by treatment with
      A. proteinase
      B. DNase
      C. RNase
      D. Jerry Springer
      E. Calcitonin