Name: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
E-mail: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Part I. [9 points] Probability models for DNA The tetrahedral die model A random nucleotide sequence is generated independently according to the probability distribution
The Markov chain model A random nucleotide sequence is generated by a Markov chain with transition probability matrix
If the probability distribution of the first base is Solution: The probability of getting an A in the third base is the same as the probability of getting an A on the second transition, so we need to find f2(A). Since, We know that the probability distribution of X0 is
To determine the probability distribution of X1, you must find P(x1=A) = P(x1=A | xo=A) * P(xo=A) + P(x1=A | xo=C) * P(xo=C) + P(x1=A | xo=G) * P(xo=G) + P(x1=A | xo=T) * P(xo=T) = (0.2)*(0.25) + (0.4) * (0.2) + (0.5) * (0.2) + (0.1) * (0.35) = 0.265 f1(C), f1(G), and f1(T) are found similarly to get the probability distribution of X1:
Now to find f2(A) solve P(x2=A) = P(x2=A | x1=A) * P(x1=A) + P(x2=A | x1=C) * P(x1=C) + P(x2=A | x1=G) * P(x1=G) + P(x2=A | x1=T) * P(x1=T) = (0.2)*(0.265) + (0.4) * (0.225) + (0.5) * (0.19) + (0.1) * (0.32) = 0.27 One way to find the stationary distribution is to use the fact that once the stationary distribution is reached, the following equation is true: (f(A), f(C), f(G), f(T)) * P = (f(A), f(C), f(G), f(T)). From this equation and the above matrix, P, we can get the following system of equations: f(A) = 0.2 * f(A) + 0.4 * f(C) + 0.5 * f(G) + 0.1 f(T) f(C) = 0.3 * f(A) + 0.3 * f(C) + 0.1 * f(G) + 0.2 f(T) f(G) = 0.1 * f(A) + 0.1 * f(C) + 0.2 * f(G) + 0.3 f(T) f(T) = 0.4 * f(A) + 0.2 * f(C) + 0.2 * f(G) + 0.4 f(T) The system to solve is comprised of any 3 of the above 4 equations along with the equation 1 = f(A) + f(C) + f(G) + f(T) Another way to find the stationary distribution is to find the eigenvector with eigenvalue 1 and normalize to make it a probability distribution. Part II [6 points] Statistical Inference Hypothesis Test A DNA sequence has the following nucleotide and dinucleotide counts:
Suppose the nucleotide bases are generated independently. Test the hypothesis that the base probability distribution is
X2 = sum(i={A,C,G,T}) [(Oi - Ei)2/ Ei] or the likelihood ratio statistic G2 = 2 * sum(i={A,C,G,T}) [Oi * log ( Oi /Ei )] may be used. The critical value X20.05 with 3 degrees of freedom is 7.815.] Solution: H0: f(A) = 0.25, f(C) = 0.2, f(G) = 0.2, and f(T) = 0.35 H1: Either f(A) does not equal 0.25, or f(C) does not equal 0.2, or f(G) does not equal 0.2, or f(T) does not equal 0.35 Using the Pearson statistic, we have X2 = sum(i={A,C,G,T}) [(Oi - Ei)2/ Ei] = [(246 -250)2/250] + [(219-200)2/200] + [(191 - 200)2/200] + [(344 - 250)2/350] = 2.3769 Since 2.3769 < 7.815, we fail to reject H0, and conclude that there is evidence to suggest that H0 is true.
Using the likelihood ratio, we have
Solution: First we need to find P(Strong) = (191 + 219) / 1000 = 0.41 and P(Weak) = (246 + 344) / 1000 = 0.59 Use these probabilities to find pSS, pSW, pWS, and pWW. pSS = P(current base is strong given that previous base is strong) = P(current base is strong | previous base is strong) = P(current base is strong and previous base is strong) / P(previous base is strong) = P(CC or CG or GC or GG) / P(Strong) = [ P(CC) + P(CG) + P(GC) P(GG) ] / P(Strong) = [(76 / 999) + (26 / 999) + (12 / 999) + (38/999)] / 0.41 = 0.3711 pSW = P(current base is weak given that previous base is strong) = P(current base is weak | previous base is strong) = P(current base is weak and previous base is strong) / P(previous base is strong) = P(CA or CT or GA or GT) / P(Strong) = [ P(CA) + P(CT) + P(GA) + P(GT) ] / P(Strong) = [(76 / 999) + (26 / 999) + (12 / 999) + (38/999)] / 0.41 = 0.6299 pWS = P(current base is strong given that previous base is weak) = P(current base is strong | previous base is weak) = P(current base is strong and previous base is weak) / P(previous base is weak) = P(AC or AG or TC or TG) / P(weak) = [ P(AC) + P(AG) + P(TC) + P(TG) ] / P(weak) = [(74 / 999) + (19 / 999) + (56 / 999) + (108/999)] / 0.59 = 0.4360 pWW = P(current base is weak given that previous base is weak) = P(current base is weak | previous base is weak) = P(current base is weak and previous base is weak) / P(previous base is weak) = P(AA or AT or TA or TT) / P(weak) = [ P(AA) + P(AT) + P(TA) + P(TT) ] / P(weak) = [(40 / 999) + (113 / 999) + (28 / 999) + (151/999)] / 0.059 = 0.5633 Hence, the transition probability matrix is
Part III [5 points] Select the best answer and write the letter in the provided space.
Part IV Extra Credit Problems [1 point extra credit] |