STA 4953 (Spring 2001) Test I

Name:

E-mail:

STA 4953 Test I Solutions Introduction to Bioinformatics February 22, 2001, 3:30-4:45 p.m.

Part I. [9 points] Probability models for DNA

The tetrahedral die model
A random nucleotide sequence is generated independently according to the probability distribution (f(A), f(C), f(G), f(T)) = (0.25, 0.2, 0.2, 0.35) When a quadruplet made up of 4 nucleotide bases from this sequence is observed, what is the probability that

it contains at least one A?

Solution:

P( contains at least one A)

= 1 - P(contains no A's in four independent rolls of the tetrahedral die)

= 1 - {[1 - P(A)]⁴} Since P(no A's in one roll of the die) = 1- P(A), then P(no A's in four rolls of the die) = [1 - P(A)]⁴

=1 - {[1 - (0.25)]⁴} = 0.6836

all four bases are identical?

Solution:
P(all four bases are identical)
= P(all four bases are A's or all four bases are C's or all four bases are G's or all four bases are T's)
= P(all four bases are A's) + P(all four bases are C's) + P(all four bases are G's) + P(all four bases are T's)
=(0.25)⁴ + (0.2)⁴ + (0.2)⁴ + (0.35)⁴
= 0.0221

The Markov chain model
A random nucleotide sequence is generated by a Markov chain with transition probability matrix

P =			A	C	G	T
	A	(	0.2	0.3	0.1	0.4	)
	C		0.4	0.3	0.1	0.2
	G		0.5	0.1	0.2	0.2
	T		0.1	0.2	0.3	0.4

If the probability distribution of the first base is (f(A), f(C), f(G), f(T)) = (0.25, 0.2, 0.2, 0.35), What is the probability of getting an A in the third base? Describe one way to find the stationary distribution of this Markov chain.

Solution:
The probability of getting an A in the third base is the same as the probability of getting an A on the second transition, so we need to find f₂(A).
Since, (f₁(A), f₁(C), f₁(G), f₁(T))* P = (f₂(A), f₂(C), f₂(G), f₂(T)), we must first find the probability distribution of X₁.
We know that the probability distribution of X₀ is

x f₀(x)

A 0.25

C 0.2

G 0.2

T 0.35

and (f₀(A), f₀(C), f₀(G), f₀(T))* P = (f₁(A), f₁(C), f₁(G), f₁(T))
To determine the probability distribution of X₁, you must find f₁(A), f₁(C), f₁(G), f₁(T). To find f₁(A) solve
P(x₁=A) = P(x₁=A | x_o=A) * P(x_o=A) + P(x₁=A | x_o=C) * P(x_o=C) + P(x₁=A | x_o=G) * P(x_o=G) + P(x₁=A | x_o=T) * P(x_o=T)
= (0.2)*(0.25) + (0.4) * (0.2) + (0.5) * (0.2) + (0.1) * (0.35)
= 0.265

f₁(C), f₁(G), and f₁(T) are found similarly to get the probability distribution of X₁:

x f₁(x)

A 0.265

C 0.225

G 0.19

T 0.32

Now to find f₂(A) solve
P(x₂=A) = P(x₂=A | x₁=A) * P(x₁=A) + P(x₂=A | x₁=C) * P(x₁=C) + P(x₂=A | x₁=G) * P(x₁=G) + P(x₂=A | x₁=T) * P(x₁=T)
= (0.2)*(0.265) + (0.4) * (0.225) + (0.5) * (0.19) + (0.1) * (0.32)
= 0.27

One way to find the stationary distribution is to use the fact that once the stationary distribution is reached, the following equation is true:
(f(A), f(C), f(G), f(T)) * P = (f(A), f(C), f(G), f(T)).
From this equation and the above matrix, P, we can get the following system of equations:
f(A) = 0.2 * f(A) + 0.4 * f(C) + 0.5 * f(G) + 0.1 f(T)
f(C) = 0.3 * f(A) + 0.3 * f(C) + 0.1 * f(G) + 0.2 f(T)
f(G) = 0.1 * f(A) + 0.1 * f(C) + 0.2 * f(G) + 0.3 f(T)
f(T) = 0.4 * f(A) + 0.2 * f(C) + 0.2 * f(G) + 0.4 f(T)

The system to solve is comprised of any 3 of the above 4 equations along with the equation
1 = f(A) + f(C) + f(G) + f(T)

Another way to find the stationary distribution is to find the eigenvector with eigenvalue 1 and normalize to make it a probability distribution.

Part II [6 points] Statistical Inference

Hypothesis Test

A DNA sequence has the following nucleotide and dinucleotide counts:

A	246	AA	40	AC	74	AG	19	AT	113
C	219	CA	86	CC	76	CG	26	CT	31
G	191	GA	92	GC	12	GG	38	GT	49
T	344	TA	28	TC	56	TG	108	TT	151

Suppose the nucleotide bases are generated independently. Test the hypothesis that the base probability distribution is (f(A), f(C), f(G), f(T)) = (0.25, 0.2, 0.2, 0.35),

X² = sum(i={A,C,G,T}) [^{(O_i -
E_i)²}/ _{E_i}]

G² = 2 * sum(i={A,C,G,T}) [O_i * log ( ^O_i /_{E_i} )]

_0.05

Solution:

H₀: f(A) = 0.25, f(C) = 0.2, f(G) = 0.2, and f(T) = 0.35
H₁: Either f(A) does not equal 0.25, or f(C) does not equal 0.2, or f(G) does not equal 0.2, or f(T) does not equal 0.35

Using the Pearson statistic, we have
X² = sum(i={A,C,G,T}) [^{(O_i -
E_i)²}/ _{E_i}]
= [^{(246 -250)²}/₂₅₀] + [^(219-200)²/₂₀₀] + [^{(191 - 200)²}/₂₀₀] + [^{(344 - 250)²}/₃₅₀]
= 2.3769

Since 2.3769 < 7.815, we fail to reject H₀, and conclude that there is evidence to suggest that H₀ is true.

Using the likelihood ratio, we have
G² = 2 * sum(i={A,C,G,T}) [O_i * log(^O_i /_{E_i})]
= 2 * {(246) * log(²⁴⁶/₂₅₀) + (219) * log(²¹⁹/₂₀₀) + (191) * log(¹⁹¹/₂₀₀) + (344) * log(³⁴⁴/₃₅₀)}
= 2.3294

Since 2.3294 < 7.815, we fail to reject H₀, and conclude that there is evidence to suggest that H₀ is true.

Estimation
Suppose the Strong/Weak (S/W) classification for this DNA sequence conforms to a Markov chain. Estimate the transition probability matrix

P =	(	P_SS	P_SW	)
		P_WS	P_WW

.

Solution:
First we need to find

P(Strong) = ^{(191 + 219)} / ₁₀₀₀ = 0.41
and
P(Weak) = ^{(246 + 344)} / 1000 = 0.59

Use these probabilities to find p_SS, p_SW, p_WS, and p_WW.

p_SS
= P(current base is strong given that previous base is strong)
= P(current base is strong | previous base is strong)
= ^{P(current base is strong and previous base is strong)} / _{P(previous base is strong)}
= ^{P(CC or CG or GC or GG)} / _{P(Strong)

= ^{[ P(CC) + P(CG) + P(GC) P(GG) ]} / _P(Strong)

= ^{[(⁷⁶
/ ₉₉₉) + (²⁶ / ₉₉₉) + (¹² /
₉₉₉) + (³⁸/₉₉₉)]} /
_0.41

= 0.3711

p_SW

= P(current base is weak given that previous base is
strong)

= P(current base is weak | previous base is strong)

= ^{P(current base is weak and previous base is strong)} /
_{P(previous base is strong)}

= ^{P(CA or CT or GA or GT)}
/ _{P(Strong)

= ^{[ P(CA) + P(CT) + P(GA) + P(GT) ]} / _{P(Strong)

= ^{[(⁷⁶
/ ₉₉₉) + (²⁶ / ₉₉₉) + (¹² /
₉₉₉) + (³⁸/₉₉₉)]} /
_0.41

= 0.6299

p_WS

= P(current base is strong given that previous base is weak)

= P(current base is strong | previous base is weak)

= ^{P(current base is strong and previous base is weak)} /
_{P(previous base is weak)}

= ^{P(AC or AG or TC or TG)}
/ _{P(weak)

= ^{[ P(AC) + P(AG) + P(TC) + P(TG) ]} / _{P(weak)

= ^{[(⁷⁴
/ ₉₉₉) + (¹⁹ / ₉₉₉) + (⁵⁶ /
₉₉₉) + (¹⁰⁸/₉₉₉)]} /
_0.59

= 0.4360

p_WW

= P(current base is weak given that previous base is weak)

= P(current base is weak | previous base is weak)

= ^{P(current base is weak and previous base is weak)} /
_{P(previous base is weak)}

= ^{P(AA or AT or TA or TT)}
/ _{P(weak)

= ^{[ P(AA) + P(AT) + P(TA) + P(TT) ]} / _{P(weak)

= ^{[(⁴⁰
/ ₉₉₉) + (¹¹³ / ₉₉₉) + (²⁸ /
₉₉₉) + (¹⁵¹/₉₉₉)]} /
_0.059

= 0.5633

Hence, the transition probability matrix is

   P  =

(

0.3711

0.6299

)

0.4360

0.5633

.}}}}}}}

Part III [5 points] Select the best answer and write the letter in the provided space.

The process of making an RNA copy of DNA is called
The process fo reading an amino acid sequence from an RNA molecule is called
A protein molecule is made up of
The base guanine is always paired with
Which of the DNA sequences below is a palindrome?

Part IV Extra Credit Problems [1 point extra credit]

Hershey and Chase differentiated between DNA and protein by:
In Avery's experiment, the ability of an extract of heat killed, smooth, disease causing bacteria to transform rough, non-disease causing bacteria was blocked by treatment with