STA 4953 Final
Projects
Introduction
to Bioinformatics
Due Friday,
May 11 2001 by 1:15 p.m.
Gather
yourselves into groups of four. Each group must contain at least one Computer
Science major and one Mathematics major, and preferably at least one individual
who has completed the STA 3523 class.
Each group
will choose one of the following two projects to work on. Every individual
student will deliver an oral progress report on the part of the project under
his/her responsibility on Tues. 4/24 (project A) and Thurs. 4/26 (Project B).
The group will prepare together a written report to be turned in by 1:15 p.m.
on 5/11.
Project
A . Viral Genome Sequence Composition
1.
Write
a short essay (no more than 1 double spaced page) to introduce the family of
adenoviruses. What is the adenovirus made up of? What medical conditions are
associated with adenoviruses? Are there any existing remedies for serious
medical concerns caused by adenoviruses?
2.
The
accession number of five complete adenovirus genome sequences are given below.
You can access their GenBank records by following the links in the
Bioinformatics homepage (->Genome
projects -> Double Stranded Viruses ). Fill in the appropriate information:
Accession |
host |
length |
|
NC_001405 |
|
|
|
NC_001784 |
|
|
|
NC_001813 |
|
|
|
NC_001958 |
|
|
|
NC_002501 |
|
|
|
3.
Obtain
the nucleotide, dinucleotide and trinucleotide counts for each of the above
viruses. Display your counts in the form of tables and bar charts. Are there
significant variations in nucleotide, dinucleotide, and trinucleotide compositions
among the five adenoviruses?
4.
(a)
If each of these viral genomes is to be modeled as a sequence of independent
and identically distributed (i.i.d.) random variables (i.e., the rolling-die
model), what are the parameters of the model and what are their maximum
likelihood estimates?
(b) Repeat part (a) with a
Markov chain model.
(c) Do the data fit with
model (a) or model (b) better? Explain in detail how you assess the goodness of
fit of the models.
(d) Do you think that the
better model in (c) is a sufficiently good model for these viral DNA? If not,
what improvement would you suggest?
5. Pick any one of the adenoviruses. Make an overall
amino acid count and codon count on all the proposed coding regions (CDS) on
the GenBank record. Display your results in the form of tables and bar charts.
For those amino acids coded by multiple codons, test if their synonymous codons
are being used equally likely.
Project
B Database Search
1.
Perform
a BLAST search with the following query sequence (this is the nucleotide
sequence around a certain gene of the fire ant) against a nucleotide database:
ATAACATCGCGACTTTAAGCCTGAGAACATACTGCTGGACGAGCACGGTCACGTCAGGAT
ATCGGATCTGGGATTGGCATGTGATTTCTCAAAGAAGAAGCCGCACGCGAGCGTTGGCACA
CACGATATATCGCACCTGAGATCATTAA
From the BLAST result, can
you get any idea of what gene this is?
2.
Translate
the above sequence into amino acids in all 6 possible frames (3 direct, 3
complement). Which amino acid sequence is most likely to be the real one? You
may use the GCG program "translate" or write your own program.
3.
Perform
a BLAST search using the amino acid sequence you have selected from question 2
as your query sequence against a
protein database. Repeat the run for
all different choices of substitution matrices and gap penalty functions
provided by the NCBI BLAST server. Are your results affected by the different
choices? Select a list of sequences that are consistently within the top 30
hits of the BLAST searches.
4.
From
the BLAST result, can you get any idea of what the protein is? Does it seem to
agree with your finding in question 1? If not, which one are you more inclined
to believe in? Why?
5.
Among
your list of selected proteins. What organisms do they come from? Which
organism is biologically closest to the fire ant?
6.
Take
a protein sequence from the organism you have chosen and align it against the
fire ant protein using first the GCG programs Gap and Bestfit. Explain the
differences between the results of Gaps and Bestfit.
7.
In
the output of both the Gap and Bestfit programs, there are parameters called
"Quality" and "Quality Ratio". Find out the meaning of
these parameters, how they are calculated, and verify that they have been
calculated correctly in your outputs.