STA 4953 Final Projects

Introduction to Bioinformatics

Due Friday, May 11 2001 by 1:15 p.m.

 

Gather yourselves into groups of four. Each group must contain at least one Computer Science major and one Mathematics major, and preferably at least one individual who has completed the STA 3523 class.

 

Each group will choose one of the following two projects to work on. Every individual student will deliver an oral progress report on the part of the project under his/her responsibility on Tues. 4/24 (project A) and Thurs. 4/26 (Project B). The group will prepare together a written report to be turned in by 1:15 p.m. on 5/11.

 

Project A .  Viral Genome Sequence Composition

1.      Write a short essay (no more than 1 double spaced page) to introduce the family of adenoviruses. What is the adenovirus made up of? What medical conditions are associated with adenoviruses? Are there any existing remedies for serious medical concerns caused by adenoviruses?

 

2.      The accession number of five complete adenovirus genome sequences are given below. You can access their GenBank records by following the links in the Bioinformatics homepage  (->Genome projects -> Double Stranded Viruses ). Fill in the appropriate information:

 

Accession

host

length

NC_001405

 

 

 

NC_001784

 

 

 

NC_001813

 

 

 

NC_001958

 

 

 

NC_002501

 

 

 

 

3.      Obtain the nucleotide, dinucleotide and trinucleotide counts for each of the above viruses. Display your counts in the form of tables and bar charts. Are there significant variations in nucleotide, dinucleotide, and trinucleotide compositions among the five adenoviruses?

 

4.      (a) If each of these viral genomes is to be modeled as a sequence of independent and identically distributed (i.i.d.) random variables (i.e., the rolling-die model), what are the parameters of the model and what are their maximum likelihood estimates?

(b) Repeat part (a) with a Markov chain model.

(c) Do the data fit with model (a) or model (b) better? Explain in detail how you assess the goodness of fit of the models.

(d) Do you think that the better model in (c) is a sufficiently good model for these viral DNA? If not, what improvement would you suggest?

5. Pick any one of the adenoviruses. Make an overall amino acid count and codon count on all the proposed coding regions (CDS) on the GenBank record. Display your results in the form of tables and bar charts. For those amino acids coded by multiple codons, test if their synonymous codons are being used equally likely.

 

 

Project B  Database Search

 

1.      Perform a BLAST search with the following query sequence (this is the nucleotide sequence around a certain gene of the fire ant) against a nucleotide database:

 

ATAACATCGCGACTTTAAGCCTGAGAACATACTGCTGGACGAGCACGGTCACGTCAGGAT
ATCGGATCTGGGATTGGCATGTGATTTCTCAAAGAAGAAGCCGCACGCGAGCGTTGGCACA
CACGATATATCGCACCTGAGATCATTAA

 

From the BLAST result, can you get any idea of what gene this is?

 

2.      Translate the above sequence into amino acids in all 6 possible frames (3 direct, 3 complement). Which amino acid sequence is most likely to be the real one? You may use the GCG program "translate" or  write your own program.

 

3.      Perform a BLAST search using the amino acid sequence you have selected from question 2 as your query sequence  against a protein database.  Repeat the run for all different choices of substitution matrices and gap penalty functions provided by the NCBI BLAST server. Are your results affected by the different choices? Select a list of sequences that are consistently within the top 30 hits of the BLAST searches.

 

4.      From the BLAST result, can you get any idea of what the protein is? Does it seem to agree with your finding in question 1? If not, which one are you more inclined to believe in? Why?

 

5.      Among your list of selected proteins. What organisms do they come from? Which organism is biologically closest to the fire ant?

 

6.      Take a protein sequence from the organism you have chosen and align it against the fire ant protein using first the GCG programs Gap and Bestfit. Explain the differences between the results of Gaps and Bestfit.

 

7.      In the output of both the Gap and Bestfit programs, there are parameters called "Quality" and "Quality Ratio". Find out the meaning of these parameters, how they are calculated, and verify that they have been calculated correctly in your outputs.