STA 4953 Exercise 6A: Sequence Composition

Name:

STA 4953 (Spring 2001) Exercise 6A, Due 3/8/2001
Introduction to GCG
GCG is a comprehensive genetic sequence analysis software package, first developed in the 1970's at the University of Wisconsin. It is, therefore, also referred to as the Wisconsin Package

Starting GCG and using genhelp
To start GCG on helix, just type gcg to the helix prompt. Descriptions of all GCG programs can be found by running genhelp. When you are viewing pages of this manual, use <CTR>B to go back to the previous page.
Alternatively, you can also visit an online manual located here. When you enter this site for the first time, go here (http://www.gcg.com/genhelp/), and register. You will need to remember your username and password and use them each time you visit this site. At the bottom of the page, click UNIX.

In this exercise, we shall learn to use a very simple program called "composition" from GCG. This program analyzes the base composition of a nucleic acid sequence or the amino acid composition of a protein sequence. Find a description of composition in the Genhelp manual (accessible when logged onto GCG by typing "genhelp" or on the online manual) and print a copy.
In order to run composition, you may have to reformat your sequence.

Reformatting Sequence Files
GCG programs require input sequence files of a particular format. Sequences retrieved from the databases like GenBank or obtained by biologists in their labs are not necessarily of the right format to be read by GCG. These sequences need to be reformatted.
Look up descriptions of the programs "fromgenbank" and "reformat" in the Genhelp manual (accessible when logged onto GCG by typing "genhelp") or on the online manual.

Use the descriptions of "fromgenbank" and "reformat" to convert the GenBank hemoglobin files , you retrieved in Exercise #5. Also, convert the *.na and *.aa files you prepared in Exercise #5 to GCG format.

Run "Composition" on the *.na and *.aa files that you have reformatted to GCG format above.

Exercise
Based on the outputs of the composition program on the appropriate sequence data, answer the following questions:

What are the total base counts in the nucleic acid sequences (those you retrieved from GenBank in Exercise #5 and ran composition on) of the alpha1, alpha2, beta, and delta genes of hemoglobin? What are the relative frequencies of the bases? (You need to do some simple calculations to get the relative frequencies. For instance, relative frequency = ^{(specific
base count)}/_{(total base count)} ).

Base	Count	Relative Frequency
A
C
G
T

Repeat Question 1 for dinucleotides.

Dinucleotide	Count	Rel. Freq.	Dinucleotide	Count	Rel. Freq
AA			GA
AC			GC
AG			GG
AT			GT
CA			TA
CC			TC
CG			TG
CT			TT

Do hemoglobin molecules contain all 20 different kinds of amino acids? If not, write down the name(s), along with the 3-letter and 1-letter code of the missing amino acid(s). (Helpful link to codon table)

For each of the alpha, beta, and delta chain of hemoglobin, answer the following questions(Helpful link to codon table):

Chain	Alpha 1	Alpha 2	Beta	Delta
Amino Acid
Codons
Count