Name:  



STA 4953 (Spring 2001)
Exercise 6A, Due 3/8/2001

Introduction to GCG
GCG is a comprehensive genetic sequence analysis software package, first developed in the 1970's at the University of Wisconsin. It is, therefore, also referred to as the Wisconsin Package

Starting GCG and using genhelp
To start GCG on helix, just type gcg to the helix prompt. Descriptions of all GCG programs can be found by running genhelp. When you are viewing pages of this manual, use <CTR>B to go back to the previous page.
Alternatively, you can also visit an online manual located here. When you enter this site for the first time, go here (http://www.gcg.com/genhelp/), and register. You will need to remember your username and password and use them each time you visit this site. At the bottom of the page, click UNIX.

In this exercise, we shall learn to use a very simple program called "composition" from GCG. This program analyzes the base composition of a nucleic acid sequence or the amino acid composition of a protein sequence. Find a description of composition in the Genhelp manual (accessible when logged onto GCG by typing "genhelp" or on the online manual) and print a copy.
In order to run composition, you may have to reformat your sequence.

Reformatting Sequence Files
GCG programs require input sequence files of a particular format. Sequences retrieved from the databases like GenBank or obtained by biologists in their labs are not necessarily of the right format to be read by GCG. These sequences need to be reformatted.
Look up descriptions of the programs "fromgenbank" and "reformat" in the Genhelp manual (accessible when logged onto GCG by typing "genhelp") or on the online manual.

Use the descriptions of "fromgenbank" and "reformat" to convert the GenBank hemoglobin files , you retrieved in Exercise #5. Also, convert the *.na and *.aa files you prepared in Exercise #5 to GCG format.

Run "Composition" on the *.na and *.aa files that you have reformatted to GCG format above.

Exercise
Based on the outputs of the composition program on the appropriate sequence data, answer the following questions:

  1. What are the total base counts in the nucleic acid sequences (those you retrieved from GenBank in Exercise #5 and ran composition on) of the alpha1, alpha2, beta, and delta genes of hemoglobin? What are the relative frequencies of the bases? (You need to do some simple calculations to get the relative frequencies. For instance, relative frequency = (specific base count)/(total base count) ).

    Base
    Count
    Relative Frequency
    A
    C
    G
    T



  2. Repeat Question 1 for dinucleotides.

    Dinucleotide Count Rel. Freq. Dinucleotide Count Rel. Freq
    AA GA
    AC GC
    AG GG
    AT GT
    CA TA
    CC TC
    CG TG
    CT TT



  3.   Do hemoglobin molecules contain all 20 different kinds of amino acids? If not, write down the name(s), along with the 3-letter and 1-letter code of the missing amino acid(s). (Helpful link to codon table)




  4. For each of the alpha, beta, and delta chain of hemoglobin, answer the following questions(Helpful link to codon table):
      Which amino acid is the most abundant in the chain?
      What are the codons coding for that amino acid?
      From the nucleic acid sequence, locate the codons coding for that amino acid (you probably need to look at the annotations in the GenBank file to locate the beginning of the coding sequence) and write down the count for each codon.

    Chain
    Alpha 1
    Alpha 2
    Beta
    Delta
    Amino Acid
    Codons
    Count