A full description of these programs (and all of the programs offered
by GCG)can be found in the Genhelp
manual (accessible when logged onto GCG by typing genhelp) or the
online manual.
Appendix III
Appendix IV
Bestfit
Blast
Codonfrequency
Codonpreference
Compare
Composition
Distances
Dotplot
Ending A Program Prematurely
Enzyme.dat Datafile
FASTA
Fetch
Frames
Gap
Growtree
Map
Mapplot
Motifs
Netblast
Peptidestructure
Pileup
Plotstructure
Profilescan
Reformatting Files
Setplot
Suspending A Program
To End a Program
Prematurely:
Press <CTR> + C
To Suspend a
Program: Press <CTR> + Z. Typing % fg
%job_number
brings the suspended job to the foreground where you can work with it
again, for example % fg %6. If you cannot remember what programs
you suspended, type % jobs to list the jobs and job numbers.
Reformatting
files:
GCG requires that input sequence files be in a
particular format. Before you run a program in GCG using a
sequence from Genbank or other source, it may need to be
reformatted so that it can be read by GCG. When a sequence is
being reformatted, the reformatting process begins after the ".."
in
the file which separates the heading from the data to be
reformatted. Here is an example of a typical dividing line:
Gamma.Seq Length: 11375 January 1, 1997 10:09 Type: N Checksum:
6474 ..
Make a quick check (by viewing the file using the cat or
more
command) to ensure that the ".." is located before the data (after
the heading) and that there are not any adjacent periods (..) in
the heading itself. At the helix prompt, type reformat. You
will be asked for the name of the file to be reformatted. The
output will give you the length of the reformatted sequence, date
of the file, time, and type of the sequence (N for nucleotide, P
for protein), followed by the reformatted sequence, which can now
be read by GCG. One of the restrictions for reformatting is that a
sequence may not be more than 350,000 characters long. If your
sequence is longer than this, it will need to be broken up into
sequences small enough to be reformatted by GCG. This can be done
using ChopUp.
A file from Genbank can be reformatted as above or by using the
command fromgenbank. Each sequence in the file will have its own
output, e.g. the nucleotide and amino acid sequences will have
their own outputfiles. The output files are named according to the
LOCUS line at the beginning of the sequence entry in the main file.
It is a good idea to check the resulting output. The same
restriction of 350,000 characters applies here, although
Fromgengank will automatically divide the file into the appropriate
number of output files to accommodate the excessive characters.
Composition:
Composition is one of the first programs from
GCG
that we will be using.
This program analyzes the base (a,c,t,g) composition of nucleic
acid sequences or the amino acid composition of protein sequences.
For nucleotide sequences, Composition also determines the
dinucleotide and trinucleotide content. To run Composition on a
sequence type composition at the prompt. You will be asked to
name the sequence to be analyzed and to provide a name for the
output file. It will help you to keep track of your files if you
end the output name with .composition or .comp, etc. The output
will provide you with a count of the number of occurrences of each
amino acid, base, dinucleotide, and trinucleotide.
Codonfrequency:
Codonfrequency counts the number of codons
(three
nucleotide bases which code for an amino acid) in a specified
range(s) of a sequence and creates a frequency table for the codons
in the specified range(s) of the sequence. To run codonfrequency,
type codonfrequency at the prompt. You determine the range(s)
desired by answering the prompted questions. To choose the entire
sequence, choose the range to begin at 1 and to end at 350 (for a
sequence containing 350 nucleotide bases). For reverse, choose
No. More than one range can be chosen by answering the prompted
questions. After all ranges are chosen on which to run the
codonfrequency, type N at the promted question to continue.
Since we are interested in the output file, choose to W)rite the
frequencies to your output file, and choose a file name (something
ending in .codfreq would be appropriate). The output produces a
table outlining the number of codons in the specified range, the
observed frequency of each codon per 1000, and the fraction for
each codon within in its family (all codons coding for the same
amino acid).
If your sequence is rejected, check the heading to verify that the
Type is N for nucleotide. If the Type of the sequence is
missing
or incorrect, you can correct it by reformatting as follows:
reformat /NUCleotide filename or reformat /PROtein filename
Blast:
Blast takes as input either a nucleic acid
(consisting of
nucleotide bases) or protein sequence (consisting of amino acids)
as a query and compares the input, query, sequence to a database of
sequences to find sequences similar to the query. To start, type
blast. To run a search against the entire sequence, just press
Enter at the Begin and End prompts. To run a search against a
portion of the sequence, specify the Begin and End positions. You
are then asked to choose a database. This is done according to the
type of sequence that you are using for the query, either
nucleotide or protein. For a nucleotide sequence, we will most
likely be using 3) genembl n GenBank. For a protein sequence
(amino acid), choose either the 1)PIR or 2)SwissProt databases.
PIR is the US protein sequence database and SwissProt is its
European counterpart. Just press Enter to choose the default and
ignore hits expected to occur more than 10 times. If you plan to
print your output, you may choose to limit the number of sequences
in the output to 50.
The default output name can be chosen or choose a new name. The
output will consist of a list of sequences from the chosen
database, which most closely match the query (input sequence), and
gives the range within the sequences that had a significant
alignment to the query. At the end of this list are the actual
significant alignments between the query and each match. Finally,
a complete list of parameter settings used for the search is
listed. The Bit Score tells the number of seqment pairs that you
would have to look through in order to find a high scoring pair as
good as this one. For instance, for a bit score of 60, you would
have to search a bout 2^60 independent seqment pairs before you
would find an alignment with a high scoring pair as good as this
one by chance. The E-Value is the probability that you would
observe a score or group of scores as high as the observed score
purely by chance when you do a search of this size.
Netblast:
Netblast
is similar to Blast, but Netblast uses remote
databases and Blast uses local databases in GCG. Netblast is
advantageous in that it ensures that you are using the most
up-to-date databases. To run, type netblast, etc.
FASTA:
This is another
popular database search program. It does
essentially the same thing as Blast using a different algorithm.
The general impression is that FASTA works better with nucleotide
sequences while Blast works better with amino acid sequences.
FASTA is much slower than Blast so you are encouraged to run it as
a Batch job by using the command line option -BAT. This
will free
your system for other work while the program runs. The program
will run in the background or at another time even if you log off
of the computer. To run, type fasta -BAT. The program prompts
you for all required parameters that you didn't specify on the
command line. After the prompt, the program submits itself to the
batch queue. You can type fasta <filename> -Default -BAT and
all of the prompted questions will be answered using the defaults.
For an example, go here.
After a batch job has been executed, you will receive an e-mail message
notifying you of its completion. The output file will be in the
directory from which you submitted the program.
Bestfit:
Use Bestfit to find the best segment of similarity
between
two sequences. Bestfit compares two sequences and finds the best
segment of similarity between the sequences. If the relationship
between two sequences is weak or unknown, bestfit is the best tool
in the Wisconsin Package to identify the regions of similarity
between the two sequences. This is done by inducing gaps into the
sequences in order to maximize the number of matches and obtain an
optimal alignment. The goodness of the alignment is determined
according to the local homology algorithm of Smith and Waterman.
To run, type bestfit, and answer the prompted questions. The gap
extension penalty of 50 is high, but use it as default until
further notice. It would be a good idea to end the output name
with ".bestfit". The output shows the actual alignment according
to the bestfit.
Compare:
Compare compares two protein or nucleic acid
sequences
and creates a file of the coordinates of the points of similarity
between the sequences for plotting with DotPlot ("Dot-plotting is
the best method in the Wisconsin Package(TM) for comparing two
sequences when you suspect that there could be more than one
segment of similarity between the two"). To run Compare, type compare.
You will enter the two sequences to compare. Use the default
values everywhere except for the output filename, which you may
want to end with ".compare". The output is designed to be read by
dotplot. To see an example of a dotplot, go
here
and see the plots at the end of the page.
Dotplot:
Dot plot makes a dot-plot (graph) of the output
from
Compare. It is a good tool for helping you to visualize the
segments of similarity between two sequences. Before running
dotplot, you will first need to set up the graphics by typing
setplot (See Setplot below. After Setplot is initially
configurated, it will be necessary to set up the graphics after
each logging off . The initial and subsequent graphics set ups are
described below.). To run dotplot, type dotplot. The input file
is the output file from compare. Use the default density.
Setplot:
Setplot
allows you to set up the graphics configuration
in one step. The first session using setplot will be used to set
up the Xwindows and Postscript graphics devices.
1. After starting GCG, type setplot.
2. Enter C to create a new device.
3. Fill in the necessary information. Tab to move through the
fields, and use the E command to enter the first 3 items:
Unique Abbreviation
psfile
Description
Write to a postscript file
Port/Queue/file
$program$-$time$.ps
Languages
Move by up/down arrow to the item
PostScript
Devices
Move by up/down arrow to the item EPSF
4. Type M to return to the main menu and choose S to
save
changes. Now you have created a postscript ouput device.
5. Repeat steps 2 through 4 to create another output device
for X-windows.
Unique Abbreviation
xwin
Description
X terminal
Port/Queue/file
xterm
Languages
Move by up/down arrow to the item Xwindows
Devices
Move by up/down arrow to the item color
6. You have now created an X-windows and a PostScript output
device.
Before executing a GCG graphics program, run setplot,
arrow to the device of
your choice and hit return.
After the initial session, creating the graphics device, you will
setup the graphics by typing setplot. You may choose the
desired
device by using the up/down arrows to highlight your choice, then
press return. If you are not on a Sun station, choose
psfile.
This will write the results to a postscript file to be printed. If
you are on a Sun station and would like to view the plot before
printing, choose xwin.
Gap:
Gap is used to find the global alignment of two
complete
sequences by maximizing the number of matches while minimizing the
number of gaps. Use Gap when you want an alignment of the entire
length of the sequences. To run, type gap. You will be asked for
the sequence names to be aligned (you can align two nucleotide
sequences or two protein sequences. If there is a problem, check
the Type of the sequence.). Use the default gap creation and
extension penalties. Choose the ouput name. The output will give
you the gap alignment of the two sequences.
Pileup:
Pileup is used to create an alignment of a group
of
related sequences, aligning the whole length of the sequences. The
alignment is begun by aligning the two most similar sequences,
forming a cluster, and then continues by aligning the sequence (or
cluster of sequences) most similar to this cluster, forming a new
cluster. This process is repeated until all of the sequences have
been aligned. The sequences to be aligned using Pileup must be in a
list file. Be sure that you are aligning sequences of the same
Type, either all nucleotide or all protein. To make a list file you
can use a text editor such as pico. Type pico, and the window
for the editor will appear. List the names of the sequences to be
aligned in the window and name the file, ending the filename with
".list"(Naming the file can be done as you exit the editor, <CTR>
+ X, or by saving without exiting the editor <CTR> + O.
To continue with Pileup, you will
need to
exit the text editor.). To run pileup, type pileup. You will
then be asked for the sequences, type @<filename of list
file>.list. Use the
default gap creation and extension penalties. You will probably
want to plot the output, so choose A)Plot…. Choose default density
and choose the output filename ending with .msf (This file can be
used for the input file when using the Distances program).
Codonpreference:
Codonpreference is a gene prediction
program,
which tries to find the protein coding regions of a gene by
comparing the frequency of usage of any particular codon to
represent an amino acid to a codon frequency table. If the usage
of codons is not randomly distributed, then there is an indication
that this region of the sequence codes for a particular gene. Begin
by setting the graphics device using setplot, if this has not
already been done during this session on the computer. Type
codonpreference. The sequence used should be a nucleotide
sequence. Choose Yes to run codonpreference on the reverse strand
as well as on the direct strand. For the codonfrequency file we
will use GenMoreData:human_high.cod (if using human hemoglobin).
Use defaults for the other questions.
The output will be a graph.
The boxes below the plot indicate the open reading frames for that
translation frame. The short lines extending above the box are the
potential start codons. The short lines extending below the lines
indicate potential stop codons. The short lines below the open
reading frame plot indicate rare codons. We are looking for
portions of the graph, specifically in the open reading frame, that
fall below or above the threshold (indicated by two horizontal
dot/dash lines in the graph). The dotted graphed line represents
the bias, while the solid graphed line represents codonpreference.
To find a protein coding region, we are looking at the area graphed
above the open reading frame with few rare codons. You can compare
your results to the CDS coordinates in the GenBank record.
Frames:
You need to set your graphical device before running
Frames if you have not already done so during this session on the
computer. Frames, like Codonpreference, is also a gene prediction
program. Frames shows the open reading frames for all six
translation frames of the DNA sequence. The input file should be a
nucleotide sequence. The output file is a graph with the open
reading frames indicated by boxes. By default the potential start
and stop codons, which are indicated by short lines extending
above and below the boxes, are at the beginning and end of the
boxes, respectively. The dots above each translation frame are
representing rare codons.
Fetch:
Fetch is used to copy GCG files or sequences from
the GCG
database to your working directory. To get ("fetch") a file from
the GCG database and put it into your directory, just type fetch.
You will be asked for the filename.
Map:
Map takes a DNA
sequence and maps it, showing (above the
sequence) all of the restriction enzyme cut points for the direct
strand (unless otherwise specified) of the sequence. It also
provides a protein translation of each reading frame below the
sequence. In order to get the most out of this program, you will
need to be familiar with the file enzyme.dat ("fetch" this file to
your working directory.), Appendix VII, and Appendix III (A
brief
explanation of each is provided below. You should take the time to
look at these files yourself.) In order to run map, you will need
to set up graphics ("setplot"). To run, type map. You will
specify the sequence to be mapped along with the enzyme(s) (you can
use more than one enzyme per session) used to cut, the reading
frames used to determine the cut sites, and the name of the output
file. The output file will have both the direct and reverse
strands of the sequence with the enzyme that cuts for the direct
strand specified above the sequence and a protein translation of
all reading frames below the sequence. The cut site will be
designated by a line directly above a base. The actual cut will
occur after this base (to be certain, you will need to double check
by looking the enzyme up in the enzyme.dat file). At the end of
the file, there is a list of the enzymes that did cut the sequence
and a list of the enzymes that were specified when running the
program but did not have any cuts. If you want the program to
label both the cuts on the direct and reverse strands, at the
command line, type map -BOT.
Appendix
III:
Appendix III outlines all of the acceptable
characters allowed by GCG in sequences. It gives a full list of
symbols that represent nucleotide bases, their compliments, and
their Cambridge equivalent. Symbols other than the usual A,C,G,T
are possible. A list of the standard one-letter amino acid codes
and their three-letter equivalents is also included. The knowledge
of these symbols will be helpful to understand the notations used
in the enzyme.dat file of restriction enzyme cut sites.
You should view this file at either the online Genhelp or the
Unix Genhelp(By typing genhelp when logged onto GCG.).
Appendix
VII:
This Appendix contains datafile descriptions from
GCG. We are interested in restriction enzymes datafile,
enzyme.dat.
Enzyme.dat
datafile:
This datafile contains a list of all of the
restriction enzymes used by Map, Mapsort, and Mapplot along with
their cut sites. If you are going to cut a DNA sequence with a
restriction enzyme, you can look this enzyme up in the enzyme.dat
file to find out the location on the sequence that it will cut.
For example, say you want to use the enzyme AsuI. In the
enzyme.dat file, you will find this enzyme name followed by
G'GnC_C. The " ' " tells you where the enzyme will cut the
sequence on the direct strand. That is, on the direct strand of the
sequence, anywhere that the sequence segment GGnCC occurs (reading
from the 5' end to the 3' end on the direct strand; from left to
right), this enzyme will cut the sequence after the first G. In
Appendix III the "n" can be found to mean G or A or T or C. The
"_" shows the cut position for this enzyme on the reverse strand of
the sequence. So, wherever the sequence segment CCnGG (reading
from left to right; 3' end to 5' end on the reverse strand) occurs
on the reverse strand of the sequence, this enzyme will cut the
sequence after the first C.
Mapplot:
Mapplot
constructs a graphical display of the restriction
map. It graphically displays the site, cut position, and total
number of cuts for each enzyme. In order to run Mapplot, you will
need to set up graphics (setplot). To run, type
mapplot. You
will input the sequence to be mapped, the range within the sequence
to be mapped, and the restriction enzymes to cut. The output will
consist of one horizontal line, which is representative of the
sequence, for each restriction enzyme. The cut sites for that
enzyme are represented by slash marks on the line at the cut
position. The total number of cuts for the restriction enzyme is
at the end of the line just before the restriction enzyme name.
Distances:
Distances creates a matrix (table) of the
pairwise
distances within a group of aligned sequences. The output file
from Pileup will be the input file for Distances. The output from
Distances will be the input for Growtree. To run, type
distances. When asked for the aligned sequences, type the
filename from Pileup (ending in .msf) followed by {*}. For
example, if you named the output file from Pileup hum_gtr.msf ,
then your input file for Distances will be hum_gtr.msf{*}. Use
the default distance correction method, 3 Kimura protein distance.
End the output filename with .distances.
Growtree:
Growtree
takes the output file from Distances and
creates a phylogenetic tree. Two methods can be used to do this,
either the neighbor-joining or the UPGMA method. It will create
either a text file (filename ending in .trees) or a figure (a tree
diagram). Be sure to set up graphics before beginning (setplot).
Type growtree. The input file is the distances output file.
Choose the method to use and output filename. The output will have
two separate files; a text file with the filename ending in .trees,
and a figure file (growtree.figure).
Peptidestructure:
Peptidestructure is a program that makes
predictions for the secondary structure of the peptide sequence.
The output file from Peptidestructure is used to make a graphical
presentation using Plotstructure. The input file for
Peptidestructure must be a single protein sequence.
Plotstructure:
Plotstructure takes the results from
Peptidestructure and makes a graphical display. The measures of
the protein secondary structure predicted by Peptidestructure can
be displayed on parallel panels of a graph or with a
two-dimensional presentation. Set up graphics using setplot.
Type plotstructure. The input file must be the file from the
output of Peptidestructure. Choose the type of graph. The output
will be in a file named "plotstructure-(some numbers).ps". In the
panel output, the horizontal line across the surface probability
panel at position 1.0 on the y-axis indicates the expected surface
probability calculated for a random sequence. Any value above this
line is indicative of an increased probability of being found on a
protein surface. In the two- dimensional plot, helices are
shown with a sine wave, beta-sheets with a sharp saw-tooth wave,
turns with 180 degree turns, and coils with a dull saw-tooth wave.
Whenever hydrophilicity, surface probability, flexibility, or
antigenic index exceed a certain threshold, special symbols are
superimposed over the wave, with the size of the symbol
proportional to the value of the attribute.
Motifs:
Motifs is used to find interesting regions of
proteins.
Profilescan:
ProfileScan uses a database of profiles to
find
structural and sequence motifs in protein sequences.