GCG Programs

GCG Programs

A full description of these programs (and all of the programs offered by GCG)can be found in the Genhelp manual (accessible when logged onto GCG by typing genhelp) or the online manual.

Appendix III
Appendix IV
Bestfit
Blast
Codonfrequency
Codonpreference
Compare
Composition
Distances
Dotplot
Ending A Program Prematurely
Enzyme.dat Datafile
FASTA
Fetch
Frames
Gap
Growtree
Map
Mapplot
Motifs
Netblast
Peptidestructure
Pileup
Plotstructure
Profilescan
Reformatting Files
Setplot
Suspending A Program

To End a Program Prematurely: Press <CTR> + C

To Suspend a Program: Press <CTR> + Z. Typing % fg %job_number brings the suspended job to the foreground where you can work with it again, for example % fg %6. If you cannot remember what programs you suspended, type % jobs to list the jobs and job numbers.

Reformatting files: GCG requires that input sequence files be in a particular format. Before you run a program in GCG using a sequence from Genbank or other source, it may need to be reformatted so that it can be read by GCG. When a sequence is being reformatted, the reformatting process begins after the ".." in the file which separates the heading from the data to be reformatted. Here is an example of a typical dividing line:
Gamma.Seq Length: 11375 January 1, 1997 10:09 Type: N Checksum: 6474 ..
Make a quick check (by viewing the file using the cat or more command) to ensure that the ".." is located before the data (after the heading) and that there are not any adjacent periods (..) in the heading itself. At the helix prompt, type reformat. You will be asked for the name of the file to be reformatted. The output will give you the length of the reformatted sequence, date of the file, time, and type of the sequence (N for nucleotide, P for protein), followed by the reformatted sequence, which can now be read by GCG. One of the restrictions for reformatting is that a sequence may not be more than 350,000 characters long. If your sequence is longer than this, it will need to be broken up into sequences small enough to be reformatted by GCG. This can be done using ChopUp.

A file from Genbank can be reformatted as above or by using the command fromgenbank. Each sequence in the file will have its own output, e.g. the nucleotide and amino acid sequences will have their own outputfiles. The output files are named according to the LOCUS line at the beginning of the sequence entry in the main file. It is a good idea to check the resulting output. The same restriction of 350,000 characters applies here, although Fromgengank will automatically divide the file into the appropriate number of output files to accommodate the excessive characters.

Composition:
Composition is one of the first programs from GCG that we will be using. This program analyzes the base (a,c,t,g) composition of nucleic acid sequences or the amino acid composition of protein sequences. For nucleotide sequences, Composition also determines the dinucleotide and trinucleotide content. To run Composition on a sequence type composition at the prompt. You will be asked to name the sequence to be analyzed and to provide a name for the output file. It will help you to keep track of your files if you end the output name with .composition or .comp, etc. The output will provide you with a count of the number of occurrences of each amino acid, base, dinucleotide, and trinucleotide.

Codonfrequency:
Codonfrequency counts the number of codons (three nucleotide bases which code for an amino acid) in a specified range(s) of a sequence and creates a frequency table for the codons in the specified range(s) of the sequence. To run codonfrequency, type codonfrequency at the prompt. You determine the range(s) desired by answering the prompted questions. To choose the entire sequence, choose the range to begin at 1 and to end at 350 (for a sequence containing 350 nucleotide bases). For reverse, choose No. More than one range can be chosen by answering the prompted questions. After all ranges are chosen on which to run the codonfrequency, type N at the promted question to continue. Since we are interested in the output file, choose to W)rite the frequencies to your output file, and choose a file name (something ending in .codfreq would be appropriate). The output produces a table outlining the number of codons in the specified range, the observed frequency of each codon per 1000, and the fraction for each codon within in its family (all codons coding for the same amino acid). If your sequence is rejected, check the heading to verify that the Type is N for nucleotide. If the Type of the sequence is missing or incorrect, you can correct it by reformatting as follows:
reformat /NUCleotide filename or reformat /PROtein filename

Blast:
Blast takes as input either a nucleic acid (consisting of nucleotide bases) or protein sequence (consisting of amino acids) as a query and compares the input, query, sequence to a database of sequences to find sequences similar to the query. To start, type blast. To run a search against the entire sequence, just press Enter at the Begin and End prompts. To run a search against a portion of the sequence, specify the Begin and End positions. You are then asked to choose a database. This is done according to the type of sequence that you are using for the query, either nucleotide or protein. For a nucleotide sequence, we will most likely be using 3) genembl n GenBank. For a protein sequence (amino acid), choose either the 1)PIR or 2)SwissProt databases. PIR is the US protein sequence database and SwissProt is its European counterpart. Just press Enter to choose the default and ignore hits expected to occur more than 10 times. If you plan to print your output, you may choose to limit the number of sequences in the output to 50. The default output name can be chosen or choose a new name. The output will consist of a list of sequences from the chosen database, which most closely match the query (input sequence), and gives the range within the sequences that had a significant alignment to the query. At the end of this list are the actual significant alignments between the query and each match. Finally, a complete list of parameter settings used for the search is listed. The Bit Score tells the number of seqment pairs that you would have to look through in order to find a high scoring pair as good as this one. For instance, for a bit score of 60, you would have to search a bout 2^60 independent seqment pairs before you would find an alignment with a high scoring pair as good as this one by chance. The E-Value is the probability that you would observe a score or group of scores as high as the observed score purely by chance when you do a search of this size.

Netblast:
Netblast is similar to Blast, but Netblast uses remote databases and Blast uses local databases in GCG. Netblast is advantageous in that it ensures that you are using the most up-to-date databases. To run, type netblast, etc.

FASTA:
This is another popular database search program. It does essentially the same thing as Blast using a different algorithm. The general impression is that FASTA works better with nucleotide sequences while Blast works better with amino acid sequences. FASTA is much slower than Blast so you are encouraged to run it as a Batch job by using the command line option -BAT. This will free your system for other work while the program runs. The program will run in the background or at another time even if you log off of the computer. To run, type fasta -BAT. The program prompts you for all required parameters that you didn't specify on the command line. After the prompt, the program submits itself to the batch queue. You can type fasta <filename> -Default -BAT and all of the prompted questions will be answered using the defaults. For an example, go here. After a batch job has been executed, you will receive an e-mail message notifying you of its completion. The output file will be in the directory from which you submitted the program.

Bestfit:
Use Bestfit to find the best segment of similarity between two sequences. Bestfit compares two sequences and finds the best segment of similarity between the sequences. If the relationship between two sequences is weak or unknown, bestfit is the best tool in the Wisconsin Package to identify the regions of similarity between the two sequences. This is done by inducing gaps into the sequences in order to maximize the number of matches and obtain an optimal alignment. The goodness of the alignment is determined according to the local homology algorithm of Smith and Waterman. To run, type bestfit, and answer the prompted questions. The gap extension penalty of 50 is high, but use it as default until further notice. It would be a good idea to end the output name with ".bestfit". The output shows the actual alignment according to the bestfit.

Compare:
Compare compares two protein or nucleic acid sequences and creates a file of the coordinates of the points of similarity between the sequences for plotting with DotPlot ("Dot-plotting is the best method in the Wisconsin Package(TM) for comparing two sequences when you suspect that there could be more than one segment of similarity between the two"). To run Compare, type compare. You will enter the two sequences to compare. Use the default values everywhere except for the output filename, which you may want to end with ".compare". The output is designed to be read by dotplot. To see an example of a dotplot, go here and see the plots at the end of the page.

Dotplot:
Dot plot makes a dot-plot (graph) of the output from Compare. It is a good tool for helping you to visualize the segments of similarity between two sequences. Before running dotplot, you will first need to set up the graphics by typing setplot (See Setplot below. After Setplot is initially configurated, it will be necessary to set up the graphics after each logging off . The initial and subsequent graphics set ups are described below.). To run dotplot, type dotplot. The input file is the output file from compare. Use the default density.

Setplot:
Setplot allows you to set up the graphics configuration in one step. The first session using setplot will be used to set up the Xwindows and Postscript graphics devices.
1. After starting GCG, type setplot.
2. Enter C to create a new device.
3. Fill in the necessary information. Tab to move through the fields, and use the E command to enter the first 3 items:
Unique Abbreviation
psfile
Description
Write to a postscript file
Port/Queue/file
$program$-$time$.ps
Languages
Move by up/down arrow to the item PostScript
Devices
Move by up/down arrow to the item EPSF
4. Type M to return to the main menu and choose S to save changes. Now you have created a postscript ouput device.
5. Repeat steps 2 through 4 to create another output device for X-windows.
Unique Abbreviation
xwin
Description
X terminal
Port/Queue/file
xterm
Languages
Move by up/down arrow to the item Xwindows
Devices
Move by up/down arrow to the item color
6. You have now created an X-windows and a PostScript output device.
Before executing a GCG graphics program, run setplot, arrow to the device of your choice and hit return.
After the initial session, creating the graphics device, you will setup the graphics by typing setplot. You may choose the desired device by using the up/down arrows to highlight your choice, then press return. If you are not on a Sun station, choose psfile. This will write the results to a postscript file to be printed. If you are on a Sun station and would like to view the plot before printing, choose xwin.

Gap:
Gap is used to find the global alignment of two complete sequences by maximizing the number of matches while minimizing the number of gaps. Use Gap when you want an alignment of the entire length of the sequences. To run, type gap. You will be asked for the sequence names to be aligned (you can align two nucleotide sequences or two protein sequences. If there is a problem, check the Type of the sequence.). Use the default gap creation and extension penalties. Choose the ouput name. The output will give you the gap alignment of the two sequences.

Pileup:
Pileup is used to create an alignment of a group of related sequences, aligning the whole length of the sequences. The alignment is begun by aligning the two most similar sequences, forming a cluster, and then continues by aligning the sequence (or cluster of sequences) most similar to this cluster, forming a new cluster. This process is repeated until all of the sequences have been aligned. The sequences to be aligned using Pileup must be in a list file. Be sure that you are aligning sequences of the same Type, either all nucleotide or all protein. To make a list file you can use a text editor such as pico. Type pico, and the window for the editor will appear. List the names of the sequences to be aligned in the window and name the file, ending the filename with ".list"(Naming the file can be done as you exit the editor, <CTR> + X, or by saving without exiting the editor <CTR> + O. To continue with Pileup, you will need to exit the text editor.). To run pileup, type pileup. You will then be asked for the sequences, type @<filename of list file>.list. Use the default gap creation and extension penalties. You will probably want to plot the output, so choose A)Plot�. Choose default density and choose the output filename ending with .msf (This file can be used for the input file when using the Distances program).

Codonpreference:
Codonpreference is a gene prediction program, which tries to find the protein coding regions of a gene by comparing the frequency of usage of any particular codon to represent an amino acid to a codon frequency table. If the usage of codons is not randomly distributed, then there is an indication that this region of the sequence codes for a particular gene. Begin by setting the graphics device using setplot, if this has not already been done during this session on the computer. Type codonpreference. The sequence used should be a nucleotide sequence. Choose Yes to run codonpreference on the reverse strand as well as on the direct strand. For the codonfrequency file we will use GenMoreData:human_high.cod (if using human hemoglobin). Use defaults for the other questions.
The output will be a graph. The boxes below the plot indicate the open reading frames for that translation frame. The short lines extending above the box are the potential start codons. The short lines extending below the lines indicate potential stop codons. The short lines below the open reading frame plot indicate rare codons. We are looking for portions of the graph, specifically in the open reading frame, that fall below or above the threshold (indicated by two horizontal dot/dash lines in the graph). The dotted graphed line represents the bias, while the solid graphed line represents codonpreference. To find a protein coding region, we are looking at the area graphed above the open reading frame with few rare codons. You can compare your results to the CDS coordinates in the GenBank record.

Frames:
You need to set your graphical device before running Frames if you have not already done so during this session on the computer. Frames, like Codonpreference, is also a gene prediction program. Frames shows the open reading frames for all six translation frames of the DNA sequence. The input file should be a nucleotide sequence. The output file is a graph with the open reading frames indicated by boxes. By default the potential start and stop codons, which are indicated by short lines extending above and below the boxes, are at the beginning and end of the boxes, respectively. The dots above each translation frame are representing rare codons.

Fetch:
Fetch is used to copy GCG files or sequences from the GCG database to your working directory. To get ("fetch") a file from the GCG database and put it into your directory, just type fetch. You will be asked for the filename.

Map:
Map takes a DNA sequence and maps it, showing (above the sequence) all of the restriction enzyme cut points for the direct strand (unless otherwise specified) of the sequence. It also provides a protein translation of each reading frame below the sequence. In order to get the most out of this program, you will need to be familiar with the file enzyme.dat ("fetch" this file to your working directory.), Appendix VII, and Appendix III (A brief explanation of each is provided below. You should take the time to look at these files yourself.) In order to run map, you will need to set up graphics ("setplot"). To run, type map. You will specify the sequence to be mapped along with the enzyme(s) (you can use more than one enzyme per session) used to cut, the reading frames used to determine the cut sites, and the name of the output file. The output file will have both the direct and reverse strands of the sequence with the enzyme that cuts for the direct strand specified above the sequence and a protein translation of all reading frames below the sequence. The cut site will be designated by a line directly above a base. The actual cut will occur after this base (to be certain, you will need to double check by looking the enzyme up in the enzyme.dat file). At the end of the file, there is a list of the enzymes that did cut the sequence and a list of the enzymes that were specified when running the program but did not have any cuts. If you want the program to label both the cuts on the direct and reverse strands, at the command line, type map -BOT.

Appendix III:
Appendix III outlines all of the acceptable characters allowed by GCG in sequences. It gives a full list of symbols that represent nucleotide bases, their compliments, and their Cambridge equivalent. Symbols other than the usual A,C,G,T are possible. A list of the standard one-letter amino acid codes and their three-letter equivalents is also included. The knowledge of these symbols will be helpful to understand the notations used in the enzyme.dat file of restriction enzyme cut sites. You should view this file at either the online Genhelp or the Unix Genhelp(By typing genhelp when logged onto GCG.).

Appendix VII:
This Appendix contains datafile descriptions from GCG. We are interested in restriction enzymes datafile, enzyme.dat.

Enzyme.dat datafile:
This datafile contains a list of all of the restriction enzymes used by Map, Mapsort, and Mapplot along with their cut sites. If you are going to cut a DNA sequence with a restriction enzyme, you can look this enzyme up in the enzyme.dat file to find out the location on the sequence that it will cut. For example, say you want to use the enzyme AsuI. In the enzyme.dat file, you will find this enzyme name followed by G'GnC_C. The " ' " tells you where the enzyme will cut the sequence on the direct strand. That is, on the direct strand of the sequence, anywhere that the sequence segment GGnCC occurs (reading from the 5' end to the 3' end on the direct strand; from left to right), this enzyme will cut the sequence after the first G. In Appendix III the "n" can be found to mean G or A or T or C. The "_" shows the cut position for this enzyme on the reverse strand of the sequence. So, wherever the sequence segment CCnGG (reading from left to right; 3' end to 5' end on the reverse strand) occurs on the reverse strand of the sequence, this enzyme will cut the sequence after the first C.

Mapplot:
Mapplot constructs a graphical display of the restriction map. It graphically displays the site, cut position, and total number of cuts for each enzyme. In order to run Mapplot, you will need to set up graphics (setplot). To run, type mapplot. You will input the sequence to be mapped, the range within the sequence to be mapped, and the restriction enzymes to cut. The output will consist of one horizontal line, which is representative of the sequence, for each restriction enzyme. The cut sites for that enzyme are represented by slash marks on the line at the cut position. The total number of cuts for the restriction enzyme is at the end of the line just before the restriction enzyme name.

Distances:
Distances creates a matrix (table) of the pairwise distances within a group of aligned sequences. The output file from Pileup will be the input file for Distances. The output from Distances will be the input for Growtree. To run, type distances. When asked for the aligned sequences, type the filename from Pileup (ending in .msf) followed by {*}. For example, if you named the output file from Pileup hum_gtr.msf , then your input file for Distances will be hum_gtr.msf{*}. Use the default distance correction method, 3 Kimura protein distance. End the output filename with .distances.

Growtree:
Growtree takes the output file from Distances and creates a phylogenetic tree. Two methods can be used to do this, either the neighbor-joining or the UPGMA method. It will create either a text file (filename ending in .trees) or a figure (a tree diagram). Be sure to set up graphics before beginning (setplot). Type growtree. The input file is the distances output file. Choose the method to use and output filename. The output will have two separate files; a text file with the filename ending in .trees, and a figure file (growtree.figure).

Peptidestructure:
Peptidestructure is a program that makes predictions for the secondary structure of the peptide sequence. The output file from Peptidestructure is used to make a graphical presentation using Plotstructure. The input file for Peptidestructure must be a single protein sequence.

Plotstructure:
Plotstructure takes the results from Peptidestructure and makes a graphical display. The measures of the protein secondary structure predicted by Peptidestructure can be displayed on parallel panels of a graph or with a two-dimensional presentation. Set up graphics using setplot. Type plotstructure. The input file must be the file from the output of Peptidestructure. Choose the type of graph. The output will be in a file named "plotstructure-(some numbers).ps". In the panel output, the horizontal line across the surface probability panel at position 1.0 on the y-axis indicates the expected surface probability calculated for a random sequence. Any value above this line is indicative of an increased probability of being found on a protein surface. In the two- dimensional plot, helices are shown with a sine wave, beta-sheets with a sharp saw-tooth wave, turns with 180 degree turns, and coils with a dull saw-tooth wave. Whenever hydrophilicity, surface probability, flexibility, or antigenic index exceed a certain threshold, special symbols are superimposed over the wave, with the size of the symbol proportional to the value of the attribute.

Motifs:
Motifs is used to find interesting regions of proteins.

Profilescan:
ProfileScan uses a database of profiles to find structural and sequence motifs in protein sequences.