PIMA Help


NAME
     pima - Pattern-Induced Multi-sequence Alignment program

SYNOPSIS
     pima [options] cluster_name seq_filename
        [ref_seq_name sec_struct_seq_filename]

EXAMPLES
     pima SAMPLE sample-family.fa
     pima SAMPLE-STRUCT sample-struct.fa 1ldm pdb-dssp.ss

DESCRIPTION
     pima  performs  a  multi-sequence  alignment  of  a  set  of
     (presumably  related)  sequences  using  an extension of our
     covering pattern construction  algorithm  (Smith  and  Smith
     1990,  1992).  All pairwise comparisons between sequences in
     the set are performed and  the  resulting  scores  clustered
     into  one or more families using using two different linkage
     rules: 1) maximal linkage (Smith and  Smith,  1990)  and  2)
     sequential  branching  (see Smith and Smith, 1992).  For the
     latter, all pairwise  scores  are  sorted  high-to-low,  the
     first  sequence  from  the highest scoring pair is chosen as
     the "reference sequence", and the sequences clustered  based
     strictly  on  the  order  of  similarity  to  the  reference
     sequence.  Each cluster is  then  multiply-aligned  using  a
     pattern-based  alignment  algorithm (Smith and Smith, 1992).
     Patterns are constructed using one  of  two  extended  amino
     acid alphabets (see below).

     If secondary structure sequences are  provided  for  one  or
     more  of  the primary sequences (one of which must be desig-
     nated as a "reference  sequence")  then  the  sequences  are
     clustered  using the sequentially branching rule and the set
     multiply-aligned using a secondary structure- dependent  gap
     penalty algorithm (Smith and Smith, 1992).


     Original Amino Acid Class Hierarchy Alphabet (Class1  alpha-
     bet):


                       Amino Acid Classes                     Match score

                                                                  -2
                _______________ X __________________               0
               /          /           \             \
            _ f _        /       ______r _______     \             1
          /  /    \     /       /   /     \     \     \
         /  c      \   e       /   m       p     \   _ j __        2
        /  /  \     \ / \     /   / \     / \     \ /   \  \
       /  a    b     d   \   /   l   k   o   n     i     h  \      3
      /  / \  / \   /|\   \ /   / \ / \ / \  /\   / \   / \  \
     C   I V  L M  F W Y   H   N   D   E  Q  K R  S T   A G   P    5


     New 83 Character Pattern Alphabet (Patgen alphabet):

     We have recently developed  an  alternate  pattern  alphabet
     that  includes  the  standard  IUPAC  codes for the 20 amino
     acids plus additional  characters  for  63  combinations  of
     amino-acids.  These combinations provide the  highest amount
     of information (i.e., most abundant as  compared  to  random
     expectation)  observed  in  our database of aligned sequence
     families (Ladunga I, Wiese B, and Smith RF, In preparation):

        J IV   f LV    n AV   t QK   1 PT   8 AE   ( QP    ; NH    _ NE
        U RK   h AG    Z QE   u RQ   2 NG   9 AL   ) AST   < QS    { IF
        a DE   i ILV   o AT   v DG   3 QH   ! NT   * ILM   ? QL    | SV
        b IL   j LF    p PS   w LP   4 LS   # ES   + KT    @ MV    } RP
        c FY   B ND    q NS   y EG   5 TV   $ IT   , GP    [ EP    ~ RH
        d ST   k LM    r AP   z RG   6 HY   % DS   / KS    ] AGS   . GK
        e AS   m GS    s EK   0 NK   7 IM   & RS   : LT    ^ GT    X (wildcard)

     For both alphabets, gaps are denoted by "g"s.

PARAMETERS
     cluster_name
           An arbitrary name used to label the cluster.

     seq_filename
           Name of the input file containing the sequences to  be
           clustered  and multi-aligned.  Sequences can be in any
           of the following  formats:   IG/Stanford,  GenBank/GB,
           NBRF,    EMBL,    Pearson/Fasta,   PIR/CODATA,   Table
           (LOCUS_NAMESEQUENCE [one seq/line]).  LOCUS_NAMES
           can not contain left or right parentheses.  The format
           of the output sequence files will match the format  of
           this input file.

     ref_seq_name
           [optional; if specified, then  sec_struct_seq_filename
           must also be specified]. Locus name of one of the pri-
           mary sequences for which the secondary structure is in
           the file seq_struct_seq_filename.

     sec_struct_seq_filename
           [optional; if specified, then ref_seq_name  must  also
           be  specified]  Name  of  a  file containing secondary
           structure sequences for one or  more  of  the  primary
           sequences   in   the  set.   The  secondary  structure
           sequences in this file must be in one of  the  formats
           listed  above  (see  sequence_filename,  above).   The
           locus name of each sequence must be the locus name  of
           it's  corresponding  primary  sequence with the suffix
           '.ss' (e.g. 1ldm.ss). An alpha-helix, 3-10  helix  and
           beta-strand  must  be  designated  'h',  'g', and 'e',
           repectively.  All other characters  in  the  secondary
           structure  sequences  will  be ignored with respect to
           the the structure-dependent  gap  penalty.   To  allow
           gaps to be placed between the first and the second and
           the last elements of these structures, the  first  and
           last  2  elements of each should be changed to another
           character designation.   In  the  secondary  structure
           sequence  file pdb-dssp.ss provided with this package,
           these end cap elements are designated  'i',  'f',  and
           'd', for alpha-helices, 3-10 helices and beta-strands,
           respectfully.


OPTIONS
     -c number       Use a cluster score cutoff of  number.  This
                    is  the  lowest  match  score  to  be used to
                    incorporate a sequence into a  cluster.   The
                    default  value  of  0.0  will force all input
                    sequences into 1 cluster, but the final  pat-
                    tern may be completely degenerate.

     -d number      Use a length dependent gap penalty of number.
                    This  is  the  cost  of extending a gap.  The
                    default value is dependent on the matrix file
                    used.

     -h              This option will print a short help  message
                    and quit.

     -i number       Use a  length  independent  gap  penalty  of
                    number.  This  is  the cost of opening a gap.
                    The default value is dependent on the  matrix
                    file used.

     -l number       Use minimum local score of number.  This  is
                    the  lowest  score a quadrant can have before
                    an attempt is made to join this local  align-
                    ment with the local alignment at the previous
                    step.  The default value is dependent on  the
                    matrix file used.

     -m file         Use matrix file  with  the  name  file.  The
                    default matrix ( class1.mat ) uses the origi-
                    nal amino acid class hierarchy alphabet.  The
                    matrix  file patgen.mat uses the new 83 char-
                    acter pattern alphabet.

     -n              Do not use numerical extensions on each step
                    of the alignment.

     -t number       Use a secondary  structure  gap  penalty  of
                    number.  This is the cost of a gap at a posi-
                    tion   matching   a    secondary    structure
                    character.  The default value is dependent on
                    the matrix file used and is always  10  times
                    the  value  of  the  length  independent  gap
                    penalty of the matrix file.

     -u characters   Use characters  as  the  list  of  secondary
                    structure  characters  instead of the default
                    characters of hge.

     -w number       Use  a  minimum  local  alignment  width  of
                    number instead of the default 15.  A quadrant
                    with a width less than this value is  ignored
                    and  no  attempt to join this local alignment
                    with the  local  alignment  at  the  previous
                    step.

     -M              Only perform maximal linkage.   This  option
                    will  also  drop the -ML from the output file
                    names.

     To see the default values for a give matrix run the  program
     pima-pm,  enter the name of the matrix for which you want to
     see the default  values.   Hit  return  until  you  see  the
     default  value  of the parameter you are interested and then
     just interupt (control-C) the program.

OUTPUT FILES CREATED
     cluster_name--ML|SB][.ext].cluster
           The  cluster  tree(s)s  created  by   the   clustering
           algorithm(s):   maximal  linkage clusters are labelled
           with '-ML' appended to  the  cluster_name;  sequential
           branching clusters are labeled '-SB'. If more than one
           cluster is generated from the input sequence set, each
           cluster  is  given  an  extension  (cluster_name-ML.1,
           cluster_name-ML.2, etc).  Each cluster  in  a  cluster
           file  is  represented  as  a nested list with sequence
           names separated by a match score, e.g.:
           CLUSTER_NAME-ML((A 200.0 B) 150.0 C)
           File               format:               cluster_name-
           [ML|SB][.ext]cluster_nested_list

     cluster_name[-ML|-SB][.ext].pattern
           The "root" AACC pattern constructed from each cluster.
           File               format:               cluster_name-
           [ML|SB][.ext]AACC_sequence

     cluster_name[-ML|-SB][.ext].pima
           The  pattern-induced  multiple-sequence  alignment  of
           each clustered sequence set; includes the "nodal" pat-
           terns used to align the sequences (the nodal  patterns
           have the locus name cluster_name-[ML|SB].ext -- exten-
           sions added to the sequence names match the  extension
           of  the  nodal-pattern used to align the corresponding
           sequence subset, e.g. seq_1-ML.1 and seq_2-ML.1  would
           be aligned by nodal-pattern cluster_name-ML.1 .
           File format: Will be created the  same  as  the  input
           sequence file, sequence_filename.

REQUIRED AUXILLARY PROGRAMS/SCRIPTS/FILES
     Programs: cluster-pima, pima-mso, pima-pm,  extract-cluster-
     loci,   extract-records,   extract-root-pat,  print-cluster,
     trim-root-num, print-pima, make-cluster, make-pattern
     Files: class1.mat, patgen.mat

NOTES
     Only minimal  sequence  information  is  maintained  by  the
     sequence  input and output routines.  Additionally not every
     aspect of the  various  sequence  file  formats  is  handled
     correctly.   If in doubt, please use sequence files that are
     in Fasta or table format.

REFERENCES
     Smith, Randall F. and Smith, Temple  F.  (1990).   Automatic
     generation of primary sequence patterns from sets of related
     protein sequences.  PNAS 87:118-122.

     Smith, Randall F. and  Temple  F.  Smith  (1992).   Pattern-
     Induced  Multi-sequence Alignment (PIMA) algorithm employing
     secondary structure-dependent gap penalties for  comparitive
     protein modelling.  Protein Engineering 5:35-41.


     Randall F. Smith
     Human Genome Center, Dept. of Molecular and Human Genetics,
     Baylor College of Medicine, Houston TX  77096
     rsmith@bcm.tmc.edu

     Temple F. Smith
     Molecular Bio-Enginnering Research Center
     Boston Univ.,  36 Cummington St, Boston, MA 02115
     tsmith@darwin.bu.edu

     Copyright (c) 1990, 1991, 1992, MBCRR, Dana-Farber Cancer Institute and Harvard University.
     Copyright (c) 1993, 1994, Baylor College of Medicine.


.
BCM HGSC