- Build a Position Weight Matrix (PWM)    The user submits
a binding site alignment
file.
This file is input to the hmmbuild command from the HMMER
software suite, to build the
position weight matrix.
Only the emission parameters in the file are used in the search.
These are the four numbers in lines starting with 1,2,3 etc.
The sixth column in these lines is the position in the multiple alignment
which these parameters represent. For example, in the following position
weight matrix:
PWM_column A C G T alignment_column
1 -231 -514 1472 -1392 1
2 568 158 834 -4321 2
3 -4321 273 2016 -1959 3
4 -1106 2016 -304 -4321 4
5 1357 -4323 -370 -4321 5
6 -93 560 1305 -4321 6
7 -4321 1915 690 -2407 7
8 -300 1801 -1414 -2189 9
9 -1229 2152 -1046 -4321 10
10 -4321 2486 -4323 -4321 11
11 1555 -4323 -4323 -4321 12
12 -126 -4323 -551 802 13
13 -1859 -4323 2181 -1399 14
14 -4321 2486 -4323 -4321 15
15 1555 -4323 -4323 -4321 16
16 -1612 1168 -4323 602 18
17 -2636 1020 83 348 19
generated from the alignment
GAGCAAGTCCCATGCAATT
TAGCAGGTCCCATGCACTT
AAGCAAGTCCCATGCAATG
GGGCAAGTCCCATGCAATA
AAGCAGTTCCCATGCAGAC
AAGCAGTTCCCATGCAGAC
AAGCACCTCCCATGCAGAC
GAGAAGCGCCCAAGCATTG
GCGCAACCACCAGTCATTT
GCGCAACCACCAGTCATTT
GGGCAGCGAACAAGCATCT
CAGCAGGCGCCATGCAACC
GCGCAACCACCAGTCAGCT
GGTGAGCCCCCAAGCATCT
ACGCAACACCCATACACTC
AGCAGCCGCCCATGCATTG
CAGCAGGATACATGCAACC
TGCGGCCGCCCATGCACTC
GACCACCAAGCAAGCACCT
we can see that there are 19 columns in the alignment but only
17 columns (displayed inverted as rows) in the position weight matrix.
Columns 8 and 17 HMMER judged too close to backgound frequency to
extract a log-odds scoring column. If this position weight matrix
were used in an actual search, the nucleotides in positions 8 and 17
would be ignored and merely used as placeholders.
There can be poorly conserved columns anywhere in the alignment, but
for a successful search, the alignment must contain at least 6
well conserved columns.
hmmbuild then uses a tree-based sequence weighting scheme to
calculate the weighted counts of nucleotides at each position
in the multiple alignment. Background frequency priors are added to
each count at each position. These are the frequencies of nucleotides
in the non-exonic regions of C. elegans and C. briggsae genomes to be searched.
(Using current GFF files to determine these regions, the frequencies
are 34.23% G/C and 65.77% A/T content.)
Then the ratio with the background nucleotide frequency of the sequence
being searched is taken, and the log2 is taken of this ratio,
giving the 'bits'. The resulting PWM is the 4 x n matrix in
which a cell i,j is the bits score of nucleotide i
at sequence position j.
If the bits are too low in flanking positions (resulting from a
position in the alignment having close to background frequency),
hmmbuild automatically excludes these positions from the PWM.
Because of this, if the user is unsure where the region of conservation ends,
it is best to err on the side of including unconserved flanking
regions, as this won't affect the results. For details, see the HMMER
UserGuide.
- Isolate and Classify non-exonic sequence   Using
GFF
 (General Feature Format) exon annotation files, the program
snip.plx isolates all non-exonic
sequence from either C. elegans or C. briggsae genomic
sequence.
(See this illustration).
Sequences are classified as
- 3' intergenic   between the 3' ends of two genes on opposite strands
- 5' intergenic   between the 5' ends of two genes on opposite strands
- 3'/5' intergenic   between two genes on the same strand
- intronic#   between exons of the same gene (though not necessarily the same splice variant)
Note: if the gene has only one splice form, these fragments correspond to actual numbered introns
- BEGIN or END   at the beginning or ending of a chromosome (C. elegans) or sequence read (C. briggsae)
- other    all segments not in any above category (such as those bounded by exons from two different genes.
In addition, any segment will be prefixed by alt_ if either
of its bounding exons belongs to a gene with more than one splice
variant. Note that no attempt is made to correctly number introns
for each splice variant.
- Search non-exonic regions using the PWM   The program
SimpleSearch is used as a scanning window to
find up to N (user-defined) top-scoring hits in each genome.
Technically, the program chooses the score cutoff to produce as many
top-scoring hits as possible less than N. It estimates this cutoff by
sampling 1% of the windows and sorting the resulting list of scores.
Therefore, the number of hits retrieved will be slightly less than N in most
cases. If the PWM has very little information (low bits in most
positions) then there will be many windows achieving the same score,
and the number of hits retrieved could be substantially less than N.
This can arise if the binding sites are incorrectly aligned or poorly
conserved.
- Filter out 'orphan' hits   We then use a 1-1
mapping of C. elegans to C. briggsae orthologs provided
in the file
orthologs-2.00
to find only those hits in one species for which there is a hit in the
matching ortholog of the other species. The filter traverses all
ortholog pairs, and retrieves the top-scoring hit for each ortholog,
called a 'hit-pair'. The percentage of top-scoring
hits filtered out in each species during this step will depend on the
number of hits (N) the user chooses to retrieve. A permissive
(i.e. 20000) value for N will include many more hit-pairs during this
step, but since the hit-pairs passing the filter are then sorted by
either average or maximum or minimum score (see next step), often the
value of N chosen won't affect which hits appear at the top of the list.
The only case where this can affect the top-ranked hit-pairs is the
results sorted by mismatch first, then by some score criterion
(minimum, average, or maximum).
- Output sorted Hit-Pairs in HTML tables   Finally,
the resulting hit-pairs are sorted by either minimum, average, or
maximum score in hit-pair, and by number of mismatches between the hit
sequences. All six possible combinations of primary/secondary sortings
are provided. Ranking by average score first gives the most statistically
correct measure of rank order in the context of our protocol. The other
sortings are provided for more specialized analysis of the results.
|