Comparing WHAM with BWA using wgsim

This document presents results from an evaluation of WHAM against BWA using the wgsim tool.

Methodology:

First, we generate 100000 single-end 100bps reads using wgsim. The command is exactly same as the one used on this page.

wgsim -N 100000 -r 0.01 -1 100 -S11 -d0 -e0 hs37m.fa r0.fq /dev/null

Then, we issued the following command to build a WHAM index on the sequence file hs37m.fa. The options -l 100 and -v 4 specify that the index will be used to align 100bps reads with up to 4 errors. It takes about one hour to build this index on a machine with a 2.67GHz Intel Xeon CPU and 24GB memory.

wham_build -l 100 -v 4 -p 5 --mask hs37m.fa indexes/wgsim

After building the index, we used two commands (described below) to align the simulated reads with two different error models. The SAM outputs are stored in wham1.sam and wham2.sam, respectively.

The first command find the best matches with up to 4 errors and 2 gaps. Specifying –e 0 instructs WHAM to report quality scores for all alignments. The command takes around 7 seconds to run on our machine. This command is:

wham -v 4 -g 2 –e 0 -S r0.fa indexes/wgsim wham1.sam

The second command finds alignments with up to 8 errors and 6 gaps. Since the index is built with 4 errors, WHAM can’t guarantee that it find all matched alignments under this error model, but it can find more alignments compared to the first command. The command takes around 21 seconds to run. This second command is:

wham -v 8 -g 6 –e 0 -S r0.fa indexes/wgsim wham2.sam

Next, we used the wgsim script to compare the accuracy of the outputs of WHAM and BWA-SW. The results are shown in the following tables. In each table, we show the number of true positives and the number of false positives in the second and third columns respectively. For BWA-SW, we only show the result for quality scores lower than 36. Note that the quality scores for BWA-SW and WHAM are not identical (both use different methods), so one can’t compare the quality scores directly.

WHAM using the first command

Quality Score #True Positives #False Positives
36 77353 2
33 87851 12
28 90354 88
8 90949 88
6 91186 97
5 91621 97
4 91764 103
3 93319 945
2 93869 1309
1 94586 2026
WHAM using the second command

Quality Score #True Positives #False Positives
36 77344 2
33 87842 12
28 90712 86
8 91312 86
6 91552 95
5 91985 95
4 92131 101
3 94313 1520
2 94864 1887
1 95581 2606
BWA-SW using the first command

Quality Score #True Positives #False Positives
36 83999 1
35 84426 1
34 85590 1
33 86840 1
32 87413 1
31 87564 1
30 87928 2
29 88039 2
28 88180 2
27 88306 2
26 88437 2
25 88581 2
24 88721 2
23 88885 2
22 89051 2
21 89329 2
20 89695 3
19 89854 3
18 90071 3
17 90217 3
16 90431 3
15 90642 4
14 90844 4
13 91114 4
12 91346 4
11 91843 5
10 92474 7
9 92714 7
8 93095 10
7 93347 12
6 93794 15
5 94068 17
4 94638 24
3 94862 35
2 95577 103
1 95798 184

As we can see from tables above, the first WHAM command produces fewer true positives. The second command finds similar number of correct alignments as BWA, but outputs more false positives than BWA.

To analyze the reason why WHAM outputs more wrong alignments, we use the following command to output up to 5 best alignments for each read, and use wgsim to verify these alignments.

We found that with this command, WHAM finds 97708 true positives, more than 95798 of BWA. This means that WHAM finds the correct alignments based on its error model, but has difficulty in ranking the correct alignment to the best match due to its simple model of quality score. We are actively looking for life sciences collaborators to give input on the scoring model, and collaborate on customizing it for their environment. Please send an email via the comment form below if you would like to work with us.

Scoring method used in WHAM

In the current scoring model in WHAM, the quality score of an alignment depends on the number of errors in this alignment, and also the number of other alignments for a given read.

Suppose that a given read aligns to multiple positions on the target genome, i.e. nk positions for k errors. Also assume that the best alignment has kmin errors. The posterior probability that the best alignment is actually correct is P[kmin errors] / (P[kmin errors] +  Σ k * P[k errors]). This probability is then mapped to a Phred quality score.

The probability of an alignment with k errors, P[k errors], is calculated using a binomial distribution. Suppose a read has l bases, and each one has a probability of p to be wrong. Then, we have P[k errors] = C(l, k) * pk*(1-p)l-k. In the current version of WHAM, p = 0.01, which is a Phred quality score of 20. In future version, we might use base quality from FASTQ format to evaluate an alignment.

Acknowledgements

Thanks to Toby Bloom for pointing us to the wgsim tool and suggesting this evaluation with BWA!

 

Comments are closed.