BLAST word-size

Length of an exact sequence match, as start region for the final alignment

blastn -query genes.fasta -subject genome.fasta -word_size 11

A BLAST search starts with finding a perfect sequence match of length given by -word_size. This initial region of an exact sequence match is then extended in both direction allowing gaps and substitutions based on the scoring thresholds.

Changing the initial word-size can help to find more, but less accurate hits; or to limit the results to almost perfect hits.

- Decreasing the word-size will increase the number of detected homologous sequences, but hits can include alignments of higher fragmentation due to gaps and substitutions (example: search for homologous genes between distant species, see also: -task blastn)
- Increasing the word-size will give less hits as it requires a longer continuous regions of exact match. If the word-size is chosen to be almost the size of the query, BLAST will search for almost exact matches (example: search for location of gene sequences in the original genome of the gene)

For short sequences, word-size must be less than half the query length, otherwise reliable hits can be missed.

Default word-sizes

nucleotide sequence search blastn with default megablast (bastn): -word_size 28

nucleotide sequence search blastn only (bastn -task blastn): -word_size 11

amino acid search (blastp): -word_size 3

→ BLAST command-line options

Setting the word-size to a very low value ( -word_size 5 ) makes a blastn search very slow.

Page updated

Google Sites

Report abuse