Tools‎ > ‎Shotgun sequencing‎ > ‎

Remove host sequences

In silico removal of host sequences

removing host (contamination) sequences in order to analyze remaining (bacterial) sequences

How to filter out host reads from paired-end fastq files?

a) bowtie2 mapping against host: write all (mapped/unmapped) reads to a single .bam file
b) Samtools: use flags to extract unmapped reads
c) bedtools: split paired-end reads into separated files

a) bowtie2 mapping against host sequence

Host example: human genome hg19 (download bowtie2 hg19 index)

# 1) create bowtie2 index database (host_DB) from host reference genome
bowtie2-build host_genome.fna host_DB

# 2) bowtie2 mapping against host sequence database, keep both mapped and unmapped reads (paired-end reads)
bowtie2 -x host_DB -1 SAMPLE_r1.fastq -2 SAMPLE_r2.fastq -S SAMPLE_mapped_and_unmapped.sam

# 3) convert file .sam to .bam
samtools view -bS SAMPLE_mapped_and_unmapped.sam > SAMPLE_mapped_and_unmapped.bam

b) filter required unmapped reads

# SAMtools SAM-flag filter: get unmapped pairs (both ends unmapped)
samtools view -b -f 12 -F 256 SAMPLE_mapped_and_unmapped.bam > SAMPLE_bothEndsUnmapped.bam
-f 12     Extract only (-f) alignments with both reads unmapped: <read unmapped><mate unmapped>
-F 256   Do not(-F) extract alignments which are: <not primary alignment>
see meaning of SAM-flags

c)  split paired-end reads into separated fastq files .._r1 .._r2

# sort bam file by read name (-n) to have paired reads next to each other as required by bedtools
samtools sort -n SAMPLE_bothEndsUnmapped.bam SAMPLE_bothEndsUnmapped_sorted

bedtools bamtofastq
-i SAMPLE_bothEndsUnmapped_sorted.bam -fq SAMPLE_host_removed_r1.fastq -fq2 SAMPLE_host_removed_r2.fastq

see other options for  → converting .bam to .fastq


Two files of paired-end reads, containing non-host sequences


As alternative, bowtie2 option --un-conc can be used to directly get unmapped reads.

But SAM-flags gives more flexibility in extracting the required reads, including
- correctly mapped pairs <read mapped in proper pair>
  (mapped distance as expected and correctly oriented to each other)
- keeping or rejecting pairs in which only one read is mapped

Removel of ribosomal RNA

To in silico remove rRNA, see software tool → SortMeRNA