In silico removal of host sequencesremoving host (contamination) sequences in order to analyze remaining (bacterial) sequences
quick solution to get the paired reads that do not map to the host reference genome (both reads unmapped).
# download ready to use bowtie2 database of human host genome GRCh38 (hg38) wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip unzip GRCh38_noalt_as.zip # run bowtie2 mapping (using --un-conc-gz to get gzip compressed output files; 8 processors)
bowtie2 -p 8 -x
# bowtie2 results (gz files without gz ending) ls SAMPLE_host_removed.1 SAMPLE_host_removed.2 °°° # rename host-sequence free samples
mv SAMPLE_host_removed.1 SAMPLE_host_removed_R1.fastq.gz mv SAMPLE_host_removed.2 SAMPLE_host_removed_R2.fastq.gz Option --un-conc shows results like samtools options -F 2 (excluding reads "mapped in proper pair"). Paired reads that do not map both to the host sequence might still be included in the "host removed" output. For better control about read filtering options, see workflow below. If multi-processor option -p is used, output reads might have a different order compared to input files.Use option --reorder to keep the original read order.(read order refers to .sam output but might effect also host-removed read output files .1 .2) 2) Using bowtie2 together with samtoolscomplex solution that gives better control over the rejected reads by using SAM-flags
How to filter out host reads from paired-end fastq files?
a) bowtie2 mapping against host genome: write all (mapped and unmapped) reads to a single .bam file
b) samtools view: use filter-flags to extract unmapped reads c) samtools fastq: split paired-end reads into separated R1 and R2 fastq files
a) bowtie2 mapping against host sequence
|
Tools > Shotgun sequencing >