Tools‎ > ‎Sequence data‎ > ‎

NCBI SRA file format

→ Install SRA-tools (fastq_dump, prefetch,... )

Converting SRA files to fastq


fastq-dump can be used for local .sra files or for direct download from NCBI


# local use (path to .sra file)
fastq-dump --split-spot path/to/local/file/SRR649944.sra
SRR649944.fastq

# direct download from NCBI/SRA (only accession number, no path)
fastq-dump --split-3 SRR649944
SRR649944_1.fastq
SRR649944_2.fastq

A .sra file copy will be saved to a local cache/archive folder, used for repeated fastq-dump calls without re-download   
$HOME/ncbi/public/sra/SRR649944.sra

Download only
1) using
prefetch
Alternatively, prefetch can be used for only downloading the .sra file for later use by fastq-dump
prefetch SRR649944   # stores .sra file in $HOME/ncbi/public/sra/
fastq-dump --split-3 SRR649944  # takes file from $HOME/ncbi/public/sra/ (without download again)
SRR649944_1.fastq
SRR649944_2.fastq

2) using wget  (not recommended)

Download error
In case of download error, a cache and/or lock file may need to be removed, before trying again
rm $HOME/ncbi/public/sra/SRR649944.sra.cache
rm $HOME/ncbi/public/sra/SRR649944.sra.cache.lock
rm $HOME/ncbi/public/sra/SRR649944.sra.tmp.23569.tmp  (prefetch)

http://www.ncbi.nlm.nih.gov/books/NBK158899/#SRA_download.downloading_sra_data_using

Options


Extracting fastq files from SRA files, for paired-end reads

fastq-dump --split-3 SAMPLE

results:
  SAMPLE_1.fastq
  SAMPLE_2.fastq
  SAMPLE.fastq (only if .sra contains single reads / single-end sequencing)

  --split-3   splits paired reads into files *_1.fastq and *_2.fastq; single read (if any) into  *.fastq

   SAMPLE  can be a SRA-id  (download from NCBI or local ncbi/public/sra/ archive) or direct path to local .sra file
   fastq-dump --split-3 SRR649944
   fastq-dump --split-3 path/to/local/file/SRR649944.sra


Converting SRA files into a single fastq file

fastq-dump --split-spot  SAMPLE
results:
   SAMPLE.fastq

options:
 --split-spot split paired-end reads, but writes all to a single fastq file



To use in a pipe

fastq-dump -Z --split-spot SAMPLE  | bowtie2  ...

options:
  -Z    writes sequences to standard output



Filter read length of SRA samples

fastq-dump --minReadLen 80 --split-3 SAMPLE
fastq-dump --minReadLen 80 --split-spot -Z SAMPLE | bowtie2 ...

options:
  --minReadLen 80    extracts only reads >= 80bp from SRA file





read more

http://ncbi.github.io/sra-tools/fastq-dump.html
https://github.com/ncbi/sra-tools/wiki/HowTo:-Access-SRA-Data
http://www.ncbi.nlm.nih.gov/books/NBK47528/
http://www.ncbi.nlm.nih.gov/books/NBK242621/#_SRA_Download_Guid_BK_The_SRA_Toolkit_
http://www.ncbi.nlm.nih.gov/books/NBK158899/#SRA_download.downloading_sra_data_using


fastq-dump --help
Usage:
  fastq-dump [options] <path> [<path>...]
  fastq-dump [options] <accession>

INPUT
  -A|--accession <accession>       Replaces accession derived from <path> in
                                   filename(s) and deflines (only for single
                                   table dump)
  --table <table-name>             Table name within cSRA object, default is
                                   "SEQUENCE"

PROCESSING

Read Splitting                     Sequence data may be used in raw form or
                                     split into individual reads
  --split-spot                     Split spots into individual reads

Full Spot Filters                  Applied to the full spot independently
                                     of --split-spot
  -N|--minSpotId <rowid>           Minimum spot id
  -X|--maxSpotId <rowid>           Maximum spot id
  --spot-groups <[list]>           Filter by SPOT_GROUP (member): name[,...]
  -W|--clip                        Apply left and right clips

Common Filters                     Applied to spots when --split-spot is not
                                     set, otherwise - to individual reads
  -M|--minReadLen <len>            Filter by sequence length >= <len>
  -R|--read-filter <[filter]>      Split into files by READ_FILTER value
                                   optionally filter by value:
                                   pass|reject|criteria|redacted
  -E|--qual-filter                 Filter used in early 1000 Genomes data: no
                                   sequences starting or ending with >= 10N
  --qual-filter-1                  Filter used in current 1000 Genomes data

Filters based on alignments        Filters are active when alignment
                                     data are present
  --aligned                        Dump only aligned sequences
  --unaligned                      Dump only unaligned sequences
  --aligned-region <name[:from-to]>  Filter by position on genome. Name can
                                   either be accession.version (ex:
                                   NC_000001.10) or file specific name (ex:
                                   "chr1" or "1"). "from" and "to" are 1-based
                                   coordinates
  --matepair-distance <from-to|unknown>  Filter by distance beiween matepairs.
                                   Use "unknown" to find matepairs split
                                   between the references. Use from-to to limit
                                   matepair distance on the same reference

Filters for individual reads       Applied only with --split-spot set
  --skip-technical                 Dump only biological reads

OUTPUT
  -O|--outdir <path>               Output directory, default is working
                                   directory '.' )
  -Z|--stdout                      Output to stdout, all split data become
                                   joined into single stream
  --gzip                           Compress output using gzip
  --bzip2                          Compress output using bzip2

Multiple File Options              Setting these options will produce more
                                     than 1 file, each of which will be suffixed
                                     according to splitting criteria.
  --split-files                    Dump each read into separate file.Files
                                   will receive suffix corresponding to read
                                   number
  --split-3                        Legacy 3-file splitting for mate-pairs:
                                   First biological reads satisfying dumping
                                   conditions are placed in files *_1.fastq and
                                   *_2.fastq If only one biological read is
                                   present it is placed in *.fastq Biological
                                   reads and above are ignored.
  -G|--spot-group                  Split into files by SPOT_GROUP (member name)
  -R|--read-filter <[filter]>      Split into files by READ_FILTER value
                                   optionally filter by value:
                                   pass|reject|criteria|redacted
  -T|--group-in-dirs               Split into subdirectories instead of files
  -K|--keep-empty-files            Do not delete empty files

FORMATTING

Sequence
  -C|--dumpcs <[cskey]>            Formats sequence using color space (default
                                   for SOLiD),"cskey" may be specified for
                                   translation
  -B|--dumpbase                    Formats sequence using base space (default
                                   for other than SOLiD).

Quality
  -Q|--offset <integer>            Offset to use for quality conversion,
                                   default is 33
  --fasta <[line width]>           FASTA only, no qualities, optional line
                                   wrap width (set to zero for no wrapping)

Defline
  -F|--origfmt                     Defline contains only original sequence name
  -I|--readids                     Append read id after spot id as
                                   'accession.spot.readid' on defline
  --helicos                        Helicos style defline
  --defline-seq <fmt>              Defline format specification for sequence.
  --defline-qual <fmt>             Defline format specification for quailty.
                                   <fmt> is string of characters and/or
                                   variables. The variables can be one of: $ac
                                   - accession, $si spot id, $sn spot
                                   name, $sg spot group (barcode), $sl spot
                                   length in bases, $ri read number, $rn
                                   read name, $rl read length in bases. '[]'
                                   could be used for an optional output: if
                                   all vars in [] yield empty values whole
                                   group is not printed. Empty value is empty
                                   string or for numeric variables. Ex:
                                   @$sn[_$rn]/$ri '_$rn' is omitted if name
                                   is empty
 
OTHER:
  --disable-multithreading         disable multithreading
  -h|--help                        Output brief explanation of program usage
  -V|--version                     Display the version of the program
  -L|--log-level <level>           Logging level as number or enum string One
                                   of (fatal|sys|int|err|warn|info) or (0-5)
                                   Current/default is warn
  -v|--verbose                     Increase the verbosity level of the program
                                   Use multiple times for more verbosity
  --ncbi_error_report              Control program execution environment
                                   report generation (if implemented). One of
                                   (never|error|always). Default is error
  --legacy-report                  use legacy style 'Written spots' for tool

fastq-dump : 2.3.4