NCBI ftp genome download

How to download all reference genomes of a selected species from NCBI

Ubuntu / Linux command line terminal


1) Download list of all available reference genomes

download complete list of manually reviewed genomes (RefSeq database which is a subset of GenBank)

rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt ./

or, download list of all available genomes (GenBank), may include bad quality genomes

rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt ./

→ read more at NCBI


2) Search for available genomes of a species

Example: Eubacterium limosum (RefSeq database, check columns 8,9,14,15,16)

grep -E 'Eubacterium.*limosum' assembly_summary_refseq.txt | cut -f 8,9,14,15,16

Eubacterium limosum strain=ATCC 8486 Full 2017/04/03 ASM80767v2

Eubacterium limosum strain=SA11 Full 2015/12/23 ASM148172v1

Eubacterium limosum strain=8486cho Full 2018/05/31 ASM318251v1

Eubacterium limosum strain=DFI.6.107 Full 2021/10/25 ASM2055962v1

Eubacterium limosum strain=B2 Full 2022/05/23 ASM2352075v1

Eubacterium limosum Full 2019/02/19 ASM90068377v1


3) Get FTP download link

# for selected genomes (Eubacterium limosum ), get NCBI ftp download folder (column 20)

grep -E 'Eubacterium.*limosum' assembly_summary_refseq.txt | cut -f 20 > ftp_links.txt

head ftp_links.txt

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/807/675/GCF_000807675.2_ASM80767v2

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/481/725/GCF_001481725.1_ASM148172v1

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/182/515/GCF_003182515.1_ASM318251v1

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/559/625/GCF_020559625.1_ASM2055962v1

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/023/520/755/GCF_023520755.1_ASM2352075v1

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/683/775/GCF_900683775.1_ASM90068377v1

# extend download link: create an exact genome (fna or gff) download link

awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print "rsync -t -v "ftpdir,file" ./"}' ftp_links.txt | sed 's/https/rsync/g' > download_fna_files.sh

awk 'BEGIN{FS=OFS="/";filesuffix="genomic.gff.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print "rsync -t -v "ftpdir,file" ./"}' ftp_links.txt | sed 's/https/rsync/g' > download_gff_files.sh

head download_fna_files.sh

rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/807/675/GCF_000807675.2_ASM80767v2/GCF_000807675.2_ASM80767v2_genomic.fna.gz ./

rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/481/725/GCF_001481725.1_ASM148172v1/GCF_001481725.1_ASM148172v1_genomic.fna.gz ./

rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/182/515/GCF_003182515.1_ASM318251v1/GCF_003182515.1_ASM318251v1_genomic.fna.gz ./

rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/020/559/625/GCF_020559625.1_ASM2055962v1/GCF_020559625.1_ASM2055962v1_genomic.fna.gz ./

rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/023/520/755/GCF_023520755.1_ASM2352075v1/GCF_023520755.1_ASM2352075v1_genomic.fna.gz ./

rsync -t -v rsync://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/900/683/775/GCF_900683775.1_ASM90068377v1/GCF_900683775.1_ASM90068377v1_genomic.fna.gz ./


4) Run download

download the .fna genome files (fasta format)

source download_fna_files.sh

ls

download_fna_files.sh

GCF_000807675.2_ASM80767v2_genomic.fna.gz

GCF_001481725.1_ASM148172v1_genomic.fna.gz

GCF_003182515.1_ASM318251v1_genomic.fna.gz

GCF_020559625.1_ASM2055962v1_genomic.fna.gz

GCF_023520755.1_ASM2352075v1_genomic.fna.gz

GCF_900683775.1_ASM90068377v1_genomic.fna.gz


# get description (top line) of genome .fna files (more metadata are in file assembly_summary_refseq.txt)

find . -name "*.fna.gz" -exec sh -c "echo -n '{}: '; zcat {} | head -1" \;

./GCF_000807675.2_ASM80767v2_genomic.fna.gz: >NZ_CP019962.1 Eubacterium limosum strain ATCC 8486 chromosome, complete genome

./GCF_001481725.1_ASM148172v1_genomic.fna.gz: >NZ_CP011914.1 Eubacterium limosum strain SA11 chromosome, complete genome

./GCF_003182515.1_ASM318251v1_genomic.fna.gz: >NZ_QGUD01000001.1 Eubacterium limosum strain 8486cho Ga0206405_101, whole genome shotgun sequence

./GCF_020559625.1_ASM2055962v1_genomic.fna.gz: >NZ_JAJCLO010000001.1 Eubacterium limosum strain DFI.6.107 IMADOJIF_1, whole genome shotgun sequence

./GCF_023520755.1_ASM2352075v1_genomic.fna.gz: >NZ_CP097376.1 Eubacterium limosum strain B2 chromosome, complete genome

./GCF_900683775.1_ASM90068377v1_genomic.fna.gz: >NZ_LR215983.1 Eubacterium limosum isolate Eubacterium limosum 81C1 chromosome 1


Alternative: manual ftp download

Download manually genome.fna files from the NCBI website:

https://ftp.ncbi.nlm.nih.gov/genomes/refseq/

ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/


NCBI file formats

Sequences

fna - genome sequence, as single or multiple contig nculeotide sequence (fasta format)

ffn - gene sequence (multifasta format), not available anymore

faa - protein amino-acid sequences (fast format)

Annotations

gff - gene annotations (location, function, ...), gff from NCBI does not include sequence

gbff - gene annotations and sequence (genbank format)

gpff - protein annotations and sequence (genbank format)

https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#files


see also

→ NCBI FTP FAQ

→ Strain-level metagenomics (PanPhlAn) genome download