Searching and Downloading Genome Data using NCBI Datasets¶
Table of contents¶
- NCBI Datasets: the CLI version
1a. Installing NCBI datasets - Genome retrieval options.
- Using the datasets CLI to download a single genome assembly
- How can I download multiple genomes at once using datasets?
4a. Using dataformat to explore metadata - Datasets documentation and tutorials
- Bonus exercise: downloading a large number of genomes
1. NCBI Datasets: the CLI version ¶
NCBI Datasets comprises an API, a web-interface and a command-line tool (CLI). In this workshop, we already covered the web-interface and how it can be used to search and download for your genomes of interest.
As useful as the web interface is, at times it's much more convenient to have a way of accessing genomes from a command-line environment. Let's say your working on your institution's high-performance computing (HPC) system and you need to download dozens (or hundreds of genomes). Even if you're using the Datasets web interface, this would potentially be a two step process:
- Download the genome data package locally;
- Transfer the files to the HPC system.
With the NCBI Datasets CLI, you can do this process in a single step. Our CLI allows users to access not only genomes, but also genes, ortholog sets and virus genomes.
1a. Installing NCBI Datasets ¶
For this exercise, we will install the datasets CLI in the GitPod instance you are using. The same instructions can be followed to to install it on your own machine or HPC system.
The list of commands below will accomplish the following tasks:
- Create a new conda environment named
datasets
and install the datasets CLI tool and the UNIX tree tool (useful for visualizing the folder structure) in that new environment. - Activate the
datasets
environment. - Test the installation by calling the datasets CLI.
datasets
will print the help message below:
datasets is a command-line tool that is used to query and download biological sequence data
across all domains of life from NCBI databases.
Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) documentation for information about getting started with the command-line tools.
Usage
datasets [command]
Data Retrieval Commands
summary Print a data report containing gene, genome or virus metadata
download Download a gene, genome or virus dataset as a zip file
rehydrate Rehydrate a downloaded, dehydrated dataset
Miscellaneous Commands
completion Generate autocompletion scripts
Flags
--api-key string Specify an NCBI API key
--debug Emit debugging info
--help Print detailed help about a datasets command
--version Print version of datasets
Use datasets <command> --help for detailed help about a command.
2. Genome retrieval options ¶
Today we're focusing on retrieving genome data and metadata using the NCBI Datasets CLI. We will explore the options available to download data and metadata, including filtering options.
For this exercise, we will be the using an environment in Gitpod. You need your GitHub credentials to login to it. Here's the link to the NCBI Datasets pod:
In this environment, we have the necessary tools installed for you to explore NCBI Datasets without the need to configure anything. When you decide to use NCBI Datasets on your own machine or HPC system, you need to install it. More information on how to install NCBI Datasets can be found in our documentation page.
The NCBI Datasets CLI command structure is very intuitive. If you take a look at the diagram below, you will notice that the commands are built by choosing one option from each vertical rectangle. Let's start!
The datasets CLI has even more filtering options than the web-interface. For example: for chromosome level assemblies, you can choose which chromosomes to download using the --chromosomes
flag. Use the --help
flag to see all available flags and options in datasets.
Flags
--assembly-version string Limit to 'latest' assembly accession version or include 'all' (latest + previous versions)
(default "latest")
--include string(,string) Specify the data files to include (comma-separated).
* genome: genomic sequence
* rna: transcript
* protein: amnio acid sequences
* cds: nucleotide coding sequences
* gff3: general feature file
* gtf: gene transfer format
* gbff: GenBank flat file
* seq-report: sequence report file
* none: do not retrieve any sequence files
(default [genome])
--reference Limit to reference genomes
--tax-exact-match Exclude sub-species when a species-level taxon is specified
Global Flags
--annotated Limit to annotated genomes
--api-key string Specify an NCBI API key
--assembly-level string Limit to genomes at one or more assembly levels (comma-separated):
* chromosome
* complete
* contig
* scaffold
(default "[]")
--assembly-source string Limit to 'RefSeq' (GCF_) or 'GenBank' (GCA_) genomes (default "all")
--chromosomes strings Limit to a specified, comma-delimited list of chromosomes, or 'all' for all chromosomes
--debug Emit debugging info
--dehydrated Download a dehydrated zip archive including the data report and locations of data files (use the rehydrate command to retrieve data files).
--exclude-atypical Exclude atypical assemblies
--filename string Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")
--help Print detailed help about a datasets command
--mag string Limit to metagenome assembled genomes (only) or remove them from the results (exclude) (default "all")
--no-progressbar Hide progress bar
--preview Show information about the requested data package
--released-after string Limit to genomes released on or after a specified date (MM/DD/YYYY)
--released-before string Limit to genomes released on or before a specified date (MM/DD/YYYY)
--search strings Limit results to genomes with specified text in the searchable fields:
species and infraspecies, assembly name and submitter.
To search multiple strings, use the flag multiple times.
--version Print version of datasets
3. Using the datasets CLI to download a single genome assembly ¶
For our first exercise, let's practive downloading a single genome using the datasets CLI. Imagine that you read a genomics paper about an organism that you are interested in (in this case, let's use the California two-spot octopus (Octopus bimaculoides) as our example.
The NCBI assembly accession for this species' reference genome is GCF_001194135.2
, and you can use the command below to achieve this goal:
datasets download genome accession GCF_001194135.2 --filename octopus-bimaculoides-ref.zip
Collecting 1 records [================================================] 100% 1/1
Downloading: octopus-bimaculoides-ref.zip 643MB done
After downloading the data package, let's unzip it and take a look at the folder structure:
unzip octopus-bimaculoides-ref.zip -d octopus-ref
Archive: octopus-bimaculoides-ref.zip
inflating: octopus-ref/README.md
inflating: octopus-ref/ncbi_dataset/data/assembly_data_report.jsonl
inflating: octopus-ref/ncbi_dataset/data/GCF_001194135.2/GCF_001194135.2_ASM119413v2_genomic.fna
inflating: octopus-ref/ncbi_dataset/data/dataset_catalog.json
tree octopus-ref
octopus-ref/
|-- README.md
`-- ncbi_dataset
`-- data
|-- GCF_001194135.2
| `-- GCF_001194135.2_ASM119413v2_genomic.fna
|-- assembly_data_report.jsonl
`-- dataset_catalog.json
3 directories, 4 files
The folder structure of the data package downloaded with the CLI is the same as the one you would get from a web download. The genome assembly FASTA is located inside a folder identified with the assembly accession inside the data
folder. The data
folder also contains a metadata report (assembly_data_report.jsonl
) and a catalog listing which files were included in the data package (dataset_catalog.json
).
In addition to the genome FASTA file, you might be interested in other files associated with this assembly, such as transcripts, proteins, GFF3, etc. To customize the data package, you can use the flag --include
and list everything you would like to download. If the files are available, they will be included in your data package. Let's test that:
datasets download genome accession GCF_001194135.2 --include genome,rna,protein --filename octopus-genome-rna-prot.zip
Collecting 1 records [================================================] 100% 1/1
Downloading: octopus-genome-rna-prot.zip 672MB done
unzip octopus-genome-rna-prot.zip -d octopus-genome-rna-protein
Archive: octopus-genome-rna-prot.zip
inflating: octopus-genome-rna-protein/README.md
inflating: octopus-genome-rna-protein/ncbi_dataset/data/assembly_data_report.jsonl
inflating: octopus-genome-rna-protein/ncbi_dataset/data/GCF_001194135.2/GCF_001194135.2_ASM119413v2_genomic.fna
inflating: octopus-genome-rna-protein/ncbi_dataset/data/GCF_001194135.2/rna.fna
inflating: octopus-genome-rna-protein/ncbi_dataset/data/GCF_001194135.2/protein.faa
inflating: octopus-genome-rna-protein/ncbi_dataset/data/dataset_catalog.json
tree octopus-genome-rna-protein
octopus-genome-rna-protein
|-- README.md
`-- ncbi_dataset
`-- data
|-- GCF_001194135.2
| |-- GCF_001194135.2_ASM119413v2_genomic.fna
| |-- protein.faa
| `-- rna.fna
|-- assembly_data_report.jsonl
`-- dataset_catalog.json
3 directories, 6 files
As we can see, this new data package has two extra files in the same folder where the genome FASTA was in the first package we downloaded: rna.fna
and protein.faa
.
Now, let's download multiple genomes at once.
4. How can I download multiple genomes at once using datasets? ¶
If we have a long list of genome accessions we want to retrieve, or maybe we want to download all genomes for a certain taxon, datasets CLI allows you to do that in a single command.
Let's map out a strategy for doing this in the most efficient way possible. By efficient we mean: not downloading unnecessary files, since it costs time and storage to do so.
We can use a web/CLI approach, where we search for the genomes of interest on the web, or we can do everything in the command-line environment. We will explore the command-line only option, and to do so, we will use another CLI tool called dataformat, which is datasets companion tool to explore and convert metadata to TSV or Excel formats.
4a. Using dataformat to explore metadata ¶
NCBI Datasets provides metadata reports in JSON or JSON-Lines format, and those can be retrieved in two different ways:
- using datasets summary
option; or
- as part of any data package.
To facilitate the visualization of metadata, you can use dataformat to convert the JSON/JSON-Lines files to TSV or Excel format.
In the diagram below, you can see all the different kinds of data reports avaialble and how to build the dataformat command. You can pipe the datasets summary
output directly to dataformat using the flag --as-json-lines
or you can use the flags --package
or --inputfile
if you're providing .
Each metadata report has its own list of fields and our schema is recorded in our documentation pages.
We can also use the --help
flag to look at all options and fields.
You can look at the output by clicking in the arrow below:
Help output:
Flags
--fields strings Comma-separated list of fields
- accession
- ani-best-ani-match-ani
- ani-best-ani-match-assembly
- ani-best-ani-match-assembly_coverage
- ani-best-ani-match-category
- ani-best-ani-match-organism
- ani-best-ani-match-type_assembly_coverage
- ani-best-match-status
- ani-category
- ani-check-status
- ani-comment
- ani-submitted-ani-match-ani
- ani-submitted-ani-match-assembly
- ani-submitted-ani-match-assembly_coverage
- ani-submitted-ani-match-category
- ani-submitted-ani-match-organism
- ani-submitted-ani-match-type_assembly_coverage
- ani-submitted-organism
- ani-submitted-species
- annotinfo-busco-complete
- annotinfo-busco-duplicated
- annotinfo-busco-fragmented
- annotinfo-busco-lineage
- annotinfo-busco-missing
- annotinfo-busco-singlecopy
- annotinfo-busco-totalcount
- annotinfo-busco-ver
- annotinfo-featcount-gene-non-coding
- annotinfo-featcount-gene-other
- annotinfo-featcount-gene-protein-coding
- annotinfo-featcount-gene-pseudogene
- annotinfo-featcount-gene-total
- annotinfo-method
- annotinfo-name
- annotinfo-pipeline
- annotinfo-provider
- annotinfo-release-date
- annotinfo-release-version
- annotinfo-report-url
- annotinfo-software-version
- annotinfo-status
- assminfo-assembly-method
- assminfo-atypicalis-atypical
- assminfo-atypicalwarnings
- assminfo-bioproject
- assminfo-bioproject-lineage-accession
- assminfo-bioproject-lineage-parent-accession
- assminfo-bioproject-lineage-parent-accessions
- assminfo-bioproject-lineage-title
- assminfo-biosample-accession
- assminfo-biosample-attribute-name
- assminfo-biosample-attribute-value
- assminfo-biosample-bioproject-accession
- assminfo-biosample-bioproject-parent-accession
- assminfo-biosample-bioproject-parent-accessions
- assminfo-biosample-bioproject-title
- assminfo-biosample-description-comment
- assminfo-biosample-description-organism-common-name
- assminfo-biosample-description-organism-infraspecific-breed
- assminfo-biosample-description-organism-infraspecific-cultivar
- assminfo-biosample-description-organism-infraspecific-ecotype
- assminfo-biosample-description-organism-infraspecific-isolate
- assminfo-biosample-description-organism-infraspecific-sex
- assminfo-biosample-description-organism-infraspecific-strain
- assminfo-biosample-description-organism-name
- assminfo-biosample-description-organism-pangolin
- assminfo-biosample-description-organism-tax-id
- assminfo-biosample-description-title
- assminfo-biosample-ids-db
- assminfo-biosample-ids-label
- assminfo-biosample-ids-value
- assminfo-biosample-last-updated
- assminfo-biosample-models
- assminfo-biosample-owner-contact-lab
- assminfo-biosample-owner-name
- assminfo-biosample-package
- assminfo-biosample-publication-date
- assminfo-biosample-status-status
- assminfo-biosample-status-when
- assminfo-biosample-submission-date
- assminfo-blast-url
- assminfo-description
- assminfo-level
- assminfo-linked-assm-accession
- assminfo-linked-assm-type
- assminfo-name
- assminfo-notes
- assminfo-paired-assm-accession
- assminfo-paired-assm-changed
- assminfo-paired-assm-manual-diff
- assminfo-paired-assm-name
- assminfo-paired-assm-only-genbank
- assminfo-paired-assm-only-refseq
- assminfo-paired-assm-status
- assminfo-refseq-category
- assminfo-release-date
- assminfo-sequencing-tech
- assminfo-status
- assminfo-submitter
- assminfo-suppression-reason
- assminfo-synonym
- assminfo-type
- assmstats-contig-l50
- assmstats-contig-n50
- assmstats-gaps-between-scaffolds-count
- assmstats-gc-count
- assmstats-gc-percent
- assmstats-genome-coverage
- assmstats-number-of-component-sequences
- assmstats-number-of-contigs
- assmstats-number-of-organelles
- assmstats-number-of-scaffolds
- assmstats-scaffold-l50
- assmstats-scaffold-n50
- assmstats-total-number-of-chromosomes
- assmstats-total-sequence-len
- assmstats-total-ungapped-len
- checkm-completeness
- checkm-completeness-percentile
- checkm-contamination
- checkm-marker-set
- checkm-marker-set-rank
- checkm-species-tax-id
- checkm-version
- current-accession
- organelle-assembly-name
- organelle-bioproject-accessions
- organelle-description
- organelle-infraspecific-name
- organelle-submitter
- organelle-total-seq-length
- organism-common-name
- organism-infraspecific-breed
- organism-infraspecific-cultivar
- organism-infraspecific-ecotype
- organism-infraspecific-isolate
- organism-infraspecific-sex
- organism-infraspecific-strain
- organism-name
- organism-pangolin
- organism-tax-id
- source_database
- type_material-display_text
- type_material-label
- wgs-contigs-url
- wgs-project-accession
- wgs-url
-h, --help help for genome
--inputfile string Input file
--package string Data package (zip archive), inputfile parameter is relative to the root path inside the archive
Global Flags
--elide-header Do not output header
--force Force dataformat to run without type check prompt
For our question here, let's assume that we want to look at all available turtle genomes submitted by the Vertebrate Genomes Project, check their assembly level (assminfo-level
), and scaffold N50 (assmstats-scaffold-n50
), in addition to scientific name (organism-name
) and NCBI assembly accession ID (accession
).
datasets summary genome taxon turtles --as-json-lines --search "Vertebrates Genome" | dataformat tsv genome \
--fields accession,organism-name,assminfo-level,assmstats-scaffold-n50
Assembly Accession Organism Name Assembly Level Assembly Stats Scaffold N50
GCA_009430475.1 Actinemys marmorata Scaffold 13640393
GCA_007922185.1 Carettochelys insculpta Scaffold 45881824
GCA_007922165.1 Chelydra serpentina Scaffold 21135443
GCF_000241765.5 Chrysemys picta bellii Chromosome 16028813
GCA_000241765.5 Chrysemys picta bellii Chromosome 16028813
GCA_004028625.2 Cuora amboinensis Scaffold 247606
GCA_003846335.1 Cuora mccordi Scaffold 32628679
GCA_007922305.1 Dermatemys mawii Scaffold 34365009
GCA_007922225.1 Emydura subglobosa Scaffold 44759016
GCA_007922155.1 Mesoclemmys tuberculata Scaffold 46415247
GCF_000230535.1 Pelodiscus sinensis Scaffold 3350749
GCA_000230535.1 Pelodiscus sinensis Scaffold 3350749
GCA_007922175.1 Pelusios castaneus Scaffold 14055032
GCA_007922195.1 Podocnemis expansa Scaffold 37101370
GCF_002925995.2 Terrapene carolina triunguis Scaffold 24249581
GCA_002925995.2 Terrapene carolina triunguis Scaffold 24249581
The assemblies are show in alphabetical order (scientific name) and we can see that for some species, like Chrysemys picta bellii, we have two assemblies with the same scaffold N50. If we check the accession numbers, we will see that we have a GCF and a GCA assembly (GCF_000241765.5 and GCA_000241765.5). GCA assemblies and their associate annotation files represent the original submission by users, while GCF assemblies are a copy of the original submission that was selected to be annotated and part of the RefSeq collection. You can read more about the differences between GCA and GCF assemblies in our documentation page.
For the sake of brevity, let's download only the GCF assemblies. In this case, we can create a list of accessions and save it as a text file that can be used as input for datasets. Let's call this file turtles-ref.acc
:
Using dataformat to create a text file:
In this case, since we are using only RefSeq genomes, we can use `dataformat` to help us create the input file with the accessions we need. Like this:Using nano to create a text file:
You can create a text file with a list of accessions anywhere that's convenient for you. The only thing to be aware is to use UTF-8 encoding for the file to avoid any issues.1. Open nano: `nano`
2. Paste/type the list of accessions/identifiers you would like to use.
You can either right click and select "Paste" or use `Control + V` (Windows) or `Cmd + V` (Mac)
3. Press `Control + X` to exit
4. `Save modified buffer`: type `Y`
5. Type the file name: `turtles.acc` and press `Enter`
Now let's download the selected assemblies using datasets. In this case, we will download genomes by accession and not taxon:
datasets download genome accession --inputfile turtles.acc --filename turtles_vgp-select.zip
Collecting 3 records [================================================] 100% 3/3
Downloading: turtles_vgp-select.zip 2.18GB done
unzip turtles_vgp-select.zip -d turtles-VGP
tree turtles-VGP/
turtles-VGP/
|-- README.md
`-- ncbi_dataset
`-- data
|-- GCF_000230535.1
| `-- GCF_000230535.1_PelSin_1.0_genomic.fna
|-- GCF_000241765.5
| `-- GCF_000241765.5_Chrysemys_picta_BioNano-3.0.4_genomic.fna
|-- GCF_002925995.2
| `-- GCF_002925995.2_T_m_triunguis-2.0_genomic.fna
|-- assembly_data_report.jsonl
`-- dataset_catalog.json
5 directories, 6 files
5. Datasets documentation and tutorials ¶
We have a documentation page with lots of information about about datasets and NCBI genome and assembly in general. Examples of information are:
- Command-line tools: how to download and install the datasets CLI;
- How-to guides: short, one-line datasets CLI tasks;
- Tutorials: multi-task, longer tutorials, mostly based on feedback or questions we get from users;
Please explore our docs and feel free to reach out if you need help with any tasks.
6. Bonus exercise: downloading a large number of genomes ¶
By now, you've learned how to download a single or multiple genome assemblies and their associated annotation files. You also learned how to filter the download/metadata retrieval using the datasets CLI flags.
In the examples we explored today, we were dealing with a relatively small number of genomes and download size. But it's not uncommon for our users to have download requests of thousands of genomes or hundreds of gigabytes of data. In those cases, we recommend users to use the "dehydration/rehydration" option in datasets. Some definitions:
Dehydrated data package: data package without any data files (FASTA, GFF3, etc), only the dataset catalog and assembly report. A dehydrated data package includes a fetch.txt
file, with a list of files to be downloaded during rehydration.
Rehydration: to rehydrate a dehydrated data package means to download the data files that are listed in the fetch.txt
file. Users can choose to download/rehydrate all files listed in the fetch.txt
file or use the flag --match
to part of the file names for rehydration.
We will use the same turtle download example with the --dehydrated
flag and look at the differences in the results.
datasets download genome accession --inputfile turtles.acc --filename turtles_vgp-select-dehy.zip --dehydrated
Collecting 3 records [================================================] 100% 3/3
Downloading: turtles_vgp-select-dehy.zip 6.79kB done
The first difference is the download size: 14KB dehydrated vs 9.14 GB regular download.
unzip turtles_vgp-select-dehy.zip -d turtles-vgp-dehydrated
tree turtles-vgp-dehydrated/
turtles-vgp-dehydrated/
|-- README.md
`-- ncbi_dataset
|-- data
| |-- assembly_data_report.jsonl
| `-- dataset_catalog.json
`-- fetch.txt
2 directories, 4 files
datasets rehydrate --directory turtles-vgp-dehydrated
Found 3 of 3 files for rehydration
Completed 3 of 3 [================================================] 100%
Let's look again at the folder turtles-vgp-dehydrated
to see what changed:
tree turtles-vgp-dehydrated/
turtles-vgp-dehydrated/
|-- README.md
`-- ncbi_dataset
|-- data
| |-- GCF_000230535.1
| | `-- GCF_000230535.1_PelSin_1.0_genomic.fna
| |-- GCF_000241765.5
| | `-- GCF_000241765.5_Chrysemys_picta_BioNano-3.0.4_genomic.fna
| |-- GCF_002925995.2
| | `-- GCF_002925995.2_T_m_triunguis-2.0_genomic.fna
| |-- assembly_data_report.jsonl
| `-- dataset_catalog.json
`-- fetch.txt
5 directories, 7 files
fetch.txt
, now the folder structure looks the same as we had with the original (non-dehydrated) download.
The most important question to answer is: why would I use this?¶
And the answer is: a dehydrated download is faster and more reliable than a regular download, simply because the number of files and amount of data being transferred is smaller. Also, the download process is serial, which means that one file is downloaded after the other, while the rehydration process runs in parallel, where multiple files can be downloaded at the same time. By default, datasets downloads/rehydrates 10 files in parallel, and that number can go up to 30. Another advantage is that the rehydration process can be resumed if it is interrupted for any reason; the same is not true for downloads: they either finish successfully or fail and can't be resumed, only restarted from the beginning.