NCBIdatasets:
Datasets - NCBIhttps://www.ncbi.nlm.nih.gov/datasets/
安装
windows下载链接:
https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/win64/datasets.exe
exe路径写入环境变量后cmd输入datasets出现提示即为安装成功
conda安装:
conda create -n ncbi_datasets
conda activate ncbi_datasets
conda install -c conda-forge ncbi-datasets-cli
使用
Examples
datasets download genome accession GCF_000001405.39 --chromosomes X,Y --exclude-gff3 --exclude-rna
datasets download genome taxon "bos taurus"
datasets download gene gene-id 672
datasets download gene symbol brca1 --taxon mouse
datasets download gene accession NP_000483.3
datasets download virus genome taxon sars-cov-2 --host dog
datasets download virus protein S --host dog --filename SARS2-spike-dog.zip
datasets download --input-json request_file.json --filename output.zip
其中基因组下载选项:
选择自己需要的数据进行下载
Flags
-a, --annotated only include genomes with annotation
--assembly-level string restrict assemblies to a comma-separated list of one or more of: chromosome, complete_genome, contig, scaffold
--assembly-source string restrict assemblies to refseq or genbank only
--chromosomes strings limit to a specified, comma-delimited list of chromosomes (default [all])
--dehydrated download a dehydrated zip archive including the data report and locations of data files (use the rehydrate command to retrieve data files).
--exclude-genomic-cds exclude cds_from_genomic.fna (genomic cds file)
--exclude-gff3 exclude genomic.gff (gff3 annotation file)
--exclude-protein exclude protein.faa (protein sequence file)
--exclude-rna exclude rna.fna (transcript sequence file)
--exclude-seq exclude genomic.fna (genomic sequence file)
-h, --help help for genome
--include-gbff include genomic.gbff (GenBank flat file sequence and annotation), if available
--include-gtf include genomic.gtf (gtf annotation file), if available
--reference limit to reference and representative (GCF_ and GCA_) assemblies
--released-before string only include genomes that have been released before a specified date (MM/DD/YYYY)
--released-since string only include genomes that have been released after a specified date (MM/DD/YYYY)
--search strings only include genomes that have the specified text in the
searchable fields: species and infraspecies, assembly name and submitter
To provide multiple strings '--search' can be included multiple times
比如需要下载真菌(taxid:4751)基因组数据:
(taxid 可以通过NCBI搜索得到。其他下载选项可以通过命令 datasets download查看)
datasets download genome taxon "4751" --dehydrated --filename fungi_genome_dataset.zip --api-key 123456789abcdefghijk
因为数据量比较大先下载为json的压缩包形式 ,后面的--api-key防止短时请求次数过多被服务器屏蔽IP,api-key可以通过注册ncbi账号得到。
fungi_genome_dataset.zip下载完后解压到当前目录文件结构如下:
Archive: fungi_genome_dataset.zip
inflating:fungi_genome_dataset/README.md
inflating:fungi_genome_dataset/ncbi_dataset/data/*/assembly_data_report.jsonl
inflating:fungi_genome_dataset/ncbi_dataset/data/dataset_catalog.json
inflating:fungi_genome_dataset/ncbi_dataset/fetch.txt
下载
##提示not find 仔细检查路径格式
datasets rehydrate --directory fungi_genome_dataset/