0
点赞
收藏
分享

微信扫一扫

利用NCBIdatasets批量下载大规模生信数据集

NCBIdatasets:

Datasets - NCBIicon-default.png?t=M276https://www.ncbi.nlm.nih.gov/datasets/

安装

windows下载链接:

https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/win64/datasets.exe

exe路径写入环境变量后cmd输入datasets出现提示即为安装成功

conda安装:

conda create -n ncbi_datasets

conda activate ncbi_datasets

conda install -c conda-forge ncbi-datasets-cli

使用

Examples
  datasets download genome accession GCF_000001405.39 --chromosomes X,Y --exclude-gff3 --exclude-rna
  datasets download genome taxon "bos taurus"
  datasets download gene gene-id 672
  datasets download gene symbol brca1 --taxon mouse
  datasets download gene accession NP_000483.3
  datasets download virus genome taxon sars-cov-2 --host dog
  datasets download virus protein S --host dog --filename SARS2-spike-dog.zip
  datasets download --input-json request_file.json --filename output.zip

其中基因组下载选项:

选择自己需要的数据进行下载

Flags
  -a, --annotated                only include genomes with annotation
      --assembly-level string    restrict assemblies to a comma-separated list of one or more of: chromosome, complete_genome, contig, scaffold
      --assembly-source string   restrict assemblies to refseq or genbank only
      --chromosomes strings      limit to a specified, comma-delimited list of chromosomes (default [all])
      --dehydrated               download a dehydrated zip archive including the data report and locations of data files (use the rehydrate command to retrieve data files).
      --exclude-genomic-cds      exclude cds_from_genomic.fna (genomic cds file)
      --exclude-gff3             exclude genomic.gff (gff3 annotation file)
      --exclude-protein          exclude protein.faa (protein sequence file)
      --exclude-rna              exclude rna.fna (transcript sequence file)
      --exclude-seq              exclude genomic.fna (genomic sequence file)
  -h, --help                     help for genome
      --include-gbff             include genomic.gbff (GenBank flat file sequence and annotation), if available
      --include-gtf              include genomic.gtf (gtf annotation file), if available
      --reference                limit to reference and representative (GCF_ and GCA_) assemblies
      --released-before string   only include genomes that have been released before a specified date (MM/DD/YYYY)
      --released-since string    only include genomes that have been released after a specified date (MM/DD/YYYY)
      --search strings           only include genomes that have the specified text in the
                                 searchable fields: species and infraspecies, assembly name and submitter
                                 To provide multiple strings '--search' can be included multiple times

比如需要下载真菌(taxid:4751)基因组数据:

(taxid 可以通过NCBI搜索得到。其他下载选项可以通过命令 datasets download查看)

datasets download genome taxon "4751" --dehydrated --filename fungi_genome_dataset.zip --api-key 123456789abcdefghijk

因为数据量比较大先下载为json的压缩包形式 ,后面的--api-key防止短时请求次数过多被服务器屏蔽IP,api-key可以通过注册ncbi账号得到。

fungi_genome_dataset.zip下载完后解压到当前目录文件结构如下:

Archive:  fungi_genome_dataset.zip
  inflating:fungi_genome_dataset/README.md
  inflating:fungi_genome_dataset/ncbi_dataset/data/*/assembly_data_report.jsonl
  inflating:fungi_genome_dataset/ncbi_dataset/data/dataset_catalog.json
  inflating:fungi_genome_dataset/ncbi_dataset/fetch.txt

下载

##提示not find 仔细检查路径格式
datasets rehydrate --directory fungi_genome_dataset/
举报

相关推荐

0 条评论