0
点赞
收藏
分享

微信扫一扫

BioPython分割大fastq为小fastq

目的:自己手头的测序数据文件有点大,电脑运行不起来,想将其分开成几份单独运行


原文地址 ​​​https://biopython.org/wiki/Split_large_file​​​ It useful to be able to split a sequence file into a set of smaller files, each containing a subset of original file's sequences.


第一步:模拟生成双端fastq文件

wgsim -N 4000 -1 150 -2 150 NC_008253.fna reads_1.fastq reads_2.fastq
  • -N 参数用来指定reads的数量
  • -1, -2 用来指定双端reads的长度
  • 还有其他参数可以空运行命令来查看

第二步:python代码

代码

首先定义一个函数,避免所有数据一次性读入内存(这段代码自己还看不太懂)

There are many possible ways to solve this general problem, this recipe uses a generator function to avoid having all the data in memory at once.

def batch_iterator(iterator,batch_size):
entry = True
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = iterator.__next__() ###原文代码为 entry = iterator.next()
except StopIteration:
entry = None
if entry is None:
break
batch.append(entry)
if batch:
yield batch

井号注释的部分如果用原文代码会报错​​AttributeError: 'generator' object has no attribute 'next'​​​;:在python 3.X中generator(有yield关键字的函数则会被识别为generator函数)中的​​next()​​​方法变为​​__next__()​​了。

import sys
from Bio import SeqIO

file_format = sys.argv[1]
input_file = sys.argv[2]
reads_number = int(sys.argv[3])

record_iter = SeqIO.parse(open(input_file),file_format)
for i,batch in enumerate(batch_iterator(record_iter,reads_number)):
filename = "group_%i.fastq"%(i+1)
with open(filename,"w") as handle:
count = SeqIO.write(batch,handle,"fastq")
print("Wrote %i records to %s"%(count,filename))

相比原文代码稍微改动了一点

使用方法
python3 split_Fastq_into_multiple_Small_fastq.py fastq reads_1.fastq 1000
  • 第一个位置指定文件格式​​fastq​​​或者​​fasta​
  • 第二个位置指定输入文件
  • 第三个位置指定每个小的fastq文件存储的reads数量
结语:好像很慢!
举报

相关推荐

0 条评论