目的:自己手头的测序数据文件有点大,电脑运行不起来,想将其分开成几份单独运行
原文地址 https://biopython.org/wiki/Split_large_file It useful to be able to split a sequence file into a set of smaller files, each containing a subset of original file's sequences.
第一步:模拟生成双端fastq文件
wgsim -N 4000 -1 150 -2 150 NC_008253.fna reads_1.fastq reads_2.fastq
- -N 参数用来指定reads的数量
- -1, -2 用来指定双端reads的长度
- 还有其他参数可以空运行命令来查看
第二步:python代码
代码
首先定义一个函数,避免所有数据一次性读入内存(这段代码自己还看不太懂)
There are many possible ways to solve this general problem, this recipe uses a generator function to avoid having all the data in memory at once.
def batch_iterator(iterator,batch_size):
entry = True
while entry:
batch = []
while len(batch) < batch_size:
try:
entry = iterator.__next__() ###原文代码为 entry = iterator.next()
except StopIteration:
entry = None
if entry is None:
break
batch.append(entry)
if batch:
yield batch
井号注释的部分如果用原文代码会报错AttributeError: 'generator' object has no attribute 'next'
;:在python 3.X中generator(有yield关键字的函数则会被识别为generator函数)中的next()
方法变为__next__()
了。
import sys
from Bio import SeqIO
file_format = sys.argv[1]
input_file = sys.argv[2]
reads_number = int(sys.argv[3])
record_iter = SeqIO.parse(open(input_file),file_format)
for i,batch in enumerate(batch_iterator(record_iter,reads_number)):
filename = "group_%i.fastq"%(i+1)
with open(filename,"w") as handle:
count = SeqIO.write(batch,handle,"fastq")
print("Wrote %i records to %s"%(count,filename))
相比原文代码稍微改动了一点
使用方法
python3 split_Fastq_into_multiple_Small_fastq.py fastq reads_1.fastq 1000
- 第一个位置指定文件格式
fastq
或者fasta
- 第二个位置指定输入文件
- 第三个位置指定每个小的fastq文件存储的reads数量