BioPython分割大fastq为小fastq-CFANZ编程社区

BioPython分割大fastq为小fastq

目的：自己手头的测序数据文件有点大，电脑运行不起来，想将其分开成几份单独运行

原文地址 https://biopython.org/wiki/Split_large_file It useful to be able to split a sequence file into a set of smaller files, each containing a subset of original file's sequences.

第一步：模拟生成双端fastq文件

wgsim -N 4000 -1 150 -2 150 NC_008253.fna reads_1.fastq reads_2.fastq

-N 参数用来指定reads的数量
-1， -2 用来指定双端reads的长度
还有其他参数可以空运行命令来查看

第二步：python代码

代码

首先定义一个函数，避免所有数据一次性读入内存（这段代码自己还看不太懂）

There are many possible ways to solve this general problem, this recipe uses a generator function to avoid having all the data in memory at once.

def batch_iterator(iterator,batch_size):
    entry = True
    while entry:
        batch = []
        while len(batch) < batch_size:
            try:
                entry = iterator.__next__() ###原文代码为 entry = iterator.next()
            except StopIteration:
                entry = None
            if entry is None:
                break
            batch.append(entry)
        if batch:
            yield batch

井号注释的部分如果用原文代码会报错AttributeError: 'generator' object has no attribute 'next'；：在python 3.X中generator(有yield关键字的函数则会被识别为generator函数)中的next()方法变为__next__()了。

import sys
from Bio import SeqIO

file_format = sys.argv[1]
input_file = sys.argv[2]
reads_number = int(sys.argv[3])

record_iter = SeqIO.parse(open(input_file),file_format)
for i,batch in enumerate(batch_iterator(record_iter,reads_number)):
    filename = "group_%i.fastq"%(i+1)
    with open(filename,"w") as handle:
        count = SeqIO.write(batch,handle,"fastq")
    print("Wrote %i records to %s"%(count,filename))

相比原文代码稍微改动了一点

使用方法

python3 split_Fastq_into_multiple_Small_fastq.py fastq reads_1.fastq 1000

第一个位置指定文件格式fastq或者fasta
第二个位置指定输入文件
第三个位置指定每个小的fastq文件存储的reads数量

结语：好像很慢！

0 条评论