Centrifuge: 快速对 metagenome 序列进行分类


Centrifuge: rapid and sensitive classification of metagenomic sequences


Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species on desktop computers. The system uses an indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (4.2 GB for 4078 bacterial and 200 archaeal genomes) and classifies sequences at very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together, these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers. Because of its space-optimized indexing schemes, Centrifuge also makes it possible to index the entire NCBI nonredundant nucleotide sequence database (a total of 109 billion bases) with an index size of 69 GB, in contrast to k-mer-based indexing schemes, which require far more extensive space.






Centrifuge 又是一款快速有效的 metagenome 序列分类的软件(reads 和 contig、整个染色体 ), 采用了结合BWT变换(Burrows-Wheeler transform,BWT)和 FM索引(Ferragina-Manzini ,FM)的策略对序列分类进行优化,通过基因组压缩策略 有效降低了内存的需求,因此可以处理NT库级别的库索引,因为Kraken等基于Kmer的策略,所以并不需要这样的操作,但是需要存储很大的Kmer表,虽然速度快、准确性高(大的Kmer长度 k=31),但是敏感性很低,特别是针对多样性比较复杂的环境。

Centrifuge 为 Johns Hopkins University CCB(The Center for Computational Biology)出品, 采用的软件架构和bowtie2、hisat2 等还是比较类似, 命令行接口也类似,学习成本比较低。

当前库版本 p+h+v(Bacteria, Viruses, Human),大小13G, 包含了 28718 条核酸序列,14871个NCBI Taxonomy节点,8382 species , NT库 77G大小, 包含了 39648092 条核酸序列,1028487个物种信息。

有意思的是 Centrifuge 竟然允许一条序列可以有多个taxonomy 标签,并允许通过设置阈值将多个hits回归到LCA模式,针对multi-hit 模式,通过EM算法可以进行丰度定量。 centrifuge-kreport 提供了将Centrifuge的结果转换成Kraken风格的结果,这点很值得赞, Kaiju也提供了 Kraken style格式文件,这样后端程序就比较统一,应该有一个标准才好。



