Archives

Categories

使用RDP对16S序列进行分类:rdp_classifier

今天探讨的是使用RDP分类器对16S/ITS/18S序列进行分类;

一、RDP分类器

RDP分类器是老牌的序列分类软件,采用了 naïve Bayesian classifier,最低分类等级到属水平。 QIIME(1.9 版本)使用的是RDP比较老的版本(2.2),当前版本为 2.12, 这个符合QIIME 一贯的作风,OTU 聚类 usearch 还一直采用Usearch比较老的算法版本(Uclust).

QIIME老用户请切换至 QIIME2, 全新的设计,还带有很多新的特性。

具体算法原理请参考:

Wang, Q, G. M. Garrity, J. M. Tiedje, and J. R. Cole. 2007. Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol. 73(16):5261-7

二、下载安装

分类器:

可以从Github上下载源代码编译(仓库:RDPTools), 也可以直接从 SourceForge 下载编译程序:

https://sourceforge.net/projects/rdp-classifier/files/rdp-classifier/rdp_classifier_2.12.zip/download

程序执行需要JAVA环境,大部分Linux操作系统都带有JAVA环境,所以不需要额外做些什么。

训练集:

官方提供了 RDPClassifier_16S_trainsetNo16 fungalits_warcup_trainingdata_V2 fungalits_UNITE_trainingdata_07042014 RDPClassifier_fungiLSU_trainsetNo11 四个训练集, 可以从 SourceForge 下载:

wget https://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/RDPClassifier_16S_trainsetNo16_rawtrainingdata.zip/download
wget https://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/fungalits_warcup_trainingdata_V2.zip/download
wget https://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/fungalits_UNITE_trainingdata_07042014.zip/download
wget https://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/RDPClassifier_fungiLSU_trainsetNo11_rawtrainingdata.zip/download

下面我们有:

.
├── fungalits_UNITE_trainingdata_07042014
├── fungalits_warcup_trainingdata_V2
├── otus.fa
├── RDPClassifier_16S_trainsetNo16_rawtrainingdata
├── rdp_classifier_2.12
└── RDPClassifier_fungiLSU_trainsetNo11_rawtrainingdata

接下来可以进行序列分类了。

三、命令行执行

RDP分类器提供了Web服务,地址: https://rdp.cme.msu.edu/classifier/classifier.jsp, 如果只有几条序列,直接提交到Web上就可以了。

下载的程序(SourceForge 只包含了 classify 程序), 对于我们做序列分类足够使用了。

命令接口:

classify 主程序接口:

$java -jar  classifier.jar
USAGE: ClassifierMain <subcommand> <subcommand args ...>
default command is classify
        classify      - classify one or multiple samples
        crossvalidate - cross validate accuracy testing
        comp-trainset - compare multiple training sets to find shared and unique taxa and sequences
        libcompare    - compare two samples
        loot          - leave one (sequence or taxon) out accuracy testing
        merge-detail  - merge classification detail result files to create a taxon assignment counts file
        merge-count   - merge multiple taxon assignment count files to into one count file
        random-sample - random select a subset or subregion of sequences
        rm-dupseq     - remove identical or any sequence contained by another sequence
        rm-partialseq - remove partial sequences
        taxa-sim      - calculate and plot the similarities within taxa
        train         - retrain classifier
        version       - taxonomy versions of the pre-compiled training sets

训练:

使用分类器首先序列对提供的训练集进行训练,使用 train 子命令,其接口如下:

$java -jar rdp_classifier_2.12/dist/classifier.jar  train
Command Error: taxon file must be specified
usage: train [-c <arg>] [-m <arg>] [-n <arg>] [-o <arg>] [-s <arg>] [-t <arg>] [-v <arg>]
 -c,--copynumber_file <arg>   contains at least name, rank and the mean copy number of taxa. A header line is required
                              to find the corresponding columns
                              Only the copy number of the lowest rank taxa will be loaded and the copy number of the
                              other taxa are derived from these.
 -m,--mod <arg>               the modifcation information of the taxonomy
 -n,--version_no <arg>        an integer used to refer to a training set
 -o,--out_dir <arg>           the output directory
 -s,--seq <arg>               training sequences in FASTA format with lineage in the header:
                              a list taxon names seperated by ';' with highest rank taxon first.
                              The lowest rank of the lineage have to be the same for all sequence.
                              The lowest rank is not limited to genus
 -t,--tax_file <arg>          contains the hierarchical taxonomy information in the following format:
                              taxid*taxon name*parent taxid*depth*rank
                              Fields taxid, the parent taxid and depth should be in integer format
                              The taxid, or the combination of taxon name and rank is unique
                              depth indicates the depth from the root taxon.
                              Note: the depth for the root is 0
 -v,--version <arg>           the version of the hierarchical taxonomy

RDP 提供的训练文件满足以下参数需求:

cd RDPClassifier_16S_trainsetNo16_rawtrainingdata
java -jar  ../rdp_classifier_2.12/dist/classifier.jar  train  -n 16  -o .  -s trainset16_022016.fa  -t  trainset16_db_taxid.txt  -c  rrnDB-4.4.4_copynumber_RDP_022016.txt
wget  https://raw.githubusercontent.com/rdpstaff/classifier/master/samplefiles/rRNAClassifier.properties

必须要有rRNAClassifier.properties 这个文件, 例如这样就可以了:

# Sample ResourceBundle properties file
bergeyTree=bergeyTrainingTree.xml
probabilityList=genus_wordConditionalProbList.txt
probabilityIndex=wordConditionalProbIndexArr.txt
wordPrior=logWordPrior.txt
classifierVersion=RDP Naive Bayesian rRNA Classifier Version 2.11

执行分类器的时候需要指定这个文件。

分类:

现在就可以使用 classify 子命令进行序列分类, 其接口:

$java -jar  classifier.jar  classify
Command Error: Require the output file for classification assignment
usage:  [options] <samplefile>[,idmappingfile] ...
 -b,--bootstrap_outfile <arg>   the output file containing the number of
                                matching assignments out of 100 bootstraps for
                                major ranks. Default is null
 -c,--conf <arg>                assignment confidence cutoff used to determine
                                the assignment count for each taxon. Range
                                [0-1], Default is 0.8.
 -d,--metadata <arg>            the tab delimited metadata file for the samples,
                                with first row containing attribute name and
                                first column containing the sample name
 -f,--format <arg>              tab-delimited output format:
                                [allrank|fixrank|biom|filterbyconf|db]. Default
                                is allRank.
                                allrank: outputs the results for all ranks
                                applied for each sequence: seqname, orientation,
                                taxon name, rank, conf, ...
                                fixrank: only outputs the results for fixed
                                ranks in order: domain, phylum, class, order,
                                family, genus
                                biom: outputs rich dense biom format if OTU or
                                metadata provided
                                filterbyconf: only outputs the results for major
                                ranks as in fixrank, results below the
                                confidence cutoff were bin to a higher rank
                                unclassified_node
                                db: outputs the seqname, trainset_no, tax_id,
                                conf.
 -g,--gene <arg>                16srrna, fungallsu, fungalits_warcup,
                                fungalits_unite. Default is 16srrna. This option
                                can be overwritten by -t option
 -h,--hier_outfile <arg>        tab-delimited output file containing the
                                assignment count for each taxon in the
                                hierarchical format. Default is null.
 -m,--biomFile <arg>            the input clluster biom file. The classification
                                result will replace the taxonomy of the
                                corresponding cluster id.
 -o,--outputFile <arg>          tab-delimited text output file for
                                classification assignment.
 -q,--queryFile                 legacy option, no longer needed
 -s,--shortseq_outfile <arg>    the output file containing the sequence names
                                that are too short to be classified
 -t,--train_propfile <arg>      property file containing the mapping of the
                                training files if not using the default. Note:
                                the training files and the property file should
                                be in the same directory.
 -w,--minWords <arg>            minimum number of words for each bootstrap
                                trial. Default(maximum) is 1/8 of the words of
                                each sequence. Minimum is 5

我们使用程序的宗旨一般是: 非高级用户就使用默认参数。

最简单执行模式:

java -jar rdp_classifier_2.12/dist/classifier.jar classify  -t  RDPClassifier_16S_trainsetNo16_rawtrainingdata/rRNAClassifier.properties -c 0.5  -o  otu_16S_trainsetNo16.txt  otus.fa

参数设置:

分类器主要要考虑的参数为:可信度阈值,对于比较短(长度短于250碱基)的序列推荐设置: 0.5, 请参考:

Partial sequences with length shorter than 250 bps should use bootstrap cutoff 50%.

四、探讨

1. 命令行接口

Java类程序最佳实践应该为 jar 包 写一个Wrap, 这样比较容易嵌入命令行程序,只要在PATH路径就可以随便调用,而不用去了解jar路径。

现在可以这样:

cd rdp_classifier_2.12
touch  rdp_classifier
export PATH=$PWD:$PATH

rdp_classifier的内容是这样:

#!/usr/bin/perl -w

use strict;
use warnings;
use File::Basename;
use Cwd 'abs_path';

die "Usage: rdp_classifier

default command is classify
        classify      - classify one or multiple samples
        crossvalidate - cross validate accuracy testing
        comp-trainset - compare multiple training sets to find shared and unique taxa and sequences
        libcompare    - compare two samples
        loot          - leave one (sequence or taxon) out accuracy testing
        merge-detail  - merge classification detail result files to create a taxon assignment counts file
        merge-count   - merge multiple taxon assignment count files to into one count file
        random-sample - random select a subset or subregion of sequences
        rm-dupseq     - remove identical or any sequence contained by another sequence
        rm-partialseq - remove partial sequences
        taxa-sim      - calculate and plot the similarities within taxa
        train         - retrain classifier
        version       - taxonomy versions of the pre-compiled training sets
" if(@ARGV <=  1);

my $path = dirname( abs_path($0) );

my $cmd = join(" ", @ARGV);
system qq{java -jar $path/dist/classifier.jar $cmd\n};

现在就可以愉快的使用命令行做序列分类了:

rdp_classifier  classify  -t  RDPClassifier_16S_trainsetNo16_rawtrainingdata/rRNAClassifier.properties     -b 100  -c 0.5  -o  otu_16S_trainsetNo16.txt  otus.fa

2. 库选择

这里有一个误区, “库越大越好”, 其实首先要保证的是库里序列分类的准确性,如果很多分类都不准确,那怎么能预测准确呢? 具体解释请参考Robert Edgar 的描述: FAQ: Which taxonomy database should I use?

本文材料为 BASE (Biostack Applied bioinformatic SEies ) 课程 Linux Command Line Tools for Life Scientists 材料, 版权归 上海逻捷信息科技有限公司 所有

Last Update: 2017-09-22 3:09 PM

Comments are closed.