Archives

扩增子测序分析之去噪 – DADA2及其平台化

引言

DADA2

DADA2(Divisive Amplicon Denoising Algorithm)的核心 去噪 然后使用 Amplicon Sequence Variants (ASVs)概念构建类OTU表;

具体算法可以参考下面几篇文章:

Denoising PCR-amplified metagenome data
DADA2: High-resolution sample inference from Illumina amplicon data
Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses

二、核心算法

DADA2 的核心算法:

1. 错误率模型计算,采用selfConsist无监督学习模型;
2. Rate λ_ji: 通过错误率模型,衡量扩增子序列i是否来在来自j;
3. 丰度P-value值,使用丰度信息衡量:该丰度的扩增子序列是否可以通过错误模型解释,使用了一个经验P值来判断,比如 OMEGA_A = 1e-40
4. 分裂算法( divisive partitioning)形成扩增子簇,簇心定义为denoise序列,簇大小定义denoise序列丰度。

三、DADA2实践

3.1 准备材料:

下载测试文件

wget http://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip

下载训练集

Silva 128 [Silva taxonomic training data formatted for DADA2 (Silva version 128)][65]

wget https://zenodo.org/record/824551/files/silva_nr_v128_train_set.fa.gz
wget https://zenodo.org/record/824551/files/silva_species_assignment_v128.fa.gz

RDP trainset 16 [RDP taxonomic training data formatted for DADA2][76]

wget https://zenodo.org/record/801828/files/rdp_species_assignment_16.fa.gz
wget https://zenodo.org/record/801828/files/rdp_train_set_16.fa.gz

3.2 DADA2 安装

测试平台:R 3.3.0 wget https://zenodo.org/record/158955/files/rdp_species_assignment_14.fa.gz
wget https://zenodo.org/record/158955/files/rdp_train_set_14.fa.gz

UNITE
wget https://unite.ut.ee/sh_files/sh_general_release_10.10.2017.zip

DADA2 Pipeline

测试平台:R 3.3.0

DADA2 R Pakcage 安装

使用默认安装模式:

source("https://bioconductor.org/biocLite.R")
biocLite("dada2")

当前安装版本为: 1.2.2

使用devtools安装模式:

devtools::install_github("benjjneb/dada2")

安装最新版本: 1.5.8

3.3 测试流程

“`r
library(dada2);
packageVersion(“dada2”)

rdp_path <- "/project/task/dada2/RDP-16"
miseq_sop <- "/project/task/dada2/MiSeq_SOP"

fnFs <- sort(list.files(miseq_sop, pattern="_R1_001.fastq"))
fnRs <- sort(list.files(miseq_sop, pattern="_R2_001.fastq"))

sample.names <- sapply(strsplit(fnFs, "_"), `[`, 1)
fnFs <- file.path(miseq_sop, fnFs)
fnRs <- file.path(miseq_sop, fnRs)

plotQualityProfile(fnFs[1:2])

filt_path <- file.path(miseq_sop, "filtered")
filtFs <- file.path(filt_path, paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sample.names, "_R_filt.fastq.gz"))

filt <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,160),
      maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
      compress=TRUE, multithread=TRUE)

derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)

names(derepFs) <- sample.names
names(derepRs) <- sample.names

errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)

dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)

mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)
seqtab  <- makeSequenceTable(mergers)
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)

getN <- function(x) sum(getUniques(x))
track <- cbind(filt, sapply(dadaFs, getN), sapply(mergers, getN), rowSums(seqtab), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoised", "merged", "tabled", "nonchim")
rownames(track) <- sample.names

taxa <- assignTaxonomy(seqtab.nochim,  file.path(rdp_path, "rdp_train_set_16.fa.gz"), multithread=TRUE)
seqtab_taxa <- cbind('#seq'=rownames(taxa), t(seqtab.nochim), taxa)
write.table(seqtab_taxa, "dada2-counts-taxon.txt", sep="\t", quote=F, row.names = F)

taxa.plus <- addSpecies(taxa, file.path(rdp_path, "rdp_species_assignment_16.fa.gz"))
seqtable.taxa.plus <- cbind('#seq'=rownames(taxa.plus), t(seqtab.nochim), taxa.plus)
write.table(seqtable.taxa.plus , "dada2-counts-taxon.species.txt", sep="\t", quote=F, row.names = F)

“`

3.4 流程解读

整个流程包含以下几个步骤:

序列过滤和切除
错误率计算
去噪音
合并双端序列
去除嵌合体
物种分类
物种水平鉴定

整个流程通过几个关键函数完成,filterAndTrim、derepFastq、learnErrors、dada、mergePairs、assignTaxonomy、addSpecies 等几个核心函数。

值得关注是addSpecies和assignSpecies, 逻辑是通过完全匹配的序列分类一致性(重点在这里, 两个问题: 1. 库应该足够大, 2. 序列库足够准确)去判断是否应该分配species信息;

DADA2 提供了silva_species_assignment_v128.fa.gzrdp_species_assignment_16.fa.gz用于识别可以分类到种水平信息, 该文件是通过对原始序列问题进行几个操作实现:

a. 去除分类不明确的序列,比如 Uncultured、 unclassified、Outgroup、Unidentified序列;
b. 去除分类不完整序列,要包含属和种水平信息;
c. 只保留属和种信息;

3.5 QIIME整合

QIIME 通过插件形式整合DADA2用于构建这中feature-table, 代替传统的OTU。

qiime dada2 命令可以执行特征表构建, 并使用MD5描述信息,支持单端(denoise-single)和双端模式(denoise-paired),

支持的命令行接口(参数基本可以和DADA2的R函数对应):

--i-demultiplexed-seqs: demultiplexe后的序列;
--p-trunc-len-f: 针对forward序列,截断序列长度大小,设置为0不执行截断操作;
--p-trunc-len-r: 针对reverse序列,截断序列长度大小,设置为0不执行截断操作;
--o-table: FeatureTable,用于下游各种分析。
--o-representative-sequences: 代表序列,簇心。
--p-hashed-feature-ids:  使用Hash值做序列标识符。
--p-no-hashed-feature-ids:不使用Hash值做序列标识符。
--p-n-reads-learn: 用于无监督学习的序列数。
--p-n-threads: 线程数;

实例:

qiime dada2 denoise-paired \
    --i-demultiplexed-seqs paired-end-sequences.qza \
    --p-trunc-len-f  0   \
    --p-trunc-len-r  0   \
    --o-table  table.qza \
    --o-representative-sequences \
    rep-seqs.qza

本文材料为 BASE (Biostack Applied bioinformatic SEies ) 课程 Linux Command Line Tools for Life Scientists 材料, 版权归 上海逻捷信息科技有限公司 所有.

Last Upate: 2017/10/16 21:20

Comments are closed.