Archives

Categories

CentOS 7 系统三步手动挂载硬盘 – GUI篇

第一步:选中Disk管理

方式: Applications -> Utilities -> Disk

Disk

第二步:格式化磁盘

通过配置按钮,选择格式化,进行格式化硬盘,默认选择 ext4 格式

2-1

2-2

第三步:挂在磁盘

选择挂在磁盘目录, 比如 Biostack 系列服务器, 选择挂在在根目录

创建根目录挂载点:

cd /
mkdir biostack

在挂在选项挂载点设置: /biostack, 并关闭自动挂载选项

此处输入图片的描述

来自Biostack团队: 2017-08-03 版本

Biostack R系列生物信息一体机:软件、硬件和软件配置

当前戴尔销售的服务器为第13代,高性能计算中使用比较多的为R730,14代服务器应该也会于Q4发布,但是一些公司,比如游戏公司,会不断对服务器升级,并下架那些老一代的服务器,那些下架机器仍然具有很好的性能。

现阶段学习生物信息越来越受到关注,而学习生物信息最快的途径就是实战,亲自去编写程序或去使用生物信息应用解决生物信息学问题,这样就需要有学习的软件和硬件环境。

最新更新

2017-07-12 嵌入 div_seq_lite 数据分析流程
  10个样本(每个样本平均3W条序列)的数据可以在1.5分钟完成解析。 核心功能全部使用 Usearch, 主要功能包括:
      1. 序列修剪.
      2.双端序列合并
      3. 构建OTU表
      4. 去除叶绿体线粒体 16S序列
      5. 菌群结构分布
      6. alpha多样性分析以及稀释曲线 
      7. beta多样性分析

标配价格

5800 人民币/台
注:1. 包运费
    2. 含税
    3. 主机(除额外配置SATA硬盘、以及8G/16G 内存以外)皆为DELL原装.
    4. 2T SATA硬盘、8G / 16G 内存为全新配件。
    5. 采用系统2块300G Raid1 镜像冗余,保证系统稳定性。
    6. Biostack软件独立一块2T盘,方便后期维护。
    7. 项目project目录独立,方便组Raid扩容。 

硬件配置

服务器安装配置:

处理器:2 * E5645 (2.40GHz),共计12核心、24线程
内存: 4  * 4G, 共计16G内存
阵列卡:1 * 6I阵列卡,支持Raid 0/1/5/10等
硬盘:2 * 300G 15K SAS盘
硬盘:2 * 2TB SATA盘
电源: 单电

软件配置

系统分区及其目录结构:

1. biostack分区, 安装应用和数据库
2. project分区, 数据和项目目录

账户系统###:

1. root 密码: biostack
2. 管理员用户: biostack, 密码:biostack

基础开发环境及其服务

采用CentOS 7 的开发者模式安装操作系统, C/C++编译器已经预安装

安装FTP服务器

目录结构:

1. FTP主目录, /project/pub
2. 用户主目录, /project/pub/用户名
3. 用户数据目录, /project/pub/用户名/pub

添加虚拟用户:

1. 修改: /etc/vsftpd/virtusers 文件, 添加用户和密码
2. 添加虚拟用户配置文件: 再 /etc/vsftpd/vconf 目录创建相对应的用户配置文件

更新生效:

db_load -T -t hash -f /etc/vsftpd/virtusers /etc/vsftpd/virtusers.db
systemctl restart  vsftpd.service

安装Samba服务器

根据需求添加Samba共享服务目录

修改/etc/samba/smb.conf 文件, 添加如下信息:

[project]
path = /biostack              #添加目录路径
public = yes
browsable = yes
valid users = @biostack      #限制用户组
writable = yes
guest ok = yes
read only = no
available = yes

更新生效:

sudo systemctl restart smb.service
sudo systemctl restart nmb.service

Perl 环境

预安装Perl模块列表:

App::cpanminus
Array::Utils
B::Hooks::EndOfScope
Capture::Tiny
Class::Data::Inheritable
Class::HPLOO
Class::Load
Class::Load::XS
Convert::Binary::C
CPAN::Meta
CPAN::Meta::Check
CPAN::Meta::YAML
Crypt::RC4
Cwd
Data
Data::OptList
DBD::SQLite
Devel::GlobalDestruction
Devel::OverloadInfo
Devel::StackTrace
Digest::Perl::MD5
Dist::CheckConflicts
Encode
Eval::Closure
Exception::Class
Exporter::Tiny
ExtUtils::PkgConfig
File::Find::Rule
File::Grep
File::Path
File::pushd
File::Slurper
File::Temp
File::Which
Getopt::Long
Graph
Graph::ReadWrite
HDB
HTML-TableExtract
IO::String
IO::Stringy
JSON::PP
List::MoreUtils
List::Util
Log::Log4perl
Module::Implementation
Module::Metadata
Module::Runtime::Conflicts
Moose
MRO::Compat
namespace::clean
Number::Compare
OLE::Storage_Lite
Package::DeprecationManager
Package::Stash
Package::Stash::XS
Params::Util
Parse::Yapp
PerlIO::utf8_strict
Spreadsheet::ParseExcel
Sub::Exporter
Sub::Exporter::Progressive
Sub::Identify
Sub::Install
Sub::Name
Sub::Uplevel
SVG
Switch
Test::CleanNamespaces
Test::Deep
Test::Exception
Test::Fatal
Test::Harness
Test::Most
Test::Needs
Test::Pod
Test::Requires
Test::Simple
Test::Warn
Test::Warnings
Text::CSV
Text::Glob
Try::Tiny
Variable::Magic
XML::DOM
XML::DOM::XPath
XML::Filter::BufferText
XML::RegExp
XML::SAX::Expat
XML::SAX::Writer
XML::Simple
XML::Writer
XML::XPathEngine
YAML::LibYAML

Python 环境

预安装Python模块列表:

attrdict
backports.shutil-get-terminal-size
backports.ssl-match-hostname
Beaker
biom-format
biome
biopython
blivet
brewer2mpl
Brlapi
burrito
burrito-fillings
cffi
chardet
click
cogent
configobj
configshell-fb
coverage
cryptography
cupshelpers
custodia
cycler
Cython
decorator
di
dnspython
docopt
emperor
enum34
ete3
ethtool
firstboot
freeipa
fros
funcsigs
future
ga4ghmongo
gdata
ggplot
gssapi
idna
iniparse
initial-setup
iotop
ipaclient
ipaddr
ipaddress
ipalib
ipaplatform
ipapython
IPy
ipython
ipython-genutils
javapackages
jwcrypto
kdcproxy
kitchen
kmod
langtable
lvm
lxml
M2Crypto
Magic-file-extensions
Mako
MarkupSafe
matplotlib
mock
mongoengine
mykatlas
mykrobe
MySQL-python
natsort
ncbi-genome-download
netaddr
netifaces
nose
ntplib
numpy
openlmi
pandas
Paste
pathlib
pathlib2
patsy
pbr
pcp
perf
pexpect
pickleshare
pip
ply
policycoreutils-default-encoding
prompt-toolkit
psycopg2
ptyprocess
pyasn1
pycparser
pycups
pycurl
Pygments
pygobject
pygpgme
pyinotify
pykickstart
pyliblzma
pymongo
pynast
pyOpenSSL
pyparsing
pyparted
pyqi
pysam
pysmbc
python-augeas
python-dateutil
python-dmidecode
python-ldap
python-meh
python-memcached
python-nss
python-yubico
pytz
pyudev
pyusb
PyVCF
pywbem
pyxattr
PyYAML
qcli
qiime
qiime-default-reference
qrcode
requests
rtslib-fb
scandir
scdate
scikit-bio
scipy
seaborn
seobject
sepolicy
setproctitle
setroubleshoot
setuptools
simplegeneric
six
slip
slip.dbus
SSSDConfig
statsmodels
targetcli-fb
targetd
Tempita
traitlets
urlgrabber
urllib3
urwid
virtualenv
wcwidth
wheel
XlsxWriter
yum-langpacks
yum-metadata-parser

R 环境

预安装R模块列表:

annotate
BiocGenerics
class
digest
GenomicFeatures
graphics
labeling
methods
pkgconfig
rlang
spatial
tools
zlibbioc
AnnotationDbi
BiocInstaller
cluster
docopt
GenomicRanges
grDevices
lambda.r
mgcv
plogr
rlist
splines
topGO
AnnotationForge
BiocParallel
codetools
dplyr
getopt
grid
lattice
mime
plyr
rpart
stats
translations
ape
biomaRt
colorspace
edgeR
ggplot2
GSEABase
lazyeval
munsell
qvalue
Rsamtools
stats4
utils
assertthat
Biostrings
compiler
foreign
ggtree
gtable
limma
nlme
R6
RSQLite
stringi
vegan
base
bit
curl
futile.logger
git2r
gtools
locfit
nnet
RBGL
rstudioapi
stringr
whisker
BH
bit64
datasets
futile.options
glue
hash
magrittr
openssl
RColorBrewer
rtracklayer
SummarizedExperiment
withr
BiasedUrn
bitops
data.table
genefilter
GO.db
httr
MASS
optparse
Rcpp
S4Vectors
survival
XML
bindr
blob
DBI
geneLenDataBase
goseq
IRanges
Matrix
parallel
RCurl
scales
tcltk
xtable
bindrcpp
boot
devtools
GenomeInfoDb
GOstats
jsonlite
matrixStats
permute
reshape
snow
tibble
XVector
Biobase
Category
dichromat
GenomicAlignments
graph
KernSmooth
memoise
pheatmap
reshape2
SparseM
tidyr
yaml

Biostack 原安装应用列表

原装应用:

bbmap-36.84
blast-2.2.26
blat-36
bowtie-1.2R
bowtie2-2.3.0
bwa-0.7.15
cd-hit-4.6.1
clustal-omega-1.2.3
diamond-0.8.37
hmmer-2.3.2
hmmer-3.1b2
infernal-1.1.2
mafft-7.222
snap-1.0dev.96
mauve-2.4.0
minimap-0.2
MUMmer-3.23
muscle-3.8.1551
ncbi-blast+-2.6.0
verticalize-0.0.1
prank-121218
RAPSearch-2.24
vsearch-2.4.2
bmtagger-3.102
KronaTools-2.7
mash-1.1.1
pplacer-1.1a18
ribopicker-0.4.3
sortmerna-2.1b
metaphlan2-2.6.0
uproc-2.0.0-rc1
bioawk-1.0
FastQC-0.11.5
fastq-tools-0.8
fastx_toolkit-0.0.14
FLASH-1.2.11
fqtools-2.0
jellyfish-1.1.11
jellyfish-2.2.6
seqkit-0.3.5
seqtk-1.2
sickle-1.33
sratoolkit-2.8.1
Trimmomatic-0.36
wgsim-0.3.1
jdk-1.7.0_80
jdk-1.8.0_111
R-3.3.3
bamtools-2.4.1
bcftools-1.4
bedtools2-2.26.0
htslib-1.4
igvtools-2.3.91
picard-2.9.0
sambamba-0.6.5
samblaster-0.1.23
samtools-0.1.18
samtools-1.4
EMBOSS-6.6.0
gargs-0.3.7
genometools-1.5.9
tabtk-0.1
csvtk-0.8.0
tsv-utils-1.1.11
parallel-20161022
htop-2.0.2
aspera-connect-3.6.1.110647
axel-2.11
gffread-0.9.6
pigz-2.3.4
xz-5.2.2
cufflinks-2.2.1
hisat2-2.0.5
transrate-1.0.3
kallisto-0.43.0
Rockhopper-2.0.3
RSEM-1.3.0
Salmon-0.8.2
STAR-2.5.3a
stringtie-1.3.3b
subread-1.5.1
tophat-2.1.1
TransDecoder-3.0.1
trinityrnaseq-2.2.0
idba-1.1.3
idba-mt-1.0
CAP3-021015
megahit-1.0.6.1
miniasm-0.2
SOAPdenovo2-240
SPAdes-3.9.1
velvet-1.2.10
VelvetOptimiser-2.2.5
andi-0.10
FastTree-2.1.9
Parsnp-1.2
phylip-3.696
trimal-1.4.1
abricate-0.3
ANIcalculator-1.0
aragorn-1.2.38
barrnap-0.7
eggnog-mapper-0.12.7
glimmer-3.02b
GlimmerHMM-3.0.4
mcl-14-137
augustus-3.2.3
BRAKER1-1.10
gm_et-4.33
prodigal-2.6.3
busco-3.0
tbl2asn-25.3
mlst-2.6
prokka-1.12
delly-0.7.6
freebayes-1.1.0
lumpy-sv-0.2.13
vcfanno-0.2.4
snp-sites-2.3.2
wham-1.8.0
snpEff-4.3
seqtk_utils-0.1.1
tabtk_utils-0.1.4
CNOGpro-utils-0.1.1
ImageMagick_utils-0.1.1
blast_utils-0.1.1
blobtools-0.9.19.3
taxonomy_utils-0.1.1
breseq-0.30.0
snippy-3.1
vcftools-0.1.14
go-1.8.1
boost-1.61.0
ksontk_utils-0.0.1
RGI-3.1.1
ResistoMap-0.0.x

Biostack 预安装数据库

数据库列表:

16SMicrobial
brenda
dbcan
eggnog-mapper
greengene
hg38
nr
pfam
RDP
rfam
sintax
swiss_prot
taxonomy
ardb
CARD
eggnog
gene_ontology
hg19
mash
nt
pynast_greengene
Resfams
silva
STRING
tax4fun

来自Biostack团队: 2017-08-01 版本

使用Dell Server Update Utility (SUU) 和 生命周期控制器 升级你的PowerEdge服务器

前面介绍了: 两步升级 DELL R710 升级 PERC 6/i 阵列卡 固件 方案, 但是有时候可能进不了系统,使用U盘是最简单的办法,下面介绍使用U盘集合SUU和生命周期控制器进行升级的策略。

在正文之前先说明几点:

1. 直接将下载的固件或者升级程序拷贝进U盘在R710服务器没有测试成功;
2. 该安装方法在R710上完成测试;
3. 可直接下载对应型号的SUU,但是体积很大,最新版本超过10G;
4. 生命周期控制器接受WINDOWNS版本的升级程序,不管机器安装什么操作系统,都可以选择 WIN32 模式;

下载 Dell Repository Manager

下载仓库 Dell_Repository_Manager_2.1.0.451, WINDOWS平台,下载后直接安装。

构建SUU镜像

参考: 使用Dell Repository Manager生成定制SUU

生命周期控制器升级服务器固件或者驱动程序

参考: PowerEdge服务器Lifecycle Controller简介

来自Biostack团队: 2017-08-01 版本

Div-seq lite 16S 数据分析流程使用教程

引言

微生物组研究的主要对象为特定环境中微生物群落的所有成员及其全部遗传与生理功能,传统的纯培养法经过百年来的发展和完善,从环境中分离和培养并鉴定菌株的研究工作已经接近极限,随着测序技术的发展,直接对环境样本中的保守的遗传标记片段比如:16S rRNA 基因进行扩增测序成为新的研究手段。 随着测序技术的发展以及测序成本的降低,越来越多的环境微生物多样性被研究,比如:人类肠道、温泉、土壤、火山口、南极冰川、深海、沼液、城市交通系统、生活用水等,对健康、农业、环境、海洋等重大系统问题产生深远影响。 人体微生物组计划(The Human Microbiome Project,HMP),对人体部位(胃肠道、口腔、鼻腔、女性生殖道以及皮肤)的微生物组进行广泛的研究,揭示出微生物组与人类健康状态息息相关,并推动了各国的微生物组计划和执行,美国更是将微生物组研究上升到了国家层面,提出了“国家微生物组计划”, 因此微生物研究也成为生物医学领域最火爆的前沿研究方向,未来精准医疗的重要组成部分。 为此,我们开发了div_seq 通用微生物组数据分析流程,快速有效的解析微生物的组成和功能,以便深入了解环境中微生物的群落结构及多样性和微生物的功能及代谢机理。

Div-seq-lite 介绍

Div-seq-lite 是完全基于 USEARCH(版本 10), USEARCH 版本10 新增了很多新的功能,包括了, 质量控制和双端序列合并 fastq_mergepairs, doi: 10.1093/bioinformatics/btv401、OTU表构建 UPARSE, Pubmed:23955772, dx.doi.org/10.1038/nmeth.2604 、 菌群组成分析 SINTAX 多样性指数分析 (alpha 和 beta)等。

流程包括了一下几个分析内容(模块):

1. 原始数据质量控制 trimming;
2. 双端序列合并 mergepairs;
3. OTU表构建 uparse
4. 鉴定代表序列的分类 sintax;
5. 物种组成多样及其可视化 taxonomy;
6. 样品多样性分析(alpha 多样性),获得稀释曲线 alpha;
7. 样品间多样性分析 (beta 多样性)beta;

程序执行

Div-seq-lite 将所有子程序都封装在div_seq 主程序,接受多两个参数:

div_seq  [配置文件,metadata.tsv]  [子命令,uparse]

div_seq 已经添加入环境变量,可以直接在任何目录执行。

配置文件

mapping file文件

mapping_file 文件和QIIME的 mapping file 格式一致, 例子如下:

#SampleID   BarcodeSequence LinkerPrimerSequence    Description
F1_1    AGAGTA,TTAGGC   AAAAAAAAAAAAAAA F1
F1_2    GGAAGA,CTCAGA   AAAAAAAAAAAAAAA F1
F1_4    CTTCCA,GCCTTA   AAAAAAAAAAAAAAA F1
F2_1    TGACCA,ATGCCT   AAAAAAAAAAAAAAA F2
F2_2    AGTTCC,TGAATG   AAAAAAAAAAAAAAA F2
F2_3    GTACTT,CCAGCT   AAAAAAAAAAAAAAA F2
F3_1    CAGATC,GTGAAA   AAAAAAAAAAAAAAA F3
F3_2    TAATCG,ACTTGA   AAAAAAAAAAAAAAA F3
F3_3    ATCACG,TACAGC   AAAAAAAAAAAAAAA F3

Div-seq-lite目前只开放了拆分后的数据,所以 BarcodeSequence序列和LinkerPrimerSequence随意填写, 第四列为样本分组信息。

metadata文件

#metadata
project_home    /project/div_seq
project_id       data_analysis
raw_data        /project/div_seq/raw_data
mapping_file    /project/div_seq/mapping_file.txt
singleton        F

#hardware
cpus             24
parallel         8
threads          5

#trimmomatic
trimmomatic       /biostack/tools/fastx_utils/Trimmomatic-0.36/trimmomatic-0.36.jar
trim_parameter    LEADING:3  TRAILING:3  MINLEN:100
trim_adapter      /biostack/tools/fastx_utils/Trimmomatic-0.36/adapters/TruSeq3-PE-2.fa:2:30:12:1
trim_mod          SLIDINGWINDOW:4:5

#usearch
usearch          /biostack/tools/alignment/usearch-10.0.240/usearch
usearch_mergepairs -fastq_minmergelen 0 -fastq_maxmergelen 500 -fastq_maxdiffs 10 -fastq_pctid 80  -fastq_trunctail 2

#taxonomy
sintax_db        /biostack/database/sintax/rdp_16s_v16_sp.udb
sintax_cutoff    0.8
taxon_filter     NONE

该文件为空格分隔的两列文件,包含了一些基本程序执行参数,基本不需要调整。

其它问题

Div-seq-lite 提供的中间结果可以使用其它程序继续分析,比如使用 QIIME 绘制 Beta多样系 3D EMPeror 可视化图, 以及PCoA分析

1.如何构建系统进化树

align_seqs.py -i OTUs_represent_tags.fasta -t /biostack/database/pynast_greengene/core_set_aligned.fasta.imputed -o align
filter_alignment.py -o  align -i align/OTUs_represent_tags_aligned.fasta
make_phylogeny.py -i align/OTUs_represent_tags_aligned_pfiltered.fasta -o OTUs_represent_tags.tre
beta_diversity.py -i OTU_table.biom -o  bdiv --metrics weighted_unifrac,unweighted_unifrac,bray_curtis OTUs_represent_tags.tre

2.如何进行PCoA 分析

single_rarefaction.py -i  OTU_table.biom -o OTU_table.rarify.biom -d  #number
beta_diversity.py -i  OTU_table.rarify.biom -o bdiv --metrics weighted_unifrac,unweighted_unifrac,bray_curtis -t OTUs_represent_tags.tre
principal_coordinates.py -i  bray_curtis.distmx.txt -o  bray_curtis_coords.txt
make_2d_plots.py -i bray_curtis_coords.txt -m mapping_file.txt -o 2d_bray_curtis
make_emperor.py -i bray_curtis_coords.txt -m  mapping_file.txt -o 3d_bray_curtis

3.如何构建Venn图 Venn图或者类似图适合比较少的集合,集合超过5个显示比较困难,可以采用

otu-upset otu_table.txt 1 otu_table.mask.txt
upset.R otu_table.mask.txt upset.pdf

推荐执行之前使用 USEARCH UNCROSS 去除对低丰度OTU进行纠正(测序过程Cross-link导致)

可以产生类似图:

upset

代码可以从biostack github 仓库获得 https://github.com/biostack-repo/otu-upset

来自Biostack团队: 2017-07-31 版本

两步升级 DELL R710 升级 PERC 6/i 阵列卡 固件

1. 下载新固件

需要从官方下载固件程序:SAS-RAID_Firmware_F96NR_LN_6.3.3-0002_X00.BIN 链接:https://downloads.dell.com/FOLDER01002136M/1/SAS-RAID_Firmware_F96NR_LN_6.3.3-0002_X00.BIN

cd /biostack/misc
wget https://downloads.dell.com/FOLDER01002136M/1/SAS-RAID_Firmware_F96NR_LN_6.3.3-0002_X00.BIN

2. 安装

执行:

# sh  SAS-RAID_Firmware_F96NR_LN_6.3.3-0002_X00.BIN
Collecting inventory...
........
Running validation...
PERC 6/i Integrated Controller 0
The version of this Update Package is newer than the currently installed version.
Software application name: PERC 6/i Integrated Controller 0 Firmware
Package version: 6.3.3-0002
Installed version: 6.3.1-0003
Continue? Y/N:Y
Executing update...
WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER DELL PRODUCTS WHILE UPDATE IS IN PROGRESS.
THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE!
.......................................................................................................................
Device: PERC 6/i Integrated Controller 0
  Application: PERC 6/i Integrated Controller 0 Firmware
  The operation was successful.
Would you like to reboot your system now?
Continue? Y/N:

来自Biostack团队: 2017-07-31 版本

Centrifuge: 快速对 metagenome 序列进行分类

标题:

Centrifuge: rapid and sensitive classification of metagenomic sequences

摘要:

Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species on desktop computers. The system uses an indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (4.2 GB for 4078 bacterial and 200 archaeal genomes) and classifies sequences at very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together, these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers. Because of its space-optimized indexing schemes, Centrifuge also makes it possible to index the entire NCBI nonredundant nucleotide sequence database (a total of 109 billion bases) with an index size of 69 GB, in contrast to k-mer-based indexing schemes, which require far more extensive space.

地址:

http://genome.cshlp.org/content/26/12/1721

源码:

https://github.com/infphilo/centrifuge http://www.ccb.jhu.edu/software/centrifuge

安装:

axel ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads/centrifuge-1.0.3-beta-Linux_x86_64.zip
axel ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/nt.tar.gz
axel ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/b+h+v.tar.gz

备注:

Centrifuge: rapid and sensitive classification of metagenomic sequences, http://biorxiv.org/content/early/2016/05/25/054965

http://www.homolog.us/blogs/blog/2015/12/07/centrifuge-a-low-ram-metagenomic-classifier-from-salzberg-group/

导读:


基因组压缩示意图


基因组压缩

Centrifuge 又是一款快速有效的 metagenome 序列分类的软件(reads 和 contig、整个染色体 ), 采用了结合BWT变换(Burrows-Wheeler transform,BWT)和 FM索引(Ferragina-Manzini ,FM)的策略对序列分类进行优化,通过基因组压缩策略 有效降低了内存的需求,因此可以处理NT库级别的库索引,因为Kraken等基于Kmer的策略,所以并不需要这样的操作,但是需要存储很大的Kmer表,虽然速度快、准确性高(大的Kmer长度 k=31),但是敏感性很低,特别是针对多样性比较复杂的环境。

Centrifuge 为 Johns Hopkins University CCB(The Center for Computational Biology)出品, 采用的软件架构和bowtie2、hisat2 等还是比较类似, 命令行接口也类似,学习成本比较低。

当前库版本 p+h+v(Bacteria, Viruses, Human),大小13G, 包含了 28718 条核酸序列,14871个NCBI Taxonomy节点,8382 species , NT库 77G大小, 包含了 39648092 条核酸序列,1028487个物种信息。

有意思的是 Centrifuge 竟然允许一条序列可以有多个taxonomy 标签,并允许通过设置阈值将多个hits回归到LCA模式,针对multi-hit 模式,通过EM算法可以进行丰度定量。 centrifuge-kreport 提供了将Centrifuge的结果转换成Kraken风格的结果,这点很值得赞, Kaiju也提供了 Kraken style格式文件,这样后端程序就比较统一,应该有一个标准才好。

版本:

2016-12-12.v1

STAMP:基于 raw counts 、简单易用的 metagenomic communities 生物学差异鉴定工具

标题

Identifying biologically relevant differences between metagenomic communities

摘要:

Motivation: Metagenomics is the study of genetic material recovered directly from environmental samples. Taxonomic and functional differences between metagenomic samples can highlight the influence of ecological factors on patterns of microbial life in a wide range of habitats. Statistical hypothesis tests can help us distinguish ecological influences from sampling artifacts, but knowledge of only the P-value from a statistical hypothesis test is insufficient to make inferences about biological relevance. Current reporting practices for pairwise comparative metagenomics are inadequate, and better tools are needed for comparative metagenomic analysis.

Results: We have developed a new software package, STAMP, for comparative metagenomics that supports best practices in analysis and reporting. Examination of a pair of iron mine metagenomes demonstrates that deeper biological insights can be gained using statistical techniques available in our software. An analysis of the functional potential of ‘Candidatus Accumulibacter phosphatis’ in two enhanced biological phosphorus removal metagenomes identified several subsystems that differ between the A.phosphatis stains in these related communities, including phosphate metabolism, secretion and metal transport.

Availability: Python source code and binaries are freely available from our website at http://kiwi.cs.dal.ca/Software/STAMP

地址:

http://bioinformatics.oxfordjournals.org/content/26/6/715.long

源码:

http://kiwi.cs.dal.ca/Software/STAMP
https://github.com/dparks1134/STAMP

安装:

https://github.com/dparks1134/STAMP/releases/download/v2.1.3/STAMP_2_1_3.exe

导读:

STAMPS

对 metagenome 数据进行Profiling(物种系统分类谱,taxonomy profile 以及功能谱 functional profile )是解析metagenome 数据的第一步,但是深入了解环境样本的功能以及机理的一个重要手段就是比较,并通过控制变量因素(或者自然差异条件)预测哪些因素驱动 metagenomic communities 上的变化。

现在metagenome 测序数据很容易给出物种分类和功能分类的信息并使用counts (reads数目)来进行表征, STAMP 提供比较友好的用户界面(官方提供 Windows 和 Linux 两个版本)、以及多种可选的统计策略(分为两个样本、两个分组以及多组统计等),数据可视化形式也多种多样(barplot、headtmap、PCA Plot等),Odd Ratio、 relative risk、以及差异丰度等对差异分类进行过滤, 没有生物信息学经验也可以很容易使用。

版本:

2016-12-6.v1

COGNIZER: metagenome 功能注释框架

标题:

COGNIZER: A Framework for Functional Annotation of Metagenomic Datasets

摘要:

Recent advances in sequencing technologies have resulted in an unprecedented increase in the number of metagenomes that are being sequenced world-wide. Given their volume, functional annotation of metagenomic sequence datasets requires specialized computational tools/techniques. In spite of having high accuracy, existing stand-alone functional annotation tools necessitate end-users to perform compute-intensive homology searches of metagenomic datasets against “multiple” databases prior to functional analysis. Although, web-based functional annotation servers address to some extent the problem of availability of compute resources, uploading and analyzing huge volumes of sequence data on a shared public web-service has its own set of limitations. In this study, we present COGNIZER, a comprehensive stand-alone annotation framework which enables end-users to functionally annotate sequences constituting metagenomic datasets. The COGNIZER framework provides multiple workflow options. A subset of these options employs a novel directed-search strategy which helps in reducing the overall compute requirements for end-users. The COGNIZER framework includes a cross-mapping database that enables end-users to simultaneously derive/infer KEGG, Pfam, GO, and SEED subsystem information from the COG annotations.

地址:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0142102

源码:

http://metagenomics.atc.tcs.com/function/cognizer/

安装:

wget http://metagenomics.atc.tcs.com/cognizer/application/COGNIZER_source_code.zip
unzip COGNIZER_source_code.zip
mv  source_code   cognizer-0.9b
cd cognizer-0.9b
gcc -O2 -g  cognizer.c  -o  cognizer

修改:为了方便在任何目录访问cognizer程序,需要修改源代码中的 blastall或者RAPSearch的路径模式,去掉相对路径,修改成只要环境变量可以访问RAPSearch 或者 blastall 就可以使用模式。
修改:数据库db相对路径修改成绝对路径,保证任何目录都可以访问。
修改:RAPSearch模式变成RAPSearch2 命令行接口模式, 多线程使用 -z , 添加 bitscore 限制, 最小 bitscore 60;

导读:

COGNIZER 快速注释模式,采用了NCBI COG 数据库 ftp://ftp.ncbi.nih.gov/pub/COG/COG/myva 作为RAPSearch的库索引进行序列相似性比对,然后与其他数据库进行关联,比如GO、KEGG 、Fig等,最大的问题可能就是库比较小, MOCAT2: a metagenomic assembly, annotation and profiling framework 文章也提及COG谱要比COGNIZER好点,原因可能就是库上,另外COG注释的一个数据库是 eggNOG, 库还是比较大,不过使用diamond软件,速度应该和 myva+RAPSearch相当, 但是二者肯定比使用 blastall 作为序列比对引擎快, 如果能认可使用 NCBI 的COG 序列库进行序列相似性搜索,COGNIZER 还是很不错。

版本:

2016-12-01.v1

RAPSearch2: 快速、高效 NGS reads 序列比对工具,无碰撞哈希表实现蛋白质序列库索引

标题:

RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data

摘要:

Summary: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20–90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments derived from NGS data. RAPSearch, however, requires a substantial memory footprint to identify alignment seeds, due to its use of a suffix array data structure. Here we present RAPSearch2, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database. The utilization of an optimized data structure further speeds up the similarity search—another 2–3 times. We also implemented multi-threading in RAPSearch2, and the multi-thread modes achieve significant acceleration (e.g. 3.5X for 4-thread mode). RAPSearch2 requires up to 2G memory when running in single thread mode, or up to 3.5G memory when running in 4-thread mode.

Availability and implementation: Implemented in C++, the source code is freely available for download at the RAPSearch2 website: http://omics.informatics.indiana.edu/mg/RAPSearch2/.

Contact: yye@indiana.edu

Supplementary information: Available at the RAPSearch2 website.

地址:

http://bioinformatics.oxfordjournals.org/content/28/1/125.abstract

源码:

http://omics.informatics.indiana.edu/mg/RAPSearch2/
https://github.com/zhaoyanswill/RAPSearch2 非最新版本
https://sourceforge.net/projects/rapsearch2/files/ 最新版本

安装:

axel https://sourceforge.net/projects/rapsearch2/files/RAPSearch2.24_64bits.tar.gz/download
tar xzvf  RAPSearch2.24_64bits.tar.gz
mv  RAPSearch2.24_64bits  RAPSearch-2.24

备注:

Zhang, X. (2013). A New Module in RAPSearch2 for Fast Protein Similarity Search of Paired-end Sequences.

导读:

RAPSearch 的升级版, RAPSearch2 改变了 RAPSearch 算法实现,由先前的suffix array 数据结构变更了collision-free hash table 对库做索引,进一步降低了内存使用情况, 从使用情况看,还是没有 Diamond 等后期新秀速度快,另外RAPSearch2实现了一个功能模块支持PEreads序列。

版本:

2016-12-11.v1

IDBA-MT: 元转录组数据拼装工具

标题:

IDBA-MT: De Novo Assembler for Metatranscriptomic Data Generated from Next-Generation Sequencing Technology

摘要:

High-throughput next-generation sequencing technology provides a great opportunity for analyzing metatranscriptomic data. However, the reads produced by these technologies are short and an assembling step is required to combine the short reads into longer contigs. As there are many repeat patterns in mRNAs from different genomes and the abundance ratio of mRNAs in a sample varies a lot, existing assemblers for genomic data, transcriptomic data, and metagenomic data do not work on metatranscriptomic data and produce chimeric contigs, that is, incorrect contigs formed by merging multiple mRNA sequences. To our best knowledge, there is no assembler designed for metatranscriptomic data. In this article, we introduce an assembler called IDBA-MT, which is designed for assembling reads from metatranscriptomic data. IDBA-MT produces much fewer chimeric contigs (reduce by 50% or more) when compared with existing assemblers such as Oases, IDBA-UD, and Trinity.

地址:

http://online.liebertpub.com/doi/abs/10.1089/cmb.2013.0042

源码:

https://code.google.com/archive/p/hku-idba-mt/source/default/source

安装:

git clone  https://github.com/jameslz/idba_mt-and-idba_mtp
#edit: idba_mt/idba_mtp libheader.h, add  #include <stdint.h>
make

导读:

metatranscriptome 的拼装软件不是很多,一般都是使用老牌的转录组拼装软件,比如 Trinity 、Oasos https://github.com/dzerbino/oases 等, 可以参考一些测评文章 : Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation 也有一些直接使用DNA的拼装软件,比如 IDBA-UDMetavelvet ,针对 metatranscriptome 的拼装的最大问题就是嵌合体问题,所以针对metatranscriptome的组装软件都在尝试解决这些问题,有的需要使用 Paired-End 序列 IDBA-MT , 也的需要辅助蛋白质序列,比如 IDBA-MTP

IDBA-MT 的软件包托管在 Google Code https://code.google.com/archive/p/hku-idba-mt/source/default/source ,已经将其导入到了 Github页面 https://github.com/jameslz/idba_mt-and-idba_mtp, 方便下载使用, IDBA-MT 需要先使用 IDBA-UD 完成组装 在使用 IDBA-MT纠正一些嵌合体序列。

版本:

2016-11-24.v1