Archives

Centrifuge: 快速对 metagenome 序列进行分类

标题:

Centrifuge: rapid and sensitive classification of metagenomic sequences

摘要:

Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species on desktop computers. The system uses an indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (4.2 GB for 4078 bacterial and 200 archaeal genomes) and classifies sequences at very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together, these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers. Because of its space-optimized indexing schemes, Centrifuge also makes it possible to index the entire NCBI nonredundant nucleotide sequence database (a total of 109 billion bases) with an index size of 69 GB, in contrast to k-mer-based indexing schemes, which require far more extensive space.

地址:

http://genome.cshlp.org/content/26/12/1721

源码:

https://github.com/infphilo/centrifuge http://www.ccb.jhu.edu/software/centrifuge

安装:

axel ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/downloads/centrifuge-1.0.3-beta-Linux_x86_64.zip
axel ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/nt.tar.gz
axel ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/b+h+v.tar.gz

备注:

Centrifuge: rapid and sensitive classification of metagenomic sequences, http://biorxiv.org/content/early/2016/05/25/054965

http://www.homolog.us/blogs/blog/2015/12/07/centrifuge-a-low-ram-metagenomic-classifier-from-salzberg-group/

导读:


基因组压缩示意图


基因组压缩

Centrifuge 又是一款快速有效的 metagenome 序列分类的软件(reads 和 contig、整个染色体 ), 采用了结合BWT变换(Burrows-Wheeler transform,BWT)和 FM索引(Ferragina-Manzini ,FM)的策略对序列分类进行优化,通过基因组压缩策略 有效降低了内存的需求,因此可以处理NT库级别的库索引,因为Kraken等基于Kmer的策略,所以并不需要这样的操作,但是需要存储很大的Kmer表,虽然速度快、准确性高(大的Kmer长度 k=31),但是敏感性很低,特别是针对多样性比较复杂的环境。

Centrifuge 为 Johns Hopkins University CCB(The Center for Computational Biology)出品, 采用的软件架构和bowtie2、hisat2 等还是比较类似, 命令行接口也类似,学习成本比较低。

当前库版本 p+h+v(Bacteria, Viruses, Human),大小13G, 包含了 28718 条核酸序列,14871个NCBI Taxonomy节点,8382 species , NT库 77G大小, 包含了 39648092 条核酸序列,1028487个物种信息。

有意思的是 Centrifuge 竟然允许一条序列可以有多个taxonomy 标签,并允许通过设置阈值将多个hits回归到LCA模式,针对multi-hit 模式,通过EM算法可以进行丰度定量。 centrifuge-kreport 提供了将Centrifuge的结果转换成Kraken风格的结果,这点很值得赞, Kaiju也提供了 Kraken style格式文件,这样后端程序就比较统一,应该有一个标准才好。

版本:

2016-12-12.v1

STAMP:基于 raw counts 、简单易用的 metagenomic communities 生物学差异鉴定工具

标题

Identifying biologically relevant differences between metagenomic communities

摘要:

Motivation: Metagenomics is the study of genetic material recovered directly from environmental samples. Taxonomic and functional differences between metagenomic samples can highlight the influence of ecological factors on patterns of microbial life in a wide range of habitats. Statistical hypothesis tests can help us distinguish ecological influences from sampling artifacts, but knowledge of only the P-value from a statistical hypothesis test is insufficient to make inferences about biological relevance. Current reporting practices for pairwise comparative metagenomics are inadequate, and better tools are needed for comparative metagenomic analysis.

Results: We have developed a new software package, STAMP, for comparative metagenomics that supports best practices in analysis and reporting. Examination of a pair of iron mine metagenomes demonstrates that deeper biological insights can be gained using statistical techniques available in our software. An analysis of the functional potential of ‘Candidatus Accumulibacter phosphatis’ in two enhanced biological phosphorus removal metagenomes identified several subsystems that differ between the A.phosphatis stains in these related communities, including phosphate metabolism, secretion and metal transport.

Availability: Python source code and binaries are freely available from our website at http://kiwi.cs.dal.ca/Software/STAMP

地址:

http://bioinformatics.oxfordjournals.org/content/26/6/715.long

源码:

http://kiwi.cs.dal.ca/Software/STAMP
https://github.com/dparks1134/STAMP

安装:

https://github.com/dparks1134/STAMP/releases/download/v2.1.3/STAMP_2_1_3.exe

导读:

STAMPS

对 metagenome 数据进行Profiling(物种系统分类谱,taxonomy profile 以及功能谱 functional profile )是解析metagenome 数据的第一步,但是深入了解环境样本的功能以及机理的一个重要手段就是比较,并通过控制变量因素(或者自然差异条件)预测哪些因素驱动 metagenomic communities 上的变化。

现在metagenome 测序数据很容易给出物种分类和功能分类的信息并使用counts (reads数目)来进行表征, STAMP 提供比较友好的用户界面(官方提供 Windows 和 Linux 两个版本)、以及多种可选的统计策略(分为两个样本、两个分组以及多组统计等),数据可视化形式也多种多样(barplot、headtmap、PCA Plot等),Odd Ratio、 relative risk、以及差异丰度等对差异分类进行过滤, 没有生物信息学经验也可以很容易使用。

版本:

2016-12-6.v1

COGNIZER: metagenome 功能注释框架

标题:

COGNIZER: A Framework for Functional Annotation of Metagenomic Datasets

摘要:

Recent advances in sequencing technologies have resulted in an unprecedented increase in the number of metagenomes that are being sequenced world-wide. Given their volume, functional annotation of metagenomic sequence datasets requires specialized computational tools/techniques. In spite of having high accuracy, existing stand-alone functional annotation tools necessitate end-users to perform compute-intensive homology searches of metagenomic datasets against “multiple” databases prior to functional analysis. Although, web-based functional annotation servers address to some extent the problem of availability of compute resources, uploading and analyzing huge volumes of sequence data on a shared public web-service has its own set of limitations. In this study, we present COGNIZER, a comprehensive stand-alone annotation framework which enables end-users to functionally annotate sequences constituting metagenomic datasets. The COGNIZER framework provides multiple workflow options. A subset of these options employs a novel directed-search strategy which helps in reducing the overall compute requirements for end-users. The COGNIZER framework includes a cross-mapping database that enables end-users to simultaneously derive/infer KEGG, Pfam, GO, and SEED subsystem information from the COG annotations.

地址:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0142102

源码:

http://metagenomics.atc.tcs.com/function/cognizer/

安装:

wget http://metagenomics.atc.tcs.com/cognizer/application/COGNIZER_source_code.zip
unzip COGNIZER_source_code.zip
mv  source_code   cognizer-0.9b
cd cognizer-0.9b
gcc -O2 -g  cognizer.c  -o  cognizer

修改:为了方便在任何目录访问cognizer程序,需要修改源代码中的 blastall或者RAPSearch的路径模式,去掉相对路径,修改成只要环境变量可以访问RAPSearch 或者 blastall 就可以使用模式。
修改:数据库db相对路径修改成绝对路径,保证任何目录都可以访问。
修改:RAPSearch模式变成RAPSearch2 命令行接口模式, 多线程使用 -z , 添加 bitscore 限制, 最小 bitscore 60;

导读:

COGNIZER 快速注释模式,采用了NCBI COG 数据库 ftp://ftp.ncbi.nih.gov/pub/COG/COG/myva 作为RAPSearch的库索引进行序列相似性比对,然后与其他数据库进行关联,比如GO、KEGG 、Fig等,最大的问题可能就是库比较小, MOCAT2: a metagenomic assembly, annotation and profiling framework 文章也提及COG谱要比COGNIZER好点,原因可能就是库上,另外COG注释的一个数据库是 eggNOG, 库还是比较大,不过使用diamond软件,速度应该和 myva+RAPSearch相当, 但是二者肯定比使用 blastall 作为序列比对引擎快, 如果能认可使用 NCBI 的COG 序列库进行序列相似性搜索,COGNIZER 还是很不错。

版本:

2016-12-01.v1

RAPSearch2: 快速、高效 NGS reads 序列比对工具,无碰撞哈希表实现蛋白质序列库索引

标题:

RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data

摘要:

Summary: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20–90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments derived from NGS data. RAPSearch, however, requires a substantial memory footprint to identify alignment seeds, due to its use of a suffix array data structure. Here we present RAPSearch2, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database. The utilization of an optimized data structure further speeds up the similarity search—another 2–3 times. We also implemented multi-threading in RAPSearch2, and the multi-thread modes achieve significant acceleration (e.g. 3.5X for 4-thread mode). RAPSearch2 requires up to 2G memory when running in single thread mode, or up to 3.5G memory when running in 4-thread mode.

Availability and implementation: Implemented in C++, the source code is freely available for download at the RAPSearch2 website: http://omics.informatics.indiana.edu/mg/RAPSearch2/.

Contact: yye@indiana.edu

Supplementary information: Available at the RAPSearch2 website.

地址:

http://bioinformatics.oxfordjournals.org/content/28/1/125.abstract

源码:

http://omics.informatics.indiana.edu/mg/RAPSearch2/
https://github.com/zhaoyanswill/RAPSearch2 非最新版本
https://sourceforge.net/projects/rapsearch2/files/ 最新版本

安装:

axel https://sourceforge.net/projects/rapsearch2/files/RAPSearch2.24_64bits.tar.gz/download
tar xzvf  RAPSearch2.24_64bits.tar.gz
mv  RAPSearch2.24_64bits  RAPSearch-2.24

备注:

Zhang, X. (2013). A New Module in RAPSearch2 for Fast Protein Similarity Search of Paired-end Sequences.

导读:

RAPSearch 的升级版, RAPSearch2 改变了 RAPSearch 算法实现,由先前的suffix array 数据结构变更了collision-free hash table 对库做索引,进一步降低了内存使用情况, 从使用情况看,还是没有 Diamond 等后期新秀速度快,另外RAPSearch2实现了一个功能模块支持PEreads序列。

版本:

2016-12-11.v1

IDBA-MT: 元转录组数据拼装工具

标题:

IDBA-MT: De Novo Assembler for Metatranscriptomic Data Generated from Next-Generation Sequencing Technology

摘要:

High-throughput next-generation sequencing technology provides a great opportunity for analyzing metatranscriptomic data. However, the reads produced by these technologies are short and an assembling step is required to combine the short reads into longer contigs. As there are many repeat patterns in mRNAs from different genomes and the abundance ratio of mRNAs in a sample varies a lot, existing assemblers for genomic data, transcriptomic data, and metagenomic data do not work on metatranscriptomic data and produce chimeric contigs, that is, incorrect contigs formed by merging multiple mRNA sequences. To our best knowledge, there is no assembler designed for metatranscriptomic data. In this article, we introduce an assembler called IDBA-MT, which is designed for assembling reads from metatranscriptomic data. IDBA-MT produces much fewer chimeric contigs (reduce by 50% or more) when compared with existing assemblers such as Oases, IDBA-UD, and Trinity.

地址:

http://online.liebertpub.com/doi/abs/10.1089/cmb.2013.0042

源码:

https://code.google.com/archive/p/hku-idba-mt/source/default/source

安装:

git clone  https://github.com/jameslz/idba_mt-and-idba_mtp
#edit: idba_mt/idba_mtp libheader.h, add  #include <stdint.h>
make

导读:

metatranscriptome 的拼装软件不是很多,一般都是使用老牌的转录组拼装软件,比如 Trinity 、Oasos https://github.com/dzerbino/oases 等, 可以参考一些测评文章 : Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation 也有一些直接使用DNA的拼装软件,比如 IDBA-UDMetavelvet ,针对 metatranscriptome 的拼装的最大问题就是嵌合体问题,所以针对metatranscriptome的组装软件都在尝试解决这些问题,有的需要使用 Paired-End 序列 IDBA-MT , 也的需要辅助蛋白质序列,比如 IDBA-MTP

IDBA-MT 的软件包托管在 Google Code https://code.google.com/archive/p/hku-idba-mt/source/default/source ,已经将其导入到了 Github页面 https://github.com/jameslz/idba_mt-and-idba_mtp, 方便下载使用, IDBA-MT 需要先使用 IDBA-UD 完成组装 在使用 IDBA-MT纠正一些嵌合体序列。

版本:

2016-11-24.v1

IDBA-UD: 针对单细胞以及元基因组的序列组装软件

标题:

IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth

摘要:

Motivation: Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. However, both technologies suffer from the problem that sequencing depth of different regions of a genome or genomes from different species are highly uneven. Most existing genome assemblers usually have an assumption that sequencing depths are even. These assemblers fail to construct correct long contigs.

Results: We introduce the IDBA-UD algorithm that is based on the de Bruijn graph approach for assembling reads from single-cell sequencing or metagenomic sequencing technologies with uneven sequencing depths. Several non-trivial techniques have been employed to tackle the problems. Instead of using a simple threshold, we use multiple depthrelative thresholds to remove erroneous k-mers in both low-depth and high-depth regions. The technique of local assembly with paired-end information is used to solve the branch problem of low-depth short repeat regions. To speed up the process, an error correction step is conducted to correct reads of high-depth regions that can be aligned to highconfident contigs. Comparison of the performances of IDBA-UD and existing assemblers (Velvet, Velvet-SC, SOAPdenovo and Meta-IDBA) for different datasets, shows that IDBA-UD can reconstruct longer contigs with higher accuracy.

Availability: The IDBA-UD toolkit is available at our website http://www.cs.hku.hk/~alse/idba_ud

Contact: chin@cs.hku.hk

© The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

地址:

http://bioinformatics.oxfordjournals.org/content/28/11/1420.abstract

源码:

https://github.com/loneknightpy/idba

安装:

axel https://github.com/loneknightpy/idba/archive/1.1.3.tar.gz
tar xzvf idba-1.1.3.tar.gz
cd idba-1.1.3
./build.sh

导读:

IDBA-UD 是一款单细胞基因组或者元基因组从头拼装软件,默认参数 short reads 只能设置 128碱基,当前主流高通量测序平台 Hiseq X Ten / Hiseq 3000/4000 测序长度 150bp, 因此需要修改源代码使得可以执行 大于 128bp短序列拼装,修改方式见: https://groups.google.com/forum/#!topic/hku-idba/GL-1VZnhLI0, 如果看不到,见下面工作组内容:

Hi,

I’ve started a new thread for this in case anyone wants to do the same thing. I wanted idba_ud to run with larger kmers. It seems to work pretty well.

I changed /idba-1.1.2/src/sequence/short_sequence.h to longer kMaxShortSequence to:

static const uint32_t kMaxShortSequence = 500;

and changed the max kmer size in idba-1.1.2/src/basic/kmer.h byt changing the number of bits to:

static const uint32_t kNumUint64 = 16;

Then recompiled:

./configure make

Now IDBA is working with my 300 bp paired end illumina data with kmers of 100, 200 and 300. Assembly looks much better so far than it was with kmers limited to 124 bp. I don’t really understand why assemblers limit kmer size, but I’m not a mathematician. The highest kmer size always seems to give the ‘best’ assembly. If you run into a memory problem it might be worth limiting the number of threads used, although I haven’t tested this, when I use fewer than maximum, it works.

T.

IDBA家族的几个应用都在这个包里, 比如 idba/idba-trans/idba_hybrid等,而且如果执行 metatranscriptome 拼装,可以使用 idba-mt 以及 idba-mtp 等, 后面会在另外的一个帖子提及。

metagenome 数据拼装除了 IDBA-UD,还有 MetavelvetOmega 等, 另外一个对手是 MEGAHIT 被广泛用于 metagenome 数据组装。

版本:

2016-11-24.v1

SortmeRNA: 快速准确筛选NGS序列集合中的rRNA序列

标题:

SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data

摘要:

Motivation: The application of next-generation sequencing (NGS) technologies to RNAs directly extracted from a community of organisms yields a mixture of fragments characterizing both coding and non-coding types of RNAs. The task to distinguish among these and to further categorize the families of messenger RNAs and ribosomal RNAs (rRNAs) is an important step for examining gene expression patterns of an interactive environment and the phylogenetic classification of the constituting species.

Results: We present SortMeRNA, a new software designed to rapidly filter rRNA fragments from metatranscriptomic data. It is capable of handling large sets of reads and sorting out all fragments matching to the rRNA database with high sensitivity and low running time.

Availability: http://bioinfo.lifl.fr/RNA/sortmerna

Contact: evguenia.kopylova@lifl.fr

地址:

https://academic.oup.com/bioinformatics/article-abstract/28/3/433/189113/Identification-and-removal-of-ribosomal-RNA

源码:

https://github.com/biocore/sortmerna

安装:

axel https://github.com/biocore/sortmerna/archive/2.1.tar.gz
tar xzvf sortmerna-2.1.tar.gz
cd sortmerna-2.1
./configure  --prefix=$PWD
 make -j 20

./indexdb_rna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db:\
    ./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-db:\
    ./rRNA_databases/silva-arc-16s-id95.fasta,./index/silva-arc-16s-db:\
    ./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-db:\
    ./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-db:\
    ./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s-db:\
    ./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-db:\
    ./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s-db

导读:

做未知物种(如果有参考序列可以了,自己做下索引)的RNA-seq时, 需要计算数据里面 rRNA含量以评估实验的质量。

做 metatranscriptome 时,目的是了解环境样本的功能结构,也可以通过metatranscriptome 序列里的 16S rDNA 序列调查物种的信息,但是需要知道metatranscriptome 实验第一步就是要实验上过滤掉 rRNA (可以使用 Ribo-Zero rRNA Removal Kit http://www.illumina.com/products/by-type/molecular-biology-reagents/ribo-zero-rrna-removal-human-mouse-rat.html 之类试剂盒 ),所以 16S rDNA 序列谱是不完整的,只能参考。

甚至 metagenome 数据分析时,也可以通过16S rDNA 序列调查物种的组成。

这几个问题的核心都是鉴定序列集合中的rRNA序列, 这一类的工具也比较多,比如ribopickerkneaddata 甚至直接使用一些 rDNA 序列库做序列比对也可以实现, bwa/bowtie2/blat/usearch/blast等序列相似性比对引擎都可以用, SortMeRNA从速度、敏感性、易用性上来讲都很不错,软件提供的库也比较丰度,包含了:

1. silva-bac-23s
2. silva-arc-16s
3. silva-arc-23s
4. silva-euk-18s
5. silva-euk-28s
6. rfam-5s
7. rfam-5.8s

7个库,涵盖了 23S/16S/5S/5.8S等rRNA类型,输出结果也提供了 各个库的统计比例,以及rRNA基因序列/非rRNA 序列等, 也提供了 sam/blast等各种格式。

SortMeRNA 也提供了几个不擅长的功能, 比如 otu-pick, 对了这里面输入文件需要 interleaved reads 格式,可以使用 seqtk 或者 seqtk_utils 程序解决。

版本:

2016-11-22.v1

wgsim: 比较通用的序列模拟器

标题:

wgsim: Reads simulator

摘要:

Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors. It does not generate INDEL sequencing errors, but this can be partly compensated by simulating INDEL polymorphisms. Wgsim outputs the simulated polymorphisms, and writes the true read coordinates as well as the number of polymorphisms and sequencing errors in read names. One can evaluate the accuracy of a mapper or a SNP caller with wgsim_eval.pl that comes with the package.

地址:

https://github.com/lh3/wgsim

源码:

https://github.com/lh3/wgsim

安装:

git clone  https://github.com/lh3/wgsim
cd wgsim
gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm

导读:

wgsim 比较通用的序列模拟软件,需要提供参考基因组,从提供的命令行参数可以看到,可以控制:

1. 双端序列的外部距离(等价于插入片段长度);
2. 序列长度;
3. 错误率;
4. 序列条数;
5. 突变率;
6. Indel的比例

等。 很适合模拟 metagenome 数据集,但是需要提供很多参考基因组;

对于metagenome 数据的模拟,可以直接到 refseq 下载想模拟的微生物基因组 ,一般下载 细菌/古菌/病毒的序列,各占一定比例, 然后合并在一起,就可以模拟了, 可以参考 A comparative study of metagenomics analysis pipelines at the species level 这篇文章, 比例为:hg19 + 315 微生物基因组 (292 237 species: 74 bacteriophages, 69 viruses and 149 bacteria), 这里用的是 species,应该去除同一个种的不同 strain,但是真实的 metagenome 数据肯定会有很多同一个种的不同 strain,这也是metagenome 数据分析的难点:拼装的嵌合问题。

因为NCBI新的拼装数据库不提供打包下载方法,下载Refseq也没有先前那么简单了,这里提供一个下载 Refseq 比较方便的程序:Kraken_db_install_scripts

版本:

 2016-11-21.v1

Genix:细菌基因组在线自动化注释流程

标题:

Genix: A New Online Automated Pipeline for Bacterial Genome Annotation

摘要:

Next-Generation Sequencing (NGS) has significantly reduced the cost of genome sequencing projects, resulting in an expressive increase in the availability of genomic data in public databases. The cheaper and easier is to sequence new genomes, the more accurate the annotation steps have to be to avoid both the loss of information and the accumulation of erroneous features that may affect the accuracy of further analysis. In the case of bacteria genomes, a range of web annotation software has been developed; however, many applications have yet to incorporate the steps required to improve their result, including the removal of false-positive/spurious and a more complete identification of non-coding features. We present Genix, a new web-based bacterial genome annotation pipeline. A comparison of the results generated by Genix for four reference genomes against those generated by other annotation tools indicated that our pipeline is able to provide results that are closer to the reference genome annotation, with a smaller amount of false-positive proteins and missing functional annotated proteins. Additionally, the metrics obtained by Genix were slightly better than those obtained by Prokka, a state-of-art standalone annotation system. Our results indicate that Genix is a useful tool that is able to provide a more refined result, and may be a user-friendly way to obtain high quality results.

地址:

http://femsle.oxfordjournals.org/content/early/2016/11/16/femsle.fnw263

源码:

https://github.com/fredericokremer/genix

安装:

http://labbioinfo.ufpel.edu.br/genix/

导读:

已有的自动化细菌基因组注释流程很多,有 RASTBASysProkka,不管怎么样,只要有创新就可以发文章,给我们提供更多选择。

Genix 提供Web版本以及源代码(ps. 不提供源代码的web 应用都是耍流氓) ,免费但是需要注册账户,数据库实现 Apache、Mysql、 SQLite、Perl、Python、BashGenix 流程根据提供的序列以及物种分类信息可以很方便的获得基因组注释信息,如果提供一些额外数据库提交信息,更是可以直接生成可提交的Genbank文件,对于服务器负荷来讲,和所有的在线服务一样,需要排队,一个一个来。

Genix 主要是由了一下工具:

  1. 蛋白质编码基因预测 Prodigal
  2. tRNA 基因预测 tRNAscan-SE
  3. rRNA基因预测 RNAmmer
  4. tmRNA基因预测 Aragorn
  5. ncRNA预测 blastn + infernal + Rfam
  6. 数据可视化 JBrowse 基因组浏览器。

序列库的选择是每个应用的都会关注的问题,因为序列比对计算上比较昂贵,但是可以选择一些加速序列比对算法,比如 DiamondUsearch等,但是通过减小数据库大小会带来很好的性能提升,最典型的就是根据物种去划分,比如Viruses/Archaea/Bacteria 分类,当然也可以自己提供序列集合。 eggNOG-4.0 以后也是这样做的,毕竟注释一个细菌基因组没有必要使用全部的Uniprot/eggNOG库,只需要细菌的那一部分就可以了。 Genix 会根据提供的物种 taxonomy 信息(NCBI Taxonomy 标识符) 自动下载Uniprot序列库,并使用 CD-HIT 对序列做冗余过滤,减小参考数据库的大小 如果物种未知怎么办,设置细菌的NCBI Taxonomy。

相对于Prokka的一些特殊的地方在于基因预测模型优化上,原核微生物蛋白质编码基因相对真核简单很多,Prodigal 相对其它预测工具也是首选,但是也有一些错误的预测模型,Genix 对使用Antifam 对预测的预测模型进行过滤,并对后续的CDS序列校验以及起始密码纠正。

Genix 流程图

**Genix 流程图**

版本:

2016-11-19.v1

Scalpel: Indel 变异鉴定工具,支持单样本、家庭样本以及正常和肿瘤对样本

标题:

Indel variant analysis of short-read sequencing data with Scalpel.

摘要:

As the second most common type of variation in the human genome, insertions and deletions (indels) have been linked to many diseases, but the discovery of indels of more than a few bases in size from short-read sequencing data remains challenging. Scalpel (http://scalpel.sourceforge.net) is an open-source software for reliable indel detection based on the microassembly technique. It has been successfully used to discover mutations in novel candidate genes for autism, and it is extensively used in other large-scale studies of human diseases. This protocol gives an overview of the algorithm and describes how to use Scalpel to perform highly accurate indel calling from whole-genome and whole-exome sequencing data. We provide detailed instructions for an exemplary family-based de novo study, but we also characterize the other two supported modes of operation: single-sample and somatic analysis. Indel normalization, visualization and annotation of the mutations are also illustrated. Using a standard server, indel discovery and characterization in the exonic regions of the example sequencing data can be completed in ~5 h after read mapping.

地址:

http://www.nature.com/nprot/journal/v11/n12/full/nprot.2016.150.html

源码:

http://scalpel.sourceforge.net

安装:

axel  https://sourceforge.net/projects/scalpel/files/scalpel-0.5.3.tar.gz/download
tar xzvf  scalpel-0.5.3.tar.gz
cd scalpel-0.5.3
make

导读:

Nature Protocols 杂志发表了不少优秀生物信息工具Protocol,比如:Trinity RNA-seq 拼装如软件平台,HISAT,StringTie and Ballgown RNA-seq 分析套件, 这次导读的是Nature Protocols上刚出版的 INDEL鉴定工具 Scalpel

在执行 INDEL 鉴定之前基本可以按照 Broad GATK 最佳实践 或者 Speedseq 最佳实践执行预处理, 其中GATK最佳实践: BWA-MEM + SAMTOOLS + PICARD , Speedseq 最佳实践: BWA-MEM + Sambamba + SAMBLASTER , 通常我们鉴定 INDEL 可能直接 GATK 或者 FreeBayes

Scalpel 除了可以处理单样本indels鉴定, 也提供了家庭装,鉴定Inherited indelsDe novo indels, 同时也可以鉴定正常样本和肿瘤样本的 Somatic indels,可作为GATK或者FreeBayes 流程之外的扩展。

Scalpel protocol 主要步骤

Scalpel protocol

版本

2016-11-18.v1