Archives

Categories

IDBA-UD: 针对单细胞以及元基因组的序列组装软件

标题:

IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth

摘要:

Motivation: Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. However, both technologies suffer from the problem that sequencing depth of different regions of a genome or genomes from different species are highly uneven. Most existing genome assemblers usually have an assumption that sequencing depths are even. These assemblers fail to construct correct long contigs.

Results: We introduce the IDBA-UD algorithm that is based on the de Bruijn graph approach for assembling reads from single-cell sequencing or metagenomic sequencing technologies with uneven sequencing depths. Several non-trivial techniques have been employed to tackle the problems. Instead of using a simple threshold, we use multiple depthrelative thresholds to remove erroneous k-mers in both low-depth and high-depth regions. The technique of local assembly with paired-end information is used to solve the branch problem of low-depth short repeat regions. To speed up the process, an error correction step is conducted to correct reads of high-depth regions that can be aligned to highconfident contigs. Comparison of the performances of IDBA-UD and existing assemblers (Velvet, Velvet-SC, SOAPdenovo and Meta-IDBA) for different datasets, shows that IDBA-UD can reconstruct longer contigs with higher accuracy.

Availability: The IDBA-UD toolkit is available at our website http://www.cs.hku.hk/~alse/idba_ud

Contact: chin@cs.hku.hk

© The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

地址:

http://bioinformatics.oxfordjournals.org/content/28/11/1420.abstract

源码:

https://github.com/loneknightpy/idba

安装:

axel https://github.com/loneknightpy/idba/archive/1.1.3.tar.gz
tar xzvf idba-1.1.3.tar.gz
cd idba-1.1.3
./build.sh

导读:

IDBA-UD 是一款单细胞基因组或者元基因组从头拼装软件,默认参数 short reads 只能设置 128碱基,当前主流高通量测序平台 Hiseq X Ten / Hiseq 3000/4000 测序长度 150bp, 因此需要修改源代码使得可以执行 大于 128bp短序列拼装,修改方式见: https://groups.google.com/forum/#!topic/hku-idba/GL-1VZnhLI0, 如果看不到,见下面工作组内容:

Hi,

I’ve started a new thread for this in case anyone wants to do the same thing. I wanted idba_ud to run with larger kmers. It seems to work pretty well.

I changed /idba-1.1.2/src/sequence/short_sequence.h to longer kMaxShortSequence to:

static const uint32_t kMaxShortSequence = 500;

and changed the max kmer size in idba-1.1.2/src/basic/kmer.h byt changing the number of bits to:

static const uint32_t kNumUint64 = 16;

Then recompiled:

./configure make

Now IDBA is working with my 300 bp paired end illumina data with kmers of 100, 200 and 300. Assembly looks much better so far than it was with kmers limited to 124 bp. I don’t really understand why assemblers limit kmer size, but I’m not a mathematician. The highest kmer size always seems to give the ‘best’ assembly. If you run into a memory problem it might be worth limiting the number of threads used, although I haven’t tested this, when I use fewer than maximum, it works.

T.

IDBA家族的几个应用都在这个包里, 比如 idba/idba-trans/idba_hybrid等,而且如果执行 metatranscriptome 拼装,可以使用 idba-mt 以及 idba-mtp 等, 后面会在另外的一个帖子提及。

metagenome 数据拼装除了 IDBA-UD,还有 MetavelvetOmega 等, 另外一个对手是 MEGAHIT 被广泛用于 metagenome 数据组装。

版本:

2016-11-24.v1

SortmeRNA: 快速准确筛选NGS序列集合中的rRNA序列

标题:

SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data

摘要:

Motivation: The application of next-generation sequencing (NGS) technologies to RNAs directly extracted from a community of organisms yields a mixture of fragments characterizing both coding and non-coding types of RNAs. The task to distinguish among these and to further categorize the families of messenger RNAs and ribosomal RNAs (rRNAs) is an important step for examining gene expression patterns of an interactive environment and the phylogenetic classification of the constituting species.

Results: We present SortMeRNA, a new software designed to rapidly filter rRNA fragments from metatranscriptomic data. It is capable of handling large sets of reads and sorting out all fragments matching to the rRNA database with high sensitivity and low running time.

Availability: http://bioinfo.lifl.fr/RNA/sortmerna

Contact: evguenia.kopylova@lifl.fr

地址:

https://academic.oup.com/bioinformatics/article-abstract/28/3/433/189113/Identification-and-removal-of-ribosomal-RNA

源码:

https://github.com/biocore/sortmerna

安装:

axel https://github.com/biocore/sortmerna/archive/2.1.tar.gz
tar xzvf sortmerna-2.1.tar.gz
cd sortmerna-2.1
./configure  --prefix=$PWD
 make -j 20

./indexdb_rna --ref ./rRNA_databases/silva-bac-16s-id90.fasta,./index/silva-bac-16s-db:\
    ./rRNA_databases/silva-bac-23s-id98.fasta,./index/silva-bac-23s-db:\
    ./rRNA_databases/silva-arc-16s-id95.fasta,./index/silva-arc-16s-db:\
    ./rRNA_databases/silva-arc-23s-id98.fasta,./index/silva-arc-23s-db:\
    ./rRNA_databases/silva-euk-18s-id95.fasta,./index/silva-euk-18s-db:\
    ./rRNA_databases/silva-euk-28s-id98.fasta,./index/silva-euk-28s-db:\
    ./rRNA_databases/rfam-5s-database-id98.fasta,./index/rfam-5s-db:\
    ./rRNA_databases/rfam-5.8s-database-id98.fasta,./index/rfam-5.8s-db

导读:

做未知物种(如果有参考序列可以了,自己做下索引)的RNA-seq时, 需要计算数据里面 rRNA含量以评估实验的质量。

做 metatranscriptome 时,目的是了解环境样本的功能结构,也可以通过metatranscriptome 序列里的 16S rDNA 序列调查物种的信息,但是需要知道metatranscriptome 实验第一步就是要实验上过滤掉 rRNA (可以使用 Ribo-Zero rRNA Removal Kit http://www.illumina.com/products/by-type/molecular-biology-reagents/ribo-zero-rrna-removal-human-mouse-rat.html 之类试剂盒 ),所以 16S rDNA 序列谱是不完整的,只能参考。

甚至 metagenome 数据分析时,也可以通过16S rDNA 序列调查物种的组成。

这几个问题的核心都是鉴定序列集合中的rRNA序列, 这一类的工具也比较多,比如ribopickerkneaddata 甚至直接使用一些 rDNA 序列库做序列比对也可以实现, bwa/bowtie2/blat/usearch/blast等序列相似性比对引擎都可以用, SortMeRNA从速度、敏感性、易用性上来讲都很不错,软件提供的库也比较丰度,包含了:

1. silva-bac-23s
2. silva-arc-16s
3. silva-arc-23s
4. silva-euk-18s
5. silva-euk-28s
6. rfam-5s
7. rfam-5.8s

7个库,涵盖了 23S/16S/5S/5.8S等rRNA类型,输出结果也提供了 各个库的统计比例,以及rRNA基因序列/非rRNA 序列等, 也提供了 sam/blast等各种格式。

SortMeRNA 也提供了几个不擅长的功能, 比如 otu-pick, 对了这里面输入文件需要 interleaved reads 格式,可以使用 seqtk 或者 seqtk_utils 程序解决。

版本:

2016-11-22.v1

wgsim: 比较通用的序列模拟器

标题:

wgsim: Reads simulator

摘要:

Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors. It does not generate INDEL sequencing errors, but this can be partly compensated by simulating INDEL polymorphisms. Wgsim outputs the simulated polymorphisms, and writes the true read coordinates as well as the number of polymorphisms and sequencing errors in read names. One can evaluate the accuracy of a mapper or a SNP caller with wgsim_eval.pl that comes with the package.

地址:

https://github.com/lh3/wgsim

源码:

https://github.com/lh3/wgsim

安装:

git clone  https://github.com/lh3/wgsim
cd wgsim
gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm

导读:

wgsim 比较通用的序列模拟软件,需要提供参考基因组,从提供的命令行参数可以看到,可以控制:

1. 双端序列的外部距离(等价于插入片段长度);
2. 序列长度;
3. 错误率;
4. 序列条数;
5. 突变率;
6. Indel的比例

等。 很适合模拟 metagenome 数据集,但是需要提供很多参考基因组;

对于metagenome 数据的模拟,可以直接到 refseq 下载想模拟的微生物基因组 ,一般下载 细菌/古菌/病毒的序列,各占一定比例, 然后合并在一起,就可以模拟了, 可以参考 A comparative study of metagenomics analysis pipelines at the species level 这篇文章, 比例为:hg19 + 315 微生物基因组 (292 237 species: 74 bacteriophages, 69 viruses and 149 bacteria), 这里用的是 species,应该去除同一个种的不同 strain,但是真实的 metagenome 数据肯定会有很多同一个种的不同 strain,这也是metagenome 数据分析的难点:拼装的嵌合问题。

因为NCBI新的拼装数据库不提供打包下载方法,下载Refseq也没有先前那么简单了,这里提供一个下载 Refseq 比较方便的程序:Kraken_db_install_scripts

版本:

 2016-11-21.v1

Genix:细菌基因组在线自动化注释流程

标题:

Genix: A New Online Automated Pipeline for Bacterial Genome Annotation

摘要:

Next-Generation Sequencing (NGS) has significantly reduced the cost of genome sequencing projects, resulting in an expressive increase in the availability of genomic data in public databases. The cheaper and easier is to sequence new genomes, the more accurate the annotation steps have to be to avoid both the loss of information and the accumulation of erroneous features that may affect the accuracy of further analysis. In the case of bacteria genomes, a range of web annotation software has been developed; however, many applications have yet to incorporate the steps required to improve their result, including the removal of false-positive/spurious and a more complete identification of non-coding features. We present Genix, a new web-based bacterial genome annotation pipeline. A comparison of the results generated by Genix for four reference genomes against those generated by other annotation tools indicated that our pipeline is able to provide results that are closer to the reference genome annotation, with a smaller amount of false-positive proteins and missing functional annotated proteins. Additionally, the metrics obtained by Genix were slightly better than those obtained by Prokka, a state-of-art standalone annotation system. Our results indicate that Genix is a useful tool that is able to provide a more refined result, and may be a user-friendly way to obtain high quality results.

地址:

http://femsle.oxfordjournals.org/content/early/2016/11/16/femsle.fnw263

源码:

https://github.com/fredericokremer/genix

安装:

http://labbioinfo.ufpel.edu.br/genix/

导读:

已有的自动化细菌基因组注释流程很多,有 RASTBASysProkka,不管怎么样,只要有创新就可以发文章,给我们提供更多选择。

Genix 提供Web版本以及源代码(ps. 不提供源代码的web 应用都是耍流氓) ,免费但是需要注册账户,数据库实现 Apache、Mysql、 SQLite、Perl、Python、BashGenix 流程根据提供的序列以及物种分类信息可以很方便的获得基因组注释信息,如果提供一些额外数据库提交信息,更是可以直接生成可提交的Genbank文件,对于服务器负荷来讲,和所有的在线服务一样,需要排队,一个一个来。

Genix 主要是由了一下工具:

  1. 蛋白质编码基因预测 Prodigal
  2. tRNA 基因预测 tRNAscan-SE
  3. rRNA基因预测 RNAmmer
  4. tmRNA基因预测 Aragorn
  5. ncRNA预测 blastn + infernal + Rfam
  6. 数据可视化 JBrowse 基因组浏览器。

序列库的选择是每个应用的都会关注的问题,因为序列比对计算上比较昂贵,但是可以选择一些加速序列比对算法,比如 DiamondUsearch等,但是通过减小数据库大小会带来很好的性能提升,最典型的就是根据物种去划分,比如Viruses/Archaea/Bacteria 分类,当然也可以自己提供序列集合。 eggNOG-4.0 以后也是这样做的,毕竟注释一个细菌基因组没有必要使用全部的Uniprot/eggNOG库,只需要细菌的那一部分就可以了。 Genix 会根据提供的物种 taxonomy 信息(NCBI Taxonomy 标识符) 自动下载Uniprot序列库,并使用 CD-HIT 对序列做冗余过滤,减小参考数据库的大小 如果物种未知怎么办,设置细菌的NCBI Taxonomy。

相对于Prokka的一些特殊的地方在于基因预测模型优化上,原核微生物蛋白质编码基因相对真核简单很多,Prodigal 相对其它预测工具也是首选,但是也有一些错误的预测模型,Genix 对使用Antifam 对预测的预测模型进行过滤,并对后续的CDS序列校验以及起始密码纠正。

Genix 流程图

**Genix 流程图**

版本:

2016-11-19.v1

Scalpel: Indel 变异鉴定工具,支持单样本、家庭样本以及正常和肿瘤对样本

标题:

Indel variant analysis of short-read sequencing data with Scalpel.

摘要:

As the second most common type of variation in the human genome, insertions and deletions (indels) have been linked to many diseases, but the discovery of indels of more than a few bases in size from short-read sequencing data remains challenging. Scalpel (http://scalpel.sourceforge.net) is an open-source software for reliable indel detection based on the microassembly technique. It has been successfully used to discover mutations in novel candidate genes for autism, and it is extensively used in other large-scale studies of human diseases. This protocol gives an overview of the algorithm and describes how to use Scalpel to perform highly accurate indel calling from whole-genome and whole-exome sequencing data. We provide detailed instructions for an exemplary family-based de novo study, but we also characterize the other two supported modes of operation: single-sample and somatic analysis. Indel normalization, visualization and annotation of the mutations are also illustrated. Using a standard server, indel discovery and characterization in the exonic regions of the example sequencing data can be completed in ~5 h after read mapping.

地址:

http://www.nature.com/nprot/journal/v11/n12/full/nprot.2016.150.html

源码:

http://scalpel.sourceforge.net

安装:

axel  https://sourceforge.net/projects/scalpel/files/scalpel-0.5.3.tar.gz/download
tar xzvf  scalpel-0.5.3.tar.gz
cd scalpel-0.5.3
make

导读:

Nature Protocols 杂志发表了不少优秀生物信息工具Protocol,比如:Trinity RNA-seq 拼装如软件平台,HISAT,StringTie and Ballgown RNA-seq 分析套件, 这次导读的是Nature Protocols上刚出版的 INDEL鉴定工具 Scalpel

在执行 INDEL 鉴定之前基本可以按照 Broad GATK 最佳实践 或者 Speedseq 最佳实践执行预处理, 其中GATK最佳实践: BWA-MEM + SAMTOOLS + PICARD , Speedseq 最佳实践: BWA-MEM + Sambamba + SAMBLASTER , 通常我们鉴定 INDEL 可能直接 GATK 或者 FreeBayes

Scalpel 除了可以处理单样本indels鉴定, 也提供了家庭装,鉴定Inherited indelsDe novo indels, 同时也可以鉴定正常样本和肿瘤样本的 Somatic indels,可作为GATK或者FreeBayes 流程之外的扩展。

Scalpel protocol 主要步骤

Scalpel protocol

版本

2016-11-18.v1

UNOISE2:通过对Illumina测序平台结果错误纠正进行微生物多样性分析

标题:

UNOISE2: Improved error-correction for Illumina 16S and ITS amplicon reads

摘要

Amplicon sequencing of tags such as 16S and ITS ribosomal RNA is a popular method for investigating microbial populations. In such experiments, sequence errors caused by PCR and sequencing are difficult to distinguish from true biological variation. I describe UNOISE2, an updated version of the UNOISE algorithm for denoising (error-correcting) Illumina amplicon reads and show that it has comparable or better accuracy than DADA2.

地址:

http://biorxiv.org/content/early/2016/10/15/081257

软件:

http://www.drive5.com/usearch/

导读:

Usearch 先前在微博上介绍了很多次,主要三点: 1. 序列相似性比对, 2. 微生物多样性数据处理,逐渐构成了小生态,3. 序列处理瑞士军刀, 三点上竞争对手都很多, 第一点上 diamondRAPSearch等都是竞争对手, 第二点 vsearch 紧随其后,另外还有 [QIIMEMothur 等老牌工具, 第三个问题太多了,主要有 seqtkseqkit 等。 不过这个帖子提到的是新出炉的 UNOISE2,就是错误纠正(这类工具也很多), 包括了去除测序错误的序列,嵌合体序列,Phix 污染序列以及低复杂度序列等, 然后就可以直接构建 OTU表了, UNOISE2 流程推荐直接从最原始的序列开始, 合并双端序列、过滤、去冗余、错误纠正、序列比对、构建OTU表、一气呵成。 另外:可以增加调整序列方向这一步,需要参考序列库,比如 RDP 的序列库,或者使用 Silva 的库。

不等不提 Usearch 工具使用序列: 32位版本不管是工业界还是学术界随便用,免费, 64位版本需要进行收费了,学术界要比工业界便宜不少,现在刚进入 9.0版本,销售策略也进行了调整,从先前按年订阅, 变成现在 按大版本号订阅 ,更人性化了。

官网的主要介绍:

-. UNOISE algorithm

The UNOISE algorithm performs error-correction (denoising) on amplicon reads. It is implemented in the unoise command. UNOISE is designed for Illumina reads, not earlier technologies such as 454 pyrosequencing.

Correct biological sequences are recovered from the reads, resolving distinct sequences down to a single difference (sometimes) or two or more differences (almost always). I consider this approach superior to traditional OTU clustering at 97% identity because OTUs may merge different species (or more generally, different phenotypes) with distinct sequences while denoising gives the best possible resolution.

Errors are corrected as follows: – Reads with sequencing error are identified and removed. – Abundances are corrected (when the OTU table is generated). – Chimeras are removed. – PhiX sequences are removed. – Low-complexity sequences due to Illumina artifacts are removed.

Using denoised sequences as OTUs has two possible drawbacks: a single species may be split into two OTUs due to different strains or paralogs, and the sensitivity is slightly lower because UPARSE can make robust OTUs from unique sequences with abundance as low as 2 while the minimum abundance for UNOISE is around 4. I consider splitting of strains to be a good thing, because they may have different phonotypes and hence different ecological roles. Splitting due to paralogs is relatively benign (what does it matter?), and is not solved by clustering at 97% identity because paralogs have identities <97% in some cases. Splitting or lumping is unadvoidable regardless of whether the clustering identity is 97% or 100% so I would argue that it is better to resolve as many distinct biological sequences as possible. Sensitivity to unique sequences with abundance <8 (summed over all samples) is rarely important in practice.

Denoised sequences are valid OTUs (the clustering identity is 100%, if you like) and can be used to generate an OTU table in just the same way as 97% OTUs.

-. UNOISE pipeline

A UNOISE pipeline recovers biological sequences from an amplicon sequencing experiment by performing error-correction (denoising) of Illumina reads. UNOISE is not designed for other sequencing technologies, e.g. 454 pyrosequencing reads. The UNOISE algorithm is implemented in the unoise command.

See Tutorials for example scripts & data.

Reads in FASTQ format I strongly recommended starting from “raw” reads, i.e. the reads originally provided by the sequencing machine base-calling software. You should do quality filtering with USEARCH rather than using reads that have already been filtered by third-party software.

Reads in FASTA format The unoise command supports reads in FASTA format. You may need to do this if your reads have already been quality filtered by some other method and you don’t have access to the original FASTQ reads.

Sample pooling I recommend combining reads from as many samples as possible. See sample pooling for discussion.

Read quality filtering Quality filtering of the reads should be done using USEARCH because maximum expected error filtering method is much more effective at suppressing reads with high error rates than other filters, e.g. those based on average Q scores. Using a maximum expected errors of 1.0 is a good default choice (-fastq_maxee 1.0 option to fastq_filter or fastq_merge_maxee 1.0 option of fastq_mergepairs). You can use fastx_learn to estimate the error rate after filtering.

Global trimming You should trim reads to a fixed length unless the sequences are contigs generated by a paired read assembler, in which case it may not be necessary. You should also trim any primer-binding sequences at the ends of the reads. See global trimming for discussion.

Unique sequences Get the set of unique sequences with abundances using the fastx_uniques command with the -sizeout option. This will be the input file for the unoise command.

Creating an OTU table Denoised sequences are valid OTUs (the clustering identity is 100%, if you like) and can be used to generate an OTU table in just the same way as 97% OTUs. Reads must have sample identifiers for this to work. The simplest way to do this is usually to use the -relabel @ option of fastq_filter or fastq_mergepairs.

Example commands For typical Illumina reads with one pair of FASTQ files (R1 and R2) per sample.

usearch -fastq_mergepairs _R1.fastq -relabel @ -fastaout reads.fq

usearch -fastq_filter reads.fq -fastq_maxee 1.0 -fastaout filtered.fa

usearch -fastx_uniques filtered.fa -fastaout uniques.fa -sizeout

usearch -unoise uniques.fa -tabbedout out.txt -fastaout denoised.fa

usearch -usearch_global reads.fq -db denoised.fa -strand plus -id 0.97 -otutabout otu_table.txt

版本:

2016-11-17.v1

MG-RAST:经典的Metagenome在线数据分析平台,完美解决物种组成和功能解析

标题:

The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes

摘要:

Background Random community genomes (metagenomes) are now commonly used to study microbes in different environments. Over the past few years, the major challenge associated with metagenomics shifted from generating to analyzing sequences. High-throughput, low-cost next-generation sequencing has provided access to metagenomics to a wide range of researchers.

Results A high-throughput pipeline has been constructed to provide high-performance computing to all researchers interested in using metagenomics. The pipeline produces automated functional assignments of sequences in the metagenome by comparing both protein and nucleotide databases. Phylogenetic and functional summaries of the metagenomes are generated, and tools for comparative metagenomics are incorporated into the standard views. User access is controlled to ensure data privacy, but the collaborative environment underpinning the service provides a framework for sharing datasets between multiple users. In the metagenomics RAST, all users retain full control of their data, and everything is available for download in a variety of formats.

Conclusion The open-source metagenomics RAST service provides a new paradigm for the annotation and analysis of metagenomes. With built-in support for multiple data sources and a back end that houses abstract data types, the metagenomics RAST is stable, extensible, and freely available to all researchers. This service has removed one of the primary bottlenecks in metagenome sequence analysis – the availability of high-performance computing for annotating the data.

http://metagenomics.nmpdr.org

地址:

http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-386

源码:

https://github.com/MG-RAST/MG-RAST

导读:

此处输入图片的描述

MG-RAST 提供在线的 Metagenome/Metatranscriptome 数据平台,直接原始的reads也支持拼装的contig,先不管分析内容的好坏,首先提供的project编号 可以很方便的出现在文章中,给可重复性的研究也提供的有力保证。 此外在线 Metagenome/Metatranscriptome 数据分析,EBI Metagenome 也有这样的优势,也是不错的选择。 目前MG-RAST 收录已经完成了268,325个样本的数据分析, 版本刚刚升级到了4.0 版本,针对单样本分析,可以获得以下信息:

1. 序列统计、质量控制(GC含量图,核酸组成、序列长度分布)
2. 序列预测(功能分类 rRNA/ protein coding);
3. 重复序列预测(使用 DRISEE);
4. Kmer谱 (rank abundance 可视化);
5. 序列比对结果统计
6. COG/NOG 功能谱;
7. KEGG 分类(KEGG  Ortholog 分类);
8. the SEED 注释;
9. 物种组成分布;
10. 多样性分析(稀释曲线/多样性指数)
11. 元数据

所有的数据,包括绘制图表的数据都可以自己有下载,的确很方便,这也是那么受欢迎的原因,另外MG-RAST使用M5NR 非冗余数据库进行序列比对,这样减少了需要比对很多库的麻烦。

版本:

2016-11-16.v1

Miniasm+Racon:快速准确完成三代测序数据拼装

标题:

Fast and accurate de novo genome assembly from long uncorrected reads

摘要:

The assembly of long reads from Pacific Biosciences and Oxford Nanopore Technologies typically requires resource intensive error correction and consensus generation steps to obtain high quality assemblies. We show that the error correction step can be omitted and high quality consensus sequences can be generated efficiently with a SIMD accelerated, partial order alignment based stand-alone consensus module called Racon. Based on tests with PacBio and Oxford Nanopore datasets we show that Racon coupled with Miniasm enables consensus genomes with similar or better quality than state-of-the-art methods while being an order of magnitude faster.

地址:

http://biorxiv.org/content/early/2016/08/05/068122

源码:

https://github.com/isovic/racon

安装:

git clone https://github.com/isovic/racon.git && cd racon && make modules && make tools && make -j

导读:

三代测序拼装软件,三代测序平台 Nanopore / Pacbio 产生的数据的一个共同点就是,读长长,错误率高,在用于分析之前需要对数据进行特殊处理(consensus,错误纠正),再进行拼装任务,Liheng 开发的 Miniasm 可以直接使用未处理的长读长序列进行快速拼装,但是Miniasm对拼装的Contig序列进行抛光处理,所以会出现不少SNP/INDEL, Racon 就是为了解决这个问题,支持 (GFA, FASTA, FASTQ, SAM, MHAP and PAF) 等文件输入格式,相对于 Quiver / Nanopolish 通用性更高,这样一套新的组合 Miniasm+Racon 出现了,高校快速。

版本:

2016-10-15.v1

Resfams:基于HMM谱的抗性基因注释

标题:

Improved annotation of antibiotic resistance determinants reveals microbial resistomes cluster by ecology

摘要:

Antibiotic resistance is a dire clinical problem with important ecological dimensions. While antibiotic resistance in human pathogens continues to rise at alarming rates, the impact of environmental resistance on human health is still unclear. To investigate the relationship between human-associated and environmental resistomes, we analyzed functional metagenomic selections for resistance against 18 clinically relevant antibiotics from soil and human gut microbiota as well as a set of multidrug-resistant cultured soil isolates. These analyses were enabled by Resfams, a new curated database of protein families and associated highly precise and accurate profile hidden Markov models, confirmed for antibiotic resistance function and organized by ontology. We demonstrate that the antibiotic resistance functions that give rise to the resistance profiles observed in environmental and human-associated microbial communities significantly differ between ecologies. Antibiotic resistance functions that most discriminate between ecologies provide resistance to β-lactams and tetracyclines, two of the most widely used classes of antibiotics in the clinic and agriculture. We also analyzed the antibiotic resistance gene composition of over 6000 sequenced microbial genomes, revealing significant enrichment of resistance functions by both ecology and phylogeny. Together, our results indicate that environmental and human-associated microbial communities harbor distinct resistance genes, suggesting that antibiotic resistance functions are largely constrained by ecology.

文章:

http://www.nature.com/ismej/journal/v9/n1/full/ismej2014106a.html

源码:

https://github.com/dantaslab/resfams
http://www.dantaslab.org/resfams/

安装:

wget http://dantaslab.wustl.edu/resfams/Resfams-proteins.tar.gz
tar xzvf  Resfams-proteins.tar.gz
cat proteins/*  >resfams.faa
diamond  makedb  --in  Resfams.fa  -d  Resfams

wget  http://dantaslab.wustl.edu/resfams/Resfams.hmm.gz   ./
gunzip Resfams.hmm.gz
hmmpress Resfams.hmm

导读:

抗生素抗性基因注释在病原微生物基因组测序、metagenome测序等项目中的关注度很高,先前有 Antibiotic Resistance Database (ARDB)Resistance Database (CARD)等序列库用于抗性基因注释,Resfams 提供了基于谱序列相似性搜索的策略,用于基因序列注释(功能谱注释),对于拼装的序列来说,HMMER还是可以提供比较快的执行速度,但是对于 Metagenome项目来说, 如果不拼装,直接使用reads翻译的ORF注释的话,计算序列还是很大,基于序列相似性搜索的工具Diamond/Usearch等可以快速鉴定可能的抗性基因(使用阈值足够大确保不遗漏),然后在使用HMM谱过滤掉一些假阳性序列可以达到加速目的。

版本:

2016-11-14.v1

FASTQSim:高通量测序数据模拟应用,支持 illumina/ion/pacbio/roche平台

标题:

FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets.

摘要:

BACKGROUND:
High-throughput next generation sequencing technologies have enabled rapid characterization of clinical and environmental samples. Consequently, the largest bottleneck to actionable data has become sample processing and bioinformatics analysis, creating a need for accurate and rapid algorithms to process genetic data. Perfectly characterized in silico datasets are a useful tool for evaluating the performance of such algorithms. Background contaminating organisms are observed in sequenced mixtures of organisms. In silico samples provide exact truth. To create the best value for evaluating algorithms, in silico data should mimic actual sequencer data as closely as possible.
RESULTS:
FASTQSim is a tool that provides the dual functionality of NGS dataset characterization and metagenomic data generation. FASTQSim is sequencing platform-independent, and computes distributions of read length, quality scores, indel rates, single point mutation rates, indel size, and similar statistics for any sequencing platform. To create training or testing datasets, FASTQSim has the ability to convert target sequences into in silico reads with specific error profiles obtained in the characterization step.
CONCLUSIONS:
FASTQSim enables users to assess the quality of NGS datasets. The tool provides information about read length, read quality, repetitive and non-repetitive indel profiles, and single base pair substitutions. FASTQSim allows the user to simulate individual read datasets that can be used as standardized test scenarios for planning sequencing projects or for benchmarking metagenomic software. In this regard, in silico datasets generated with the FASTQsim tool hold several advantages over natural datasets: they are sequencing platform independent, extremely well characterized, and less expensive to generate. Such datasets are valuable in a number of applications, including the training of assemblers for multiple platforms, benchmarking bioinformatics algorithm performance, and creating challenge datasets for detecting genetic engineering toolmarks, etc.

文章:

http://bmcresnotes.biomedcentral.com/articles/10.1186/1756-0500-7-533

源码:

https://sourceforge.net/projects/fastqsim

安装:

    axel  https://sourceforge.net/projects/fastqsim/files/FASTQsim_v2.0.tgz/download
    tar xzvf  FASTQsim_v2.0.tgz
    mv   FASTQsim_v2.0  FASTQsim-2.0

导读:

FASTQSim:高通量测序数据模拟应用,支持 illumina/ion/pacbio/roche平台,被广泛用于metagenome数据模拟,比如文章:
Evaluating performance of metagenomic characterization algorithms using in silico datasets generated with FASTQSim

版本:

2016-11-13.v1