Archives

CaBLASTX:Entropy-scaling search of massive biological data

文章:

Entropy-scaling search of massive biological data

摘要:

The continual onslaught of new omics data has forced upon scientists the fortunate problem of having too much data to analyze. Luckily, it turns out that many datasets exhibit well-defined structure that can be exploited for the design of smarter analysis tools. We introduce an entropy-scaling data structure—which given a low fractal dimension database, scales in both time and space with the entropy of that underlying database—to perform similarity search, a fundamental operation in data science. Using these ideas, we present accelerated versions of standard tools for use by practitioners in the three domains of high-throughput drug screening, metagenomics, and protein structure search, none of which have any loss in specificity or significant loss in sensitivity: Ammolite, 12x speedup of small molecule similarity search with less than 4% loss in sensitivity; CaBLASTX, 673x speedup of BLASTX with less than 5% loss in sensitivity; and esFragBag, 10x speedup of FragBag with less than 0.2% loss in sensitivity.

文章链接:

http://arxiv.org/abs/1503.05638t

Github地址:

https://github.com/ndaniels/cablastp2

官方主页:

http://cast.csail.mit.edu/

文章导读:

做短序列比对 Usearch、Diamond、 Vsearch、RapSearch2、Lambda 以及硬件加速版Tera-BLAST,现在的使用entropy-scaling 数据结构的CaBLASTX,速度比Diamond逊色点,但是损失的敏感性不到5%, NR 库很大,所以只能使用加速算法做序列比对,特别是针对元基因组数据,需要对reads进行序列比对(BLASTX),直接对NR库比对对计算资源要求太高了,CaBLASTX 通过调用 blastn-short 加速比对。

注意事项:大基因组还是很需要内存, 96G内存机器 NR 库没有运行成功;

软件安装:

数据库下载:

   axel -n 24  -o   nr-20140917-cablastx.tgz   http://giant.csail.mit.edu/gems/nr-20140917-cablastx.tgz

程序下载:

   http://giant.csail.mit.edu/gems/cablastx-src.tgz
   http://giant.csail.mit.edu/gems/cablastx-linux-amd64.tgz

程序测试:

   cablastx-compress --ext-seed-size 0 --match-seq-id-threshold 70  --ext-seq-id-threshold 60 --max-seeds 20 -p 40 nr-20140917-cablastx nr.fasta 
   cablastx /biostack/database/nr/nr-20140917-cablastx test.fa --blast-args -outfmt 5 -out cablastx.tsv -evalue 0.1 --max_target_seqs 1 -num_threads 24

相关工具:

Diamond :https://github.com/bbuchfink/diamond/
Usearch:http://www.drive5.com/usearch/
Kraken:https://ccb.jhu.edu/software/kraken/
CaBLAST http://www.nature.com/nbt/journal/v30/n7/full/nbt.2241.html

Update: 2015/6/4 7:51

Comments are closed.