Archives

K-mer在生物信息学中的应用及其工具列表

先在这里开个头,后面不断对这个Topic 进行更新。

基本介绍

K-mer 在生物信息学中有着广泛的应用,比如基因组拼装,评估基因组测序覆盖度,测序数据的纠错,多序列比对,重复序列检测。但是计算K-mer 比较耗费内存,因此好的数据结构有利于降低内存的使用,比如Khmer,采用概率型数据结构(Bloom_filter, http://en.wikipedia.org/wiki/Bloom_filter),Jellyfish 采用了并行无锁哈希表(lock-free hash table)数据结构,为了降低内存使用,有时候可能需要在时间,内存,磁盘空间使用上进行折中。 下面列出了现在比较常用的K-mer计算的工具以及一些应用实例。

工具

  1. DSK (Rizk et al. 2013)1 http://minia.genouest.org/dsk/
  2. Musket (Liu et al. 2013)2 http://musket.sourceforge.net/homepage.htm#latest
  3. Khmer (McDonald and Brown 2013)3 http://khmer.readthedocs.org/en/latest/
  4. BFCounter  (Melsted and Pritchard 2011)4 http://pritch.bsd.uchicago.edu/bfcounter.html
  5. Simrank (DeSantis et al. 2011)5 http://search.cpan.org/~shuriko/String-Simrank-0.079/lib/String/Simrank.pm
  6. Kmer (Walenz and Florea 2011)6 http://sourceforge.net/apps/mediawiki/kmer/index.php?title=Main_Page
  7. Jellyfish  (Marcais and Kingsford 2011)7 http://www.cbcb.umd.edu/software/jellyfish/
  8. Tallymer (Kurtz et al. 2008) 8 http://www.zbh.uni-hamburg.de/?id=211
  9. NmerFreq 9 http://compbio.umbc.edu/research/software/188-2/

Utility

  1. PriMux (Hysom et al. 2012)10 http://sourceforge.net/projects/PriMux
  2. COPE (Liu et al. 2012)11 http://sourceforge.net/projects/coperead/
  3. KASpOD (Parisot et al. 2012)12 http://g2im.u-clermont1.fr/kaspod/
  4. SINA (Pruesse et al. 2012)13 http://www.arb-silva.de/aligner
  5. SlideSort (Shimizu and Tsuda 2011)14 http://www.cbrc.jp/~shimizu/slidesort/index.php
  6. piRNApredictor (Zhang et al. 2011)15 http://59.79.168.90/piRNA/
  7. Gk-arrays (Philippe et al. 2011)16 http://crac.gforge.inria.fr/gkarrays/
  8. Reptile (Yang et al. 2010)17 http://aluru-sun.ece.iastate.edu/doku.php?id=reptile
  9. Figaro (White et al. 2008)18 http://sourceforge.net/apps/mediawiki/amos/index.php?title=Figaro
  10. BLMT (Ganapathiraju et al. 2004)19
  11. MRD (Subramanian et al. 2002)20

文献列表

  1. Rizk G, Lavenier D, Chikhi R. 2013. DSK: k-mer counting with very low memory usage. Bioinformatics 29(5): 652-653.

  2. Liu Y, Schroder J, Schmidt B. 2013. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics 29(3): 308-315.

  3. McDonald E, Brown T: khmer: Working with Big Data in Bioinformatics. 2013.

  4. Melsted P, Pritchard JK. 2011. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12: 333.

  5. DeSantis TZ, Keller K, Karaoz U, Alekseyenko AV, Singh NN, Brodie EL, Pei Z, Andersen GL, Larsen N. 2011. Simrank: Rapid and sensitive general-purpose k-mer search tool. BMC Ecol 11: 11.

  6. Walenz B, Florea L. 2011. Sim4db and Leaff: utilities for fast batch spliced alignment and sequence indexing. Bioinformatics 27(13): 1869-1870.

  7. Marcais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6): 764-770.

  8. Kurtz S, Narechania A, Stein JC, Ware D. 2008. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics 9: 517.

  9. http://compbio.umbc.edu/research/software/188-2/

  10. Hysom DA, Naraghi-Arani P, Elsheikh M, Carrillo AC, Williams PL, Gardner SN. 2012. Skip the alignment: degenerate, multiplex primer and probe design using K-mer matching instead of alignments. PLoS One 7(4): e34560.

  11. Liu B, Yuan J, Yiu SM, Li Z, Xie Y, Chen Y, Shi Y, Zhang H, Li Y, Lam TW et al. 2012. COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics 28(22): 2870-2874.

  12. Parisot N, Denonfoux J, Dugat-Bony E, Peyret P, Peyretaillade E. 2012. KASpOD–a web service for highly specific and explorative oligonucleotide design. Bioinformatics 28(23): 3161-3162.

  13. Pruesse E, Peplies J, Glockner FO. 2012. SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics 28(14): 1823-1829.

  14. Shimizu K, Tsuda K. 2011. SlideSort: all pairs similarity search for short reads. Bioinformatics 27(4): 464-470.

  15. Zhang Y, Wang X, Kang L. 2011. A k-mer scheme to predict piRNAs and characterize locust piRNAs. Bioinformatics 27(6): 771-776.

  16. Philippe N, Salson M, Lecroq T, Leonard M, Commes T, Rivals E. 2011. Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics 12: 242.

  17. Yang X, Dorman KS, Aluru S. 2010. Reptile: representative tiling for short read error correction. Bioinformatics 26(20): 2526-2533.

  18. White JR, Roberts M, Yorke JA, Pop M. 2008. Figaro: a novel statistical method for vector sequence removal. Bioinformatics 24(4): 462-467.

  19. Ganapathiraju M, Manoharan V, Klein-Seetharaman J. 2004. BLMT: statistical sequence analysis using N-grams. Appl Bioinformatics 3(2-3): 193-200.

  20. Subramanian S, Madgula VM, George R, Mishra RK, Pandit MW, Kumar CS, Singh L. 2002. MRD: a microsatellite repeats database for prokaryotic and eukaryotic genomes. Genome Biol 3(12): PREPRINT0011.

update:2013-08-14

Comments are closed.