先在这里开个头,后面不断对这个Topic 进行更新。


K-mer 在生物信息学中有着广泛的应用,比如基因组拼装,评估基因组测序覆盖度,测序数据的纠错,多序列比对,重复序列检测。但是计算K-mer 比较耗费内存,因此好的数据结构有利于降低内存的使用,比如Khmer,采用概率型数据结构(Bloom_filter,,Jellyfish 采用了并行无锁哈希表(lock-free hash table)数据结构,为了降低内存使用,有时候可能需要在时间,内存,磁盘空间使用上进行折中。 下面列出了现在比较常用的K-mer计算的工具以及一些应用实例。


  1. DSK (Rizk et al. 2013)1
  2. Musket (Liu et al. 2013)2
  3. Khmer (McDonald and Brown 2013)3
  4. BFCounter  (Melsted and Pritchard 2011)4
  5. Simrank (DeSantis et al. 2011)5
  6. Kmer (Walenz and Florea 2011)6
  7. Jellyfish  (Marcais and Kingsford 2011)7
  8. Tallymer (Kurtz et al. 2008) 8
  9. NmerFreq 9


  1. PriMux (Hysom et al. 2012)10
  2. COPE (Liu et al. 2012)11
  3. KASpOD (Parisot et al. 2012)12
  4. SINA (Pruesse et al. 2012)13
  5. SlideSort (Shimizu and Tsuda 2011)14
  6. piRNApredictor (Zhang et al. 2011)15
  7. Gk-arrays (Philippe et al. 2011)16
  8. Reptile (Yang et al. 2010)17
  9. Figaro (White et al. 2008)18
  10. BLMT (Ganapathiraju et al. 2004)19
  11. MRD (Subramanian et al. 2002)20


  1. Rizk G, Lavenier D, Chikhi R. 2013. DSK: k-mer counting with very low memory usage. Bioinformatics 29(5): 652-653.

  2. Liu Y, Schroder J, Schmidt B. 2013. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics 29(3): 308-315.

  3. McDonald E, Brown T: khmer: Working with Big Data in Bioinformatics. 2013.

  4. Melsted P, Pritchard JK. 2011. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12: 333.

  5. DeSantis TZ, Keller K, Karaoz U, Alekseyenko AV, Singh NN, Brodie EL, Pei Z, Andersen GL, Larsen N. 2011. Simrank: Rapid and sensitive general-purpose k-mer search tool. BMC Ecol 11: 11.

  6. Walenz B, Florea L. 2011. Sim4db and Leaff: utilities for fast batch spliced alignment and sequence indexing. Bioinformatics 27(13): 1869-1870.

  7. Marcais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6): 764-770.

  8. Kurtz S, Narechania A, Stein JC, Ware D. 2008. A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes. BMC Genomics 9: 517.


  10. Hysom DA, Naraghi-Arani P, Elsheikh M, Carrillo AC, Williams PL, Gardner SN. 2012. Skip the alignment: degenerate, multiplex primer and probe design using K-mer matching instead of alignments. PLoS One 7(4): e34560.

  11. Liu B, Yuan J, Yiu SM, Li Z, Xie Y, Chen Y, Shi Y, Zhang H, Li Y, Lam TW et al. 2012. COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly. Bioinformatics 28(22): 2870-2874.

  12. Parisot N, Denonfoux J, Dugat-Bony E, Peyret P, Peyretaillade E. 2012. KASpOD–a web service for highly specific and explorative oligonucleotide design. Bioinformatics 28(23): 3161-3162.

  13. Pruesse E, Peplies J, Glockner FO. 2012. SINA: accurate high-throughput multiple sequence alignment of ribosomal RNA genes. Bioinformatics 28(14): 1823-1829.

  14. Shimizu K, Tsuda K. 2011. SlideSort: all pairs similarity search for short reads. Bioinformatics 27(4): 464-470.

  15. Zhang Y, Wang X, Kang L. 2011. A k-mer scheme to predict piRNAs and characterize locust piRNAs. Bioinformatics 27(6): 771-776.

  16. Philippe N, Salson M, Lecroq T, Leonard M, Commes T, Rivals E. 2011. Querying large read collections in main memory: a versatile data structure. BMC Bioinformatics 12: 242.

  17. Yang X, Dorman KS, Aluru S. 2010. Reptile: representative tiling for short read error correction. Bioinformatics 26(20): 2526-2533.

  18. White JR, Roberts M, Yorke JA, Pop M. 2008. Figaro: a novel statistical method for vector sequence removal. Bioinformatics 24(4): 462-467.

  19. Ganapathiraju M, Manoharan V, Klein-Seetharaman J. 2004. BLMT: statistical sequence analysis using N-grams. Appl Bioinformatics 3(2-3): 193-200.

  20. Subramanian S, Madgula VM, George R, Mishra RK, Pandit MW, Kumar CS, Singh L. 2002. MRD: a microsatellite repeats database for prokaryotic and eukaryotic genomes. Genome Biol 3(12): PREPRINT0011.


Comments are closed.