Insights into the phylogeny and coding potential of microbial dark matter

Insights into the phylogeny and coding potential of microbial dark matter

Supplementary Information

SAG assembly

The draft genome of all but 13 SAGs was generated at the DOE Joint genome Institute (JGI) using the Illumina technology. An Illumina standard shotgun library was constructed and sequenced using the  Illumina HiSeq 2000 platform. All general aspects of library construction and sequencing performed at the JGI can be found at Raw Illumina sequence data were filtered for known Illumina  sequencing and library preparation artifacts and then screened and trimmed according to the k-mers present in the dataset. High-depth k-mers, presumably derived from MDA amplification bias, cause  problems in the assembly, especially if the k-mer depth varies in orders of magnitude for different regions of the genome. Reads representing highly abundant k-mers were removed such that no k-mers with a  coverage of more than 30x were present after filtering. Reads with an average kmer depth of less than 2x were removed. The following steps were then performed for assembly: (1) filtered Illumina reads were  assembled using Velvet version 1.1.04 (Zerbino and Birney, 2008). The VelvetOptimiser script (version  2.1.7) was used with default optimization functions (n50 for k-mer choice, total number of base pairs in  large contigs for cov_cutoff optimization). (2) 1-3 kbp simulated paired end reads were created from  Velvet contigs using the wgsim software. (3) the normalized Illumina reads were assembled together with  simulated read pairs using Allpaths-LG (version 41043) (Gnerre and MacCallum, 2011). Parameters for  assembly steps were: 1) VelvetOptimiser (–v –s 51 –e 71 –i 4 –t 1 –o “-ins_length 250 -min_contig_lgth  500”) 2) wgsim (-e 0 -1 100 -2 100 -r 0 -R 0 -X 0) 3) Allpaths-LG (prepareAllpathsParams: PHRED_64=1  PLOIDY=1 FRAG_COVERAGE=125 JUMP_COVERAGE=25 LONG_JUMP_COV=50,  runAllpathsParams: THREADS=8 RUN=std_pairs TARGETS=standard VAPI_WARN_ONLY=True  OVERWRITE=True). For the remaining 13 SAGs the draft genomes were generated at the JGI using a  Roche 454 Genome Sequencer FLX System using Titanium chemistry according to the manufacturer’s  protocols (454 Life Sciences, Branford, CT). The hybrid 454/Illumina assemblies steps (1)-(3) were  identical. Next, (4) Allpaths contigs larger than 1 kbp were shredded into 1 kbp pieces with 200 bp  overlaps. (5) Lastly, the Allpaths shreds and raw 454 pyrosequence reads were assembled using the 454  Newbler assembler version 2.4 (Roche). All assemblies are available on the Microbial Dark Matter project  website (

  1.  illumina  原始序列进行质量控制,包括去掉一些 artifacts ,以及根据Kmer 频率过滤 read
    去掉包含 kmer 频率大于 30X  read
    b. 去掉平均 Kmer 频率值 小于 2 read
  2. 通过Velvet (使用 VelvetOptimiser 优化参数)对过滤后的 reads 进行拼装。
  3. wgsim 根据 Velvet 的拼接结果进行模拟出 1-3 kbp 的Paried-End 序列。
  4. Allpaths-LG  模拟的序列以及 normalized Illumina reads 进行拼接。
454/Illumina 混合拼接策略:
  1. 对于illununa reads ,按照上面的流程操作;
  2. Allpaths-LG 产生的 contig (> 1kb), 分拆成有 200bp  交叠的 1kb 序列;
  3. 混合454 和第二步产生的序列,使用 454 Newbler 进行拼装;
Microbial Dark Matter Project

1 comment to Insights into the phylogeny and coding potential of microbial dark matter

Leave a Reply

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>