wgsim: Reads simulator


Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors. It does not generate INDEL sequencing errors, but this can be partly compensated by simulating INDEL polymorphisms. Wgsim outputs the simulated polymorphisms, and writes the true read coordinates as well as the number of polymorphisms and sequencing errors in read names. One can evaluate the accuracy of a mapper or a SNP caller with that comes with the package.




git clone
cd wgsim
gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm


wgsim 比较通用的序列模拟软件,需要提供参考基因组,从提供的命令行参数可以看到,可以控制:

1. 双端序列的外部距离(等价于插入片段长度);
2. 序列长度;
3. 错误率;
4. 序列条数;
5. 突变率;
6. Indel的比例

等。 很适合模拟 metagenome 数据集,但是需要提供很多参考基因组;

对于metagenome 数据的模拟,可以直接到 refseq 下载想模拟的微生物基因组 ,一般下载 细菌/古菌/病毒的序列,各占一定比例, 然后合并在一起,就可以模拟了, 可以参考 A comparative study of metagenomics analysis pipelines at the species level 这篇文章, 比例为:hg19 + 315 微生物基因组 (292 237 species: 74 bacteriophages, 69 viruses and 149 bacteria), 这里用的是 species,应该去除同一个种的不同 strain,但是真实的 metagenome 数据肯定会有很多同一个种的不同 strain,这也是metagenome 数据分析的难点:拼装的嵌合问题。

因为NCBI新的拼装数据库不提供打包下载方法,下载Refseq也没有先前那么简单了,这里提供一个下载 Refseq 比较方便的程序:Kraken_db_install_scripts



