Next-Generation Sequencing (NGS) has significantly reduced the cost of genome sequencing projects, resulting in an expressive increase in the availability of genomic data in public databases. The cheaper and easier is to sequence new genomes, the more accurate the annotation steps have to be to avoid both the loss of information and the accumulation of erroneous features that may affect the accuracy of further analysis. In the case of bacteria genomes, a range of web annotation software has been developed; however, many applications have yet to incorporate the steps required to improve their result, including the removal of false-positive/spurious and a more complete identification of non-coding features. We present Genix, a new web-based bacterial genome annotation pipeline. A comparison of the results generated by Genix for four reference genomes against those generated by other annotation tools indicated that our pipeline is able to provide results that are closer to the reference genome annotation, with a smaller amount of false-positive proteins and missing functional annotated proteins. Additionally, the metrics obtained by Genix were slightly better than those obtained by Prokka, a state-of-art standalone annotation system. Our results indicate that Genix is a useful tool that is able to provide a more refined result, and may be a user-friendly way to obtain high quality results.





已有的自动化细菌基因组注释流程很多,有 RASTBASysProkka,不管怎么样,只要有创新就可以发文章,给我们提供更多选择。

Genix 提供Web版本以及源代码(ps. 不提供源代码的web 应用都是耍流氓) ,免费但是需要注册账户,数据库实现 Apache、Mysql、 SQLite、Perl、Python、BashGenix 流程根据提供的序列以及物种分类信息可以很方便的获得基因组注释信息,如果提供一些额外数据库提交信息,更是可以直接生成可提交的Genbank文件,对于服务器负荷来讲,和所有的在线服务一样,需要排队,一个一个来。

Genix 主要是由了一下工具:

  1. 蛋白质编码基因预测 Prodigal
  2. tRNA 基因预测 tRNAscan-SE
  3. rRNA基因预测 RNAmmer
  4. tmRNA基因预测 Aragorn
  5. ncRNA预测 blastn + infernal + Rfam
  6. 数据可视化 JBrowse 基因组浏览器。

序列库的选择是每个应用的都会关注的问题,因为序列比对计算上比较昂贵,但是可以选择一些加速序列比对算法,比如 DiamondUsearch等,但是通过减小数据库大小会带来很好的性能提升,最典型的就是根据物种去划分,比如Viruses/Archaea/Bacteria 分类,当然也可以自己提供序列集合。 eggNOG-4.0 以后也是这样做的,毕竟注释一个细菌基因组没有必要使用全部的Uniprot/eggNOG库,只需要细菌的那一部分就可以了。 Genix 会根据提供的物种 taxonomy 信息(NCBI Taxonomy 标识符) 自动下载Uniprot序列库,并使用 CD-HIT 对序列做冗余过滤,减小参考数据库的大小 如果物种未知怎么办,设置细菌的NCBI Taxonomy。

相对于Prokka的一些特殊的地方在于基因预测模型优化上,原核微生物蛋白质编码基因相对真核简单很多,Prodigal 相对其它预测工具也是首选,但是也有一些错误的预测模型,Genix 对使用Antifam 对预测的预测模型进行过滤,并对后续的CDS序列校验以及起始密码纠正。

Genix 流程图

**Genix 流程图**



