Archives

Categories

如何根据列表快速提取fasta/q序列子集/补集:seqtk_subseq

本部分内容主要介绍如何根据列表快速提取 fasta/q 序列子集/补集,主要解决下面两个问题场景:

 1. 交集 寻找出现在列表的序列子集
 2. 补集 寻找未出现在列表的序列子集

并集的问题我们也经常会遇到,比如我们手头上有很多序列集合,但是序列有重叠(序列名称一致、序列碱基/氨基酸一致),这个问题可以通过程序 seqtk_uniq 解决, 下次再讲.

一、 seqtk_subseq 介绍

下面讲如何使用seqtk_namesseqtk_subseq 这两个小工具解决这些问题, 程序实现主要是使用 klib (高效的轻量级C库,大部分是使用宏实现,学习如何使用klib, 请 移步 Wiki

下面是当前主要功能模块:

Common components

khash.h: generic hash table based on double hashing.
kbtree.h: generic search tree based on B-tree.
ksort.h: generic sort, including introsort, merge sort, heap sort, comb sort, Knuth shuffle and the k-small algorithm.
kseq.h: generic stream buffer and a FASTA/FASTQ format parser.
kvec.h: generic dynamic array.
kdq.h: generic double-ended queue (de-queue).
klist.h: generic single-linked list and memory pool.
kstring.{h,c}: basic string library.
kmath.{h,c}: numerical routines including MT19937-64 pseudorandom number generator, basic nonlinear programming and a few special math functions.
kson.{h,c}: simple JSON parser (no streaming)
kthread.c: simple multi-threading models.

Components for more specific use cases

kexpr.{h,c}: parsing and evaluating mathematical expressions
ksa.c: constructing suffix arrays for strings with multiple sentinels, based on a revised SAIS algorithm.
knetfile.{h,c}: random access to remote files on HTTP or FTP.
kurl.{h,c}: FILE-like interfaces to libcurl.
kopen.c: smart stream opening.
khmm.{h,c}: basic HMM library.
ksw.{h,c}: Striped Smith-Waterman algorithm.
knhx.{h,c}: Newick tree format parser.

有效使用这些函数,还用什么 Perl/Python 等动态语言完成计算/IO密集度比较高的数据处理任务呢?

因为我们经常遇到的是很大的数据集,为了测试下面小程序, 这里使用的测试数据都比较大, 并且从一个大文件中随机抽取 200000 条序列和 200 条序列以减小序列分布上的偏差。

材料清单:

序列集EAOA2.fq.gz合基本信息

seqtk_info EAOA2.fq.gz

#sequence       base            min_len    max_len    avg_len
16017817        2402672550      150        150        150.00

序列随机抽取:

seqtk sample   EAOA2.fq.gz  200 | gzip >200.fq.gz
seqtk sample   EAOA2.fq.gz  200000 | gzip >200000.fq.gz

seqtk_names: 快速获取序列的名字列表:

seqtk_names  200.fq.gz >200.txt
seqtk_names  200000.fq.gz >200000.txt

seqtk_subseq: 交集/补集

seqtk_subseq 命令行接口:

$ seqtk_subseq
Usage: seqtk_subseq [options] <in.fa> <name.list>
Options:
  -s INT     subset option: 1 intersection, 0 supplementary set, default: 1
  -v         print version number

使用-s 可选参数, 默认是交集模式:

$seqtk_subseq  EAOA2.fq.gz  200.txt
$seqtk_subseq -s 1 EAOA2.fq.gz  200.txt

前面两个命令是等价的, 补集模式:

seqtk_subseq -s 0 EAOA2.fq.gz  200.txt

另外: 命令行模式程序支持 streaming 模式可以提供很大便利, 比如我获取 文件 200.fq.gz 和 EAOA2.fq.gz文件的交集, 那么:

cat  200.fq.gz | seqtk_names - | seqtk_subseq  -s 1  EAOA2.fq.gz  -  | gzip  > intersection.fq.gz

就可以很愉快的解决问题;

二、其它相关工具介绍

前面两个小工具是本着一个程序解决一个问题的原则来实现的, 下面要介绍两个瑞士军刀类工具: seqtkseqkit

seqtk工具

seqtk 是 klib 作者 liheng 实现的处理序列文件的工具集合,下面是根据各种接口:

Usage:   seqtk <command> <arguments>
Version: 1.2-r94

Command: seq       common transformation of FASTA/Q
         comp      get the nucleotide composition of FASTA/Q
         sample    subsample sequences
         subseq    extract subsequences from FASTA/Q
         fqchk     fastq QC (base/quality summary)
         mergepe   interleave two PE FASTA/Q files
         trimfq    trim FASTQ using the Phred algorithm

         hety      regional heterozygosity
         gc        identify high- or low-GC regions
         mutfa     point mutate FASTA at specified positions
         mergefa   merge two FASTA/Q files
         famask    apply a X-coded FASTA to a source FASTA
         dropse    drop unpaired from interleaved PE FASTA/Q
         rename    rename sequence names
         randbase  choose a random base from hets
         cutN      cut sequence at long N
         listhet   extract the position of each het

seqtk的subseq可以解决这个问题:

seqtk subseq EAOA2.fq.gz  200.txt >/dev/null

但是只能求交集模式, 很多时候我们需要补集模式

seqkit 工具

下面是seqkit的命令接口, 相当丰富,可以解决很多很多问题

SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Version: 0.7.1-dev

Author: Wei Shen <shenwei356@gmail.com>

Documents  : http://bioinf.shenwei.me/seqkit
Source code: https://github.com/shenwei356/seqkit
Please cite: https://doi.org/10.1371/journal.pone.0163962

Suggestion : Install pigz to gain better parsing performance for gzipped data

Usage:
  seqkit [command]

Available Commands:
  common          find common sequences of multiple files by id/name/sequence
  concat          concatenate sequences with same ID from multiple files
  convert         convert FASTQ quality encoding between Sanger, Solexa and Illumina
  dup             duplicate sequences N times
  faidx           create FASTA index file
  fq2fa           convert FASTQ to FASTA
  fx2tab          convert FASTA/Q to tabular format (with length/GC content/GC skew)
  genautocomplete generate shell autocompletion script
  grep            search sequences by pattern(s) of name or sequence motifs
  head            print first N FASTA/Q records
  help            Help about any command
  locate          locate subsequences/motifs
  range           print FASTA/Q records in a range (start:end)
  rename          rename duplicated IDs
  replace         replace name/sequence by regular expression
  restart         reset start position for circular genome
  rmdup           remove duplicated sequences by id/name/sequence
  sample          sample sequences by number or proportion
  seq             transform sequences (revserse, complement, extract ID...)
  shuffle         shuffle sequences
  sliding         sliding sequences, circular genome supported
  sort            sort sequences by id/name/sequence/length
  split           split sequences into files by id/seq region/size/parts
  stats           simple statistics of FASTA/Q files
  subseq          get subsequences by region/gtf/bed, including flanking sequences
  tab2fx          convert tabular format to FASTA/Q format
  version         print version information and check for update

Flags:
      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
  -h, --help                            help for seqkit
      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
      --id-regexp string                regular expression for parsing ID (default "^([^\\s]+)\\s?")
  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)
  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
      --quiet                           be quiet and do not show extra information
  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

Use "seqkit [command] --help" for more information about a command.

seqkit 执行提取序列

seqkit  common -j 8  EAOA2.fq.gz    200000.fq.gz  > /dev/null
seqkit  grep -f <(seqkit seq -n -i 200000.fq.gz) EAOA2.fq.gz

seqkit在使用多线程情况可以很大程序提高执行效率。

三、比较和总结

seqtk_subseq 使用klib 实现, 采用了很简单的Hash装载序列集合,然后遍历序列文件,所以对查询序列多少不是很敏感。

下面简单测试下几种情况下执行效率:

seqtk_subseq 版本

time seqtk_names 200000.fq.gz |  seqtk_subseq EAOA2.fq.gz  -  >/dev/null

real    0m30.265s
user    0m30.076s
sys     0m0.245s

time seqtk_names EAOA2.fq.gz  | seqtk_subseq  200000.fq.gz -  >/dev/null

real    0m37.530s
user    0m42.271s
sys     0m1.330s

time seqtk_names <(pigz -dk -c -p 2 200000.fq.gz) | \
seqtk_subseq  <(pigz -dk -c -p 2 EAOA2.fq.gz) -  >/dev/null

real    0m20.460s
user    0m10.840s
sys     0m1.440s    

seqtk subseq 版本

time seqtk_names 200000.fq.gz |  seqtk subseq EAOA2.fq.gz  -  >/dev/null

real    0m31.372s
user    0m31.201s
sys     0m0.242s

time seqtk_names EAOA2.fq.gz  | seqtk subseq  200000.fq.gz -  >/dev/null

real    0m38.925s
user    0m46.647s
sys     0m1.575s

time seqtk_names <(pigz -dk -c -p 2 200000.fq.gz) | \
seqtk subseq <(pigz -dk -c -p 2 EAOA2.fq.gz)  -  >/dev/null

real    0m20.082s
user    0m11.580s
sys     0m1.344s

seqkit common 版本

单线程模式(pigz)

time  seqkit  common   EAOA2.fq.gz    200000.fq.gz  > /dev/null
[INFO] read file: EAOA2.fq.gz
[INFO] read file: 200000.fq.gz
[INFO] find common seqs ...
[INFO] 200000 unique sequence IDs found in 2 files, which belong to 200000 records in the first file: EAOA2.fq.gz
[INFO] extract seqs from the first file: EAOA2.fq.gz

real    2m2.577s
user    2m42.880s
sys     0m16.566s

time  seqkit  common  200000.fq.gz EAOA2.fq.gz  > /dev/null
[INFO] read file: 200000.fq.gz
[INFO] read file: EAOA2.fq.gz
[INFO] find common seqs ...
[INFO] 200000 unique sequence IDs found in 2 files, which belong to 200000 records in the first file: 200000.fq.gz
[INFO] extract seqs from the first file: 200000.fq.gz

real    1m41.730s
user    2m43.950s
sys     0m12.143s

多线程模式(pigz)

time  seqkit  common -j 20  EAOA2.fq.gz    200000.fq.gz  > /dev/null
[INFO] read file: EAOA2.fq.gz
[INFO] read file: 200000.fq.gz
[INFO] find common seqs ...
[INFO] 200000 unique sequence IDs found in 2 files, which belong to 200000 records in the first file: EAOA2.fq.gz
[INFO] extract seqs from the first file: EAOA2.fq.gz

real    1m35.691s
user    3m17.335s
sys     0m16.924s

time  seqkit  common -j 20  200000.fq.gz  EAOA2.fq.gz  > /dev/null
[INFO] read file: 200000.fq.gz
[INFO] read file: EAOA2.fq.gz
[INFO] find common seqs ...
[INFO] 200000 unique sequence IDs found in 2 files, which belong to 200000 records in the first file: 200000.fq.gz
[INFO] extract seqs from the first file: 200000.fq.gz

real    1m1.913s
user    2m34.953s
sys     0m13.042s

seqkit grep 模式(pigz)

time  seqkit  grep -f <(seqkit seq -n -i 200000.fq.gz) EAOA2.fq.gz  > /dev/null

real    0m22.403s
user    0m16.333s
sys     0m3.403s

time  seqkit  grep -f <(seqkit seq -n -i EAOA2.fq.gz) 200000.fq.gz  > /dev/null

real    0m28.713s
user    0m42.685s
sys     0m6.438s

seqkit grep 模式(gzip)

time  seqkit  grep -f <(seqkit seq -n -i 200000.fq.gz) EAOA2.fq.gz  > /dev/null

real 0m41.321s
user 0m14.727s
sys  0m3.402s

time  seqkit  grep -f <(seqkit seq -n -i EAOA2.fq.gz) 200000.fq.gz  > /dev/null

real 0m49.428s
user 0m47.589s
sys  0m7.809s

综上: seqkit grep 模式 为做交集最佳实践 (系统需要安装pigz), 使用小文件进行查找性能更佳, 但是综上结果显示各个软件最佳状态差距没有太大,那就随意选吧,选最方便/最喜欢的那个。

update:

综上: 从seqkit在gzip和pigz模式执行效率看,性能还是比较依赖pigz, 当然C实现seqtk subseq 或者 seqtk_subseq 都在pigz解压(线程设置为2)时,seqkit性能还是没有超过seqtk subseq 或者 seqtk_subseq, 不过接口更干净。

不过今天要提到的重点是同时支持交集和补集功能的 seqtk_subseq, pgzip 还是可以提供很大的优势。

写本文档过程中和seqkit作者沈伟(Github)进行了有效沟通,再次表示感谢。

本文材料为 BASE (Biostack Applied bioinformatic SEies ) 课程 Linux Command Line Tools for Life Scientists 材料, 版权归 上海逻捷信息科技有限公司 所有

Last Upate: 2017/9/16 22:00

Comments are closed.