Biostack.ORG 开始社区建设

关注我们: Biostack.ORG , 专业的生物信息社区



biostack Weekly | 2015-07-19 (第八期)

13’th July, 2015
23:26 RT @DrJCThrash Automated and accurate estimation of gene family abundance from shotgun metagenomes w/ @phylogenomics @tjsharpton

14’th July, 2015
18:14 RIG: Recalibration and Interrelation of Genomic Sequence Data with the GATK
19:30 Bandage: interactive visualization of de novo genome assemblies
19:35 Good laboratory practice for clinical next-generation sequencing informatics pipelines
21:13 RT @moorejh RT @craigbrownphd MIT proves flash is as fast as RAM, and cheaper, for #bigdata: #technology
21:22 RT @BioMickWatson Cpipe appears to be GATK best practices implemented in Bpipe:
21:34 RT @gilbertjacka @hollybik talking aout for data visualization – which uses colors and shapes to make data more accessible. #EEGen15
22:20 RT @Carlybacter Stringtie – new #transcriptome assembly from @StevenSalzberg1 lab #EEGen15
22:35 RT @genetics_blog Accelerating Scientific Publication in Biology
22:36 RT @genetics_blog The bacterial pangenome as a tool for analyzing pathogenic bacteria (review)
22:38 RT @DrSLJ38 Read these blogs #IDRNgenomics
22:42 RT @genetics_blog quantro: a data-driven approach to guide the choice of an appropriate normalization method

15’th July, 2015
22:04 Metagenomics of toilet waste from long distance flights
22:21 RT @moorejh #machinelearning #datascience RT @newsycbot A Step by Step Backpropagation Example
23:06 Cpipe: a shared variant detection pipeline designed for diagnostic settings
23:27 MetaPathways v2.5: quantitative functional, taxonomic and usability improvements
23:27 Hyperscape: visualization for complex biological networks
23:29 Investigating microbial co-occurrence patterns based on metagenomic compositional data
23:29 Correcting Illumina data

16’th July, 2015
07:53 RT @kc31958 RW: “First genome of any species should be of highest quality possible”. #UGMAsia @Pacbio
21:28 RT @pathogenomenick #mgen journal launched! Read the interview with Stanley Falkow in the ‘standing on the shoulders of giants’ section:
21:28 Big Data: Astronomical or Genomical?

21:33 RT @metagenomic_lit MBBC: an efficient approach for metagenomic binning based on clustering.
22:08 RT @andrewjpage Gubbins is now available on #homebrew
22:45 RT @assemblathon IEEE Xplore Abstract – SWAP-Assembler 2: Scalable Genome Assembler towards Millions of Cores

17’th July, 2015
22:07 RT @JavaScriptDaily Fundamental Node.js Design Patterns:
22:16 RT @genetics_blog CSHL protocols: RNA Sequencing & Analysis
22:21 How to Succeed at Clinical Genome Sequencing
23:15 RT @HeathrTurnr New #Rpackage UpSetR provides alternative to Venn diagrams, with optional plots highlighting specific intersections.

18’th July, 2015
07:53 RT @KevinADavies Lee Hood, who knows a thing or two about DNA sequencing, predicts the $100 genome in 5-8 years using synthetic nanopores #evolseq
08:39 RT @bioinformer Roche’s 454 Sues Thermo Fisher’s @IonTorrent for Patent Infringement | GenomeWeb @rochesequencing #genomics #biotech
10:06 RT @Amazing_Maps Visualizing city densities –
10:07 RT @KevinADavies Bill Efcavitch: according to @thermofisher still some 15,000 Sanger instruments running today (out of ~30,000 install base). #evolseq
10:23 RT @nanopore You can now join the queue for the PromethION Access Programme
10:36 RT @pathogenomenick ete2 – where have you been all my life, quite the best way to annotate trees!
21:06 RT @EricTopol Nice that the human interactome is so simple (not!) @CellCellPress
22:38 RT @DrJCThrash Investigating microbial co-occurrence patterns based on metagenomic compositional data

19’th July, 2015
10:56 Maize pan-transcriptome provides novel insights into genome complexity and quantitative trait variation | bioRxiv
12:58 RT @pypi_updates metameta Toolkit for analyzing meta-transcriptome/metagenome mapping data
13:02 RT @KantorKantor Impact of the Gut Metagenome on Autoimmunity
19:03 RT @koadman Alm describing Smillie’s Strainfinder2 software to predict strains from metagenome & a close reference genome #UrbanGenome

biostack Weekly | 2015-07-12 (第七期)

6’th July, 2015
22:39 RT @moorejh Don’t Get Your Kids’ Genes Sequenced Just To Keep Up #DNA #genomics #ethics
23:00 RT @Strep_papers WGS accurately predicts antimicrobial resistance in Escherichia coli.
23:12 RT @biorxivpreprint SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis

7’th July, 2015
23:40 RT @kbradnam Still seeing lots of traffic for this #ACGT post: My favorite bioinformatics blogs of 2014
23:40 RT @kbradnam Some more bioinformatics blogs are listed here:
23:42 RT @uBiome Announcing the uBiome $100k Microbiome Grant Contest!

8’th July, 2015
23:13 A robust approach for identifying differentially abundant features in metagenomic samples
23:17 IVA: accurate de novo assembly of RNA virus genomes

9’th July, 2015
06:16 CoreGenomics: MinION for 16S rRNA-seq
08:26 LayerCake: a tool for the visual comparison of viral deep sequencing data
23:29 RT @AllSeq MT @Exome_seq: Exome Sequencing: Current and Future Perspectives. #OA

10’th July, 2015
08:12 RT @phylogenomics New #microBEnet blog post: The Baby-Associated Built Environment (BABE) Microbiome Project
08:14 RT @Bioconductor IONiseR Quality Assessment Tools for Oxford Nanopore MinION data
08:15 RT @genetics_blog Genome Modeling System: A Knowledge Management Platform for Genomics
08:16 RT @mcclure111 Programming is a constant struggle between “What is harder: Implementing it myself, or understanding the existing, undocumented solution?”
08:17 RT @genetics_blog Toward effective software solutions for big biology
08:17 RT @pathogenomenick A stunningly comprehensive report Putting Pathogen Genomics into Practice from @PHGFoundation much to discuss
11:58 RT @EricTopol The 7 nanometer chip was projected to be developed by 2018, but announced today @IBM
13:08 RT @MicrobiomDigest Designer microbiome: MIT biologists program common gut bacteria @LATerynbrown #Btheta
13:29 RT @DrKatHolt The Lung Microbiome: New Principles for Respiratory Bacteriology in Health and Disease @robertpdickson @PLOSPathogens
15:21 RT @pathogenomenick ICYMI: a landmark publication on translational pathogen genomics @leilaluheshi @c_rands et al @PHGFoundation
21:05 The myths of bioinformatics software
21:43 RT @MicrobiomDigest Here, there, and everywhere: DNA contaminants creep in from the most unlikely places – Karl Gruber – EMBO Reports
21:45 RT @chandanpal143 Useful databases and resources for microbiologists. #database #bioinformatics
21:55 RT @jgoecks Slides from 2015 #usegalaxy visualization workshop:
21:59 RT @BioCyc Freely Available Metabolic and Genome Posters from BioCyc
22:06 RT @homolog_us Will I Use Kallisto? Definitely, Most Likely and Never ( @pmelsted @lpachter
22:44 RT @RNASeqBlog #RNAMiner – A Bioinformatics Protocol for Mining Large #RNASeq Data – @mizzouengineer

11’th July, 2015
13:28 Omics! Omics!: Clinical Metagenomics Pipelines: Revisiting & Reflecting
13:28 Omics! Omics!: Leaky clinical metagenomics pipelines are a very serious issue
13:35 Good laboratory practice for clinical next-generation sequencing informatics pipelines
14:40 RT @BorisAdryan Preliminary. Please get in touch if you are going to the @SoftwareSaved Collabo Workshop & want to work on @bionode:
15:32 RT @MrGreenify Happy to announce that BioJS 3 will be using @polymer and @Web_Components for viz in science. Get involved @ #biojs15
15:41 RT @analyticbridge Six categories of Data Scientists
15:57 RT @GigaScience UMLS, it’s not ANOTHER ontology, but an integration of related phenotype ontologies #ismb2015 #bioont15
16:01 Identification of protein coding regions in RNA transcripts
16:11 SplicePie: a novel analytical approach for the detection of alternative, non-sequential and recursive splicing
16:27 RT @froggleston Fantastic keynote this morning at #bosc2015 from @hollybik, reminding bioinformaticians that biologists exist! :)
16:28 RT @biocrusoe Slides for my #BOSC2015 talk “Portable workflow and tool descriptions with the CWL”
16:32 RT @michaelhoffman Now: Sebastian Schoenherr: Bringing #Hadoop into Bioinformatics with Cloudgene and CloudMan. #bosc2015
17:33 RT @sjackman QUAST 3 is released!.Faster (5x–100x running w/o reference; parallel.with large references) brew upgrade quast
17:46 RT @DickHardt My first @kickstarter has launched. Desktop Container Computer to run @docker containers

RT @ianmclean VisPy. GPU accelerated datavis for Python
20:27 RT @radaniba Set up Kubernetes with a Docker compose one-liner
20:30 RT @jaredtobin Abstract Syntax Graphs for DSLs. Really exceptional read.
20:42 RT @genetics_blog FCMM: A comparative metagenomic approach for functional characterization of multiple metagenome samples
23:15 RT @DunhamLab One neat thing about paper is it was a nice bonus from our mixed yeast metagenome data: lots of yeast HiCs for cheap
23:16 RT @gedankenstuecke @OfficialSMBE .Potential and pitfalls of eukaryotic metagenome skimming: A test case for lichens..
23:33 RT @LJDishaw Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota: Current Biology

12’th July, 2015
11:30 Accelerating Scientific Publication in Biology | bioRxiv
12:02 RT @SebastienBadia I’m pretty sure this is an accurate diagram of Openstack’s architecture.
13:21 RT @geetduggal Thanks @erikgarrison for your cpp VCF parser so I don’t have to :) tip hat to @nomad421
14:32 Screening Currency Notes for Microbial Pathogens and Antibiotic Resistance Genes Using a Shotgun Metagenomic Approach
16:06 RT @koitaxoumemesa my new paper re: #microbiome & #metagenome of nursing & weaning transition in pigs in @MicrobiomeJ w @davidamills
16:54 Sharing and executing linked data queries in a collaborative environment
17:34 RT @BangGoesSci The Sum of Our Parts. Fascinating article on the importance of the #microbiome in immunity
18:02 Sequence Your Microbiome
19:58 Indian currency notes harbour antibiotic-resistant genes: study
20:13 RT @HPDIHealth The advantages of soil based organisms for health. #probiotics #microbiome

biostack Weekly | 2015-07-05 (第六期)

29’th June, 2015
13:19 RT @lexnederbragt Getting the most out of RNA-seq data analysis #peerJPreprint
14:57 CanvasXpress is a standalone HTML5 graphing library written in Javascript to explore complex data sets.
16:57 RT @BI0graphika #Antibiotic Resistance explorer, a reactive #datavis built with #dcjs + #crossfilter #d3js #bootstrap #bio4j
22:23 Prioritizing risks of antibiotic resistance genes in all metagenomes : Nature Reviews Microbiology
23:11 RT @homolog_us @lpachter @pmelsted @yarbsalocin Thanks. How do I tell the index command about multiple isoforms?

30’th June, 2015
07:21 Assembly and diploid architecture of an individual human genome via single-molecule technologies
07:26 RT @klmr #Modules now supports regular #Rstats packages and treats them as modules—with all the perks!
07:50 RT @shenderson Immutable serialisable data structures in C++11 | Victor Laskin’s Blog #c++ #cpp
20:33 Go: Slice search vs map lookup
22:49 BGT: efficient and flexible genotype query across many samples arxiv: Github:
22:52 Optimal Seed Solver: Optimizing Seed Selection in Read Mapping
23:11 RT @torstenseemann @BioMickWatson @crashfrog yes, we use 1 cell per genome roughly, which is 600 to 1000 Mbp so about 100x and no need fur illumina
23:16 RT @homolog_us Tools or Algorithms, What is More Important in Bioinformatics? – The Answer Will Shock You (
23:16 RT @mgymrek My thoughts on the PrediXcan method for using transcriptome data to map complex traits:
23:23 RT @EduEyras Optimising Transcriptome Assembly

1’st July, 2015
08:18 Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection
10:04 Points of View: Unentangling complex plots”
18:59 ClustVis: a web tool for visualizing clustering of multivariate data using Principal Component Analysis and heatmap
19:17 RT @dhimmel Introducing data on publication and acceptance delays for 3,482 journals
19:25 RT @michaelwaskom seaborn 0.6 is out! Take a look at the release notes to find out what’s new:
19:52 RT @yokofakun Nucleic Acids Research Web Server issue 2015
19:58 RT @goKarumi Useful Python Libraries for Startups
19:59 RT @pbailis Phil Bernstein et al.’s textbook is still one of the best books on the theory of serializability (and it’s free):

3’rd July, 2015
22:03 RT @elixirpipe Elixir in times of microservices by @josevalim – #elixirlang

4’th July, 2015
06:40 RT @genetics_blog Bandage: interactive visualisation of de novo genome assemblies
06:42 RT @DaleYuzuki RT @Y_Gilad: Another attempt to use F1000 platform for a refutation study. By @StevenSalzberg1
11:55 ERGC: An efficient referential genome compression algorithm
21:37 RT @biocrusoe TIL: all NCBI code is in a public SVN repository cc @pjacock
22:33 RT @ElixirTip Here is a great article on mutable value chains by @joeerl.
22:33 RT @joakimk Really good intro to streams in general and @elixirlang streams in particular: by @drewolson.
22:35 RT @PeterCMarks Multi-process flow-based programming in @elixirlang : #myelixirstatus
22:48 RT @erikgarrison visualizing graph alignments using vg view -dA and graphviz
22:53 RT @genetics_blog How Many Genes Are Expressed in a Transcriptome?

5’th July, 2015
13:39 Blasted Bioinformatics!?: NCBI working on SAM output from BLAST+
14:48 RT @ozgurakgun this is indeed how most programs work on a multicore computer!
15:07 RT @smllmp …and finally pipeline paralellism! Been looking for a long time! Paves way for #flowbased in #elixirlang? #EUC2015

biostack Weekly | 2015-06-28 (第五期)


21’st June, 2015

15:23 RT @MicrobiomDigest Biodiversity and distribution of polar freshwater DNA viruses – Daniel Aguirre de Cárcer @ScienceAdvances #virome
15:24 RT @MicrobiomDigest Metagenomic analysis of #microbiome in colon tissue from subjects with #IBD reveals interplay of viruses and bacteria
15:31 RT @DataScienceCtrl Data Science Wars: R versus Python
15:31 RT @DataScienceCtrl Comprehensive list of data science resources
15:57 Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees , bloomtree
16:13 RT @laurencerowe CFFI and #pypy working well. Seeing a 3-4x speedup over CPython/pysam with @brent_p’s htslib wrapper –
17:03 RT @yokofakun two distinct projects: Workflow Description Language vs Common Workflow Language
17:39 RT @guillemch Workflow tool makers: Let us define the data flow, not task dependencies

22’nd June, 2015

15:49 Scientific Luigi :Extra helper methods for writing scientific workflows in Luigi
15:49 RT @MicrobiomDigest 16S rRNA gene-based profiling of infant gut microbiota strongly influenced by sample processing & PCR primer choice
17:40 Python, Ruby, and Golang: A Command-Line Application Comparison – Real Python
19:08 Go development environment for Vim
19:09 RT @vivainio Thought: languages benefiting most from #webassembly are the ones with precise memory management & no GC. C++, #rustlang, what else?

23’rd June, 2015
07:55 RT @jaredtsimpson .@ljdursi on the tricky problem of merging structural variation calls (with code!):
07:55 RT @benjaminlmoore GenomeD3Plot: A library for rich, interactive visualizations of genomic data in web applications #d3js
07:56 RT @mason_lab The impact of read length on quantification of differentially expressed genes and splice junction detection.
07:57 RT @DrJCThrash Bayesian mixture analysis for metagenomic community profiling
07:57 RT @JC_pathogenomic NEW: Assembling Short Reads from Jumping Libraries with Large Insert Sizes.

24’th June, 2015
07:48 From trainee to tenure-track: ten tips | Genome Biology | Full Text

25’th June, 2015
08:10 MetaQuery: a web server for rapid annotation and quantitative analysis of specific genes in the human gut microbiome
08:11 The problem with P values: defining clinical vs. statistical significance – Public Health
08:11 BASiCS: Bayesian Analysis of Single-Cell Sequencing Data
08:13 The changing form of Antarctic biodiversity : Nature : Nature Publishing Group
08:16 Comprehensive identification and analysis of human accelerated regulatory DNA
08:17 DT: An R interface to the DataTables library
08:18 d3heatmap: Interactive heat maps
08:49 Diving into Genetics and Genomics: RPKM/FPKM, TPM and raw counts for RNA-seq
11:38 Hyperscape: Visualization for complex biological networks
11:40 RT @GaetanBurgio Personal Microbiome code bar! ‘Identifying personal microbiomes using metagenomic codes’ PNAS
11:47 Assembling short reads from jumping libraries with large insert sizes
11:47 QVZ: lossy compression of quality values
11:47 ACE: accurate correction of errors using K-mer tries
14:13 Metabolic network modeling of.microbial communities
22:57 Virulence genes are a signature of the microbiome in the colorectal tumor microenvironment
22:57 Gut resistome development in healthy twin pairs in the first year of life

26’th June, 2015
14:48 TRAL: tandem repeat annotation library
18:03 RT @RobLanfear Why should you open source your code? Because sometimes smart folks point out you could do it 100x faster…
22:23 rSeqNP: a non-parametric approach for detecting differential expression and splicing from RNA-Seq data
22:23 RAPTR-SV: a hybrid method for the detection of structural variants
22:24 Annocript: a flexible pipeline for the annotation of transcriptomes able to identify putative long noncoding RNAs
22:24 Normalization and noise reduction for single cell RNA-seq experiments
22:24 Oasis: online analysis of small RNA deep sequencing data
23:01 On enhancing variation detection through pan-genome indexing | bioRxiv
23:19 Command-line Bootcamp

28’th June, 2015
12:52 RT @MarkLittlewood The slides from @billjaneway #BoS2015 talk on Productive Bubbles & Unicorn Bubbles.
20:36 DISTMIX: direct imputation of summary statistics for unmeasured SNPs from mixed ethnicity cohorts
20:37 ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data
21:20 RT @biorxivpreprint TransRate: reference free quality assessment of de-novo transcriptome assemblies
21:25 Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-seq Data using Lightweight-Alignment
21:41 RT @AndrewJesaitis Hopefully the Common Workflow Language can solve the interoperability bottleneck in science (esp Bioinformatics)
21:42 RT @ctitusbrown Features of #CommonWL: dataflow; streaming; scatter/gather for parallelism; built-in docker support.
21:43 RT @RLangTip Extracting data from PDFs with #rstats. Vector Image Processing (PDF)
21:54 RT @kbradnam NCBI BLAST+ v2.2.31 released #ACGT
21:54 RT @fulhack A Beginners Guide to Building Data Pipelines with Luigi – by @Dylan_Barth
22:15 RT @nanopore Interested in joining the PromethION Early Access Programme (PEAP)? Find out more:
22:24 RT @JCVenter At Human Longevity we are currently sequencing 36,000 genomes/year, scaling up to 100,000 genomes/year

biostack Weekly | 2015-06-21 (第四期)

Twitter list

14’th June, 2015
08:56 RT @sjackman Various programs used for the assembly of @PacBio data. Anyone have corrections or additions for me? @infoecho
09:12 RT @pathogenomenick Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations
12:40 RT @sjackman UniqTag: My paper for assigning k-mer IDs to genes using a minimizer has been published in @PLOSONE. R package & Ruby
12:49 NxTrim: optimized trimming of Illumina mate pair reads
22:10 Bridging the knowledge gap: from microbiome composition to function
22:18 Bespoke diets based on gut microbes could help beat disease and obesity

15’th June, 2015
18:22 Detecting somatic mutations in genomic sequences by means of Kolmogorov-Arnold analysis

16’th June, 2015
07:54 The impact of Docker containers on the performance of genomic pipelines
23:22 Error Tree: A Tree Structure for Hamming & Edit Distances & Wildcards Matching
23:24 Ultra-large alignments using phylogeny-aware profiles
23:24 Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA
23:25 MetaPathways v2.5: Quantitative functional, taxonomic, and usability improvements

17’th June, 2015
08:03 35 invaluable books on Data Visualization
08:55 Madagascar open-source software project for multidimensional data analysis and reproducible computational experiments
08:56 BitMapper: an efficient all-mapper based on bit-vector computing
18:00 Can we change our microbiome to prevent colorectal cancer development?, Acta Oncologica, Informa Healthcare
20:33 RT @MicrobiomDigest Newfound groups of bacteria are mixing up the tree of life #JillBanfield @UCBerkeleyNews
20:46 GTEx publishes results from RNA-Seq pilot study | RNA-Seq Blog
21:01 Investigating microbial co-occurrence patterns based on metagenomic compositional data
21:10 FM-index for dummies
21:13 Linear-Time Sequence Comparison Using Minimal Absent Words & Applications
21:58 RT @torstenseemann With @nanopore MkI flow cells at $900 (500 Mbp) I think @pacbio is safe for a while until MkII flowcell without electronics on it is out.
22:01 RT @hadleywickham dplyr 0.4.2 is now out: Now preserves attributes (so works with haven) + lots of crash fixes #rstats
22:02 RT @lexnederbragt New blog post: Developments in high throughput #sequencing – June 2015 edition (aka, where to put the MinION)
22:03 RT @dzerbino @jtleek @aaronquinlan @michaelhoffman WiggleTools reads BAM to Wig:
22:07 RT @JonBadalamenti For those keeping score: @PacBio’s in-house cluster has 16 nodes, 48 cores/node, 256 GB RAM, Lustre 2.1.2 HPFS #PBUGM @UMNmsi
22:13 RT @JonBadalamenti Justin Zook showing awesome results from @aphillippy @sergekoren’s MHAP doing human genome assemblies #PBUGM
22:21 RT @genetics_blog Unusual biology across a group comprising >15% of domain Bacteria via @MicrobiomDigest
22:25 RT @researchingsf MY: First draft assembly of tetraploid coffee arabica: 91-95% complete at >60X coverage, longest reads >65kb, longest contig >4Mb #pbugm
22:27 RT @PacBio Kevin Corcoran kicks off user group meeting, 100+ types of plants/animals genomes sequenced w PacBio, your genome deserves long reads #PBUGM
22:29 RT @PacBio LH: Coming soon! New barcodes for SMRT Sequencing, set of 384 barcodes, comprised of 16 bp each, flexibility for PCR primer design #PBUGM
22:54 RT @vivekbhr @jtleek @aaronquinlan @michaelhoffman Try deeptools.

18’th June, 2015
07:40 What’s the difference between Causality and Correlation?
08:03 Mining the microbial dark matter
08:22 Re-booting the human gut : Wyss Institute at Harvard
08:45 Starcode: sequence clustering based on all-pairs search
13:12 [1506.05185] CARGO: Effective format-free compressed storage of genomic information
22:25 RT @jscott_toronto Fungi to blame for fatal Berkeley balcony collapse, via @nytimes
22:41 RT @abremges Automated Contamination Detection in Single-Cell Sequencing by @l0x et al.
22:42 RT @DrKatHolt Plot trees & annotation (incl BEAST, time axis) with ggtree for R! Well spotted @mackas21
22:53 RT @infoecho FALCON PacBio Reads Assembler Internal, pipeline, directories, command line tools, scripts, inputs. outputs…
22:57 RNASequel: accurate and repeat tolerant realignment of RNA-seq reads
23:07 Dispersing misconceptions and identifying opportunities for the use of ‘omics’ in soil microbial ecology

19’th June, 2015
08:26 Compact graphical representation of phylogenetic data and metadata with GraPhlAn
08:31 RT @mikethemadbiol A Question About Oxford Nanopore and the Cost Model: This is the first mention of how much the much-vaunted Ox…
08:32 RT @researchingsf KFA: Combining ~7.5M @PacBio long reads + 183M short reads = 35 fusion genes, 56 fusion sites, 30 highly expressed fusion isoforms #pbugm
21:41 RT @BioMickWatson Complete Genomics Revolocity and the future of genome sequencing
21:47 Genomics & Transcriptomics hold key to understanding ecological and evolutionary processes
21:47 Informing the Design of Direct-to-Consumer Interactive Personal Genomics Reports

20’th June, 2015
08:32 Bermuda: Bidirectional de novo assembly of transcripts with new insights for handling uneven coverage
18:28 RT @hjpimentel Reproducibility? We’ve got that. kallisto analysis from preprint snakemake + knitr @yarbsalocin @pmelsted @lpachter
21:51 Workflow management software for pipeline development in NGS
22:25 RT @yokofakun BioShaDock registry “A #Bioinformatics Shared #Docker registry” ( via @Yvan2935 )
22:36 7 Tools for Data Visualization in R, Python, and Julia
22:44 Efficient development workflow using Git submodules and Docker Compose
22:45 RT @justinlivi Excited to try this development workflow with Docker Compose via @airpair
23:19 RT @mikaelhuss JD: Queue: parallelizes workflows, robust (reruns failed jobs), traceable, deletes intermediate files, reusable components #einframps2015
23:22 RT @mikaelhuss KH: Future: porting all tools to Spark. Parallelizing variant callers. Interactive ad hoc queries of big data in genomics. #einframps2015
23:29 RT @pathogenomenick The microbiome of the human lower airways: a next generation sequencing perspective
23:30 RT @torstenseemann Trim Galore! – auto detects which sequencing adaptors you used – by @FelixNmnKrueger #bioinformatics
23:37 Biographika: rich interactive data visualizations on the web for the research community
23:39 RT @yokofakun [delicious] CONSERTING: integrating cnv analysis with structural-variation detection #tweet: We developed Copy…

21’st June, 2015
11:45 LFQC: A lossless compression algorithm for FASTQ files
11:49 Bio4j: a high-performance cloud-enabled graph-based data platform | bioRxiv
14:38 Composition and temporal stability of the gut microbiota in older persons
14:48 RT @hammer_lab Introducing pileup.js, a Browser-based Genome Viewer (by @danvdk):

Docker 容器技术对基因组数据分析性能影响


The impact of Docker containers on the performance of genomic pipelines


Genomic pipelines consist of several pieces of third party software and, because their experimental nature, frequent changes and updates are commonly necessary thus raising serious distribution and reproducibility issues. Docker containers technology offers an ideal solution, as it allows the packaging of pipelines in an isolated and self-contained manner. This makes it easy to distribute and execute pipelines in a portable manner across a wide range of computing platforms. Thus the question that arises is to what extent the use of Docker containers might affect the performance of these pipelines. Here we address this question and conclude that Docker containers have only a minor impact on the performance of common genomic pipelines, which is negligible when the executed jobs are long in terms of computational time.



基因组学数据分析流程常常包括相当多的第三方数据分析工具,一个典型的哺乳动物基因组项目使用多大 140+ 工具/数据库, (,鉴于生物信息工具版本升级比较频繁,所有Docker容器技术比较适合生物信息数据分析流程。


enter image description here

enter image description here

enter image description here


v1, 2015/6/16 9:06

Minute.of.Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences

Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences

期刊:Nature Reviews Microbiology





Small subunit ribosomal RNA gene,亦即熟知的细菌(Bac.)与古菌(Arc.)这些原核生物中的16S rRNA。对16S rRNA的研究以及在公开提供的数据库中的序列数都迅速增长,序列条目已经超过400万。然而,对于细菌和古菌的分类和命名并没有统一的框架标准。

在这篇分析性文章中,研究人员通过对16S rRNA序列数据库中的大量数据进行分析和分类命名术语汇总,提出了基于16s rRNA序列identities对Bac.和Arc.在高阶分类水平的合理分类边界(Rational taxonomic boundaries),包括了:细菌与古菌在高阶分类水平(High taxa)上的合理的分类界域,更新了关于球细菌与古菌的种类和数量普查结果,以及阐述了可通用于可培养的和不可培养的微生物的在高阶分类水平的稳定层次分类方法(Stable hierarchical classification of high taxa)的原理。。同时,对数据的分析发现,只有近乎全长16S rRNA 的数据才能提供物种分类多样性或差异的准确评估。另外,对现有数据的分析发现当前对于环境样品的研究研究大多数止步于高阶的分类水平。


1、 It is remarkable that, although nearly 1.3 million eukaryotic species have been described so far, this number might represent only 20% of the richness that exists.

2、 Only ~11,000 bacterial and archaeal species have been classified so far. At the current rate of ~600 new descriptions per year, it has been estimated that it would take >1,000 years to classify all of the remaining species.

3、 Moreover, the full extent of their diversity is difficult to conceptualize, owing to the lack of objective criteria (such as numerical thresholds) for taxonomic circumscriptions of uncultured microorganisms, which are identified only by sequence data. Thus, estimates for the total number of bacterial and archaeal species vary widely, from 3 × 104 to ~1012.

4、 从2006~2012年间,Silva REF 114中的数据增长情况是:The number of newly detected species (at 98.7% sequence identity) was about 4 × 104 taxa and the number of newly detected genera (at 94.5% sequence identity) was about 1.3 × 104 taxa per year over the past six years.

5、 将每年添加的新的序列数据与发现的种属比率进行回归,发现逐年发现的新的种属比率是明显下降的(clear linear decrease is observed), that is, the rate of detection of new genera and of new species may be close to zero, by the end of 2015 and 2017, respectively.(编者注:但是这个仅是基于目前的技术的增加趋势,对越来越多的特殊环境的探索加上新的测序技术手段,也许会出现“柳暗花明又一村”的景象,毕竟“世界那么大,我们要去看看”)。

6、 虽然对于微生物的命名有正式规则,但是在进行物种分类(Taxa and their hierarchical classifications)是一个人工的难免主观因素的过程。Only the rank of species is circumscribed by a combination of well-accepted criteria, which include:

  1. DNA–DNA hybridization (DDH), with a threshold around 70%;
  2. Average sequence identities of shared genes (ANI), with a threshold of around 94–96%;
  3. and 16S rRNA gene sequence identities, with a threshold of around 98.7%.
  4. genetic criteria should always be accompanied by a discriminant phenotypic property

7、 Genera are also recognized by their phylogenetic separation from other such groups and the possession of 16S rRNA gene sequence identities of >95%. There are no robust rules for the circumscription of ranks above genus.(编者注:然而在NGS技术下快速发展的微生物群落多样性研究,如基于Miseq的16s rDNA,ITs研究技术,元基因组技术都主要集中在高阶的分类水平。)

8、 The pre-classification of environmental sequences in a global and unified system may have many advantages, such as avoiding cumbersome taxonomic revisions14 and producing a stable taxonomic framework.(编者注:承前启后)

9、 利用SILVA DB上的LTP 102中的数据,计算在属及以上各水平的 Sequence identity threshold,见Table 1。由于LTP中没一个种只记录了一种菌株数据,所以无法计算中水平的阈值。要注意的是,这些阈值需都是最小阈值,并且某水平下的不同分类的序列一致性值可能比计算的阈值要高,最好可以参考其他类型的数据,如形态学,遗传学和环境(背景)数据。


10、 通过对LTP108中的数据进行特定区域片段来模拟不同Id阈值在丰度估计中的应用的准确性。其中有v1~v2片段受阈值的影响效果是最明显的。Meanwhile,analysis of the taxa recovery rate indicates a great underestimation of taxa richness when partial sequences are used. Although the situation tends to ameliorate as longer segments are considered, near full-length 16S rRNA genes sequences are required for accurate richness estimations and accurate classifications of high taxa.(编者注:为什么对于高阶分类的recovery rate要求更长的序列呢?难道是因为分类水平越高,分类之间的id阈值越低,分类接线越模糊??)

11、 In conclusion, the results indicate that near full-length SSU rRNA gene sequences are required for accurate richness estimations and accurate classifications of high taxa.

12、 利用计算得到的Id 阈值对Silva Ref 114数据库里面的数据进行分析。For the high taxa, the analyses provide evidence for a current census of at least 6.1 × 104 genera, 9.6 × 103 families, 4.2 × 103 orders, 2.2 × 103 classes and 1.3 × 103 phyla of bacteria and archaea.见Table 2.


13、 从估计的结果来看,会发现高阶分类水平的数目远远超过目前在LPSN中收录描述的数据,比如门水平的估计值达(1356-84)≈1300种,而LPSN中仅有27种。捣成这种咸咸的主要原因可能是由于扩增和测序错误以及嵌合体这样的异常数据造成的估计偏差。但是,即便考虑到嵌合体造成的最大影响,门水平的数量也不过从1300种减少到1100种左右的。

14、 通过对Silva REF 114的数据在98.7%9, 99.0% 和99.5%三个水平的考察,发现大约有2.1 × 104 species-level OTUs,并且估计出全球微生物种水平的总数约是 4 × 105 ,比以前的2 × 106少很多。同时,最这个数据也因该是谨慎看待的,因为估计的策略与简单化处理了许多因素,比如数据饱和性,低丰度物种的比例,以及对于估计偏向性问题的原因等。

15、 为了使cultured 和 uncultured的微生物在分类标准上的一致,文章提出了CTUs的概念 。A new biodiversity unit called the CTU (candidate taxonomic unit), which is compatible with the hierarchy that was established in the Bacteriological Code. Define a CTU as a biological entity that is delineated by a monophyletic set of sequences with a sequence identity that stays within, or very close to, the taxonomic threshold that is proposed for a given rank.(编者注:CTU是指将在符合或非常接近在特定分类水平确定的分类序列一致性阈值下的单源序列集合作为研究的生物学实体)。


16、 CTU is conceived to be a combination of the OUT and OPU (operational phylogenetic unit), which are currently used by microbial ecologists and systematists.

17、 CTUs的分析过程:The recognition of CTUs starts with a preliminary classification based on OTUs (which is calculated using standard clustering algorithms with the different taxonomic thresholds that are identified in Table 1), the results of which guide the search for local meaningful phylogenetic clades in a reliable tree topology.

18、 CTUs的方法可以方便对已有的分类描述进行重新评测。作者用密螺旋体门(Phylum Spirochaetes, by the end of 2013, the 112 described species had been organized into 16 genera, four families, one order and one class.)来作为一个示例,并将目前密螺旋体门划分为5个独立的纲[Spirochaetes.Class1 (which comprises the genera Borrelia, Treponema, Sphaerochaeta and Spirochaeta), Spirochaetes.Class2 (which comprises the genus Exilispira), Spirochaetes.Class3 (which comprises the genus Brevinema), Spirochaetes.Class4 (which comprises the genus Brachyspira) and Spirochaetes.Class5 (which comprises the genera Leptospira, Leptonema and Turneriella)].

19、CTUs也可用于环境样品的多样性分类工作,尤其是目前数据库中的uncultured Bacteria/Archaea 数据。作者用Silva REF 108中的11个门(6 Clades+8Candidate divisions)中选取9426条序列进行评测。It is remarkable that the total number of delineated CTUs (3,553 CTUs, distributed into 2,053 genera, 767 families, 411 orders, 240 classes and 82 phyla) was nearly identical to the total number of calculated OTUs (3,562 in total).

20、文章指出,早起所谓的“Environmental clades”是与序列数据的环境背景相关,而且只有很少部分的序列被这样描述,没有严格考虑各方面的因素和问题(not consid¬ered aspects that are related to size, phylogenetic depth, internal taxonomic classification or a standard nomen¬clature format),并用CTUs的方法对 alphaproteobacterial clade SAR11进行了评测。

21、对Classes 和Phyla的在系统发生关系的连续性的讨论,以及对SAR11,TM7, OD1, OP3, OP11 and the phyla Elusimicrobia, Caldiserica 和Armatimonadetes的评测结果,跟进一步验证CTUs方法的效果。These discrepancies again emphasize the need to reconcile classification and nomenclature at all levels.


biostack Weekly | 2015-06-14 (第三期)

Twitter list

7’th June, 2015

09:38 RT @genetics_blog Finally, labeled multi-panel ggplot2 figures made easy thanks @ClausWilke
09:38 RT @genetics_blog ACE: Accurate Correction of Errors using K-mer tries
09:39 RT @genetics_blog oswitch. okay this is awesome. h/t @pathogenomenick
09:40 RT @genetics_blog sRNAtoolbox: an integrated collection of small RNA research tools
11:15 RT @biorxivpreprint Cpipe: a shared variant detection pipeline designed for diagnostic settings
11:53 RT @genetics_blog A de novo DNA Sequencing and Variant Calling Algorithm for Nanopore data
11:54 RT @genetics_blog Determining Exon Connectivity in Complex mRNAs by Nanopore Sequencing
15:38 RT @christianmaioli xpresso – Java library that lets you write Python-like code #java #python #programming

8’th June, 2015

07:44 Metagenomics of the Human Intestinal tract: From who is there to what is done there
08:10 RT @DNAmlin “C++ in the modern world” by @CPP_Coder

9’th June, 2015

07:50 RT @TechCrunch Apple Is Open-Sourcing Swift, Its New Programming Language by @kylebrussell
18:19 Using populations of human and microbial genomes for organism detection in metagenomes
18:31 [1506.02424] Algorithms for finding transposons in gene sequences

10’th June, 2015

07:30 GAML: genome assembly by maximum likelihood
07:39 RT @infoecho HTML5 slides with #IPython NB: Exploring Falcon Assembly String Graph Output #bioinformatics @PacBio Human Assembly
08:08 RT @ErichMSchwarz PBcR-MHAP de novo assembly of PacBio reads published:
08:10 RT @Magdoll Sequencing macaque MHC class I gene with PacBio Iso-Seq. P4-C2 chemistry with barcoding.
08:46 Adventures with Nanopore

11’th June, 2015

13:15 FunPat: function-based pattern analysis on RNA-seq time series data
18:26 RT @chuttenh @nsegata >3000 metagenome meta-analysis (!) profiled at the strain level for all organisms with >=1 reference genome #symbiomes
18:43 EDLIB : C/C++ library for sequence alignment using edit distance
18:52 GraphMap for Fast and Sensitive Mapping of Long Noisy Reads

12’th June, 2015

16:14 ProDeGe: a computational protocol for fully automated decontamination of genomes

13’th June, 2015

12:00 RT @hackernewsbot NixOS Linux…
15:54 RT @animesh1977 UProC: tools for ultra-fast protein domain classification
19:12 The landscape of genomic imprinting across diverse adult human tissues
19:14 Battling Phages: How Bacteria Defend against Viral Attack
19:24 Funders must encourage scientists to share
22:56 RT @MicrobiomePaper Metagenome Sequencing Reveals Rhodococcus Dominance in Farpuk Cave, Mizoram,… #microbiome
23:12 RT @MicrobiomDigest Nucleotide 9-mers Characterize the Type II Diabetic Gut Metagenome – Balazs Szalkai -ArXiv #bioinformatics

14’th June, 2015

07:27 RT @yokofakun “dbSNP build 144 now available”
07:27 RT @ctitusbrown As of latest master, khmer supports Python 3.4: Thanks, @luizirber! cc @biocrusoe @brettsky
07:27 RT @smllmp New blog: #Workflow tool makers: Allow defining the #dataflow rather than task dependencies:
08:15 RT @infoecho Genome Assembly Tutorial in 140 chrs: Sequence Overalp, String Graph and Fundamental String Algebra

ProDeGe: 自动去除基因组序列中的污染序列


ProDeGe: a computational protocol for fully automated decontamination of genomes


Single amplified genomes and genomes assembled from metagenomes have enabled the exploration of uncultured microorganisms at an unprecedented scale. However, both these types of products are plagued by contamination. Since these genomes are now being generated in a high-throughput manner and sequences from them are propagating into public databases to drive novel scientific discoveries, rigorous quality controls and decontamination protocols are urgently needed. Here, we present ProDeGe (Protocol for fully automated Decontamination of Genomes), the first computational protocol for fully automated decontamination of draft genomes. ProDeGe classifies sequences into two classes—clean and contaminant—using a combination of homology and feature-based methodologies. On average, 84% of sequence from the non-target organism is removed from the data set (specificity) and 84% of the sequence from the target organism is retained (sensitivity). The procedure operates successfully at a rate of ~0.30 CPU core hours per megabase of sequence and can be applied to any type of genome sequence.



源代码: 注:现在下载不了!


  • 相关背景

    元基因组数据分析的一个工作就是重构环境样本的优势基因组,然后剩下的工作就是评估重构基因组的完整性以及潜在的污染序列(Non-target sequence)

  • 数据分析流程

    输入信息为拼装后的序列(要求为原核基因组序列,基因预测采用了Prodigal)和 Taxonomy 信息, 进去的是Contig 序列,出来的还是Contig序列不过是带有标记信息的序列(‘Clean’ or ‘Contaminant’ );
    该方法是用了两种Binning的策略,BLAST-binning 和 K-mer Binning;

    enter image description here

  • 敏感性和特异性及其计算时间


    enter image description here



2015/6/12 22:32