Archives

Categories

biostack Weekly | 2015-07-19 (第八期)

13’th July, 2015
23:26 RT @DrJCThrash Automated and accurate estimation of gene family abundance from shotgun metagenomes w/ @phylogenomics @tjsharpton http://t.co/5djMe0WTpA

14’th July, 2015
18:14 RIG: Recalibration and Interrelation of Genomic Sequence Data with the GATK http://t.co/25OPkzHBrX
19:30 Bandage: interactive visualization of de novo genome assemblies http://t.co/86o6LbONwp
19:35 Good laboratory practice for clinical next-generation sequencing informatics pipelines http://t.co/z1kxP5pK9W
21:13 RT @moorejh RT @craigbrownphd MIT proves flash is as fast as RAM, and cheaper, for #bigdata: http://t.co/SInwJoHT5i #technology
21:22 RT @BioMickWatson Cpipe appears to be GATK best practices implemented in Bpipe: http://t.co/k7O2HvWE0j
21:34 RT @gilbertjacka @hollybik talking aout http://t.co/rMY4lIR6M5 for data visualization – which uses colors and shapes to make data more accessible. #EEGen15
22:20 RT @Carlybacter Stringtie – new #transcriptome assembly http://t.co/7P9RU90P9R from @StevenSalzberg1 lab #EEGen15
22:35 RT @genetics_blog Accelerating Scientific Publication in Biology http://t.co/yv4Gig0cYT
22:36 RT @genetics_blog The bacterial pangenome as a tool for analyzing pathogenic bacteria http://t.co/PuS5QdOqiQ (review) http://t.co/dRH3RaQ2IR
22:38 RT @DrSLJ38 Read these blogs http://t.co/ajDRW6eGbc https://t.co/wDb9odsjOF http://t.co/XwYI6mlQeH https://t.co/WkH1APJlRw #IDRNgenomics
22:42 RT @genetics_blog quantro: a data-driven approach to guide the choice of an appropriate normalization method http://t.co/ChVnsSEsji http://t.co/sgfASgHsPG

15’th July, 2015
22:04 Metagenomics of toilet waste from long distance flights http://t.co/YshekTGF0a
22:21 RT @moorejh #machinelearning #datascience RT @newsycbot A Step by Step Backpropagation Example http://t.co/FElQqECtmS http://t.co/pu5CJ0SewV
23:06 Cpipe: a shared variant detection pipeline designed for diagnostic settings http://t.co/6bPNJSVRVP
23:27 MetaPathways v2.5: quantitative functional, taxonomic and usability improvements http://t.co/pmucUeh5tV
23:27 Hyperscape: visualization for complex biological networks http://t.co/uD0SvFI4PL
23:29 Investigating microbial co-occurrence patterns based on metagenomic compositional data http://t.co/GgpGaRgaK5
23:29 Correcting Illumina data http://t.co/ooX85lPeKG

16’th July, 2015
07:53 RT @kc31958 RW: “First genome of any species should be of highest quality possible”. #UGMAsia @Pacbio http://t.co/rCUDOqV9S3
21:28 RT @pathogenomenick #mgen journal launched! Read the interview with Stanley Falkow in the ‘standing on the shoulders of giants’ section: http://t.co/TdrkiJnT1M
21:28 Big Data: Astronomical or Genomical? http://t.co/J0vXhJc9bV

21:33 RT @metagenomic_lit MBBC: an efficient approach for metagenomic binning based on clustering. http://t.co/uIOjls7DDC
22:08 RT @andrewjpage Gubbins is now available on #homebrew http://t.co/iAcxXiLUwj
22:45 RT @assemblathon IEEE Xplore Abstract – SWAP-Assembler 2: Scalable Genome Assembler towards Millions of Cores http://t.co/ljPELnIf99

17’th July, 2015
22:07 RT @JavaScriptDaily Fundamental Node.js Design Patterns: http://t.co/VDuh7D2bIL
22:16 RT @genetics_blog CSHL protocols: RNA Sequencing & Analysis http://t.co/259IOwVXNu
22:21 How to Succeed at Clinical Genome Sequencing http://t.co/rfW9j8lBFt
23:15 RT @HeathrTurnr New #Rpackage UpSetR provides alternative to Venn diagrams, with optional plots highlighting specific intersections. http://t.co/GbCFklLufD

18’th July, 2015
07:53 RT @KevinADavies Lee Hood, who knows a thing or two about DNA sequencing, predicts the $100 genome in 5-8 years using synthetic nanopores #evolseq
08:39 RT @bioinformer Roche’s 454 Sues Thermo Fisher’s @IonTorrent for Patent Infringement | GenomeWeb https://t.co/v4aDfCZpuZ @rochesequencing #genomics #biotech
10:06 RT @Amazing_Maps Visualizing city densities –http://t.co/SuNWZGqTCP
10:07 RT @KevinADavies Bill Efcavitch: according to @thermofisher still some 15,000 Sanger instruments running today (out of ~30,000 install base). #evolseq
10:23 RT @nanopore You can now join the queue for the PromethION Access Programme https://t.co/lMzBXvuYeB
10:36 RT @pathogenomenick ete2 – where have you been all my life, quite the best way to annotate trees! http://t.co/SGzCea5iPB
21:06 RT @EricTopol Nice that the human interactome is so simple (not!) http://t.co/tHjVYZZU7V @CellCellPress http://t.co/MsvQa5ZJ82
22:38 RT @DrJCThrash Investigating microbial co-occurrence patterns based on metagenomic compositional data http://t.co/O51kZRF0nW

19’th July, 2015
10:56 Maize pan-transcriptome provides novel insights into genome complexity and quantitative trait variation | bioRxiv http://t.co/6bmyYp4fCE
12:58 RT @pypi_updates metameta 0.0.0.33: Toolkit for analyzing meta-transcriptome/metagenome mapping data http://t.co/O7PuDtkWxC
13:02 RT @KantorKantor Impact of the Gut Metagenome on Autoimmunity http://t.co/vL8xORxvyJ
19:03 RT @koadman Alm describing Smillie’s Strainfinder2 software to predict strains from metagenome & a close reference genome #UrbanGenome

biostack Weekly | 2015-07-12 (第七期)

6’th July, 2015
22:39 RT @moorejh Don’t Get Your Kids’ Genes Sequenced Just To Keep Up http://t.co/7LFqd9bswn #DNA #genomics #ethics http://t.co/q4q1I1bUAr
23:00 RT @Strep_papers WGS accurately predicts antimicrobial resistance in Escherichia coli. http://t.co/utOCog1bDZ http://t.co/14ByUH4Isp
23:12 RT @biorxivpreprint SPARTA: Simple Program for Automated reference-based bacterial RNA-seq Transcriptome Analysis http://t.co/NvjeFjbplO

7’th July, 2015
23:40 RT @kbradnam Still seeing lots of traffic for this #ACGT post: My favorite bioinformatics blogs of 2014 http://t.co/qaRJfgBbvu
23:40 RT @kbradnam Some more bioinformatics blogs are listed here: http://t.co/jYchdq9obF
23:42 RT @uBiome Announcing the uBiome $100k Microbiome Grant Contest! http://t.co/Va6F9Bg4aG

8’th July, 2015
23:13 A robust approach for identifying differentially abundant features in metagenomic samples http://t.co/JjQ56wWoF0
23:17 IVA: accurate de novo assembly of RNA virus genomes http://t.co/LUp3yrwgbG

9’th July, 2015
06:16 CoreGenomics: MinION for 16S rRNA-seq http://t.co/DI5Gjub8je
08:26 LayerCake: a tool for the visual comparison of viral deep sequencing data http://t.co/t7b4sCkrPi
23:29 RT @AllSeq MT @Exome_seq: Exome Sequencing: Current and Future Perspectives. http://t.co/WlRedlRU7K #OA

10’th July, 2015
08:12 RT @phylogenomics New #microBEnet blog post: The Baby-Associated Built Environment (BABE) Microbiome Project http://t.co/WXdIhmv40X
08:14 RT @Bioconductor http://t.co/RpAxXSXEV2 IONiseR Quality Assessment Tools for Oxford Nanopore MinION data
08:15 RT @genetics_blog Genome Modeling System: A Knowledge Management Platform for Genomics http://t.co/KOWjS6v5Wh http://t.co/dXReKGr6e8
08:16 RT @mcclure111 Programming is a constant struggle between “What is harder: Implementing it myself, or understanding the existing, undocumented solution?”
08:17 RT @genetics_blog Toward effective software solutions for big biology http://t.co/GbdWX6plSH
08:17 RT @pathogenomenick A stunningly comprehensive report Putting Pathogen Genomics into Practice from @PHGFoundation http://t.co/Cj8r7grX6H much to discuss
11:58 RT @EricTopol The 7 nanometer chip was projected to be developed by 2018, but announced today http://t.co/Q8PYoY8XhO @IBM http://t.co/aqjxSulyTy
13:08 RT @MicrobiomDigest Designer microbiome: MIT biologists program common gut bacteria @LATerynbrown #Btheta http://t.co/8gtmdvVkBM http://t.co/fKvuGLas43
13:29 RT @DrKatHolt The Lung Microbiome: New Principles for Respiratory Bacteriology in Health and Disease @robertpdickson @PLOSPathogens http://t.co/lnZERU9dAn
15:21 RT @pathogenomenick ICYMI: a landmark publication on translational pathogen genomics @leilaluheshi @c_rands et al @PHGFoundation http://t.co/igB2ORw2nw
21:05 The myths of bioinformatics software https://t.co/NSEIOBL7bO
21:43 RT @MicrobiomDigest Here, there, and everywhere: DNA contaminants creep in from the most unlikely places – Karl Gruber – EMBO Reports http://t.co/1UEFOHDFi4
21:45 RT @chandanpal143 Useful databases and resources for microbiologists. http://t.co/EaqSyXxtXw #database #bioinformatics
21:55 RT @jgoecks Slides from 2015 #usegalaxy visualization workshop: https://t.co/Hvbc6XgSEy
21:59 RT @BioCyc Freely Available Metabolic and Genome Posters from BioCyc http://t.co/WVD9Ex7tEX
22:06 RT @homolog_us Will I Use Kallisto? Definitely, Most Likely and Never (http://t.co/ioo76dFNRc) @pmelsted @lpachter
22:44 RT @RNASeqBlog #RNAMiner – A Bioinformatics Protocol for Mining Large #RNASeq Data – http://t.co/w9PDXucDiu @mizzouengineer http://t.co/yhGKHrximM

11’th July, 2015
13:28 Omics! Omics!: Clinical Metagenomics Pipelines: Revisiting & Reflecting http://t.co/PxQyW2nSLg
13:28 Omics! Omics!: Leaky clinical metagenomics pipelines are a very serious issue http://t.co/yJEtQ5bClb
13:35 Good laboratory practice for clinical next-generation sequencing informatics pipelines http://t.co/z1kxP5pK9W
14:40 RT @BorisAdryan Preliminary. Please get in touch if you are going to the @SoftwareSaved Collabo Workshop & want to work on @bionode: http://t.co/VMdxdnFhwf
15:32 RT @MrGreenify Happy to announce that BioJS 3 will be using @polymer and @Web_Components for viz in science. Get involved @ http://t.co/JEnXc8k0B4 #biojs15
15:41 RT @analyticbridge Six categories of Data Scientists http://t.co/VDxIK0gZ62
15:57 RT @GigaScience UMLS, it’s not ANOTHER ontology, but an integration of related phenotype ontologies #ismb2015 #bioont15
16:01 Identification of protein coding regions in RNA transcripts http://t.co/FOxOPyaDDP
16:11 SplicePie: a novel analytical approach for the detection of alternative, non-sequential and recursive splicing http://t.co/aUzIXv8fGO
16:27 RT @froggleston Fantastic keynote this morning at #bosc2015 from @hollybik, reminding bioinformaticians that biologists exist! :) https://t.co/LSBEYLvWWt
16:28 RT @biocrusoe Slides for my #BOSC2015 talk “Portable workflow and tool descriptions with the CWL” https://t.co/r8YazKpULg
16:32 RT @michaelhoffman Now: Sebastian Schoenherr: Bringing #Hadoop into Bioinformatics with Cloudgene and CloudMan. #bosc2015
17:33 RT @sjackman QUAST 3 is released!.Faster (5x–100x running w/o reference; parallel.with large references) brew upgrade quast http://t.co/iSfyglSdjT
17:46 RT @DickHardt My first @kickstarter has launched. Desktop Container Computer to run @docker containers https://t.co/vldHJJtcSP http://t.co/mBjav4PSLv

17:46
RT @ianmclean VisPy. GPU accelerated datavis for Python http://t.co/ng755QjaAb
20:27 RT @radaniba Set up Kubernetes with a Docker compose one-liner http://t.co/fc56rh84Lo
20:30 RT @jaredtobin Abstract Syntax Graphs for DSLs. Really exceptional read. http://t.co/rKQjpeMwDJ
20:42 RT @genetics_blog FCMM: A comparative metagenomic approach for functional characterization of multiple metagenome samples http://t.co/FkQ7LtM9pA
23:15 RT @DunhamLab One neat thing about paper is it was a nice bonus from our mixed yeast metagenome data: lots of yeast HiCs for cheap http://t.co/x21I6zArmP
23:16 RT @gedankenstuecke @OfficialSMBE .Potential and pitfalls of eukaryotic metagenome skimming: A test case for lichens.. http://t.co/UP7pvR983G
23:33 RT @LJDishaw Metagenome Sequencing of the Hadza Hunter-Gatherer Gut Microbiota: Current Biology http://t.co/ieHLYBUJFN

12’th July, 2015
11:30 Accelerating Scientific Publication in Biology | bioRxiv http://t.co/siFlbAwvRk
12:02 RT @SebastienBadia I’m pretty sure this is an accurate diagram of Openstack’s architecture. http://t.co/mVrawE3yko http://t.co/wJg4AAhDrX
13:21 RT @geetduggal Thanks @erikgarrison for your cpp VCF parser so I don’t have to :) https://t.co/8PhaJCAJrG tip hat to @nomad421
14:32 Screening Currency Notes for Microbial Pathogens and Antibiotic Resistance Genes Using a Shotgun Metagenomic Approach http://t.co/G6H70XALpH
16:06 RT @koitaxoumemesa my new paper re: #microbiome & #metagenome of nursing & weaning transition in pigs http://t.co/dix3UH3ElZ in @MicrobiomeJ w @davidamills
16:54 Sharing and executing linked data queries in a collaborative environment http://t.co/XmUTsf0TYC
17:34 RT @BangGoesSci The Sum of Our Parts. Fascinating article on the importance of the #microbiome in immunity http://t.co/YBOXyo6u51
18:02 Sequence Your Microbiome http://t.co/xQm094XhON http://t.co/6j4TEym9lK
19:58 Indian currency notes harbour antibiotic-resistant genes: study http://t.co/ZvrqNCeDnH
20:13 RT @HPDIHealth The advantages of soil based organisms for health. http://t.co/Nzo04AGfEi #probiotics #microbiome

biostack Weekly | 2015-07-05 (第六期)

29’th June, 2015
13:19 RT @lexnederbragt Getting the most out of RNA-seq data analysis #peerJPreprint https://t.co/LpfIvUkg56
14:57 CanvasXpress is a standalone HTML5 graphing library written in Javascript to explore complex data sets. http://t.co/pHZREpUGRP
16:57 RT @BI0graphika #Antibiotic Resistance explorer, a reactive #datavis built with #dcjs + #crossfilter http://t.co/4M3c0jFmwP #d3js #bootstrap #bio4j
22:23 Prioritizing risks of antibiotic resistance genes in all metagenomes : Nature Reviews Microbiology http://t.co/l32Q2GVuDx
23:11 RT @homolog_us @lpachter @pmelsted @yarbsalocin Thanks. How do I tell the index command about multiple isoforms? http://t.co/LkAAF5I0VT

30’th June, 2015
07:21 Assembly and diploid architecture of an individual human genome via single-molecule technologies http://t.co/uY0INfkYSA
07:26 RT @klmr #Modules now supports regular #Rstats packages and treats them as modules—with all the perks! https://t.co/Xmf1K8ywa7 http://t.co/u9HPq9uVza
07:50 RT @shenderson Immutable serialisable data structures in C++11 | Victor Laskin’s Blog http://t.co/OY3ZONWfYf #c++ #cpp
20:33 Go: Slice search vs map lookup http://t.co/VAPNAFhyA9
22:49 BGT: efficient and flexible genotype query across many samples arxiv: http://t.co/JZk8OxgaAZ Github: http://t.co/hDSKvGkCBA
22:52 Optimal Seed Solver: Optimizing Seed Selection in Read Mapping http://t.co/IHGh0FvlGu
23:11 RT @torstenseemann @BioMickWatson @crashfrog yes, we use 1 cell per genome roughly, which is 600 to 1000 Mbp so about 100x and no need fur illumina
23:16 RT @homolog_us Tools or Algorithms, What is More Important in Bioinformatics? – The Answer Will Shock You (http://t.co/h5WUaiRN1w)
23:16 RT @mgymrek My thoughts on the PrediXcan method for using transcriptome data to map complex traits: http://t.co/95vVMjgTzw
23:23 RT @EduEyras Optimising Transcriptome Assembly http://t.co/OfJ6IeLvZC

1’st July, 2015
08:18 Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection http://t.co/CMqH5poOKv
10:04 Points of View: Unentangling complex plots”http://t.co/8S2ACL2cZs
18:59 ClustVis: a web tool for visualizing clustering of multivariate data using Principal Component Analysis and heatmap http://t.co/BMMmvhzVv9
19:17 RT @dhimmel Introducing data on publication and acceptance delays for 3,482 journals http://t.co/mkG9VgJfEZ http://t.co/OELtsz0Moi
19:25 RT @michaelwaskom seaborn 0.6 is out! Take a look at the release notes to find out what’s new: http://t.co/JjvR0Ftbqv http://t.co/DIE0oWRZTC
19:52 RT @yokofakun Nucleic Acids Research Web Server issue 2015 http://t.co/DgAePp7Jef
19:58 RT @goKarumi Useful Python Libraries for Startups http://t.co/3oAjXvAIHU
19:59 RT @pbailis Phil Bernstein et al.’s textbook is still one of the best books on the theory of serializability (and it’s free): http://t.co/x9g6a2yuFD

3’rd July, 2015
22:03 RT @elixirpipe Elixir in times of microservices by @josevalim – http://t.co/L7IuRgXfev #elixirlang

4’th July, 2015
06:40 RT @genetics_blog Bandage: interactive visualisation of de novo genome assemblies http://t.co/ov62ZX7hO0 http://t.co/EBgFz8g33L
06:42 RT @DaleYuzuki RT @Y_Gilad: Another attempt to use F1000 platform for a refutation study. By @StevenSalzberg1 http://t.co/y9vll7HNdo
11:55 ERGC: An efficient referential genome compression algorithm http://t.co/WpsCWNeQS0
21:37 RT @biocrusoe TIL: all NCBI code is in a public SVN repository http://t.co/OT1KpunBtP cc @pjacock
22:33 RT @ElixirTip Here is a great article on mutable value chains by @joeerl. http://t.co/1iZ6LlBhoc
22:33 RT @joakimk Really good intro to streams in general and @elixirlang streams in particular: http://t.co/S5O0isD0az by @drewolson.
22:35 RT @PeterCMarks Multi-process flow-based programming in @elixirlang : http://t.co/iTYOWFgRPg #myelixirstatus
22:48 RT @erikgarrison visualizing graph alignments using vg view -dA and graphviz http://t.co/ZIu1LCjwAA
22:53 RT @genetics_blog How Many Genes Are Expressed in a Transcriptome? http://t.co/UBCzUnHE3P

5’th July, 2015
13:39 Blasted Bioinformatics!?: NCBI working on SAM output from BLAST+ http://t.co/z7zqQu3XPR
14:48 RT @ozgurakgun this is indeed how most programs work on a multicore computer! http://t.co/IvSpNdgGkO
15:07 RT @smllmp …and finally pipeline paralellism! Been looking for a long time! Paves way for #flowbased in #elixirlang? #EUC2015 http://t.co/a36ygUz7yk

biostack Weekly | 2015-06-28 (第五期)

不敢偷懒,还是把这周Twitter的收藏列表贡献出来;

21’st June, 2015

15:23 RT @MicrobiomDigest Biodiversity and distribution of polar freshwater DNA viruses – Daniel Aguirre de Cárcer @ScienceAdvances #virome http://t.co/mZ27PhyB45
15:24 RT @MicrobiomDigest Metagenomic analysis of #microbiome in colon tissue from subjects with #IBD reveals interplay of viruses and bacteria http://t.co/VcTdcttjK3
15:31 RT @DataScienceCtrl Data Science Wars: R versus Python http://t.co/abLJo6fKtt http://t.co/dAg522ZZDj
15:31 RT @DataScienceCtrl Comprehensive list of data science resources http://t.co/UZ7ZuTmt59 http://t.co/CSmqGVpCY5
15:57 Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees http://t.co/E9mGFY2CCA , bloomtree https://t.co/R3I7JTTITV
16:13 RT @laurencerowe CFFI and #pypy working well. Seeing a 3-4x speedup over CPython/pysam with @brent_p’s htslib wrapper – https://t.co/2Hjy6Jelds
17:03 RT @yokofakun two distinct projects: Workflow Description Language https://t.co/FWEVcF5phU vs Common Workflow Language https://t.co/Lg4mAQQTuM
17:39 RT @guillemch Workflow tool makers: Let us define the data flow, not task dependencies http://t.co/b3lA4p7WQn

22’nd June, 2015

15:49 Scientific Luigi :Extra helper methods for writing scientific workflows in Luigi http://t.co/zSCis39uCS
15:49 RT @MicrobiomDigest 16S rRNA gene-based profiling of infant gut microbiota strongly influenced by sample processing & PCR primer choice http://t.co/j53hstM8FY
17:40 Python, Ruby, and Golang: A Command-Line Application Comparison – Real Python https://t.co/2qMSyRxEDP
19:08 Go development environment for Vim http://t.co/oBTOvuwp3d
19:09 RT @vivainio Thought: languages benefiting most from #webassembly are the ones with precise memory management & no GC. C++, #rustlang, what else?

23’rd June, 2015
07:55 RT @jaredtsimpson .@ljdursi on the tricky problem of merging structural variation calls (with code!): http://t.co/nmnnfk7cpB
07:55 RT @benjaminlmoore GenomeD3Plot: A library for rich, interactive visualizations of genomic data in web applications http://t.co/W8EatssiDp #d3js
07:56 RT @mason_lab The impact of read length on quantification of differentially expressed genes and splice junction detection. http://t.co/W2eOKV92eh
07:57 RT @DrJCThrash Bayesian mixture analysis for metagenomic community profiling http://t.co/oOb2Dq0nrc
07:57 RT @JC_pathogenomic NEW: Assembling Short Reads from Jumping Libraries with Large Insert Sizes. http://t.co/uGMyKVozKd

24’th June, 2015
07:48 From trainee to tenure-track: ten tips | Genome Biology | Full Text http://t.co/ZkvoC18dYq

25’th June, 2015
08:10 MetaQuery: a web server for rapid annotation and quantitative analysis of specific genes in the human gut microbiome http://t.co/NifKqvCQg5
08:11 The problem with P values: defining clinical vs. statistical significance – Public Health http://t.co/snzX1u68Ff
08:11 BASiCS: Bayesian Analysis of Single-Cell Sequencing Data http://t.co/aXpUkVsZVh
08:13 The changing form of Antarctic biodiversity : Nature : Nature Publishing Group http://t.co/UjRN3GpaJk
08:16 Comprehensive identification and analysis of human accelerated regulatory DNA http://t.co/b6FIMixZZF
08:17 DT: An R interface to the DataTables library http://t.co/FWRLB93WXN
08:18 d3heatmap: Interactive heat maps http://t.co/gpyaeqAE42
08:49 Diving into Genetics and Genomics: RPKM/FPKM, TPM and raw counts for RNA-seq http://t.co/2mqQvLHRsp
11:38 Hyperscape: Visualization for complex biological networks http://t.co/djGMa3E5nY
11:40 RT @GaetanBurgio Personal Microbiome code bar! ‘Identifying personal microbiomes using metagenomic codes’ PNAS http://t.co/s6BDnvkOpb http://t.co/zutojSMmQS
11:47 Assembling short reads from jumping libraries with large insert sizes http://t.co/CFqXRWFody
11:47 QVZ: lossy compression of quality values http://t.co/qCRfoGDX2f
11:47 ACE: accurate correction of errors using K-mer tries http://t.co/wo0ngboWQT
14:13 Metabolic network modeling of.microbial communities http://t.co/TZ4GwIg4IS
22:57 Virulence genes are a signature of the microbiome in the colorectal tumor microenvironment http://t.co/hKjZfgUhdM
22:57 Gut resistome development in healthy twin pairs in the first year of life http://t.co/VzhkWm3arM

26’th June, 2015
14:48 TRAL: tandem repeat annotation library http://t.co/zO9tMoWzfL
18:03 RT @RobLanfear Why should you open source your code? Because sometimes smart folks point out you could do it 100x faster… https://t.co/BELTkyHckD
22:23 rSeqNP: a non-parametric approach for detecting differential expression and splicing from RNA-Seq data http://t.co/tbSZBL3iBC
22:23 RAPTR-SV: a hybrid method for the detection of structural variants http://t.co/VvWsqmTMd8
22:24 Annocript: a flexible pipeline for the annotation of transcriptomes able to identify putative long noncoding RNAs http://t.co/wSBlztxUx8
22:24 Normalization and noise reduction for single cell RNA-seq experiments http://t.co/qUa5nSTsJM
22:24 Oasis: online analysis of small RNA deep sequencing data http://t.co/xecItgAWCp
23:01 On enhancing variation detection through pan-genome indexing | bioRxiv http://t.co/0vTi5xtpvN
23:19 Command-line Bootcamp http://t.co/CUJnbrV3xP

28’th June, 2015
12:52 RT @MarkLittlewood The slides from @billjaneway #BoS2015 talk on Productive Bubbles & Unicorn Bubbles. http://t.co/uFOli2r44Q
20:36 DISTMIX: direct imputation of summary statistics for unmeasured SNPs from mixed ethnicity cohorts http://t.co/Nfl7ztJkpG
20:37 ISQuest: Finding Insertion Sequences in Prokaryotic Sequence Fragment Data http://t.co/ric34FwUkO
21:20 RT @biorxivpreprint TransRate: reference free quality assessment of de-novo transcriptome assemblies http://t.co/eREsW8Bhby
21:25 Salmon: Accurate, Versatile and Ultrafast Quantification from RNA-seq Data using Lightweight-Alignment http://t.co/HHGOeGyWe2
21:41 RT @AndrewJesaitis Hopefully the Common Workflow Language can solve the interoperability bottleneck in science (esp Bioinformatics) https://t.co/QT8qiYiquu
21:42 RT @ctitusbrown Features of #CommonWL: dataflow; streaming; scatter/gather for parallelism; built-in docker support.
21:43 RT @RLangTip Extracting data from PDFs with #rstats. Vector Image Processing http://t.co/56Yvg6vJbk (PDF)
21:54 RT @kbradnam NCBI BLAST+ v2.2.31 released #ACGT http://t.co/WDsKOYFcxI
21:54 RT @fulhack A Beginners Guide to Building Data Pipelines with Luigi http://t.co/ypglsO9EIt – by @Dylan_Barth
22:15 RT @nanopore Interested in joining the PromethION Early Access Programme (PEAP)? Find out more: https://t.co/RZvDkErPMM
22:24 RT @JCVenter At Human Longevity we are currently sequencing 36,000 genomes/year, scaling up to 100,000 genomes/year

biostack Weekly | 2015-06-21 (第四期)

Twitter list

14’th June, 2015
08:56 RT @sjackman Various programs used for the assembly of @PacBio data. Anyone have corrections or additions for me? @infoecho http://t.co/bZGxUbw5mX
09:12 RT @pathogenomenick Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations http://t.co/NXX3eYIrBh
12:40 RT @sjackman UniqTag: My paper for assigning k-mer IDs to genes using a minimizer has been published in @PLOSONE. R package & Ruby http://t.co/iI7gjxTOh0
12:49 NxTrim: optimized trimming of Illumina mate pair reads http://t.co/nXAvv3x8Bu
22:10 Bridging the knowledge gap: from microbiome composition to function http://t.co/WEkTlOJnDF
22:18 Bespoke diets based on gut microbes could help beat disease and obesity http://t.co/xzgboCgTBo http://t.co/8xL9lT62hJ

15’th June, 2015
18:22 Detecting somatic mutations in genomic sequences by means of Kolmogorov-Arnold analysis http://t.co/5ac2eDn6Kz

16’th June, 2015
07:54 The impact of Docker containers on the performance of genomic pipelines https://t.co/GUnfi2lnJu https://t.co/hhptSgebCD
23:22 Error Tree: A Tree Structure for Hamming & Edit Distances & Wildcards Matching http://t.co/7KtDi2Lb9x
23:24 Ultra-large alignments using phylogeny-aware profiles http://t.co/P1OXnYCuUy
23:24 Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA http://t.co/DrRsKqtOj8
23:25 MetaPathways v2.5: Quantitative functional, taxonomic, and usability improvements http://t.co/FZzgNCYzvs

17’th June, 2015
08:03 35 invaluable books on Data Visualization http://t.co/rymfsLcRaa
08:55 Madagascar open-source software project for multidimensional data analysis and reproducible computational experiments http://t.co/G8brxyfsWN
08:56 BitMapper: an efficient all-mapper based on bit-vector computing http://t.co/fatqIIKsB8
18:00 Can we change our microbiome to prevent colorectal cancer development?, Acta Oncologica, Informa Healthcare http://t.co/TmXCMoeNSA
20:33 RT @MicrobiomDigest Newfound groups of bacteria are mixing up the tree of life #JillBanfield @UCBerkeleyNews https://t.co/mBRHWTUVvL http://t.co/yDVx18LA2e
20:46 GTEx publishes results from RNA-Seq pilot study | RNA-Seq Blog http://t.co/IlKXSIGcUD
21:01 Investigating microbial co-occurrence patterns based on metagenomic compositional data http://t.co/1g56yu5BCu
21:10 FM-index for dummies http://t.co/nWvgt81jWQ
21:13 Linear-Time Sequence Comparison Using Minimal Absent Words & Applications http://t.co/3aOrwm9Mk1
21:58 RT @torstenseemann With @nanopore MkI flow cells at $900 (500 Mbp) I think @pacbio is safe for a while until MkII flowcell without electronics on it is out.
22:01 RT @hadleywickham dplyr 0.4.2 is now out: https://t.co/VQaIl5LpQl. Now preserves attributes (so works with haven) + lots of crash fixes #rstats
22:02 RT @lexnederbragt New blog post: Developments in high throughput #sequencing – June 2015 edition (aka, where to put the MinION) https://t.co/mujE1WSlwV
22:03 RT @dzerbino @jtleek @aaronquinlan @michaelhoffman WiggleTools reads BAM to Wig: https://t.co/lAlkWDkjqY
22:07 RT @JonBadalamenti For those keeping score: @PacBio’s in-house cluster has 16 nodes, 48 cores/node, 256 GB RAM, Lustre 2.1.2 HPFS #PBUGM @UMNmsi
22:13 RT @JonBadalamenti Justin Zook showing awesome results from @aphillippy @sergekoren’s MHAP doing human genome assemblies #PBUGM
22:21 RT @genetics_blog Unusual biology across a group comprising >15% of domain Bacteria http://t.co/8wXzIyJXPn via @MicrobiomDigest http://t.co/1aZOhVZkD3
22:25 RT @researchingsf MY: First draft assembly of tetraploid coffee arabica: 91-95% complete at >60X coverage, longest reads >65kb, longest contig >4Mb #pbugm
22:27 RT @PacBio Kevin Corcoran kicks off user group meeting, 100+ types of plants/animals genomes sequenced w PacBio, your genome deserves long reads #PBUGM
22:29 RT @PacBio LH: Coming soon! New barcodes for SMRT Sequencing, set of 384 barcodes, comprised of 16 bp each, flexibility for PCR primer design #PBUGM
22:54 RT @vivekbhr @jtleek @aaronquinlan @michaelhoffman Try deeptools. http://t.co/IKPazR4MUD

18’th June, 2015
07:40 What’s the difference between Causality and Correlation? http://t.co/a8NYrLRijz
08:03 Mining the microbial dark matter http://t.co/lWaQoswgqZ
08:22 Re-booting the human gut : Wyss Institute at Harvard http://t.co/5LVZHts9Ss
08:45 Starcode: sequence clustering based on all-pairs search http://t.co/vCi0YsZtI2
13:12 [1506.05185] CARGO: Effective format-free compressed storage of genomic information http://t.co/E6aoV7kJdD
22:25 RT @jscott_toronto Fungi to blame for fatal Berkeley balcony collapse, via @nytimes http://t.co/MLWX24x4Z1
22:41 RT @abremges Automated Contamination Detection in Single-Cell Sequencing http://t.co/tPc1nevsSA by @l0x et al.
22:42 RT @DrKatHolt Plot trees & annotation (incl BEAST, time axis) with ggtree for R! http://t.co/9ahqKQLTaI Well spotted @mackas21 http://t.co/2ldvAiH2fj
22:53 RT @infoecho FALCON PacBio Reads Assembler Internal, pipeline, directories, command line tools, scripts, inputs. outputs… http://t.co/htM9eGWkJT
22:57 RNASequel: accurate and repeat tolerant realignment of RNA-seq reads http://t.co/oZ4dUvN9TN
23:07 Dispersing misconceptions and identifying opportunities for the use of ‘omics’ in soil microbial ecology http://t.co/t8Q0rYk5WC

19’th June, 2015
08:26 Compact graphical representation of phylogenetic data and metadata with GraPhlAn https://t.co/RcQ1N5RkTy
08:31 RT @mikethemadbiol A Question About Oxford Nanopore and the Cost Model: This is the first mention of how much the much-vaunted Ox… http://t.co/XrAvSHilLE
08:32 RT @researchingsf KFA: Combining ~7.5M @PacBio long reads + 183M short reads = 35 fusion genes, 56 fusion sites, 30 highly expressed fusion isoforms #pbugm
21:41 RT @BioMickWatson Complete Genomics Revolocity and the future of genome sequencing http://t.co/Xpl6aINRVF
21:47 Genomics & Transcriptomics hold key to understanding ecological and evolutionary processes http://t.co/7IlD54wveY
21:47 Informing the Design of Direct-to-Consumer Interactive Personal Genomics Reports http://t.co/4IFlyAJLa1

20’th June, 2015
08:32 Bermuda: Bidirectional de novo assembly of transcripts with new insights for handling uneven coverage http://t.co/mBarjBxTN1
18:28 RT @hjpimentel Reproducibility? We’ve got that. kallisto analysis from preprint https://t.co/DYY1UNMdad snakemake + knitr @yarbsalocin @pmelsted @lpachter
21:51 Workflow management software for pipeline development in NGS https://t.co/6eNUffGTrs
22:25 RT @yokofakun BioShaDock registry “A #Bioinformatics Shared #Docker registry” http://t.co/yyYteIPfVz ( via @Yvan2935 )
22:36 7 Tools for Data Visualization in R, Python, and Julia http://t.co/bDYWM7W13t
22:44 Efficient development workflow using Git submodules and Docker Compose https://t.co/Jgz0t7q52y
22:45 RT @justinlivi Excited to try this development workflow with Docker Compose http://t.co/o6whOiaD16 via @airpair
23:19 RT @mikaelhuss JD: Queue: parallelizes workflows, robust (reruns failed jobs), traceable, deletes intermediate files, reusable components #einframps2015
23:22 RT @mikaelhuss KH: Future: porting all tools to Spark. Parallelizing variant callers. Interactive ad hoc queries of big data in genomics. #einframps2015
23:29 RT @pathogenomenick The microbiome of the human lower airways: a next generation sequencing perspective http://t.co/RMAMce0ddU
23:30 RT @torstenseemann Trim Galore! – auto detects which sequencing adaptors you used – by @FelixNmnKrueger #bioinformatics http://t.co/IMe16qtBYA
23:37 Biographika: rich interactive data visualizations on the web for the research community http://t.co/5Wjnsdf5Rb
23:39 RT @yokofakun [delicious] CONSERTING: integrating cnv analysis with structural-variation detection #tweet: We developed Copy… http://t.co/EsQDH2FDeX

21’st June, 2015
11:45 LFQC: A lossless compression algorithm for FASTQ files http://t.co/FNKTaBoBpi
11:49 Bio4j: a high-performance cloud-enabled graph-based data platform | bioRxiv http://t.co/u7ShVigfKg
14:38 Composition and temporal stability of the gut microbiota in older persons http://t.co/VMBzmN4yWk
14:48 RT @hammer_lab Introducing pileup.js, a Browser-based Genome Viewer (by @danvdk): http://t.co/m77EmuOWEq

Docker 容器技术对基因组数据分析性能影响

文章标题:

The impact of Docker containers on the performance of genomic pipelines

文章摘要:

Genomic pipelines consist of several pieces of third party software and, because their experimental nature, frequent changes and updates are commonly necessary thus raising serious distribution and reproducibility issues. Docker containers technology offers an ideal solution, as it allows the packaging of pipelines in an isolated and self-contained manner. This makes it easy to distribute and execute pipelines in a portable manner across a wide range of computing platforms. Thus the question that arises is to what extent the use of Docker containers might affect the performance of these pipelines. Here we address this question and conclude that Docker containers have only a minor impact on the performance of common genomic pipelines, which is negligible when the executed jobs are long in terms of computational time.

文章地址:

https://peerj.com/preprints/1171/

文章解读:

基因组学数据分析流程常常包括相当多的第三方数据分析工具,一个典型的哺乳动物基因组项目使用多大 140+ 工具/数据库, (http://blog.openhelix.eu/?p=20002),鉴于生物信息工具版本升级比较频繁,所有Docker容器技术比较适合生物信息数据分析流程。

从文章评估结果可以看出性能损失比较少,特别是计算时间比较长的应用。

enter image description here

enter image description here

enter image description here

评估源代码:

https://github.com/cbcrg/docker-benchmarks

更新时间:
v1, 2015/6/16 9:06

Minute.of.Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences

Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences

期刊:Nature Reviews Microbiology

IF:23.317

链接:http://www.nature.com/nrmicro/journal/v12/n9/abs/nrmicro3330.html

1

简要:

Small subunit ribosomal RNA gene,亦即熟知的细菌(Bac.)与古菌(Arc.)这些原核生物中的16S rRNA。对16S rRNA的研究以及在公开提供的数据库中的序列数都迅速增长,序列条目已经超过400万。然而,对于细菌和古菌的分类和命名并没有统一的框架标准。

在这篇分析性文章中,研究人员通过对16S rRNA序列数据库中的大量数据进行分析和分类命名术语汇总,提出了基于16s rRNA序列identities对Bac.和Arc.在高阶分类水平的合理分类边界(Rational taxonomic boundaries),包括了:细菌与古菌在高阶分类水平(High taxa)上的合理的分类界域,更新了关于球细菌与古菌的种类和数量普查结果,以及阐述了可通用于可培养的和不可培养的微生物的在高阶分类水平的稳定层次分类方法(Stable hierarchical classification of high taxa)的原理。。同时,对数据的分析发现,只有近乎全长16S rRNA 的数据才能提供物种分类多样性或差异的准确评估。另外,对现有数据的分析发现当前对于环境样品的研究研究大多数止步于高阶的分类水平。

概要记录:

1、 It is remarkable that, although nearly 1.3 million eukaryotic species have been described so far, this number might represent only 20% of the richness that exists.

2、 Only ~11,000 bacterial and archaeal species have been classified so far. At the current rate of ~600 new descriptions per year, it has been estimated that it would take >1,000 years to classify all of the remaining species.

3、 Moreover, the full extent of their diversity is difficult to conceptualize, owing to the lack of objective criteria (such as numerical thresholds) for taxonomic circumscriptions of uncultured microorganisms, which are identified only by sequence data. Thus, estimates for the total number of bacterial and archaeal species vary widely, from 3 × 104 to ~1012.

4、 从2006~2012年间,Silva REF 114中的数据增长情况是:The number of newly detected species (at 98.7% sequence identity) was about 4 × 104 taxa and the number of newly detected genera (at 94.5% sequence identity) was about 1.3 × 104 taxa per year over the past six years.

5、 将每年添加的新的序列数据与发现的种属比率进行回归,发现逐年发现的新的种属比率是明显下降的(clear linear decrease is observed), that is, the rate of detection of new genera and of new species may be close to zero, by the end of 2015 and 2017, respectively.(编者注:但是这个仅是基于目前的技术的增加趋势,对越来越多的特殊环境的探索加上新的测序技术手段,也许会出现“柳暗花明又一村”的景象,毕竟“世界那么大,我们要去看看”)。

6、 虽然对于微生物的命名有正式规则,但是在进行物种分类(Taxa and their hierarchical classifications)是一个人工的难免主观因素的过程。Only the rank of species is circumscribed by a combination of well-accepted criteria, which include:

  1. DNA–DNA hybridization (DDH), with a threshold around 70%;
  2. Average sequence identities of shared genes (ANI), with a threshold of around 94–96%;
  3. and 16S rRNA gene sequence identities, with a threshold of around 98.7%.
  4. genetic criteria should always be accompanied by a discriminant phenotypic property

7、 Genera are also recognized by their phylogenetic separation from other such groups and the possession of 16S rRNA gene sequence identities of >95%. There are no robust rules for the circumscription of ranks above genus.(编者注:然而在NGS技术下快速发展的微生物群落多样性研究,如基于Miseq的16s rDNA,ITs研究技术,元基因组技术都主要集中在高阶的分类水平。)

8、 The pre-classification of environmental sequences in a global and unified system may have many advantages, such as avoiding cumbersome taxonomic revisions14 and producing a stable taxonomic framework.(编者注:承前启后)

9、 利用SILVA DB上的LTP 102中的数据,计算在属及以上各水平的 Sequence identity threshold,见Table 1。由于LTP中没一个种只记录了一种菌株数据,所以无法计算中水平的阈值。要注意的是,这些阈值需都是最小阈值,并且某水平下的不同分类的序列一致性值可能比计算的阈值要高,最好可以参考其他类型的数据,如形态学,遗传学和环境(背景)数据。

2

10、 通过对LTP108中的数据进行特定区域片段来模拟不同Id阈值在丰度估计中的应用的准确性。其中有v1~v2片段受阈值的影响效果是最明显的。Meanwhile,analysis of the taxa recovery rate indicates a great underestimation of taxa richness when partial sequences are used. Although the situation tends to ameliorate as longer segments are considered, near full-length 16S rRNA genes sequences are required for accurate richness estimations and accurate classifications of high taxa.(编者注:为什么对于高阶分类的recovery rate要求更长的序列呢?难道是因为分类水平越高,分类之间的id阈值越低,分类接线越模糊??)

11、 In conclusion, the results indicate that near full-length SSU rRNA gene sequences are required for accurate richness estimations and accurate classifications of high taxa.

12、 利用计算得到的Id 阈值对Silva Ref 114数据库里面的数据进行分析。For the high taxa, the analyses provide evidence for a current census of at least 6.1 × 104 genera, 9.6 × 103 families, 4.2 × 103 orders, 2.2 × 103 classes and 1.3 × 103 phyla of bacteria and archaea.见Table 2.

3

13、 从估计的结果来看,会发现高阶分类水平的数目远远超过目前在LPSN中收录描述的数据,比如门水平的估计值达(1356-84)≈1300种,而LPSN中仅有27种。捣成这种咸咸的主要原因可能是由于扩增和测序错误以及嵌合体这样的异常数据造成的估计偏差。但是,即便考虑到嵌合体造成的最大影响,门水平的数量也不过从1300种减少到1100种左右的。

14、 通过对Silva REF 114的数据在98.7%9, 99.0% 和99.5%三个水平的考察,发现大约有2.1 × 104 species-level OTUs,并且估计出全球微生物种水平的总数约是 4 × 105 ,比以前的2 × 106少很多。同时,最这个数据也因该是谨慎看待的,因为估计的策略与简单化处理了许多因素,比如数据饱和性,低丰度物种的比例,以及对于估计偏向性问题的原因等。

15、 为了使cultured 和 uncultured的微生物在分类标准上的一致,文章提出了CTUs的概念 。A new biodiversity unit called the CTU (candidate taxonomic unit), which is compatible with the hierarchy that was established in the Bacteriological Code. Define a CTU as a biological entity that is delineated by a monophyletic set of sequences with a sequence identity that stays within, or very close to, the taxonomic threshold that is proposed for a given rank.(编者注:CTU是指将在符合或非常接近在特定分类水平确定的分类序列一致性阈值下的单源序列集合作为研究的生物学实体)。

4

16、 CTU is conceived to be a combination of the OUT and OPU (operational phylogenetic unit), which are currently used by microbial ecologists and systematists.

17、 CTUs的分析过程:The recognition of CTUs starts with a preliminary classification based on OTUs (which is calculated using standard clustering algorithms with the different taxonomic thresholds that are identified in Table 1), the results of which guide the search for local meaningful phylogenetic clades in a reliable tree topology.

18、 CTUs的方法可以方便对已有的分类描述进行重新评测。作者用密螺旋体门(Phylum Spirochaetes, by the end of 2013, the 112 described species had been organized into 16 genera, four families, one order and one class.)来作为一个示例,并将目前密螺旋体门划分为5个独立的纲[Spirochaetes.Class1 (which comprises the genera Borrelia, Treponema, Sphaerochaeta and Spirochaeta), Spirochaetes.Class2 (which comprises the genus Exilispira), Spirochaetes.Class3 (which comprises the genus Brevinema), Spirochaetes.Class4 (which comprises the genus Brachyspira) and Spirochaetes.Class5 (which comprises the genera Leptospira, Leptonema and Turneriella)].

19、CTUs也可用于环境样品的多样性分类工作,尤其是目前数据库中的uncultured Bacteria/Archaea 数据。作者用Silva REF 108中的11个门(6 Clades+8Candidate divisions)中选取9426条序列进行评测。It is remarkable that the total number of delineated CTUs (3,553 CTUs, distributed into 2,053 genera, 767 families, 411 orders, 240 classes and 82 phyla) was nearly identical to the total number of calculated OTUs (3,562 in total).

20、文章指出,早起所谓的“Environmental clades”是与序列数据的环境背景相关,而且只有很少部分的序列被这样描述,没有严格考虑各方面的因素和问题(not consid¬ered aspects that are related to size, phylogenetic depth, internal taxonomic classification or a standard nomen¬clature format),并用CTUs的方法对 alphaproteobacterial clade SAR11进行了评测。

21、对Classes 和Phyla的在系统发生关系的连续性的讨论,以及对SAR11,TM7, OD1, OP3, OP11 and the phyla Elusimicrobia, Caldiserica 和Armatimonadetes的评测结果,跟进一步验证CTUs方法的效果。These discrepancies again emphasize the need to reconcile classification and nomenclature at all levels.

文献PDF资料见: http://pan.baidu.com/s/1kTAEEMz

biostack Weekly | 2015-06-14 (第三期)

Twitter list

7’th June, 2015

09:38 RT @genetics_blog Finally, labeled multi-panel ggplot2 figures made easy http://t.co/vl1W1IsbG7 thanks @ClausWilke http://t.co/K62PTnZsrG
09:38 RT @genetics_blog ACE: Accurate Correction of Errors using K-mer tries http://t.co/DeMbJYbbcP
09:39 RT @genetics_blog oswitch. okay this is awesome. https://t.co/QsfYZriiNn h/t @pathogenomenick http://t.co/5zLeWtlgUX
09:40 RT @genetics_blog sRNAtoolbox: an integrated collection of small RNA research tools http://t.co/2taeOBl8ki
11:15 RT @biorxivpreprint Cpipe: a shared variant detection pipeline designed for diagnostic settings http://t.co/poE6Xbyjhh
11:53 RT @genetics_blog A de novo DNA Sequencing and Variant Calling Algorithm for Nanopore data http://t.co/x77DQTpPiW https://t.co/SlcjLif2RI
11:54 RT @genetics_blog Determining Exon Connectivity in Complex mRNAs by Nanopore Sequencing http://t.co/a70XpBU7wX
15:38 RT @christianmaioli xpresso – Java library that lets you write Python-like code https://t.co/pHfGX2OHDQ #java #python #programming

8’th June, 2015

07:44 Metagenomics of the Human Intestinal tract: From who is there to what is done there http://t.co/FYzhGQWGwl
08:10 RT @DNAmlin “C++ in the modern world” by @CPP_Coder https://t.co/tXomAUpwmZ

9’th June, 2015

07:50 RT @TechCrunch Apple Is Open-Sourcing Swift, Its New Programming Language http://t.co/rZYC3kVxcx by @kylebrussell
18:19 Using populations of human and microbial genomes for organism detection in metagenomes http://t.co/7kDpnbU9wT
18:31 [1506.02424] Algorithms for finding transposons in gene sequences http://t.co/EOzuRoyoED

10’th June, 2015

07:30 GAML: genome assembly by maximum likelihood http://t.co/NvwsYrOLAr
07:39 RT @infoecho HTML5 slides with #IPython NB: Exploring Falcon Assembly String Graph Output #bioinformatics @PacBio Human Assembly https://t.co/R8t2kKjRpf
08:08 RT @ErichMSchwarz PBcR-MHAP de novo assembly of PacBio reads published: http://t.co/wcwgLOsvEw
08:10 RT @Magdoll Sequencing macaque MHC class I gene with PacBio Iso-Seq. P4-C2 chemistry with barcoding. http://t.co/8ytLFsHv9W
08:46 Adventures with Nanopore http://t.co/PcH6qqTD1s

11’th June, 2015

13:15 FunPat: function-based pattern analysis on RNA-seq time series data http://t.co/yN1vSQR3aH
18:26 RT @chuttenh @nsegata >3000 metagenome meta-analysis (!) profiled at the strain level for all organisms with >=1 reference genome #symbiomes
18:43 EDLIB : C/C++ library for sequence alignment using edit distance http://t.co/XmQfVZfBke
18:52 GraphMap for Fast and Sensitive Mapping of Long Noisy Reads http://t.co/30EsqZRZyE

12’th June, 2015

16:14 ProDeGe: a computational protocol for fully automated decontamination of genomes http://t.co/1IBC1dcYAo

13’th June, 2015

12:00 RT @hackernewsbot NixOS Linux… http://t.co/eW4L5Pf6rA
15:54 RT @animesh1977 UProC: tools for ultra-fast protein domain classification http://t.co/Abo9fzhTdT
19:12 The landscape of genomic imprinting across diverse adult human tissues http://t.co/0mcM3tJQ89
19:14 Battling Phages: How Bacteria Defend against Viral Attack http://t.co/DMr067r1hL http://t.co/8Mo9C5VXdP
19:24 Funders must encourage scientists to share http://t.co/oPlUvzUyD2
22:56 RT @MicrobiomePaper Metagenome Sequencing Reveals Rhodococcus Dominance in Farpuk Cave, Mizoram,… http://t.co/kAbusVHwKi #microbiome http://t.co/WklS4COp0Q
23:12 RT @MicrobiomDigest Nucleotide 9-mers Characterize the Type II Diabetic Gut Metagenome – Balazs Szalkai -ArXiv #bioinformatics http://t.co/LDWMJ7rND8

14’th June, 2015

07:27 RT @yokofakun “dbSNP build 144 now available” http://t.co/u01nURGLdF
07:27 RT @ctitusbrown As of latest master, khmer supports Python 3.4: https://t.co/O6QtElV9Ze Thanks, @luizirber! cc @biocrusoe @brettsky
07:27 RT @smllmp New blog: #Workflow tool makers: Allow defining the #dataflow rather than task dependencies: http://t.co/e74wTfl6I8 http://t.co/vXeVefzKhm
08:15 RT @infoecho Genome Assembly Tutorial in 140 chrs: Sequence Overalp, String Graph and Fundamental String Algebra http://t.co/NdgfPkJfT

ProDeGe: 自动去除基因组序列中的污染序列

文章标题:

ProDeGe: a computational protocol for fully automated decontamination of genomes

文章摘要:

Single amplified genomes and genomes assembled from metagenomes have enabled the exploration of uncultured microorganisms at an unprecedented scale. However, both these types of products are plagued by contamination. Since these genomes are now being generated in a high-throughput manner and sequences from them are propagating into public databases to drive novel scientific discoveries, rigorous quality controls and decontamination protocols are urgently needed. Here, we present ProDeGe (Protocol for fully automated Decontamination of Genomes), the first computational protocol for fully automated decontamination of draft genomes. ProDeGe classifies sequences into two classes—clean and contaminant—using a combination of homology and feature-based methodologies. On average, 84% of sequence from the non-target organism is removed from the data set (specificity) and 84% of the sequence from the target organism is retained (sensitivity). The procedure operates successfully at a rate of ~0.30 CPU core hours per megabase of sequence and can be applied to any type of genome sequence.

文章地址:

http://www.nature.com/ismej/journal/vaop/ncurrent/full/ismej2015100a.html

Web应用地址:

https://prodege.jgi-psf.org

源代码:

http://prodege.jgi-psf.org//downloads/src 注:现在下载不了!

文章解读:

  • 相关背景

    元基因组数据分析的一个工作就是重构环境样本的优势基因组,然后剩下的工作就是评估重构基因组的完整性以及潜在的污染序列(Non-target sequence)

  • 数据分析流程

    输入信息为拼装后的序列(要求为原核基因组序列,基因预测采用了Prodigal)和 Taxonomy 信息, 进去的是Contig 序列,出来的还是Contig序列不过是带有标记信息的序列(‘Clean’ or ‘Contaminant’ );
    该方法是用了两种Binning的策略,BLAST-binning 和 K-mer Binning;

    enter image description here

  • 敏感性和特异性及其计算时间

    相关数字和官方Web上不一致,可能是版本不一致;

    enter image description here

相关软件:

CheckM https://github.com/Ecogenomics/CheckM

2015/6/12 22:32

TransDecoder:预测转录本序列中的开放阅读框(ORF)

工具描述:

TransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using Tophat and Cufflinks.
TransDecoder identifies likely coding sequences based on the following criteria:

* a minimum length open reading frame (ORF) is found in a transcript sequence

* a log-likelihood score similar to what is computed by the GeneID software is > 0.

* the above coding score is greatest when the ORF is scored in the 1st reading frame as compared to scores in the other 5 reading frames.

* if a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported.  
  However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc).

* optional the putative peptide has a match to a Pfam domain above the noise cutoff score.

官方主页:

https://transdecoder.github.io/

Github地址:

https://github.com/TransDecoder/TransDecoder

工具导读:

对一条转录本序列预测ORF有很多不同的策略,比如:

     (1)、与NR库比对,选取阅读框,然后使用ESTscan或者 EMBOSS sixpack/transeq 指定阅读框,翻译;
     (2)、 如果是微生物的RNA-seq拼装结果,可以直接使用Prodigal等基因预测工具;
     (3)、真核RNA-seq de-novo 拼装的转录本,推荐使用TransDecoder; 

最新版本的TransDecoder 比先前的版本好用很多,其中操作上比较方便的是把原先一个命令拆分成几个步骤操作,比如:将预测的longest_orfs序列与Pfam序列比对和与其它数据的同源性搜索分离,这样可以使用外部的数据拆分提交程序,可以减少运行时间。

该软件安装比较简单,下载解压然后make

小技巧: TransDecoder 使用了cd-hit-est 对序列进行聚类, 其实这个可以改写成usearch/vsearch 提速,不过一般序列都不多,速度都不是问题,但是, 还是需要 TransDecoder.Predict 中的执行cd-hit-est 这一步将线程调整成多线程模式,别浪费CPU;

相关工具:

EMBOSS http://emboss.sourceforge.net/
ESTscan http://estscan.sourceforge.net/
TransDecoder https://transdecoder.github.io/
Usearch http://www.drive5.com/usearch/