Bioinformatique et Séquençage Haut-Débit - Abstracts/Résumés

Colloque, Date: 24 mars 2010, Lieu: Paris

1 Jean-Jacques Daudin, AgroParisTech, Paris, France

Statistical challenges from the analysis of NGS-Metagenomics experiments - Daudin's speech

All Metagenomics experiments take the same first step: DNA is extracted directly from all the microbes living in a particular environnment (sea, soil, human gut, cheese...). Sequences-based metagenomics captures a massive amount of information on the microbial community under study. Then the tags are filtered for quality, assembled and aligned on a reference gene set containing known genes and unknown ORFs. Then the statistic questions arise: Is the experiment repeatable? What are the sources of variability? How many species or genes were really present? Is there any difference between two conditions ? What about experimental design ? Some of these questions may be solved using standard tools and some of them needs new methods.

2 Roderic Guigo, Centre de Regulació Genòmica, Barcelone, Espagne

Discovery and quantification of RNA by RNAseq experiments - Guigo's speech

Transcribed regions have been long been regarded as a distinguishing characteristic of functional portions of the human genome. Massively parallel sequencing of RNAs through next generation sequencing NGS instruments promises, for the first time, sufficient sequencing depth for full transcriptome characterization, that is for the identification of every transcript species in the cell, and their quantification; in particular, for the accurate estimation of the relative abundances of alternative transcript isoforms from the same gene, and of the expression of novel non-coding RNAs. However, the most cost-effective such technologies typically produce very short sequence reads, which compounds transcript reconstruction and quantification. We will discuss computational approaches being developed to address this issue, and produce accurate estimation of transcript quantities in the cell.

3 Richard Christen, Université de Nice

Analyses de la biodiversité microbienne grâce au séquençage massif des séquences d'ARN ribosomiques. Résultats & problèmes. - Christen's speech

L'analyse d'amplicons PCR de séquences d'ARN ribosomiques (ARN ou ADN) est la méthode standard dans les analyses de la diversité microbienne (archaea, bacteria et eukaryota). Leur séquençage direct et massif (Roche 454) a remplacé le séquençage Sanger de bibliothèques de clones. Le résultat est l'obtention de millions de séquences en terme de semaines au lieu de l'obtention de centaines de séquences au bout de plusieurs mois. De nombreux projets ont ainsi analysé les compositions microbiennes de l'environnement (eau, sol, tube digestif,...). Les résultats publiés montrent tous une diversité beaucoup plus importante que celle estimée jusque là. Mais de nombreuses incertitudes demeurent.

La discussion portera sur les réponses actuelles et possibles à ces questions.

4 Valentina Boeva, Institut Curie Paris

Prediction of transcription factor binding sites from ChIP-Seq data through de novo TFBS motif discovery. - Boeva's speech

Next-generation sequencing technologies enabled genome-wide identification of binding sites of DNA-associated proteins. Recently, a number of applications predicting binding sites from ChIP-Seq data have been published [1-4]. One of the major problems in this kind of approaches is the determination of the threshold choice for DNA tag coverage. Generally, the threshold is selected using the False Discovery Rate estimation, which is done either by Monte-Carlo simulation, by using data from a control experiment or by using the Poisson distribution for tag density. Still, our experience showed that regions which have relatively low DNA tag coverage to pass the selection and thus are discarded by most of tools, very often contain binding site motif occurrences just in the area of the peak top coverage. Since the length of the top coverage area is rather small, the probability that a predicted binding site motif is found there by chance is extremely small as well. The observed number of such regions is much greater than expected, so we conclude de that some of these regions should be included in the final peak selection. We proposed an algorithm which serves two ends in ChIP-Seq data analysis: de novo binding site motif identification and binding site extraction without explicit threshold selection. First, one chooses a set of peaks with high DNA tag coverage. Then, one identifies de novo motifs in the top regions of those peaks. Next, using extracted PSSMs one selects peaks with lower DNA tag coverage which contain motifs in their central area. A user defined threshold is set for the total number of expected false positives hits among selected peaks. The algorithm is implemented as a Java package MICSA (Motif Identification for ChIP-Seq Analysis) which is available at our website http://bioinfo-out.curie.fr/projects/micsa/ . The MICSA package was tested on real data for the oncogenic transcription factor EWS-FLI1. Through the comparison with transcriptomic data, dozens of putative direct targets of EWS-Fli1 were discovered.

  1. A. Valouev et al. (2008), Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data, Nature Methods, 5(9):829-834.
  2. H. Ji et al. (2008) An integrated software system for analyzing ChIP-chip and ChIP-seq data, Nature Biotechnology, 26(11):1293-1300.
  3. D.A. Nix et al. (2008) Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks, BMC Bioinformatics, 9(523).
  4. A. Fejes et al. (2008) FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology, Bioinformatics, 24(15):1729-1730.

5 Jean-Marc Aury, Centre National de Séquençage, Evry

Les nouvelles technologies de séquençage au Genoscope: assemblage et annotation de génomes - Aury's speech

Les nouvelles technologies de séquençage haut débit permettent de générer plusieurs gigabases de séquences par semaine. Le séquençage de 'grands' génomes complets s'accélère et les méthodes utilisées auparavant doivent être adaptées ou revues. Le Genoscope, en tant que centre national de séquençage, est impliqué dans plusieurs projets de séquençage de novo de génomes complets. L'exposé détaillera les méthodes mises en place au genoscope pour l'assemblage et l'annotation de ces grands génomes eucaryotes.

6 Hughes Richard, Univ. d'Evry

Detection and Annotation of Alternative splicing events with RNA-Seq data - Richard's speech

Abstract missing

7 Gregory Kucherov, CNRS/LIFL and INRIA, Lille

Seed design framework for mapping AB SOLiD reads - Kucherov's speech

The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications.We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the other hand, a novel seeding principle especially adapted to read mapping. Our method can handle both lossy and lossless frameworks and is able to distinguish, at the level of seed design, between SNPs and reading errors. We illustrate our approach by several seed designs and demonstrate their efficiency. (Joint work with Laurent Noé and Marta Girdea)

8 Gunnar Raetsch, Max Planck Institute, Tübingen

Machine Learning Methods for RNA-seq-based Transcriptome Reconstruction - Raetsch's speech

My talk focuses on techniques to quantitatively describe transcriptomes measured with current high-throughput sequencing technologies (RNA-Seq). We have developed and applied machine learning-based methods to the various levels of the RNA-Seq data analysis: a) We extended the alignment method QPALMA and combined it with the GenomeMapper short read aligner to align both spliced and unspliced reads with high accuracy, while taking advantage of each read's quality information and computational splice site predictions. b) We advanced methods to de novo predict transcripts based on NGS data. In particular, we extended the gene finding system mGene to take advantage of read alignments to more accurately predict gene structures. c) Moreover, we developed a method, called rQuant, that simultaneously estimates biases inherent in library preparation, sequencing, and read mapping and determines the abundances of given transcripts. I will discuss the machine learning methodology that we are using to solve these problems and also show comparisons of our methods with related techniques.

Author: Eric Rivals <rivals_AT_lirmm.fr>

Date: 2010/03/30 11:17:07