RSCURS: Measuring the bias in codon usage from ribosomal activity
Paulet et al. DNA Research 2017
Summary: We propose to measure the bias in codon usage in a transcriptome wide manner using high throughput sequencing data (i.e., Ribo-seq) from ribosomal activity. Codon usage bias is generally higher in highly expressed genes than in other genes. Codon usage impacts translation, so it is desirable to measure it directly from the location of ribosomes during translation. This is what our new method (and software) does using Ribo-sequencing data. This page:
- is the companion web-page of a scientific article (see below)
- gives access to a portable, user-friendly computer program (that can be run on any system) and to some data for computing the codon usage bias from ribosome profiling data.
If you use this software please cite:
Ribo-seq enlightens Codon Usage Bias
D. Paulet, A. David, E. Rivals
DNA Research, dsw062. doi: 10.1093/dnares/dsw062, 2017.
1 Codon usage, its bias, and the RSCU measure.
DNA, the blueprint of life, is a chain of nucleotides, while the major player in the cells, the proteins are chains of amino acids.
A gene generally codes for a protein, meaning that part of its sequence on DNA can be deciphered by reading successively groups three nucleotides at a time, called codons, and each codon is translated into one amino acid.
With four possible nucleotides, one can form 64 possible codons, but only 20 amino acids need to be encoded. Hence, the genetic code translation is said to be "degenerate" (in other words not simple) since up to 6 synonymous codons encode the same amino acid.
The genetic code is shared by all living species, and thus somehow universal, but each species may use it in a slightly different way. Indeed, during the course of evolution a gene may favour some codons among all equivalent codons for a given amino acid.
When genes started to accumulate in the databases in the early 80's, scientists realised that codon usage is biased: among all possible codons for a given amino acid, codons are not equally used 1. Some are preferred and some disfavoured. This bias exists in most species, both in bacterial and in eukaryotic species (including Human).
Many statistics were proposed to measure this bias. Sharp et al. introduced the measure called Relative Synonymous Codon Usage (RSCU), which is widely used to measure codon usage bias 2.
RSCU is computed for each codon of each amino acid. It is a real value comprised between \(0\) and the number of synonymous codons for that amino acid.
Formula: For an amino acid \(i\), let \(n_i\) denote the number of codons that code amino acid \(i\).
For the \(j\) -th codon of amino acid \(i\), let \(x_{i,j}\) denote the number of occurrences of codon \(j\). Then the RSCU for codon \(j\) of amino acid \(i\) is determined using the following formula:
\(RSCU_{i,j} = \frac{ n_i x_{i,j} }{ \sum_{j=1}^{n_i} x_{i,j} }\)
2 Ribosome profiling or ribosome sequencing
Ribosome sequencing (Ribo-seq), is a sequencing assay that freeze the ribosomes during translation and extract all the fragments of RNA protected by the ribosomes 3. Those fragments are sequenced by deep sequencing technique and yields million of short reads. Those reads are mapped on a reference genome to obtain the position of ribosomes during translation. For each mapped fragment, it is possible to infer which codon was being translated. The analysis of Ribo-seq data reveals which codons are occupied by ribosomes.
3 Software and example
We provide a computer program, called RSCU-from-RiboSeq.jar, which was developed in the Java language and should thus run on various operating systems (including Linux, Windows, and MacOS). To run, it requires the Java Runtime Environment (JRE) version 1.7 or higher, which can be freely installed.
This program is freely available (under a CeCILL license) here: RSCU-from-RiboSeq.jar
- News versions
- Input files
- the reference transcriptome in FASTA format: a file of all mRNA sequences
- the SAM file containing the Ribo-seq reads mapped on a reference transcriptome
- the corresponding transcriptome annotation in General Feature Format GFF format: for each sequence of the transcriptome it tells where the coding sequence is located; it allows to determine which codons the reading frame contains
- the reference transcriptome in FASTA format: a file of all mRNA sequences
- Parameters
- a threshold for the minimal length of considered Open Reading Frames (ORF) – an integer value
- a shift to apply to the read mapping position to obtain the position of the translated codon
- a range of codons to consider for computing the codon usage bias ; it is given as a pair of minimal and maximal codon number (2 integer values)
- a prefix name for the output files
- a threshold for the minimal length of considered Open Reading Frames (ORF) – an integer value
- Toy example
Command:
java -jar RSCU-from-RiboSeq.jar MUS_mRNA_selection.fna MUS_SRR1605293_filtered.sam GCF_000001635.23_GRCm38.p3_genomic.gff 240 12 20 200 miniMUS
All the data necessary to run this toy example are available in this compressed archive (48 MB). You need to decompress it (with unzip or similar softwaare) before using the files and running the command.
Detail of the command line parameters:
Parameter Description MUS_mRNA_selection.fna : FASTA containing the RNA sequences (only one isogorm per gene is kept) MUS_SRR1605293_filtered.sam : mapping file of Ribo-sequencing reads (obtained with CRAC mapper) only on a subset of 60 genes GCF_000001635.23_GRCm38.p3_genomic.gff : RNA annotation file in GFF format (obtained from NCBI website) 240 : a threshold for the minimal length of considered Open Reading Frames (ORF) – an integer value 12 : position shift applied to each read location to determine the position of the translated codon for each read 20 : minimum codon number to specify the range of codons used for counting mapped reads (counting starts at the 20-th codon) 200 : maximum codon number to specify the range of codons used for counting mapped reads (counting stops at the 200-th codon) miniMUS : prefix to name the output files (all output files start with the user chosen prefix)