--- title: "Generating and using Ensembl based annotation packages" author: "Johannes Rainer" graphics: yes package: ensembldb output: BiocStyle::html_document: toc_float: true includes: in_header: ensembldb.bioschemas.html vignette: > %\VignetteIndexEntry{Generating an using Ensembl based annotation packages} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{ensembldb,EnsDb.Hsapiens.v86,BiocStyle,AnnotationHub,ggbio,Gviz} --- ```{r biocstyle, echo = FALSE, results = "asis", message = FALSE} library(BiocStyle) BiocStyle::markdown() ``` # Introduction The `ensembldb` package provides functions to create and use transcript centric annotation databases/packages. The annotation for the databases are directly fetched from Ensembl 1 using their Perl API. The functionality and data is similar to that of the `TxDb` packages from the `GenomicFeatures` package, but, in addition to retrieve all gene/transcript models and annotations from the database, the `ensembldb` package provides also a filter framework allowing to retrieve annotations for specific entries like genes encoded on a chromosome region or transcript models of lincRNA genes. From version 1.7 on, `EnsDb` databases created by the `ensembldb` package contain also protein annotation data (see Section [11](#org4707794) for the database layout and an overview of available attributes/columns). For more information on the use of the protein annotations refer to the *proteins* vignette. Another main goal of this package is to generate *versioned* annotation packages, i.e. annotation packages that are build for a specific Ensembl release, and are also named according to that (e.g. `EnsDb.Hsapiens.v86` for human gene definitions of the Ensembl code database version 86). This ensures reproducibility, as it allows to load annotations from a specific Ensembl release also if newer versions of annotation packages/releases are available. It also allows to load multiple annotation packages at the same time in order to e.g. compare gene models between Ensembl releases. In the example below we load an Ensembl based annotation package for Homo sapiens, Ensembl version 86. The `EnsDb` object providing access to the underlying SQLite database is bound to the variable name `EnsDb.Hsapiens.v86`. ```{r load-libs, warning=FALSE, message=FALSE} library(EnsDb.Hsapiens.v86) ## Making a "short cut" edb <- EnsDb.Hsapiens.v86 ## print some informations for this package edb ## For what organism was the database generated? organism(edb) ``` ```{r no-network, echo = FALSE, results = "hide"} ## Disable code chunks that require network connection - conditionally ## disable this on Windows only. This is to avoid TIMEOUT errors on the ## Bioconductor Windows build maching (issue #47). use_network <- TRUE ``` # Using `ensembldb` annotation packages to retrieve specific annotations One of the strengths of the `ensembldb` package and the related `EnsDb` databases is its implementation of a filter framework that enables to efficiently extract data sub-sets from the databases. The `ensembldb` package supports most of the filters defined in the `AnnotationFilter` Bioconductor package and defines some additional filters specific to the data stored in `EnsDb` databases. Filters can be passed directly to all methods extracting data from an `EnsDb` (such as `genes`, `transcripts` or `exons`). Alternatively it is possible with the `addFilter` or `filter` functions to add a filter directly to an `EnsDb` which will then be used in all queries on that object. The `supportedFilters` method can be used to get an overview over all supported filter classes, each of them (except the `GRangesFilter`) working on a single column/field in the database. ```{r filters} supportedFilters(edb) ``` These filters can be divided into 3 main filter types: - `IntegerFilter`: filter classes extending this basic object can take a single numeric value as input and support the conditions `==`, `!=`, `>`, `<`, `>=` and `<=`. All filters that work on chromosomal coordinates, such as the `GeneEndFilter` extend `IntegerFilter`. - `CharacterFilter`: filter classes extending this object can take a single or multiple character values as input and allow conditions: `==`, `!=`, `"startsWith"` , `"endsWith"` and `"contains"`. All filters working on IDs extend this class. - `GRangesFilter`: takes a `GRanges` object as input and supports all conditions that `findOverlaps` from the `IRanges` package supports (`"any"`, `"start"`, `"end"`, `"within"`, `"equal"`). Note that these have to be passed using the parameter `type` to the constructor function. The supported filters are: - `EntrezFilter`: allows to filter results based on NCBI Entrezgene identifiers of the genes. - `ExonEndFilter`: filter using the chromosomal end coordinate of exons. - `ExonIdFilter`: filter based on the (Ensembl) exon identifiers. - `ExonRankFilter`: filter based on the rank (index) of an exon within the transcript model. Exons are always numbered from 5' to 3' end of the transcript, thus, also on the reverse strand, the exon 1 is the most 5' exon of the transcript. - `ExonStartFilter`: filter using the chromosomal start coordinate of exons. - `GeneBiotypeFilter`: filter using the gene biotypes defined in the Ensembl database; use the `listGenebiotypes` method to list all available biotypes. - `GeneEndFilter`: filter using the chromosomal end coordinate of gene. - `GeneIdFilter`: filter based on the Ensembl gene IDs. - `GeneNameFilter`: filter based on the names (symbols) of the genes. - `GeneStartFilter`: filter using the chromosomal start coordinate of gene. - `GRangesFilter`: allows to retrieve all features (genes, transcripts or exons) that are either within (setting parameter `type` to "within") or partially overlapping (setting `type` to "any") the defined genomic region/range. Note that, depending on the called method (`genes`, `transcripts` or `exons`) the start and end coordinates of either the genes, transcripts or exons are used for the filter. For methods `exonsBy`, `cdsBy` and `txBy` the coordinates of `by` are used. - `SeqNameFilter`: filter by the name of the chromosomes the genes are encoded on. - `SeqStrandFilter`: filter for the chromosome strand on which the genes are encoded. - `SymbolFilter`: filter on gene symbols; note that no database columns *symbol* is available in an `EnsDb` database and hence the gene name is used for filtering. - `TxBiotypeFilter`: filter on the transcript biotype defined in Ensembl; use the `listTxbiotypes` method to list all available biotypes. - `TxEndFilter`: filter using the chromosomal end coordinate of transcripts. - `TxIdFilter`: filter on the Ensembl transcript identifiers. - `TxNameFilter`: to be compliant with `TxDb` annotation resources from the `GenomicFeatures` package, the `tx_name` database column contains also the transcript IDs and hence the `TxNameFilter` is identical to the `TxIdFilter`. Transcript names (*external name* in the Ensembl databases) are provided in column `"tx_external_name"`. - `TxExternalNameFilter`: filter based on the *transcript names* which are provided by Ensembl with the *external name* attribute (or are listed in the `"transcript_name"` field in GTF files. - `TxStartFilter`: filter using the chromosomal start coordinate of transcripts. In addition to the above listed *DNA-RNA-based* filters, *protein-specific* filters are also available: - `ProtDomIdFilter`, `ProteinDomainIdFilter`: filter by the protein domain ID. - `ProteinDomainSourceFilter`: filter by the source of the protein domain (database or method, e.g. *pfam*). - `ProteinIdFilter`: filter by Ensembl protein ID filters. - `UniprotDbFilter`: filter by the name of the Uniprot database. - `UniprotFilter`: filter by the Uniprot ID. - `UniprotMappingTypeFilter`: filter by the mapping type of Ensembl protein IDs to Uniprot IDs. These can however only be used on `EnsDb` databases that provide protein annotations, i.e. for which a call to `hasProteinData` returns `TRUE`. `EnsDb` databases for more recent Ensembl versions (starting from Ensembl 87) provide also evidence levels for individual transcripts in the `tx_support_level` database column. Such databases support also a `TxSupportLevelFilter` filter to use this columns for filtering. A simple use case for the filter framework would be to get all transcripts for the gene *BCL2L11*. To this end we specify a `GeneNameFilter` with the value *BCL2L11*. As a result we get a `GRanges` object with `start`, `end`, `strand` and `seqname` being the start coordinate, end coordinate, chromosome name and strand for the respective transcripts. All additional annotations are available as metadata columns. Alternatively, by setting `return.type` to "DataFrame", or "data.frame" the method would return a `DataFrame` or `data.frame` object instead of the default `GRanges`. ```{r transcripts} Tx <- transcripts(edb, filter = GeneNameFilter("BCL2L11")) Tx ## as this is a GRanges object we can access e.g. the start coordinates with head(start(Tx)) ## or extract the biotype with head(Tx$tx_biotype) ``` The parameter `columns` of the extractor methods (such as `exons`, `genes` or `transcripts)` allows to specify which database attributes (columns) should be retrieved. The `exons` method returns by default all exon-related columns, the `transcripts` all columns from the transcript database table and the `genes` all from the gene table. Note however that in the example above we got also a column `gene_name` although this column is not present in the transcript database table. By default the methods return also all columns that are used by any of the filters submitted with the `filter` argument (thus, because a `GeneNameFilter` was used, the column `gene_name` is also returned). Setting `returnFilterColumns(edb) <- FALSE` disables this option and only the columns specified by the `columns` parameter are retrieved. Instead of passing a filter *object* to the method it is also possible to provide a filter *expression* written as a `formula`. The `formula` has to be written in the form `~ ` with `` being the field (database column) in the database, `` the condition for the filter object and `` its value. Use the `supportedFilter` method to get the field names corresponding to each filter class. ```{r transcripts-filter-expression} ## Use a filter expression to perform the filtering. transcripts(edb, filter = ~ gene_name == "ZBTB16") ``` Filter expression have to be written as a formula (i.e. starting with a `~`) in the form *column name* followed by the logical condition. Alternatively, `EnsDb` objects can be filtered directly using the `filter` function. In the example below we use the `filter` function to filter the `EnsDb` object and pass that filtered database to the `transcripts` method using the `|>` (pile operator). ```{r transcripts-filter, message = FALSE} edb |> filter(~ symbol == "BCL2" & tx_biotype != "protein_coding") |> transcripts() ``` Adding a filter to an `EnsDb` enables this filter (globally) on all subsequent queries on that object. We could thus filter an `EnsDb` to (virtually) contain only features encoded on chromosome Y. ```{r filter-Y} edb_y <- addFilter(edb, SeqNameFilter("Y")) ## All subsequent filters on that EnsDb will only work on features encoded on ## chromosome Y genes(edb_y) ## Get all lincRNAs on chromosome Y genes(edb_y, filter = ~ gene_biotype == "lincRNA") ``` To get an overview of database tables and available columns the function `listTables` can be used. The method `listColumns` on the other hand lists columns for the specified database table. ```{r list-columns} ## list all database tables along with their columns listTables(edb) ## list columns from a specific table listColumns(edb, "tx") ``` Thus, we could retrieve all transcripts of the biotype *nonsense\_mediated\_decay* (which, according to the definitions by Ensembl are transcribed, but most likely not translated in a protein, but rather degraded after transcription) along with the name of the gene for each transcript. Note that we are changing here the `return.type` to `DataFrame`, so the method will return a `DataFrame` with the results instead of the default `GRanges`. ```{r transcripts-example2} Tx <- transcripts(edb, columns = c(listColumns(edb , "tx"), "gene_name"), filter = TxBiotypeFilter("nonsense_mediated_decay"), return.type = "DataFrame") nrow(Tx) Tx ``` For protein coding transcripts, we can also specifically extract their coding region. In the example below we extract the CDS for all transcripts encoded on chromosome Y. ```{r cdsBy} yCds <- cdsBy(edb, filter = SeqNameFilter("Y")) yCds ``` Using a `GRangesFilter` we can retrieve all features from the database that are either within or overlapping the specified genomic region. In the example below we query all genes that are partially overlapping with a small region on chromosome 11. The filter restricts to all genes for which either an exon or an intron is partially overlapping with the region. ```{r genes-GRangesFilter} ## Define the filter grf <- GRangesFilter(GRanges("11", ranges = IRanges(114129278, 114129328), strand = "+"), type = "any") ## Query genes: gn <- genes(edb, filter = grf) gn ## Next we retrieve all transcripts for that gene so that we can plot them. txs <- transcripts(edb, filter = GeneNameFilter(gn$gene_name)) ``` ```{r granges-zbtb16, message = FALSE, echo = FALSE} plot(3, 3, pch = NA, xlim = c(start(gn), end(gn)), ylim = c(0, length(txs)), yaxt = "n", ylab = "") ## Highlight the GRangesFilter region rect(xleft = start(grf), xright = end(grf), ybottom = 0, ytop = length(txs), col = "red", border = "red") for(i in 1:length(txs)) { current <- txs[i] rect(xleft = start(current), xright = end(current), ybottom = i-0.975, ytop = i-0.125, border = "grey") text(start(current), y = i-0.5, pos = 4, cex = 0.75, labels = current$tx_id) } ``` As we can see, 4 transcripts of the gene ZBTB16 are also overlapping the region. Below we fetch these 4 transcripts. Note, that a call to `exons` will not return any features from the database, as no exon is overlapping with the region. ```{r transcripts-GRangesFilter} transcripts(edb, filter = grf) ``` The `GRangesFilter` supports also `GRanges` defining multiple regions and a query will return all features overlapping any of these regions. Besides using the `GRangesFilter` it is also possible to search for transcripts or exons overlapping genomic regions using the `exonsByOverlaps` or `transcriptsByOverlaps` known from the `GenomicFeatures` package. Note that the implementation of these methods for `EnsDb` objects supports also to use filters to further fine-tune the query. The functions `listGenebiotypes` and `listTxbiotypes` can be used to get an overview of allowed/available gene and transcript biotype ```{r biotypes} ## Get all gene biotypes from the database. The GeneBiotypeFilter ## allows to filter on these values. listGenebiotypes(edb) ## Get all transcript biotypes from the database. listTxbiotypes(edb) ``` Data can be fetched in an analogous way using the `exons` and `genes` methods. In the example below we retrieve `gene_name`, `entrezid` and the `gene_biotype` of all genes in the database which names start with "BCL2". ```{r genes-BCL2} ## We're going to fetch all genes which names start with BCL. BCLs <- genes(edb, columns = c("gene_name", "entrezid", "gene_biotype"), filter = GeneNameFilter("BCL", condition = "startsWith"), return.type = "DataFrame") nrow(BCLs) BCLs ``` Sometimes it might be useful to know the length of genes or transcripts (i.e. the total sum of nucleotides covered by their exons). Below we calculate the mean length of transcripts from protein coding genes on chromosomes X and Y as well as the average length of snoRNA, snRNA and rRNA transcripts encoded on these chromosomes. For the first query we combine two `AnnotationFilter` objects using an `AnnotationFilterList` object, in the second we define the query using a filter expression. ```{r example-AnnotationFilterList} ## determine the average length of snRNA, snoRNA and rRNA genes encoded on ## chromosomes X and Y. mean(lengthOf(edb, of = "tx", filter = AnnotationFilterList( GeneBiotypeFilter(c("snRNA", "snoRNA", "rRNA")), SeqNameFilter(c("X", "Y"))))) ## determine the average length of protein coding genes encoded on the same ## chromosomes. mean(lengthOf(edb, of = "tx", filter = ~ gene_biotype == "protein_coding" & seq_name %in% c("X", "Y"))) ``` Not unexpectedly, transcripts of protein coding genes are longer than those of snRNA, snoRNA or rRNA genes. At last we extract the first two exons of each transcript model from the database. ```{r example-first-two-exons} ## Extract all exons 1 and (if present) 2 for all genes encoded on the ## Y chromosome exons(edb, columns = c("tx_id", "exon_idx"), filter = list(SeqNameFilter("Y"), ExonRankFilter(3, condition = "<"))) ``` # Extracting gene/transcript/exon models for RNASeq feature counting For the feature counting step of an RNAseq experiment, the gene or transcript models (defined by the chromosomal start and end positions of their exons) have to be known. To extract these from an Ensembl based annotation package, the `exonsBy`, `genesBy` and `transcriptsBy` methods can be used in an analogous way as in `TxDb` packages generated by the `GenomicFeatures` package. However, the `transcriptsBy` method does not, in contrast to the method in the `GenomicFeatures` package, allow to return transcripts by "cds". While the annotation packages built by the `ensembldb` contain the chromosomal start and end coordinates of the coding region (for protein coding genes) they do not assign an ID to each CDS. A simple use case is to retrieve all genes encoded on chromosomes X and Y from the database. ```{r transcriptsBy-X-Y} TxByGns <- transcriptsBy(edb, by = "gene", filter = SeqNameFilter(c("X", "Y"))) TxByGns ``` Since Ensembl contains also definitions of genes that are on chromosome variants (supercontigs), it is advisable to specify the chromosome names for which the gene models should be returned. In a real use case, we might thus want to retrieve all genes encoded on the *standard* chromosomes. In addition it is advisable to use a `GeneIdFilter` to restrict to Ensembl genes only, as also *LRG* (Locus Reference Genomic) genes2 are defined in the database, which are partially redundant with Ensembl genes. ```{r exonsBy-RNAseq, message = FALSE, eval = FALSE} ## will just get exons for all genes on chromosomes 1 to 22, X and Y. ## Note: want to get rid of the "LRG" genes!!! EnsGenes <- exonsBy(edb, by = "gene", filter = AnnotationFilterList( SeqNameFilter(c(1:22, "X", "Y")), GeneIdFilter("ENSG", "startsWith"))) ``` The code above returns a `GRangesList` that can be used directly as an input for the `summarizeOverlaps` function from the `GenomicAlignments` package 3. Alternatively, the above `GRangesList` can be transformed to a `data.frame` in *SAF* format that can be used as an input to the `featureCounts` function of the `Rsubread` package 4. ```{r toSAF-RNAseq, message = FALSE, eval = FALSE} ## Transforming the GRangesList into a data.frame in SAF format EnsGenes.SAF <- toSAF(EnsGenes) ``` Note that the ID by which the `GRangesList` is split is used in the SAF formatted `data.frame` as the `GeneID`. In the example below this would be the Ensembl gene IDs, while the start, end coordinates (along with the strand and chromosomes) are those of the the exons. Also functions from the `GenomicFeatures` package can be applied to `EnsDb` databases, such as the `exonicParts` function to extract a `GRanges` of non-overlapping exon parts which can be used in the `DEXSeq` package. ```{r disjointExons, message = FALSE, eval = FALSE} ## Create a GRanges of non-overlapping exon parts. edb_sub <- filter(edb, filter = AnnotationFilterList( SeqNameFilter(c(1:22, "X", "Y")), GeneIdFilter("ENSG%", "startsWith"))) DJE <- exonicParts(edb_sub) ``` # Retrieving sequences for gene/transcript/exon models The methods to retrieve exons, transcripts and genes (i.e. `exons`, `transcripts` and `genes`) return by default `GRanges` objects that can be used to retrieve sequences using the `getSeq` method e.g. from BSgenome packages. The basic workflow is thus identical to the one for `TxDb` packages, however, it is not straight forward to identify the BSgenome package with the matching genomic sequence. Most BSgenome packages are named according to the genome build identifier used in UCSC which does not (always) match the genome build name used by Ensembl. Using the Ensembl version provided by the `EnsDb`, the correct genomic sequence can however be retrieved easily from the `AnnotationHub` using the `getGenomeTwoBitFile`. If no 2bit file matching the Ensembl version is available, the function tries to identify a file with the correct genome build from the *closest* Ensembl release and returns that instead. In the code block below we retrieve first the `TwoBitFile` with the genomic DNA sequence, extract the genomic start and end coordinates for all genes defined in the package, subset to genes encoded on sequences available in the `TwoBitFile` and extract all of their sequences. Note: these sequences represent the sequence between the chromosomal start and end coordinates of the gene. ```{r transcript-sequence-AnnotationHub, message = FALSE, eval = FALSE } library(EnsDb.Hsapiens.v86) edb <- EnsDb.Hsapiens.v86 ## Get the TwoBit with the genomic sequence matching the Ensembl version ## using the AnnotationHub package. dna <- ensembldb:::getGenomeTwoBitFile(edb) ## Get start/end coordinates of all genes. genes <- genes(edb) ## Subset to all genes that are encoded on chromosomes for which ## we do have DNA sequence available. genes <- genes[seqnames(genes) %in% seqnames(seqinfo(dna))] ## Get the gene sequences, i.e. the sequence including the sequence of ## all of the gene's exons and introns. geneSeqs <- getSeq(dna, genes) ``` To retrieve the (exonic) sequence of transcripts (i.e. without introns) we can use directly the `extractTranscriptSeqs` method defined in the `GenomicFeatures` on the `EnsDb` object, eventually using a filter to restrict the query. ```{r transcript-sequence-extractTranscriptSeqs, message = FALSE, eval = FALSE } ## get all exons of all transcripts encoded on chromosome Y yTx <- exonsBy(edb, filter = SeqNameFilter("Y")) ## Retrieve the sequences for these transcripts from the TwoBitile. library(GenomicFeatures) yTxSeqs <- extractTranscriptSeqs(dna, yTx) yTxSeqs ## Extract the sequences of all transcripts encoded on chromosome Y. yTx <- extractTranscriptSeqs(dna, edb, filter = SeqNameFilter("Y")) ## Along these lines, we could use the method also to retrieve the coding sequence ## of all transcripts on the Y chromosome. cdsY <- cdsBy(edb, filter = SeqNameFilter("Y")) extractTranscriptSeqs(dna, cdsY) ``` Next we retrieve transcript sequences from genes encoded on chromosome Y using the `BSGenome` package for the human genome. Ensembl version 86 based on the `GRCh38` genome build and we thus load the corresponding `BSGenome` package. ```{r extractTranscriptSeqs-BSGenome, warning = FALSE, message = FALSE } library(BSgenome.Hsapiens.NCBI.GRCh38) bsg <- BSgenome.Hsapiens.NCBI.GRCh38 ## Get the genome version unique(genome(bsg)) unique(genome(edb)) ## Extract the full transcript sequences. yTxSeqs <- extractTranscriptSeqs( bsg, exonsBy(edb, "tx", filter = SeqNameFilter("Y"))) yTxSeqs ## Extract just the CDS Test <- cdsBy(edb, "tx", filter = SeqNameFilter("Y")) yTxCds <- extractTranscriptSeqs( bsg, cdsBy(edb, "tx", filter = SeqNameFilter("Y"))) yTxCds ``` # Integrating annotations from Ensembl based `EnsDb` packages with UCSC based annotations Sometimes it might be useful to combine (Ensembl based) annotations from `EnsDb` packages/objects with annotations from other Bioconductor packages, that might base on UCSC annotations. To support such an integration of annotations, the `ensembldb` packages implements the `seqlevelsStyle` and `seqlevelsStyle<-` from the `GenomeInfoDb` package that allow to change the style of chromosome naming. Thus, sequence/chromosome names other than those used by Ensembl can be used in, and are returned by, the queries to `EnsDb` objects as long as a mapping for them is provided by the `GenomeInfoDb` package (which provides a mapping mostly between UCSC, NCBI and Ensembl chromosome names for the *main* chromosomes). In the example below we change the seqnames style to UCSC. ```{r seqlevelsStyle, message = FALSE } ## Change the seqlevels style form Ensembl (default) to UCSC: seqlevelsStyle(edb) <- "UCSC" ## Now we can use UCSC style seqnames in SeqNameFilters or GRangesFilter: genesY <- genes(edb, filter = ~ seq_name == "chrY") ## The seqlevels of the returned GRanges are also in UCSC style seqlevels(genesY) ``` Note that in most instances no mapping is available for sequences not corresponding to the main chromosomes (i.e. contigs, patched chromosomes etc). What is returned in cases in which no mapping is available can be specified with the global `ensembldb.seqnameNotFound` option. By default (with `ensembldb.seqnameNotFound` set to "ORIGINAL"), the original seqnames (i.e. the ones from Ensembl) are returned. With `ensembldb.seqnameNotFound` "MISSING" each time a seqname can not be found an error is thrown. For all other cases (e.g. `ensembldb.seqnameNotFound = NA`) the value of the option is returned. ```{r seqlevelsStyle-2, message = FALSE } seqlevelsStyle(edb) <- "UCSC" ## Getting the default option: getOption("ensembldb.seqnameNotFound") ## Listing all seqlevels in the database. seqlevels(edb)[1:30] ## Setting the option to NA, thus, for each seqname for which no mapping is available, ## NA is returned. options(ensembldb.seqnameNotFound=NA) seqlevels(edb)[1:30] ## Resetting the option. options(ensembldb.seqnameNotFound = "ORIGINAL") ``` As an alternative, `seqlevelsStyle` for `EnsDb` supports also to define custom renaming. Below we thus define a `data.frame` with new names for some specific chromosomes. A column `"Ensembl"` is expected to contain the original chromosome names and the second column the new names. In the example below we simply want to rename some selected chromosomes, thus we define the mapping `data.frame` and pass that to the `seqlevelsStyle` method. ```{r} mymap <- data.frame( Ensembl = c(1, 21, "X", "Y"), myway = c("one", "twentyone", "chrX", "chrY") ) seqlevelsStyle(edb) <- mymap ``` With that we have now chromosomes 1, 21, X and Y renamed to the new names. Below we list the last 6 values showing the new names for chromosomes X and Y. ```{r} tail(seqlevels(edb)) ``` At last changing the seqname style to the default value `"Ensembl"`. ```{r seqlevelsStyle-restore } seqlevelsStyle(edb) <- "Ensembl" ``` # Interactive annotation lookup using the `shiny` web app In addition to the `genes`, `transcripts` and `exons` methods it is possibly to search interactively for gene/transcript/exon annotations using the internal, `shiny` based, web application. The application can be started with the `runEnsDbApp()` function. The search results from this app can also be returned to the R workspace either as a `data.frame` or `GRanges` object. # Plotting gene/transcript features using `ensembldb` and `Gviz` and `ggbio` The `Gviz` package provides functions to plot genes and transcripts along with other data on a genomic scale. Gene models can be provided either as a `data.frame`, `GRanges`, `TxDB` database, can be fetched from biomart and can also be retrieved from `ensembldb`. Below we generate a `GeneRegionTrack` fetching all transcripts from a certain region on chromosome Y. Note that if we want in addition to work also with BAM files that were aligned against DNA sequences retrieved from Ensembl or FASTA files representing genomic DNA sequences from Ensembl we should change the `ucscChromosomeNames` option from `Gviz` to `FALSE` (i.e. by calling `options(ucscChromosomeNames = FALSE)`). This is not necessary if we just want to retrieve gene models from an `EnsDb` object, as the `ensembldb` package internally checks the `ucscChromosomeNames` option and, depending on that, maps Ensembl chromosome names to UCSC chromosome names. ```{r gviz-plot, message=FALSE } ## Loading the Gviz library library(Gviz) library(EnsDb.Hsapiens.v86) edb <- EnsDb.Hsapiens.v86 ## Retrieving a Gviz compatible GRanges object with all genes ## encoded on chromosome Y. gr <- getGeneRegionTrackForGviz(edb, chromosome = "Y", start = 20400000, end = 21400000) ## Define a genome axis track gat <- GenomeAxisTrack() ## We have to change the ucscChromosomeNames option to FALSE to enable Gviz usage ## with non-UCSC chromosome names. options(ucscChromosomeNames = FALSE) plotTracks(list(gat, GeneRegionTrack(gr))) options(ucscChromosomeNames = TRUE) ``` Above we had to change the option `ucscChromosomeNames` to `FALSE` in order to use it with non-UCSC chromosome names. Alternatively, we could however also change the `seqnamesStyle` of the `EnsDb` object to `UCSC`. Note that we have to use now also chromosome names in the *UCSC style* in the `SeqNameFilter` (i.e. "chrY" instead of "Y"). ```{r message=FALSE } seqlevelsStyle(edb) <- "UCSC" ## Retrieving the GRanges objects with seqnames corresponding to UCSC chromosome names. gr <- getGeneRegionTrackForGviz(edb, chromosome = "chrY", start = 20400000, end = 21400000) seqnames(gr) ## Define a genome axis track gat <- GenomeAxisTrack() plotTracks(list(gat, GeneRegionTrack(gr))) ``` We can also use the filters from the `ensembldb` package to further refine what transcripts are fetched, like in the example below, in which we create two different gene region tracks, one for protein coding genes and one for lincRNAs. ```{r gviz-separate-tracks, message=FALSE, warning=FALSE } protCod <- getGeneRegionTrackForGviz(edb, chromosome = "chrY", start = 20400000, end = 21400000, filter = GeneBiotypeFilter("protein_coding")) lincs <- getGeneRegionTrackForGviz(edb, chromosome = "chrY", start = 20400000, end = 21400000, filter = GeneBiotypeFilter("lincRNA")) plotTracks(list(gat, GeneRegionTrack(protCod, name = "protein coding"), GeneRegionTrack(lincs, name = "lincRNAs")), transcriptAnnotation = "symbol") ## At last we change the seqlevels style again to Ensembl seqlevelsStyle <- "Ensembl" ``` Alternatively, we can also use `ggbio` for plotting. For `ggbio` we can directly pass the `EnsDb` object along with optional filters (or as in the example below a filter expression as a `formula`). ```{r pplot-plot, message = FALSE, eval = FALSE } library(ggbio) ## Create a plot for all transcripts of the gene SKA2 autoplot(edb, ~ gene_name == "SKA2") ``` To plot the genomic region and plot genes from both strands we can use a `GRangesFilter`. ```{r pplot-plot-2, message = FALSE, eval = FALSE } ## Get the chromosomal region in which the gene is encoded ska2 <- genes(edb, filter = ~ gene_name == "SKA2") strand(ska2) <- "*" autoplot(edb, GRangesFilter(ska2), names.expr = "gene_name") ``` # Using `EnsDb` objects in the `AnnotationDbi` framework Most of the methods defined for objects extending the basic annotation package class `AnnotationDbi` are also defined for `EnsDb` objects (i.e. methods `columns`, `keytypes`, `keys`, `mapIds` and `select`). While these methods can be used analogously to basic annotation packages, the implementation for `EnsDb` objects also support the filtering framework of the `ensembldb` package. In the example below we first evaluate all the available columns and keytypes in the database and extract then the gene names for all genes encoded on chromosome X. ```{r AnnotationDbi, message = FALSE } library(EnsDb.Hsapiens.v86) edb <- EnsDb.Hsapiens.v86 ## List all available columns in the database. columns(edb) ## Note that these do *not* correspond to the actual column names ## of the database that can be passed to methods like exons, genes, ## transcripts etc. These column names can be listed with the listColumns ## method. listColumns(edb) ## List all of the supported key types. keytypes(edb) ## Get all gene ids from the database. gids <- keys(edb, keytype = "GENEID") length(gids) ## Get all gene names for genes encoded on chromosome Y. gnames <- keys(edb, keytype = "GENENAME", filter = SeqNameFilter("Y")) head(gnames) ``` In the next example we retrieve specific information from the database using the `select` method. First we fetch all transcripts for the genes *BCL2* and *BCL2L11*. In the first call we provide the gene names, while in the second call we employ the filtering system to perform a more fine-grained query to fetch only the protein coding transcripts for these genes. ```{r select, message = FALSE, warning=FALSE } ## Use the /standard/ way to fetch data. select(edb, keys = c("BCL2", "BCL2L11"), keytype = "GENENAME", columns = c("GENEID", "GENENAME", "TXID", "TXBIOTYPE")) ## Use the filtering system of ensembldb select(edb, keys = ~ gene_name %in% c("BCL2", "BCL2L11") & tx_biotype == "protein_coding", columns = c("GENEID", "GENENAME", "TXID", "TXBIOTYPE")) ``` Finally, we use the `mapIds` method to establish a mapping between ids and values. In the example below we fetch transcript ids for the two genes from the example above. ```{r mapIds, message = FALSE } ## Use the default method, which just returns the first value for multi mappings. mapIds(edb, keys = c("BCL2", "BCL2L11"), column = "TXID", keytype = "GENENAME") ## Alternatively, specify multiVals="list" to return all mappings. mapIds(edb, keys = c("BCL2", "BCL2L11"), column = "TXID", keytype = "GENENAME", multiVals = "list") ## And, just like before, we can use filters to map only to protein coding transcripts. mapIds(edb, keys = list(GeneNameFilter(c("BCL2", "BCL2L11")), TxBiotypeFilter("protein_coding")), column = "TXID", multiVals = "list") ``` Note that, if the filters are used, the ordering of the result does no longer match the ordering of the genes. # Important notes These notes might explain eventually unexpected results (and, more importantly, help avoiding them): - The ordering of the results returned by the `genes`, `exons`, `transcripts` methods can be specified with the `order.by` parameter. The ordering of the results does however **not** correspond to the ordering of values in submitted filter objects. The exception is the `select` method. If a character vector of values or a single filter is passed with argument `keys` the ordering of results of this method matches the ordering of the key values or the values of the filter. - Results of `exonsBy`, `transcriptsBy` are always ordered by the `by` argument. - The CDS provided by `EnsDb` objects **always** includes both, the start and the stop codon. - Transcripts with multiple CDS are at present not supported by `EnsDb`. - At present, `EnsDb` support only genes/transcripts for which all of their exons are encoded on the same chromosome and the same strand. - Since a single Ensembl gene ID might be mapped to multiple NCBI Entrezgene IDs methods such as `genes`, `transcripts` etc return a `list` in the `"entrezid"` column of the resulting result object. # Getting or building `EnsDb` databases/packages Some of the code in this section is not supposed to be automatically executed when the vignette is built, as this would require a working installation of the Ensembl Perl API, which is not expected to be available on each system. Also, building `EnsDb` from alternative sources, like GFF or GTF files takes some time and thus also these examples are not directly executed when the vignette is build. ## Getting `EnsDb` databases Some `EnsDb` databases are available as `R` packages from Bioconductor and can be simply installed with the `install` function from the `BiocManager` package. The name of such annotation packages starts with *EnsDb* followed by the abbreviation of the organism and the Ensembl version on which the annotation bases. `EnsDb.Hsapiens.v86` provides thus an `EnsDb` database for homo sapiens with annotations from Ensembl version 86. Since Bioconductor version 3.5 `EnsDb` databases can also be retrieved directly from `AnnotationHub`. ```{r AnnotationHub-query, message = FALSE, eval = use_network} library(AnnotationHub) ## Load the annotation resource. ah <- AnnotationHub() ## Query for all available EnsDb databases query(ah, "EnsDb") ``` We can simply fetch one of the databases. ```{r AnnotationHub-query-2, message = FALSE, eval = use_network} ahDb <- query(ah, pattern = c("Xiphophorus Maculatus", "EnsDb", 87)) ## What have we got ahDb ``` Fetch the `EnsDb` and use it. ```{r AnnotationHub-fetch, message = FALSE, eval = FALSE } ahEdb <- ahDb[[1]] ## retriebe all genes gns <- genes(ahEdb) ``` We could even make an annotation package from this `EnsDb` object using the `makeEnsembldbPackage` and passing `dbfile(dbconn(ahEdb))` as `ensdb` argument. ## Building annotation packages ### Directly from Ensembl databases The `fetchTablesFromEnsembl` function uses the Ensembl Perl API to retrieve the required annotations from an Ensembl database (e.g. from the main site *ensembldb.ensembl.org*). Thus, to use this functionality to build databases, the Ensembl Perl API needs to be installed (see 5 for details). Below we create an `EnsDb` database by fetching the required data directly from the Ensembl core databases. The `makeEnsembldbPackage` function is then used to create an annotation package from this `EnsDb` containing all human genes for Ensembl version 75. ```{r edb-from-ensembl, message = FALSE, eval = FALSE } library(ensembldb) ## get all human gene/transcript/exon annotations from Ensembl (75) ## the resulting tables will be stored by default to the current working ## directory fetchTablesFromEnsembl(75, species = "human") ## These tables can then be processed to generate a SQLite database ## containing the annotations (again, the function assumes the required ## txt files to be present in the current working directory) DBFile <- makeEnsemblSQLiteFromTables() ## and finally we can generate the package makeEnsembldbPackage(ensdb = DBFile, version = "0.99.12", maintainer = "Johannes Rainer ", author = "J Rainer") ``` The generated package can then be build using `R CMD build EnsDb.Hsapiens.v75` and installed with `R CMD INSTALL EnsDb.Hsapiens.v75*`. Note that we could directly generate an `EnsDb` instance by loading the database file, i.e. by calling `edb <- EnsDb(DBFile)` and work with that annotation object. To fetch and build annotation packages for plant genomes (e.g. arabidopsis thaliana), the *Ensembl genomes* should be specified as a host, i.e. setting `host` to "mysql-eg-publicsql.ebi.ac.uk", `port` to `4157` and `species` to e.g. "arabidopsis thaliana". ### From a GTF or GFF file Alternatively, the `ensDbFromAH`, `ensDbFromGff`, `ensDbFromGRanges` and `ensDbFromGtf` functions allow to build EnsDb SQLite files from a `GRanges` object or GFF/GTF files from Ensembl (either provided as files or *via* `AnnotationHub`). These functions do not depend on the Ensembl Perl API, but require a working internet connection to fetch the chromosome lengths from Ensembl as these are not provided within GTF or GFF files. Also note that protein annotations are usually not available in GTF or GFF files, thus, such annotations will not be included in the generated `EnsDb` database - protein annotations are only available in `EnsDb` databases created with the Ensembl Perl API (such as the ones provided through `AnnotationHub` or as Bioconductor packages). In the next example we create an `EnsDb` database using the `AnnotationHub` package and load also the corresponding genomic DNA sequence matching the Ensembl version. We thus first query the `AnnotationHub` package for all resources available for `Mus musculus` and the Ensembl release 77. Next we create the `EnsDb` object from the appropriate `AnnotationHub` resource. We then use the `getGenomeTwoBitFile` method on the `EnsDb` to directly look up and retrieve the correct or best matching `TwoBitFile` with the genomic DNA sequence. At last we retrieve the sequences of all exons using the `getSeq` method. ```{r gtf-gff-edb, message = FALSE, eval = FALSE } ## Load the AnnotationHub data. library(AnnotationHub) ah <- AnnotationHub() ## Query all available files for Ensembl release 77 for ## Mus musculus. query(ah, c("Mus musculus", "release-77")) ## Get the resource for the gtf file with the gene/transcript definitions. Gtf <- ah["AH28822"] ## Create a EnsDb database file from this. DbFile <- ensDbFromAH(Gtf) ## We can either generate a database package, or directly load the data edb <- EnsDb(DbFile) ## Identify and get the TwoBit object with the genomic DNA sequence matching ## the EnsDb annotation. Dna <- getGenomeTwoBitFile(edb) ## We next retrieve the sequence of all exons on chromosome Y. exons <- exons(edb, filter = SeqNameFilter("Y")) exonSeq <- getSeq(Dna, exons) ``` In the example below we load a `GRanges` containing gene definitions for genes encoded on chromosome Y and generate a `EnsDb` SQLite database from that information. ```{r EnsDb-from-Y-GRanges, message = FALSE, eval = use_network} ## Generate a sqlite database from a GRanges object specifying ## genes encoded on chromosome Y load(system.file("YGRanges.RData", package = "ensembldb")) Y ## Create the EnsDb database file DB <- ensDbFromGRanges(Y, path = tempdir(), version = 75, organism = "Homo_sapiens") ## Load the database edb <- EnsDb(DB) edb ``` Alternatively we can build the annotation database using the `ensDbFromGtf` `ensDbFromGff` functions, that extract most of the required data from a GTF respectively GFF (version 3) file which can be downloaded from Ensembl (e.g. from for human gene definitions from Ensembl version 75; for plant genomes etc, files can be retrieved from ). All information except the chromosome lengths, the NCBI Entrezgene IDs and protein annotations can be extracted from these GTF files. The function also tries to retrieve chromosome length information automatically from Ensembl. Below we create the annotation from a gtf file that we fetch directly from Ensembl. ```{r EnsDb-from-GTF, message = FALSE, eval = FALSE } library(ensembldb) ## the GTF file can be downloaded from ## ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/ gtffile <- "Homo_sapiens.GRCh37.75.gtf.gz" ## generate the SQLite database file DB <- ensDbFromGtf(gtf = gtffile) ## load the DB file directly EDB <- EnsDb(DB) ## alternatively, build the annotation package ## and finally we can generate the package makeEnsembldbPackage(ensdb = DB, version = "0.99.12", maintainer = "Johannes Rainer ", author = "J Rainer") ``` # Database layout The database consists of the following tables and attributes (the layout is also shown in Figure [165](#orgfe7e95a)). Note that the protein-specific annotations might not be available in all `EnsDB` databases (e.g. such ones created with `ensembldb` version < 1.7 or created from GTF or GFF files). - **gene**: all gene specific annotations. - `gene_id`: the Ensembl ID of the gene. - `gene_name`: the name (symbol) of the gene. - `gene_biotype`: the biotype of the gene. - `gene_seq_start`: the start coordinate of the gene on the sequence (usually a chromosome). - `gene_seq_end`: the end coordinate of the gene on the sequence. - `seq_name`: the name of the sequence (usually the chromosome name). - `seq_strand`: the strand on which the gene is encoded. - `seq_coord_system`: the coordinate system of the sequence. - `description`: the description of the gene. - **entrezgene**: mapping of Ensembl genes to NCBI Entrezgene identifiers. Note that this mapping can be a one-to-many mapping. - `gene_id`: the Ensembl gene ID. - `entrezid`: the NCBI Entrezgene ID. - **tx**: all transcript related annotations. Note that while no `tx_name` column is available in this database column, all methods to retrieve data from the database support also this column. The returned values are however the ID of the transcripts. - `tx_id`: the Ensembl transcript ID. - `tx_biotype`: the biotype of the transcript. - `tx_seq_start`: the start coordinate of the transcript. - `tx_seq_end`: the end coordinate of the transcript. - `tx_cds_seq_start`: the start coordinate of the coding region of the transcript (NULL for non-coding transcripts). - `tx_cds_seq_end`: the end coordinate of the coding region of the transcript. - `tx_external_name`: the *external name* of the transcript. - `gc_count`: from Ensembl release 98 on, the **tx** table contains also a column `gc_count` providing the transcript's G-C content expressed as a percentage. - `gene_id`: the gene to which the transcript belongs. `EnsDb` databases for more recent Ensembl releases have also a column `tx_support_level` providing the evidence level for a transcript (1 high evidence, 5 low evidence, NA no evidence calculated). - **exon**: all exon related annotation. - `exon_id`: the Ensembl exon ID. - `exon_seq_start`: the start coordinate of the exon. - `exon_seq_end`: the end coordinate of the exon. - **tx2exon**: provides the n:m mapping between transcripts and exons. - `tx_id`: the Ensembl transcript ID. - `exon_id`: the Ensembl exon ID. - `exon_idx`: the index of the exon in the corresponding transcript, always from 5' to 3' of the transcript. - **chromosome**: provides some information about the chromosomes. - `seq_name`: the name of the sequence/chromosome. - `seq_length`: the length of the sequence. - `is_circular`: whether the sequence in circular. Note that the Ensembl Perl API does not correctly define/return this information (see [issue 86](https://github.com/jorainer/ensembldb/issues/86)) and a value of `0` is provided for all chromosomes. The `seqinfo` method on `EnsDb` objects manually sets `isCircular` to `TRUE` for chromosomes named `"MT"`. - **protein**: provides protein annotation for a (coding) transcript. - `protein_id`: the Ensembl protein ID. - `tx_id`: the transcript ID which CDS encodes the protein. - `protein_sequence`: the peptide sequence of the protein (translated from the transcript's coding sequence after applying eventual RNA editing). - **uniprot**: provides the mapping from Ensembl protein ID(s) to Uniprot ID(s). Not all Ensembl proteins are annotated to Uniprot IDs, also, each Ensembl protein might be mapped to multiple Uniprot IDs. - `protein_id`: the Ensembl protein ID. - `uniprot_id`: the Uniprot ID. - `uniprot_db`: the Uniprot database in which the ID is defined. - `uniprot_mapping_type`: the type of the mapping method that was used to assign the Uniprot ID to an Ensembl protein ID. - **protein\_domain**: provides protein domain annotations and mapping to proteins. - `protein_id`: the Ensembl protein ID on which the protein domain is present. - `protein_domain_id`: the ID of the protein domain (from the protein domain source). - `protein_domain_source`: the source/analysis method in/by which the protein domain was defined (such as pfam etc). - `interpro_accession`: the Interpro accession ID of the protein domain. - `prot_dom_start`: the start position of the protein domain within the protein's sequence. - `prot_dom_end`: the end position of the protein domain within the protein's sequence. - **metadata**: some additional, internal, informations (Genome build, Ensembl version etc). - `name` - `value` - *virtual* columns: - `symbol`: the database does not have such a database column, but it is still possible to use it in the `columns` parameter. This column is *symlinked* to the `gene_name` column. - `tx_name`: similar to the `symbol` column, this column is *symlinked* to the `tx_id` column. The database layout: as already described above, protein related annotations (green) might not be available in each `EnsDb` database. ![img](images/dblayout.png "Database layout.") # Session information ```{r sessionInfo } sessionInfo() ``` # Footnotes 1 2 3 4 5