--- title: "Mapping between genome, transcript and protein coordinates" author: "Johannes Rainer" graphics: yes package: ensembldb output: BiocStyle::html_document: toc_float: true includes: in_header: coordinate-mapping.bioschemas.html vignette: > %\VignetteIndexEntry{Mapping between genome, transcript and protein coordinates} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{ensembldb,EnsDb.Hsapiens.v86,BiocStyle,Gviz,BSgenome.Hsapiens.NCBI.GRCh38} --- ```{r biocstyle, echo = FALSE, results = "asis", message = FALSE } library(BiocStyle) BiocStyle::markdown() ``` # Introduction Besides retrieving genomic and protein annotations, `ensembldb` provides also functionality to map between protein, transcript and genome coordinates. This mapping relies on annotations of proteins (their sequences) to their encoding transcripts which are stored in `EnsDb` databases. The available functions and their input and result objects are: - `genomeToTranscript`: maps genomic coordinates to coordinates within the transcript(s) encoded at the specified coordinates. The function takes a `GRanges` as input and returns an `IRangesList` of length equal to the length of the input object. Each `IRanges` in the `IRangesList` provides the coordinates within the respective transcript. - `genomeToProtein`: maps genomic coordinates to positions within the sequence of the eventually encoded protein(s) in the genomic region. The function takes a `GRanges` as input and returns an `IRangesList` of the same length, each element (`IRanges`) providing the coordinates within the respective protein's sequence. - `proteinToTranscript`: maps protein-relative coordinates to positions within the encoding transcripts. Takes an `IRanges` as input and returns an `IRangesList` of the same length with each element (`IRanges`) providing the coordinates within the transcript (relative to the first nucleotide of the transcript). - `proteinToGenome`: maps protein-relative coordinates to genomic region(s) encoding the amino acid sequence. The function takes an `IRanges` as input and returns a `list` of either `GRanges` (if Ensembl protein identifiers are provided in the input object) or `GRangesList` (if the protein identifier provided for the input range can be mapped to multiple Ensembl protein IDs). - `transcriptToGenome`: maps coordinates within a transcript to genomic coordinates. Takes an `IRanges` as input and returns a `GRangesList` (within-transcript coordinates can be mapped to several exons and hence genomic locations). The returned `GRangesList` has the same length than the input `IRanges`, with empty `GRanges` for transcripts that can not be found in the database (and thus can not be mapped to the genome) or for `IRanges` that define regions outside of the transcript's sequence. - `transcriptToProtein`: maps regions within a transcript to the respective amino acid residues in the encoded protein's sequence. The function takes an `IRanges` as input and returns an `IRanges` of the same length. - `transcriptToCds`: maps between transcript-relative and CDS-relative coordinates (for protein-coding genes only). - `cdsToTranscript`: maps between CDS-relative and transcript-relative coordinates (for protein-coding genes only). All functions, except `proteinToGenome` and `transcriptToGenome` return `IRanges` with negative coordinates if the mapping failed (e.g. because the identifier is unknown to the database, or if, for mappings to and from protein coordinates, the input coordinates are not within the coding region of a transcript). `proteinToGenome` and `transcriptToGenome` return empty `GRanges` if mappings fail. Each protein encoding transcript is annotated by Ensembl to an unique translation with an assigned Ensembl protein ID. In addition, Ensembl provides a mapping from Uniprot identifiers to Ensembl protein IDs. This is however in many cases an one-to-many mapping, i.e. a single Uniprot ID is assigned to multiple Ensembl protein IDs. As an additional complication, the coding region might not be complete for some transcripts and either their 3' or 5' ends (or both) are not defined (or can not be mapped to the genome). In such cases, the length of the CDS does not match the length of the annotated protein sequence. Reported position mappings between proteins and transcripts might for that particular cases not be correct. In such cases `FALSE` is reported in a column named `"cds_ok"` in the results from functions mapping to and from protein coordinates. The `cdsToTranscript` and `transcriptToCds` functions are helpful to enable the mapping of variants in genes that are usually provided as positions within the gene's (actually transcript's) coding sequence. An example for such a mapping is provided in section *Mapping transcript coordinates to genomic coordinates* further below. Below we load all required libraries and filter the `EnsDb` database containing the annotations from Ensembl release 86 to chromosome X. All genes of the examples in this vignette are encoded on chromosome X and subsetting the `EnsDb` to a single chromosome speeds all database queries considerably up. ```{r load-libs, message = FALSE} library(ensembldb) library(EnsDb.Hsapiens.v86) edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X") ``` # Mapping genomic coordinates to transcript-relative coordinates Below we define a genomic region on chromosome X for which we want to identify the transcripts that are eventually encoded by that position and determine the coordinates of the genomic region within these (relative to their first nucleotide). ```{r genomeToTranscript-define} gnm <- GRanges("X:107716399-107716401") ``` Before we map the coordinates we visualize the genomic region and all genes and transcripts overlapping it. ```{r genomeToTranscript-ex1-plot, message = FALSE } library(Gviz) ## Since we're using Ensembl chromosome names we have to set: options(ucscChromosomeNames = FALSE) ## Define a genome axis track gat <- GenomeAxisTrack(range = gnm) ## Get all genes in that region gnm_gns <- getGeneRegionTrackForGviz(edbx, filter = GRangesFilter(gnm)) gtx <- GeneRegionTrack(gnm_gns, name = "tx", geneSymbol = TRUE, showId = TRUE) ## Generate a higlight track ht <- HighlightTrack(trackList = list(gat, gtx), range = gnm) ## plot the region plotTracks(list(ht)) ``` The genomic position overlaps two transcripts of the gene TSC22D3, encoded on the reverse strand on chromosome X. Below we map this position to the nucleotides within the encoded transcripts with the `genomeToTranscript` function. ```{r genomeToTranscript-ex1-map, message = FALSE} ## Map genomic coordinates to within-transcript coordinates gnm_tx <- genomeToTranscript(gnm, edbx) ``` As a result we get an `IRangesList` object of length 1 (since we defined a single genomic region as input). The `IRanges` in the first element of our result object provides the mapped coordinates within each transcript that is encoded in the specified genomic region. ```{r genomeToTranscript-ex1-object} gnm_tx ``` The genomic region overlapped two transcripts and the transcript-relative coordinates for both are reported. The ID of the transcripts are provided as the `names` of the `IRanges`. The original genomic coordinates are listed in additional metadata columns (which can be accessed with the `mcols` method). Also the ID of the exon encoded at the genomic region and its index/rank within the transcript is available in the `"exon_id"` metadata column. To illustrate mapping of multiple genomic regions, we define below 2 genomic regions: twice the example region from above but once restricting to the reverse and once to the forward strand. ```{r genomeToTranscript-ex2, message = FALSE} gnm_1 <- gnm strand(gnm_1) <- "-" gnm_2 <- gnm strand(gnm_2) <- "+" gnm <- c(gnm_1, gnm_2) genomeToTranscript(gnm, edbx) ``` The result for the first region is the same as above. The result for the second region is an `IRanges` with negative coordinates, since there is no transcript encoded on the forward strand at the genomic position. # Mapping genomic coordinates to protein-relative coordinates We can use the `genomeToProtein` function to map genomic coordinates to within-protein sequence coordinates. In addition to the `genomeToTranscript` function, this function determines whether the transcript at the genomic position encodes a protein, and if so, maps the genomic coordinates to coordinates within the respective protein sequence. To this end, the transcript-relative coordinates for the genomic region have to be located within CDS of the transcript (excluding the stop codon, i.e. last 3 nucleotides of the CDS, since they are not translated into an amino acid). Below we define 4 genomic regions and map them to protein-relative coordinates: the first corresponds to the first 4 nucleotides of the CDS of ENST00000381578, the second to the last nucleotide of the CDS of the same transcript. The third region maps to the last nt before the stop codon of ENST00000381578 and the last region is located within an intron of the same transcript. ```{r genomeToProtein-ex1, message = FALSE} gnm <- GRanges("X", IRanges(start = c(630898, 644636, 644633, 634829), width = c(5, 1, 1, 3))) gnm_prt <- genomeToProtein(gnm, edbx) ``` The resulting object has the length 4, one `IRanges` for each region in the input `GRanges`. The warning messages indicate that not all of the regions could be mapped to within-protein coordinates. We explore now the results for each input region separately. ```{r genomeToProtein-ex1-res1} gnm_prt[[1]] ``` The genomic region could be mapped to positions within the coding regions of 4 different transcripts, each of them being annotated to its own Ensembl protein ID. The input region was mapped to the first 4 nucleotides of each transcripts' CDS and was thus mapped to the amino acid residues 1 and 2 of the encoded protein: the first 3 nucleotides to the first amino acid, the 4th to the second amino acid. The encoding transcript ID, the exon ID, exon rank and the input genomic region are provided as metadata columns in the result `IRanges`. A metadata column `cds_ok` provides the additional information whether the length of each transcripts' CDS matches the length of the encoded protein sequence. This is an important information, as not all protein coding transcripts in Ensembl have complete CDS, either because their 3' or their 5' (or both) ends are incomplete (or could not be mapped/aligned to the genome). Mappings to or from protein coordinates with a `cds_ok` being `FALSE` might not be correct and should be manually evaluated e.g. using the Ensembl genome browser. The second genomic region maps to last nucleotide of the CDS of ENST00000381578, which is however part of the stop codon that is not translated. The coordinates can therefore not be mapped to the protein sequence and an `IRanges` with negative start position is thus returned. ```{r genomeToProtein-ex1-res2} gnm_prt[[2]] ``` The third region can be mapped to the last nucleotide before the stop codon and was thus mapped to the last amino acid of the encoded protein. ```{r genomeToProtein-ex1-res3 } gnm_prt[[3]] ``` This region maps however to coordinates within two transcripts, each with their own translation. Below we retrieve the protein sequences for both protein IDs to evaluate if it corresponds indeed to the last amino acid for the protein encoded by ENST00000381578. ```{r genomeToProtein-ex1-res3-2, message = FALSE} prt <- proteins(edbx, filter = ProteinIdFilter(names(gnm_prt[[3]]))) nchar(prt$protein_sequence) ``` As expected, the position mapped to the last amino acid of the amino acid sequence associated with both protein IDs. In fact, these amino acid sequences are identical. The result for the last region can not be mapped to any transcript-relative coordinates and hence also not to any protein. As a result, an `IRanges` with negative coordinates is returned. ```{r genomeToProtein-ex1-res4} gnm_prt[[4]] ``` # Mapping protein coordinates to transcript coordinates The `proteinToTranscript` method allows to map protein-sequence relative coordinates to the encoding region in the transcript. A protein identifier and the coordinates within the protein sequence have to be provided with an `IRanges` object, the protein identifiers (ideally Ensembl protein IDs or also Uniprot IDs) either provided as `names` of the object, or in one of its metadata columns. The function will first try to find the protein identifiers in the database and, if found, map the provided coordinates to transcript-relative positions. In our first example we retrieve the transcript-relative coordinates of positions 5 to 9 within the amino acid sequence of the gene *GAGE10*. Below we first get the protein ID for this gene from the database and define then the `IRanges` with the protein sequence-internal coordinates. ```{r proteinToTranscript-ex1, message = FALSE} GAGE10 <- proteins(edbx, filter = ~ genename == "GAGE10") GAGE10 ## Define the IRanges object. GAGE10_prt <- IRanges(start = 5, end = 9, names = GAGE10$protein_id) ``` Now we use the `proteinToTranscript` function to map the coordinates. The function also compares the length of the CDS with the length of the encoded protein sequence and, if they are not matching, returns a `FALSE` in the result object's `cds_ok` metadata column. In such cases (i.e. when the CDS of the transcript is incomplete), the returned coordinates could be wrong. ```{r proteinToTranscript-ex1-map, message = FALSE} GAGE10_tx <- proteinToTranscript(GAGE10_prt, edbx) ``` The result is a `list` with the same length as the input `IRanges`, each element representing the mapping of the protein-relative coordinates to positions within all encoding transcripts. Note that the transcript coordinates are relative to the first nucleotide of the 5' UTR, not of the CDS. ```{r proteinToTranscript-ex1-res} GAGE10_tx ``` If Ensembl protein identifiers are used, the mapping between protein- and transcript coordinates will always be 1:1. Many Uniprot identifiers are however annotated to more than one Ensembl protein ID and the resulting `IRanges` for one input region might thus be of length larger than one. Below we define regions in protein sequences identified by Uniprot IDs. In addition, to illustrate a failing mapping, we add a region with an invalid ID. ```{r proteinToTranscript-ex2, message = FALSE} ids <- c("O15266", "Q9HBJ8", "donotexistant") prt <- IRanges(start = c(13, 43, 100), end = c(21, 80, 100)) names(prt) <- ids prt_tx <- proteinToTranscript(prt, edbx, idType = "uniprot_id") ``` The region within the first protein with a Uniprot ID can be mapped to 4 different Ensembl protein IDs and the coordinates are thus mapped to regions within 4 transcripts. ```{r proteinToTranscript-ex2-res1} prt_tx[[1]] ``` The Uniprot identifier for the second region can be mapped to a single Ensembl protein ID and we get thus coordinates within a single transcript. ```{r proteinToTranscript-ex2-res2} prt_tx[[2]] ``` The last identifier can not be mapped to any Ensembl protein, and a region with negative coordinates is thus returned. ```{r proteinToTranscript-ex2-res3} prt_tx[[3]] ``` # Mapping protein coordinates to the genome The `proteinToGenome` function allows to map coordinates within the amino acid sequence of a protein to the corresponding DNA sequence on the genome. A protein identifier and the coordinates of the sequence within its amino acid sequence are required and have to be passed as an `IRanges` object to the function. The protein identifier can either be passed as `names` of this object, or provided in a metadata column (`mcols`). In our first example we map the positions 5 to 9 within the amino acid sequence of the protein *ENSP00000385415* from gene GAGE10 to the genome. ```{r proteinToGenome-gage10-define, message = FALSE} ## Define the IRanges object. GAGE10_prt <- IRanges(start = 5, end = 9, names = "ENSP00000385415") ``` We can now map the protein-relative coordinates to genomic coordinates. By default, `proteinToGenome` assumes the names of the provided `IRanges` object to be Ensembl protein identifiers. ```{r proteinToGenome-gage10-map, message = FALSE} GAGE10_gnm <- proteinToGenome(GAGE10_prt, edbx) ``` `proteinToGenome` returns a `list`, one element for each range in the input `IRanges`. We did only map a single range and hence the result is a `list` of length 1. The `GRanges` object in the first element of `GAGE10_gnm` represents the coordinates of the DNA sequence encoding positions 5 to 9 in the protein. ```{r proteinToGenome-gage10-res} GAGE10_gnm ``` This `GRanges` contains also useful additional information as metadata columns, such as the ID of the encoding transcript (column `"tx_id"`) the ID and index of the exon within the protein region is encoded (columns `"exon_id"` and `"exon_rank"`), the start and end coordinates from the input `IRanges` object (columns `"protein_start"` and `"protein_end"`) and a `logical` indicating whether the length of the encoding transcript's CDS matches the protein sequence length (`"cds_ok"`). Special care should be taken if a `FALSE` is reported in this last column. In such cases the returned genomic coordinates might not be correct and they should be manually checked using the Ensembl genome browser. The reason to use a `list` as a result object and not, e.g. a `GRangesList`, was the one-to-many mappings between Uniprot identifiers and Ensembl protein IDs. To illustrate this, we map positions within 3 proteins identified by their Uniprot identifiers to genomic regions. ```{r proteinToGenome-uniprot-ids, message = FALSE} ## Define the IRanges providing Uniprot IDs. uni_rng <- IRanges(start = c(2, 12, 8), end = c(2, 15, 17), names = c("D6RDZ7", "O15266", "H7C2F2")) ## We have to specify that the IDs are Uniprot IDs uni_gnm <- proteinToGenome(uni_rng, edbx, idType = "uniprot_id") ``` The length of the protein coding region of the encoding transcript for two of the 3 proteins (*D6RDZ7* and *H7C2F2*) do not match the length of the protein sequence. For some transcripts the CDS is not complete (either at the 3', 5' or both ends). Mapped coordinates might not be correct in such cases and it is strongly suggested to manually check the mapped coordinates. The result from the comparison of the protein sequence and the CDS length is provided in the `"cds_ok"` metadata column of the `GRanges` with the genomic coordinates. ```{r proteinToGenome-uniprot-cds_ok} uni_gnm[[3]] ``` Mappings between Uniprot and Ensembl protein IDs can be one-to-many. In such cases `proteinToGenome` returns a `GRangesList` with each element being the coordinates calculated for each annotated Ensembl protein. In our example, each of the first two proteins was annotated to 4 Ensembl proteins. ```{r proteinToGenome-uniprot-counts} ## To how many Ensembl proteins was each Uniprot ID mapped? lengths(uni_gnm) ``` Below we show the genomic coordinates for the within-protein positions calculated for all 4 Ensembl protein IDs for *O15266*. ```{r proteinToGenome-uniprot-multi} uni_gnm[["O15266"]] ``` As a last example we fetch all protein domains for the gene SYP and map all of them to the genome. To fetch protein domain information we select all columns from the *protein\_domain* table. In addition, we retrieve the result as a `AAStringSet`. Additional annotations will be available in the `mcols` of this result object. ```{r proteinToGenome-SYP-fetch-domains, message = FALSE } SYP <- proteins(edbx, filter = ~ genename == "SYP", columns = c("protein_id", "tx_id", listColumns(edbx, "protein_domain")), return.type = "AAStringSet") SYP ``` Each protein sequence of the gene SYP has multiple protein domains annotated to it, thus protein IDs and sequences are redundant in the `AAStringSet`. We restrict the result below to a single protein. ```{r proteinToGenome-SYP-single-protein, message = FALSE} ## How many proteins are annotated to SYP? unique(mcols(SYP)$protein_id) ## Reduce the result to a single protein SYP <- SYP[names(SYP) == "ENSP00000263233"] ## List the available protein domains and additional annotations mcols(SYP) ``` Next we create the `IRanges` object, one range for each protein domain, and perform the mapping of the protein domains to the genome. This time we provide the protein identifiers with one of the metadata columns and pass the name of this column with the `id` parameter. ```{r proteinToGenome-SYP-map, message = FALSE} SYP_rng <- IRanges(start = mcols(SYP)$prot_dom_start, end = mcols(SYP)$prot_dom_end) mcols(SYP_rng) <- mcols(SYP) ## Map the domains to the genome. We set "id" to the name ## of the metadata columns containing the protein IDs SYP_gnm <- proteinToGenome(SYP_rng, edbx, id = "protein_id") ``` The function mapped each domain to the genome and returned a `list` with the mapping result for each as a `GRanges` object. As an example we show the mapping result for the second protein domain (*PF01284*). ```{r proteinToGenome-SYP-second} SYP_gnm[[2]] ``` The protein domain is encoded by a sequence spanning exons 2 to 5 of the transcript ENST00000263233. Note that the gene is encoded on the reverse strand. The individual ranges are ordered by the index of the respective exon within the transcript. At last we plot the encoding transcript and all of the mapped protein domains for the protein *ENSP00000263233* of SYP. ```{r proteinToGenome-SYP-plot, message = FALSE} library(Gviz) ## Define a genome axis track gat <- GenomeAxisTrack() ## Get the transcript ID: txid <- SYP_gnm[[1]]$tx_id[1] ## Get a GRanges for the transcript trt <- getGeneRegionTrackForGviz(edbx, filter = TxIdFilter(txid)) ## Define a GRanges for the mapped protein domains and add ## metadata columns with the grouping of the ranges and the ## IDs of the corresponding protein domains, so they can be ## identified in the plot dmns <- unlist(GRangesList(SYP_gnm)) dmns$grp <- rep(1:length(SYP_rng), lengths(SYP_gnm)) dmns$id <- rep(mcols(SYP_rng)$protein_domain_id, lengths(SYP_gnm)) ## Since we're using Ensembl chromosome names we have to set options(ucscChromosomeNames = FALSE) ## Plotting the transcript and the mapped protein domains. plotTracks(list(gat, GeneRegionTrack(trt, name = "tx"), AnnotationTrack(dmns, group = dmns$grp, id = dmns$id, groupAnnotation = "id", just.group = "above", shape = "box", name = "Protein domains")), transcriptAnnotation = "transcript") ``` # Mapping transcript coordinates to genomic coordinates Coordinates within transcript sequences can be mapped to genomic coordinates with the `transcriptToGenome` function. In the example below we map coordinates within 2 transcript to the genome. ```{r transcriptToGenome-map, message = FALSE} rng_tx <- IRanges(start = c(501, 1), width = c(5, 5), names = c("ENST00000486554", "ENST00000381578")) rng_gnm <- transcriptToGenome(rng_tx, edbx) ``` The function returns a `GRangesList` with the `GRanges` in each element containing the genomic coordinates to which the positions could be mapped (or an empty `GRanges` if the transcript identifier can not be found in the database or the input range is not within the transcript's sequence). The length of each `GRanges` depends on the number of exons the region in the transcript spans. ```{r transcriptToGenome-res-1} rng_gnm ``` The region in the first transcript (*ENST00000486554*) is mapped to two genomic regions, because part of it is located in the first, and part in the second exon of the transcript. All 5 nucleotides of the second region are within the transcript's first exon and are thus mapped to only a single genomic region. Next we map variants in the gene PKP2 to the corresponding genomic coordinates. The variants are *PKP2 c.1643DelG* and *c.1881DelC* and the positions we are looking for are thus nucleotides 1643 and 1881 within the **CDS** of the gene/transcript. Looking up the available transcripts for this gene we identified *ENST00000070846* as the representative transcript for the gene. Since the positions are not relative to the transcription start site we can not use the `transcriptToGenome` function for the mapping, but we have to map the cds-relative positions first to transcript-relative coordinates. We do this below using the `cdsToTranscript` function. ```{r pkp2-cdsToTranscript} ## Define the position within the CDS of the transcript pkp2_cds <- IRanges(start = c(1643, 1881), width = c(1, 1), name = rep("ENST00000070846", 2)) ## Convert cds-relative to transcript-relative coordinates pkp2 <- cdsToTranscript(pkp2_cds, EnsDb.Hsapiens.v86) pkp2 ``` With the coordinates being now relative to the first nucleotide of the transcript we can use the `transcriptToGenome` function for the final mapping of the position to the genome. ```{r pkp2-transcriptToGenome} pkp2_gnm <- transcriptToGenome(pkp2, EnsDb.Hsapiens.v86) pkp2_gnm ``` To verify that the nucleotides at the positions are indeed *G* and *C* as stated in the definition of the variant (*c.1643DelG* and *c.1881DelC*) we extract below the nucleotide at the identified genomic position. We thus load the package providing the genome sequence for GRCh38 on which Ensembl release 86 is based. ```{r pkp2-variant-pos-validate} library(BSgenome.Hsapiens.NCBI.GRCh38) getSeq(BSgenome.Hsapiens.NCBI.GRCh38, pkp2_gnm) ``` # Mapping transcript coordinates to protein coordinates The `transcriptToProtein` function can be used to map coordinates within a transcript to the corresponding coordinates within the encoded protein. Note that only regions within protein coding transcripts can be mapped and that coordinates within the transcript are assumed to be relative to the first nucleotide of the transcript (not of the CDS). Below we define regions within several transcripts and map these to the corresponding amino acid residues in the protein. ```{r transcriptToPrptein-map, message = FALSE} rng_tx <- IRanges(start = c(501, 1, 200), width = c(5, 5, 4), names = c("ENST00000486554", "ENST00000381578", "ENST00000431238")) rng_prt <- transcriptToProtein(rng_tx, edbx) ``` The mapping did throw several warnings. The region within transcript *ENST00000431238* can not be mapped to protein coordinates, because the transcript does not encode a protein. Transcript *ENST00000381578* encodes a protein, but the specified nucleotides 1 to 5 are located in the 5' UTR of the transcript and can therefore also not be mapped. Finally, the CDS of the transcript *ENST00000486554* is not complete and, while the coordinates were mapped to protein residues, they might not be correct. ```{r transcriptToProtein-res} rng_prt ``` For transcript coordinates that could not be mapped, negative coordinates are returned (see lines/elements 2 and 3 above). The first region could be mapped, but the returned protein-relative coordinates might be wrong, because the CDS of the transcript is incomplete (hence a `FALSE` is reported in metadata column `"cds_ok"`). In fact, only the 3' end of the CDS is incomplete for this transcript and the returned coordinates are thus correct. # Session information ```{r sessionInfo } sessionInfo() ```