--- title : "__MethPed__A DNA Methylation Classifier Tool for the Identification of Pediatric Brain Tumor Subtypes" author: "*Mohammad Tanvir Ahamed, Anna Danielsson, Szilárd Nemes and Helena Carén*University of Gothenburg, Sweden" date: " Date: *`r Sys.Date()`*__*Version: 1.0.0*__" output: BiocStyle::html_document: toc: yes toc_depth: 4 number_sections: TRUE BiocStyle::pdf_document: toc: TRUE toc_depth: 4 number_sections: TRUE vignette: > %\VignetteIndexEntry{MethPed User Guide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\usepackage[utf8x]{inputenc} --- # Introduction The __MethPed__ classifier of pediatric brain tumors presented in the MethPed article [1] can be accessed through the __MethPed__ package. This package includes the probes that were used for building the classification algorithm, the MethPed classifier and a published data set that we have used as an example. # Necessary packages and installation guide The __MethPed__ package can be installed through Bioconductor ```r source("http://bioconductor.org/biocLite.R") biocLite("MethPed") ``` The __MethPed__ classifier depends on the __randomForest__ package for random forest algorithm. If necessary this can be installed from CRAN. ```r install.packages("randomForest") ``` If there are missing values present in the data, the __impute__ package can be installed for missing data imputation. This is however optional and the user can use any algorithm for missing value imputation. ```r source("http://bioconductor.org/biocLite.R") biocLite("impute") ``` # Input data type and format for the __MethPed__ classifier ## Input data type Data for running MethPed is generated by the Illumina Infinium HumanMethylation 450 BeadChip arrays. MethPed assumes that the input data are beta values ($\beta$). Beta values ($\beta$) are the estimate of methylation level using the ratio of intensities between methylated and unmethylated alleles. $$\beta~=~\frac{Methylated~allele~intensity~(M)}{(Unmethylated~allele~intensity~(U)~+~Methylated~allele~intensity~(M)~+~offset)}$$ The "offset" is chosen to avoid dividing with small values. Illumina uses a default of 100. $\beta$ are between 0 and 1 with 0 being unmethylated and 1 fully methylated. ## Input data format __MethPed__ assumes that the input data are in 3 specific data classes 1) _ExpressionSet_ class, 2) _matrix_ class and 3) _data.frame_ class. ### _ExpressionSet_ class The [_ExpressionSet_](https://www.bioconductor.org/packages/3.3/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf) class is designed to combine several different sources of information into a single convenient structure. This type of class can be manipulated (e.g., subsetted, copied) conveniently, and is the input or output from many Bioconductor functions. Input data in _ExpressionSet_ class is as ```{r} ## Data Structure ``` ```{r, include=FALSE} ## Data Structure library(MethPed) data(MethPed_sample) MethPed_sample_v1<- MethPed_sample ``` ```{r, echo=FALSE} head(MethPed_sample_v1) ``` ```{r} ## Data class ``` ```{r, echo=FALSE} class(MethPed_sample_v1) ``` ### _matrix_ class In the _matrix_ class formatted data, all the values of the matrix are beta values ($\beta$). In the matrix, the probes are found in the rows and the samples in the columns. Input data in “_matrix_” class is as ```{r, include=FALSE} library(Biobase) ``` ```{r} ## Data Structure ``` ```{r, echo=FALSE} MethPed_sample_v2<-Biobase::exprs(MethPed_sample) head(MethPed_sample_v2) ``` ```{r} ## Data class ``` ```{r, echo=FALSE} class(MethPed_sample_v2) ``` ### _data.frame_ class In the _data.frame_ class formatted data, one column of the dataset are the probes and there is no order specificity of this column. This column can be anywhere in the dataset. The rest of the columns will be the beta values ($\beta$) and each column represents an individual sample. The header of the data frame is the name of the corresponding column (Probe column name i.e. “TargetID” and sample column name i.e. “Tumor.A”, “Tumor.B”). Input data in _data.frame_ class is as ```{r} ## Data Structure ``` ```{r, echo=FALSE} MethPed_sample_v3 <- data.frame(MethPed_sample_v2) MethPed_sample_v3$TargetID <- rownames(MethPed_sample_v3) rownames(MethPed_sample_v3) <-NULL head(MethPed_sample_v3) ``` ```{r} ## Data class ``` ```{r, echo=FALSE} class(MethPed_sample_v3) ``` # Workflow of __MethPed__ classifier The __MethPed__ classifier uses the _Random Forest (RF) algorithm_ to classify unknown pediatric brain tumor samples into sub-types. The classification proceeds with the selection of the beta values ($\beta$) needed for the classification. The computational process proceeded in two stages. The first stage commences with a reduction of the probe pool or building the training probe dataset for classification. We have named this dataset as __"predictors"__. The second stage is to apply the RF algorithm to classify the probe data of interest based on the training probe dataset (predictors). For the construction of the training probe pool (predictors), methylation data generated by the Illumina Infinium HumanMethylation 450 BeadChip arrays were downloaded from the Gene Expression Omnibus ([GEO](http://www.ncbi.nlm.nih.gov/geo/)). Four hundred seventy-two cases were available, representing several brain tumor diagnoses (DIPG, glioblastoma, ETMR, medulloblastoma, ependymoma, pilocytic astrocytoma) and their further subgroups. Corresponding GEO accession number are given in the following table Table: ___Datasets used to build the MethPed classifier___ Diagnosis | GEO accession ----------------------|--------------- DIPG | GSE50022 Glioblastoma | GSE55712, GSE36278 ETMR | GSE52556 Medulloblastoma | GSE54880 Ependymoma | GSE45353 Pilocytic astrocytoma | GSE44684 The data sets were merged and probes that did not appear in all data sets were filtered away. In addition, about 190,000 CpGs were removed due to SNPs, repeats and multiple mapping sites. The final data set contained 206,823 unique probes and nine tumor classes including the medulloblastoma subgroups. K–neighbor imputation was used for missing probe data. After that, a large number of regression analyses were performed to select the 100 probes per tumor class that had the highest predictive power (AUC values). Based on the identified 900 methylation sites, the nine pediatric brain tumor types could be accurately classified using the multiclass classification algorithm __MethPed__ [1]. The complete list of 900 probes can be observed as ```{r} # List of 900 probes in predictors library(MethPed) data(MethPed_900probes) head(MethPed_900probes) ``` # A working example of __MethPed__ ## Dataset without missing probe values An example dataset without missing probe beta values ($\beta$), is provided with the __MethPed__ package in _ExpressionSet_ class. This dataset consists of 2 patients and 468,821 probes. At first we can take a look at the data: ```{r Loading_package_and_data} # Loading package library(MethPed) # Loading data set data(MethPed_sample) head(MethPed_sample) class(MethPed_sample) ``` ### Check missing value in data Additionally, it is good to check if there are missing probe values. This will identify if any values are missing from each column/sample in the probe data set. For this operation we can use the `checkNA` command form the __MethPed__ package. Running this function might require some time for large data sets. ```{r Check_missing_probe_data, echo=TRUE} ## Check for missing value missingIndex <- checkNA(MethPed_sample) missingIndex ``` In this example, there are no missing values (probe names or beta values) in the data set. Now we are set to run the MethPed algorithm. ### Apply the __MethPed__ classifier The function `MethPed` has the argument `prob` which can take two values _TRUE_ or _FALSE_. The value _TRUE_ will return the conditional probability of a sample belonging to a specific group. _FALSE_ will return the binary classification (0 or 1) of a sample belonging to a specific group (based on maximum conditional probability). By default `MethPed` will consider _TRUE_ for `prob` argument. During running the __MethPed__ algorithm, additional information on data and progress of data analysis are seen. The analysis will take longer time depending on the amount of data. ```{r, include=FALSE} set.seed(1000) ``` ```{r Run_MethPed_classifier, echo=TRUE, message=TRUE, warning=FALSE} # Run the MethPed classifier myClassification <- MethPed(MethPed_sample) ``` + First message will show the name of the probe column in the data (Default: “TargetID”). + Second message will show "OK" if at least one probe is common between input dataset and predictors (900 probes are used in the predictor [1]. See [Workflow of MethPed classifier](#womc) section for details). + Third message will report about missing value in the data. + Fourth message will list how many probes and samples are in the input data. + Fifth will show the progress of RF analysis. + Sixth will show how many probes are missing in the input data compared with the 900 probes in predictors. The output of the algorithm is partitioned in 6 parts. The full output can be seen by typing _myClassification_ (Variable name defined by user) ```{r Output_of_Methped_classifier} myClassification ``` The first part contains the name of the probe column, the second part contains the number of probes and the third part the total number of samples in the main data. Each information can be extracted as ```{r MethPed_classifier_output_description_1-3} # First part myClassification$target_id # Second part myClassification$probes # Third part myClassification$sample ``` The fourth part contains the names of the probes that were included in the original classifier but are missing from the data at hand (input). This can be accessed through the command: ```{r MethPed_classifier_output_description_4} # Fourth part myClassification$probes_missing ``` In the example, one probe that was included in the original MethPed classifier, is missing from the present data set. If a large number of probes are missing, this could potentially lead to a drop in the predictive accuracy of the model. However, as can be seen by the fifth part of the result: ```{r MethPed_classifier_output_description_5} # Fifth part myClassification$oob_err.rate ``` the out-of-bag error is 1.95%; only slightly higher than the 1.7% in the original MethPed classifier. For details see the MethPed article [1]. The last output contains the predictions. We chose not to assign tumors to one or another group but to give probabilities of belonging to each group. ```{r MethPed_classifier_output_description_6} # Sixth part myClassification$predictions ``` In the example Tumor.A preserves the highest prediction probability, which is 0.796 for subtype MB_Gr3. This should be interpreted as a conditional probability. The probability that Tumor.A belongs to the subtype MB_Gr3 for the given data is 0.796. Similarly, Tumor.B preserves the highest prediction probability, which is 0.771 for the subtype PiloAstro. ### Output summary Alternatively, the `summary` function can be used to get the result from the `MethPed` output. ```{r Summary_of_MethPed_classifier_1} summary(myClassification) ``` If only the maximum conditional probability of samples is preferred, this can be obtained by supplying the FALSE value to the argument `prob`. ```{r, include=FALSE} set.seed(1000) ``` ```{r, include=FALSE} myClassification_max <- MethPed(MethPed_sample,prob=FALSE) ``` ```r myClassification_max <- MethPed(MethPed_sample,prob=FALSE) summary(myClassification_max) ``` ```{r, echo=FALSE} summary(myClassification_max) ``` ### Output visualization The output of the classification can also be visualized in a bar plot ```{r MethPed_bar_plot_custom_code} par(mai = c(1, 1, 1, 2), xpd=TRUE) mat<-t(myClassification$predictions) mycols <- c("green",rainbow(nrow(mat),start=0,end=1)[nrow(mat):1],"red") barplot(mat, col = mycols, beside=FALSE,axisnames=TRUE, ylim=c(0,1),xlab= "Sample",ylab="Probability") legend( ncol(mat)+0.5,1, legend = rownames(mat),fill = mycols,xpd=TRUE, cex = 0.6) ``` or the `plot` command to used to get the same bar plot ```r myClassification <- MethPed(MethPed_sample,prob=TRUE) plot(myClassification) ``` ```{r MethPed_bar_plot_generic_function_TRUE, echo=FALSE} plot(myClassification) ``` If only the maximum conditional probability of samples is preferred, this can be obtained by supplying the FALSE value to the argument `prob`. ```r myClassification_max <- MethPed(MethPed_sample,prob=FALSE) plot(myClassification_max) ``` ```{r MethPed_bar_plot_generic_function_FALSE, echo=FALSE} plot(myClassification_max) ``` ### Count missing probes The name of the probes that were included in the original __MethPed__ classifier (900 probes) but are missing from the data at hand, can be accessed by the command `probeMis`. ```{r Function_to_get_missing_probes} probeMis(myClassification) ``` ## Dataset with missing probe values In the previous section __MethPed__ classification on a dataset without missing values were shown. In this section, __MethPed__ classification on dataset with missing values are shown. ### Missing beta value in the data set The dataset “MethPed_sample” provided with the package is a non-missing valued dataset. As an example we will now generate a data set with missing probe values. ```{r} # Loading dataset data(MethPed_sample) class(MethPed_sample) ``` As the provided dataset is in the _ExpressionSet_ class, we have to extract the matrix of beta values for further process ```{r, include=FALSE} library(Biobase) ``` ```r # Loading library library(Biobase) # exprs function is loaded from the package Biobase MethPed_sample_matrix<-Biobase::exprs(MethPed_sample) head(MethPed_sample_matrix) ``` ```{r, echo=FALSE} MethPed_sample_matrix<-Biobase::exprs(MethPed_sample) head(MethPed_sample_matrix) ``` ```{r} class(MethPed_sample_matrix) ``` We can check for missing value in the extracted matrix of beta values ```{r} checkNA(MethPed_sample_matrix) ``` There are no missing values in the extracted matrix of beta values. Now we will remove some beta values (Replace with NA) from this matrix to make a missing valued dataset. ```{r} MethPed_sample_missing<-MethPed_sample_matrix MethPed_sample_missing[c(1,10,200),2]<-NA MethPed_sample_missing[c(4,600,500,1000),1]<-NA head(MethPed_sample_missing,10) ``` This dataset is now ready for __MethPed__ classification. ### Missing values imputation __MethPed__ cannot be applied on data sets with missing value. Before applying the __MethPed__ classifier, it’s important to check if there are missing values of probes in the dataset. To check missing values, one can use the ´checkNA´ function ```{r} checkNA(MethPed_sample_missing) ``` In this example there are 2 tumors with in total 7 missing beta values in the data. The classifier requires that all probes that were used for training the model are present for prediction. If this condition is not met, no prediction will be returned for the tumors with missing probes. One possible solution is to discard probes that have missing data. This might not be efficient if a large number of probe's values are missing in one or more samples. A more feasible approach can be to impute the missing values. Here we illustrate imputation by the means of K-nearest neighbors. For technical details and relevant references we refer readers to the documentation of the [_impute_](https://bioconductor.org/packages/release/bioc/html/impute.html) package. It should be noted that, this package was developed for gene expression data, and not for methylation data. However, most likely the method performs well on methylation data as well [3]. The _impute_ package depends on a random value for initiation. Therefore, to assure randomness, the eventual set of random seeds should be removed. ```{r Delete_random_seed} if(exists(".Random.seed")) rm(.Random.seed) ``` As mentioned before, the dataset contains 2 tumors. So the imputation will be run on two column of the methylation data. ```{r, include=FALSE} library(impute) imputedData <- impute.knn(MethPed_sample_missing) ``` ```r # Loading library library(impute) # Apply imputation imputedData <- impute.knn(MethPed_sample_missing) # extract the imputed matrix form object "data" imputedData <- imputedData$data # Check for missing value checkNA(imputedData) ``` ```{r Combine_probe_name_with_impoted_data, include=FALSE} imputedData <- imputedData$data ``` ```{r, echo=FALSE} checkNA(imputedData) ``` Now we are set to run the MethPed algorithm on the imputated matrix. ```{r, include=FALSE} set.seed(1000) ``` ```{r, include=FALSE} myClassification <- MethPed(imputedData) ``` ```r myClassification <- MethPed(imputedData) summary(myClassification) ``` ```{r, echo=FALSE} summary(myClassification) ``` # Contact and Citation ## Contact For bug-reports, comments and feature requests please write to helenacarenlab+methped@gmail.com. ## Citation When using MethPed in your work, please cite as: ```{r} citation('MethPed') ``` # Authors information Mohammad Tanvir Ahamed, Anna Danielsson, Szilárd Nemes and Helena Carén, University of Gothenburg, Sweden. # Session info ```{r} sessionInfo() ``` # References [1] Anna Danielsson, Szilárd Nemes, Magnus Tisell, Birgitta Lannering, Claes Nordborg, Magnus Sabel, and Helena Carén. "MethPed: A DNA Methylation Classifier Tool for the Identification of Pediatric Brain Tumor Subtypes". Clinical Epigenetics 2015, 7:62, 2015 [2] Breiman, Leo (2001). "Random Forests". Machine Learning 45 (1): 5–32. doi:10.1023/A:1010933404324 [3] Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. "Missing Value Estimation Methods for DNA Microarrays." Bioinformatics 17.6 (2001): 520-25.