if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("LiNk-NY/terraTCGAdata")
Some public Terra workspaces come pre-packaged with TCGA data (i.e., cloud data
resources are linked within the data model). Particularly the workspaces that
are labelled OpenAccess_V1-0. Datasets harmonized to the hg38 genome use a
different data model / workflow and are not compatible with the functions in
this package. For those that are, we make use of the Terra data model and
represent the data as MultiAssayExperiment.
For more information on MultiAssayExperiment, please see the vignette in
that package.
library(AnVIL)
library(terraTCGAdata)
A valid GCloud SDK installation is required to use the package. Use the
gcloud_exists() function from the AnVIL package to identify
whether it is installed in your system.
gcloud_exists()
## [1] FALSE
You can also use the gcloud_project to set a project name by specifying
the project argument:
gcloud_project()
To get a list of available TCGA workspaces, use the findTCGAworkspaces()
function:
findTCGAworkspaces()
You can then set a package-wide option with the terraTCGAworkspace function
and check the setting with the getOption('terraTCGAdata.workspace') option.
terraTCGAworkspace("TCGA_COAD_OpenAccess_V1-0_DATA")
getOption("terraTCGAdata.workspace")
In order to determine what datasets to download, use the getClinicalTable
function to list all of the columns that correspond to clinical data
from the different collection centers.
ct <- getClinicalTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
ct
names(ct)
After picking the column in the getClinicalTable output, use the column
name as input to the getClinical function to obtain the data:
column_name <- "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin"
clin <- getClinical(
columnName = column_name,
participants = TRUE,
workspace = "TCGA_COAD_OpenAccess_V1-0_DATA"
)
clin[, 1:6]
dim(clin)
We use the same approach for assay data. We first produce a list of assays
from the getAssayTable and then we select one along with any sample
codes of interest.
at <- getAssayTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
at
names(at)
You can get a summary table of all the samples in the adata by using the
sampleTypesTable:
sampleTypesTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
Note that if you have the package-wide option set, the workspace argument is not needed in the function call.
prot <- getAssayData(
assayName = "protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data",
sampleCode = c("01", "10"),
workspace = "TCGA_COAD_OpenAccess_V1-0_DATA",
sampleIdx = 1:4
)
head(prot)
Finally, once you have collected all the relevant column names,
these can be inputs to the main terraTCGAdata function:
mae <- terraTCGAdata(
clinicalName = "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin",
assays =
c("protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data",
"rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data"),
sampleCode = NULL,
split = FALSE,
sampleIdx = 1:4,
workspace = "TCGA_COAD_OpenAccess_V1-0_DATA"
)
mae
We expect that most OpenAccess_V1-0 cancer datasets follow this data model.
If you encounter any errors, please provide a minimally reproducible example
at https://github.com/waldronlab/terraTCGAdata.
sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] terraTCGAdata_1.0.0 MultiAssayExperiment_1.22.0
## [3] SummarizedExperiment_1.26.0 Biobase_2.56.0
## [5] GenomicRanges_1.48.0 GenomeInfoDb_1.32.0
## [7] IRanges_2.30.0 S4Vectors_0.34.0
## [9] BiocGenerics_0.42.0 MatrixGenerics_1.8.0
## [11] matrixStats_0.62.0 AnVIL_1.8.0
## [13] dplyr_1.0.8 BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] lattice_0.20-45 tidyr_1.2.0 assertthat_0.2.1
## [4] digest_0.6.29 utf8_1.2.2 R6_2.5.1
## [7] futile.options_1.0.1 rapiclient_0.1.3 evaluate_0.15
## [10] httr_1.4.2 pillar_1.7.0 zlibbioc_1.42.0
## [13] rlang_1.0.2 jquerylib_0.1.4 Matrix_1.4-1
## [16] rmarkdown_2.14 stringr_1.4.0 RCurl_1.98-1.6
## [19] DelayedArray_0.22.0 compiler_4.2.0 xfun_0.30
## [22] pkgconfig_2.0.3 htmltools_0.5.2 tidyselect_1.1.2
## [25] tibble_3.1.6 GenomeInfoDbData_1.2.8 bookdown_0.26
## [28] codetools_0.2-18 fansi_1.0.3 crayon_1.5.1
## [31] bitops_1.0-7 grid_4.2.0 jsonlite_1.8.0
## [34] lifecycle_1.0.1 DBI_1.1.2 magrittr_2.0.3
## [37] formatR_1.12 cli_3.3.0 stringi_1.7.6
## [40] XVector_0.36.0 futile.logger_1.4.3 bslib_0.3.1
## [43] ellipsis_0.3.2 generics_0.1.2 vctrs_0.4.1
## [46] lambda.r_1.2.4 tools_4.2.0 glue_1.6.2
## [49] purrr_0.3.4 parallel_4.2.0 fastmap_1.1.0
## [52] yaml_2.3.5 BiocManager_1.30.17 knitr_1.39
## [55] sass_0.4.1