library(cBioPortalData)
library(AnVIL)The cBioPortal for Cancer Genomics website is a great resource for interactive exploration of study datasets. However, it does not easily allow the analyst to obtain and further analyze the data.
We’ve developed the cBioPortalData package to fill this need to
programmatically access the data resources available on the cBioPortal.
The cBioPortalData package provides an R interface for accessing the
cBioPortal study data within the Bioconductor ecosystem.
It downloads study data from the cBioPortal API (the full API specification can be found here https://cbioportal.org/api) and uses Bioconductor infrastructure to cache and represent the data.
We use the MultiAssayExperiment (@Ramos2017-er) package to integrate,
represent, and coordinate multiple experiments for the studies availble in the
cBioPortal. This package in conjunction with curatedTCGAData give access to
a large trove of publicly available bioinformatic data. Please see our
publication here (@Ramos2020-ya).
We demonstrate common use cases of cBioPortalData and curatedTCGAData
during Bioconductor conference
workshops.
Data are provided as a single MultiAssayExperiment per study. The
MultiAssayExperiment representation usually contains SummarizedExperiment
objects for expression data and RaggedExperiment objects for mutation and
CNV-type data. RaggedExperiment is a data class for representing ‘ragged’
genomic location data, meaning that the measurements per sample vary.
For more information, please see the RaggedExperiment and
SummarizedExperiment vignettes.
As we work through the data, there are some datasest that cannot be represented
as MultiAssayExperiment objects. This can be due to a number of reasons such
as the way the data is handled, presence of mis-matched identifiers, invalid
data types, etc. To see what datasets are currently not building, we can
look refer to getStudies() with the buildReport = TRUE argument.
cbio <- cBioPortal()## Warning in .service_validate_md5sum(api_reference_url, api_reference_md5sum, : service version differs from validated version
##     service url: https://www.cbioportal.org/api/v2/api-docs
##     observed md5sum: 008be96361f24a5c8d1cfb7f10ae9c97
##     expected md5sum: 07ceb76cc5afcf54a9cf2e1a689b18f7studies <- getStudies(cbio, buildReport = TRUE)
head(studies)## # A tibble: 6 × 15
##   name           description publicStudy groups status importDate allSampleCount
##   <chr>          <chr>       <lgl>       <chr>   <int> <chr>               <int>
## 1 Adrenocortica… "TCGA Adre… TRUE        "PUBL…      0 2022-10-2…             92
## 2 Acute Lymphob… "Comprehen… TRUE        "PUBL…      0 2022-10-2…             93
## 3 Hypodiploid A… "Whole gen… TRUE        ""          0 2022-10-2…             44
## 4 Adenoid Cysti… "Whole exo… TRUE        "ACYC…      0 2022-10-2…             12
## 5 Adenoid Cysti… "Targeted … TRUE        "ACYC…      0 2022-10-2…             28
## 6 Adenoid Cysti… "Whole-gen… TRUE        "ACYC…      0 2022-10-2…             25
## # ℹ 8 more variables: readPermission <lgl>, studyId <chr>, cancerTypeId <chr>,
## #   referenceGenome <chr>, pmid <chr>, citation <chr>, api_build <lgl>,
## #   pack_build <lgl>The last two columns will show the availability of each studyId for
either download method (pack_build for cBioDataPack and api_build for
cBioPortalData).
There are two main user-facing functions for downloading data from the cBioPortal API.
cBioDataPack makes use of the tarball distribution of study data. This is
useful when the user wants to download and analyze the entirety of the data as
available from the cBioPortal.org website.
cBioPortalData allows a more flexibile approach to obtaining study data
based on the available parameters such as molecular profile identifiers. This
option is useful for users who have a set of gene symbols or identifiers and
would like to get a smaller subset of the data that correspond to a particular
molecular profile.
This function will access the packaged data from and return an integrative MultiAssayExperiment representation.
## Use ask=FALSE for non-interactive use
laml <- cBioDataPack("laml_tcga", ask = FALSE)
laml## A MultiAssayExperiment object of 12 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 12:
##  [1] cna: SummarizedExperiment with 24776 rows and 191 columns
##  [2] cna_hg19.seg: RaggedExperiment with 13571 rows and 191 columns
##  [3] linear_cna: SummarizedExperiment with 24776 rows and 191 columns
##  [4] methylation_hm27: SummarizedExperiment with 10968 rows and 194 columns
##  [5] methylation_hm450: SummarizedExperiment with 10968 rows and 194 columns
##  [6] mrna_seq_rpkm: SummarizedExperiment with 19720 rows and 179 columns
##  [7] mrna_seq_rpkm_zscores_ref_all_samples: SummarizedExperiment with 19720 rows and 179 columns
##  [8] mrna_seq_rpkm_zscores_ref_diploid_samples: SummarizedExperiment with 19719 rows and 179 columns
##  [9] mrna_seq_v2_rsem: SummarizedExperiment with 20531 rows and 173 columns
##  [10] mrna_seq_v2_rsem_zscores_ref_all_samples: SummarizedExperiment with 20531 rows and 173 columns
##  [11] mrna_seq_v2_rsem_zscores_ref_diploid_samples: SummarizedExperiment with 20440 rows and 173 columns
##  [12] mutations: RaggedExperiment with 2584 rows and 197 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat filesThis function provides a more flexible and granular way to request a MultiAssayExperiment object from a study ID, molecular profile, gene panel, sample list.
acc <- cBioPortalData(api = cbio, by = "hugoGeneSymbol", studyId = "acc_tcga",
    genePanelId = "IMPACT341",
    molecularProfileIds = c("acc_tcga_rppa", "acc_tcga_linear_CNA")
)## harmonizing input:
##   removing 1 colData rownames not in sampleMap 'primary'acc## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 2:
##  [1] acc_tcga_linear_CNA: SummarizedExperiment with 339 rows and 90 columns
##  [2] acc_tcga_rppa: SummarizedExperiment with 57 rows and 46 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat filesNote. To avoid overloading the API service, the API was designed to only query a part of the study data. Therefore, the user is required to enter either a set of genes of interest or a gene panel identifier.
In cases where a download is interrupted, the user may experience a corrupt
cache. The user can clear the cache for a particular study by using the
removeCache function. Note that this function only works for data downloaded
through the cBioDataPack function.
removeCache("laml_tcga")For users who wish to clear the entire cBioPortalData cache, it is
recommended that they use:
unlink("~/.cache/cBioPortalData/")We can use information in the colData to draw a K-M plot with a few
variables from the colData slot of the MultiAssayExperiment. First, we load
the necessary packages:
library(survival)
library(survminer)We can check the data to lookout for any issues.
table(colData(laml)$OS_STATUS)## 
##   0:LIVING 1:DECEASED 
##         67        133class(colData(laml)$OS_MONTHS)## [1] "character"Now, we clean the data a bit to ensure that our variables are of the right type for the subsequent survival model fit.
collaml <- colData(laml)
collaml[collaml$OS_MONTHS == "[Not Available]", "OS_MONTHS"] <- NA
collaml$OS_MONTHS <- as.numeric(collaml$OS_MONTHS)
colData(laml) <- collamlWe specify a simple survival model using SEX as a covariate and we draw
the K-M plot.
fit <- survfit(
    Surv(OS_MONTHS, as.numeric(substr(OS_STATUS, 1, 1))) ~ SEX,
    data = colData(laml)
)
ggsurvplot(fit, data = colData(laml), risk.table = TRUE)If you are interested in a particular study dataset that is not currently building, please open an issue at our GitHub repository location and we will do our best to resolve the issues with either the data or the code.
We appreciate your feedback!
sessionInfo()## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] survminer_0.4.9             ggpubr_0.6.0               
##  [3] ggplot2_3.4.2               survival_3.5-5             
##  [5] cBioPortalData_2.12.0       MultiAssayExperiment_1.26.0
##  [7] SummarizedExperiment_1.30.0 Biobase_2.60.0             
##  [9] GenomicRanges_1.52.0        GenomeInfoDb_1.36.0        
## [11] IRanges_2.34.0              S4Vectors_0.38.0           
## [13] BiocGenerics_0.46.0         MatrixGenerics_1.12.0      
## [15] matrixStats_0.63.0          AnVIL_1.12.0               
## [17] dplyr_1.1.2                 BiocStyle_2.28.0           
## 
## loaded via a namespace (and not attached):
##   [1] jsonlite_1.8.4            magrittr_2.0.3           
##   [3] magick_2.7.4              GenomicFeatures_1.52.0   
##   [5] farver_2.1.1              rmarkdown_2.21           
##   [7] BiocIO_1.10.0             zlibbioc_1.46.0          
##   [9] vctrs_0.6.2               memoise_2.0.1            
##  [11] Rsamtools_2.16.0          RCurl_1.98-1.12          
##  [13] rstatix_0.7.2             BiocBaseUtils_1.2.0      
##  [15] htmltools_0.5.5           progress_1.2.2           
##  [17] lambda.r_1.2.4            curl_5.0.0               
##  [19] broom_1.0.4               sass_0.4.5               
##  [21] bslib_0.4.2               htmlwidgets_1.6.2        
##  [23] zoo_1.8-12                futile.options_1.0.1     
##  [25] cachem_1.0.7              commonmark_1.9.0         
##  [27] GenomicAlignments_1.36.0  mime_0.12                
##  [29] lifecycle_1.0.3           pkgconfig_2.0.3          
##  [31] Matrix_1.5-4              R6_2.5.1                 
##  [33] fastmap_1.1.1             GenomeInfoDbData_1.2.10  
##  [35] shiny_1.7.4               digest_0.6.31            
##  [37] colorspace_2.1-0          RaggedExperiment_1.24.0  
##  [39] AnnotationDbi_1.62.0      RSQLite_2.3.1            
##  [41] labeling_0.4.2            filelock_1.0.2           
##  [43] RTCGAToolbox_2.30.0       km.ci_0.5-6              
##  [45] fansi_1.0.4               RJSONIO_1.3-1.8          
##  [47] abind_1.4-5               httr_1.4.5               
##  [49] compiler_4.3.0            bit64_4.0.5              
##  [51] withr_2.5.0               backports_1.4.1          
##  [53] BiocParallel_1.34.0       carData_3.0-5            
##  [55] DBI_1.1.3                 highr_0.10               
##  [57] ggsignif_0.6.4            biomaRt_2.56.0           
##  [59] rappdirs_0.3.3            DelayedArray_0.26.0      
##  [61] rjson_0.2.21              tools_4.3.0              
##  [63] httpuv_1.6.9              glue_1.6.2               
##  [65] restfulr_0.0.15           promises_1.2.0.1         
##  [67] gridtext_0.1.5            grid_4.3.0               
##  [69] generics_0.1.3            gtable_0.3.3             
##  [71] KMsurv_0.1-5              tzdb_0.3.0               
##  [73] tidyr_1.3.0               data.table_1.14.8        
##  [75] hms_1.1.3                 car_3.1-2                
##  [77] xml2_1.3.3                utf8_1.2.3               
##  [79] XVector_0.40.0            markdown_1.6             
##  [81] pillar_1.9.0              stringr_1.5.0            
##  [83] vroom_1.6.1               RCircos_1.2.2            
##  [85] later_1.3.0               splines_4.3.0            
##  [87] ggtext_0.1.2              BiocFileCache_2.8.0      
##  [89] lattice_0.21-8            rtracklayer_1.60.0       
##  [91] bit_4.0.5                 tidyselect_1.2.0         
##  [93] Biostrings_2.68.0         miniUI_0.1.1.1           
##  [95] knitr_1.42                gridExtra_2.3            
##  [97] bookdown_0.33             futile.logger_1.4.3      
##  [99] xfun_0.39                 DT_0.27                  
## [101] stringi_1.7.12            yaml_2.3.7               
## [103] evaluate_0.20             codetools_0.2-19         
## [105] archive_1.1.5             tibble_3.2.1             
## [107] BiocManager_1.30.20       cli_3.6.1                
## [109] xtable_1.8-4              munsell_0.5.0            
## [111] jquerylib_0.1.4           survMisc_0.5.6           
## [113] Rcpp_1.0.10               GenomicDataCommons_1.24.0
## [115] dbplyr_2.3.2              png_0.1-8                
## [117] XML_3.99-0.14             rapiclient_0.1.3         
## [119] parallel_4.3.0            TCGAutils_1.20.0         
## [121] ellipsis_0.3.2            readr_2.1.4              
## [123] blob_1.2.4                prettyunits_1.1.1        
## [125] bitops_1.0-7              scales_1.2.1             
## [127] purrr_1.0.1               crayon_1.5.2             
## [129] rlang_1.1.0               KEGGREST_1.40.0          
## [131] rvest_1.0.3               formatR_1.14