PLoS Comput Biol. 2021 Jun 11;17(6):e1008944. doi: 10.1371/journal.pcbi.1008944. Online ahead of print.
Cancer cells display massive dysregulation of key regulatory pathways due to now well-catalogued mutations and other DNA-related aberrations. Moreover, enormous heterogeneity has been commonly observed in the identity, frequency and location of these aberrations across individuals with the same cancer type or subtype, and this variation naturally propagates to the transcriptome, resulting in myriad types of dysregulated gene expression programs. Many have argued that a more integrative and quantitative analysis of heterogeneity of DNA and RNA molecular profiles may be necessary for designing more systematic explorations of alternative therapies and improving predictive accuracy. We introduce a representation of multi-omics profiles which is sufficiently rich to account for observed heterogeneity and support the construction of quantitative, integrated, metrics of variation. Starting from the network of interactions existing in Reactome, we build a library of “paired DNA-RNA aberrations” that represent prototypical and recurrent patterns of dysregulation in cancer; each two-gene “Source-Target Pair” (STP) consists of a “source” regulatory gene and a “target” gene whose expression is plausibly “controlled” by the source gene. The STP is then “aberrant” in a joint DNA-RNA profile if the source gene is DNA-aberrant (e.g., mutated, deleted, or duplicated), and the downstream target gene is “RNA-aberrant”, meaning its expression level is outside the normal, baseline range. With M STPs, each sample profile has exactly one of the 2M possible configurations. We concentrate on subsets of STPs, and the corresponding reduced configurations, by selecting tissue-dependent minimal coverings, defined as the smallest family of STPs with the property that every sample in the considered population displays at least one aberrant STP within that family. These minimal coverings can be computed with integer programming. Given such a covering, a natural measure of cross-sample diversity is the extent to which the particular aberrant STPs composing a covering vary from sample to sample; this variability is captured by the entropy of the distribution over configurations. We apply this program to data from TCGA for six distinct tumor types (breast, prostate, lung, colon, liver, and kidney cancer). This enables an efficient simplification of the complex landscape observed in cancer populations, resulting in the identification of novel signatures of molecular alterations which are not detected with frequency-based criteria. Estimates of cancer heterogeneity across tumor phenotypes reveals a stable pattern: entropy increases with disease severity. This framework is then well-suited to accommodate the expanding complexity of cancer genomes and epigenomes emerging from large consortia projects.