Title: | A Tidy Framework to Hack Gene Expression Signatures |
---|---|
Description: | A collection of cancer transcriptomics gene signatures as well as a simple and tidy interface to compute single sample enrichment scores either with the original procedure or with three alternatives: the "combined z-score" of Lee et al. (2008) <doi:10.1371/journal.pcbi.1000217>, the "single sample GSEA" of Barbie et al. (2009) <doi:10.1038/nature08460> and the "singscore" of Foroutan et al. (2018) <doi:10.1186/s12859-018-2435-4>. The 'get_sig_info()' function can be used to retrieve information about each signature implemented. |
Authors: | Andrea Carenzo [aut, cre] , Loris De Cecco [aut] , Federico Pistore [aut] |
Maintainer: | Andrea Carenzo <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.0 |
Built: | 2024-11-22 03:36:52 UTC |
Source: | https://github.com/acare/hacksig |
check_sig()
is a helper function that shows useful information about signatures
that you want to test on your gene expression matrix.
check_sig(expr_data, signatures = "all")
check_sig(expr_data, signatures = "all")
expr_data |
A normalized gene expression matrix (or data frame) with gene symbols as row names and samples as columns. |
signatures |
It can be a list of signatures or a character vector indicating
keywords for a group of signatures. The default ( |
A tibble with a number of rows equal to the number of input signatures and five columns:
signature_id
, a unique identifier associated to a signature;
n_genes
, the number of genes composing a signature;
n_present
and frac_present
, the number and fraction of genes in a
signature which are present in expr_data
, respectively;
missing_genes
, the missing gene symbols for each signature.
check_sig(test_expr) check_sig(test_expr, "estimate")
check_sig(test_expr) check_sig(test_expr, "estimate")
Obtain gene signatures implemented in hacksig
as a named list of gene symbols.
get_sig_genes(keywords = "all")
get_sig_genes(keywords = "all")
keywords |
A character vector indicating keywords for a group of signatures.
The default ( |
A named list of gene signatures.
get_sig_info()
to get valid keywords for signatures.
get_sig_genes() get_sig_genes("estimate")
get_sig_genes() get_sig_genes("estimate")
get_sig_info()
returns information about all gene signatures implemented in
hacksig
.
get_sig_info()
get_sig_info()
A tibble with one row per signature and four columns:
signature_id
, a unique identifier associated to a signature;
signature_keywords
, valid keywords to use in the signatures
argument of
hack_sig()
and check_sig()
as well as in the keywords
argument of
get_sig_genes()
;
publication_doi
, the original publication DOI;
description
, a brief description about the signature.
check_sig()
, hack_sig()
, get_sig_genes()
get_sig_info()
get_sig_info()
Given a gene expression matrix and a 0-1 vector indicating the distant metastasis
status of samples, hack_cinsarc()
classifies samples into one of two risk
classes, C1 or C2, using the CINSARC signature as implemented in
Chibon et al., 2010.
hack_cinsarc(expr_data, dm_status)
hack_cinsarc(expr_data, dm_status)
expr_data |
A normalized gene expression matrix (or data frame) with gene symbols as row names and samples as columns. |
dm_status |
A numeric vector specifying whether a sample has either (1) or not (0) developed distant metastasis. |
CINSARC (Complexity INdex in SARComas) is a prognostic 67-gene signature related to mitosis and control of chromosome integrity. It was developed to improve metastatic outcome prediction in soft tissue sarcomas over the FNCLCC (Fédération Francaise des Centres de Lutte Contre le Cancer) grading system.
A tibble with one row for each sample in expr_data
and two columns:
sample_id
and cinsarc_class
.
The CINSARC method implemented in hacksig
makes use of leave-one-out cross
validation (LOOCV) to classify samples into C1/C2 risk groups (see Lesluyes & Chibon, 2020).
First, gene expression values are centered by their mean across samples.
Then, for each iteration of the LOOCV, mean normalized gene values are computed
by metastasis group (i.e. compute the metastatic centroids). Then, one minus the
Spearman's correlation between centered samples and metastatic centroids are computed.
Finally, if a sample is more correlated to the non-metastatic centroid, then
it is assigned to the C1 class (low risk). Conversely, if a sample is more
correlated to the metastatic centroid, then it is assigned to the C2 class (high risk).
codeocean.com/capsule/4933686/tree/v4
Chibon, F., Lagarde, P., Salas, S., Pérot, G., Brouste, V., Tirode, F., Lucchesi, C., de Reynies, A., Kauffmann, A., Bui, B., Terrier, P., Bonvalot, S., Le Cesne, A., Vince-Ranchère, D., Blay, J. Y., Collin, F., Guillou, L., Leroux, A., Coindre, J. M., & Aurias, A. (2010). Validated prediction of clinical outcome in sarcomas and multiple types of cancer on the basis of a gene expression signature related to genome complexity. Nature medicine, 16(7), 781–787. doi:10.1038/nm.2174.
Lesluyes, T., & Chibon, F. (2020). A Global and Integrated Analysis of CINSARC-Associated Genetic Defects. Cancer research, 80(23), 5282–5290. doi:10.1158/0008-5472.CAN-20-0512.
# generate random distant metastasis outcome set.seed(123) test_dm_status <- sample(c(0, 1), size = ncol(test_expr), replace = TRUE) hack_cinsarc(test_expr, test_dm_status)
# generate random distant metastasis outcome set.seed(123) test_dm_status <- sample(c(0, 1), size = ncol(test_expr), replace = TRUE) hack_cinsarc(test_expr, test_dm_status)
Obtain Immune, Stroma, ESTIMATE and Tumor Purity scores from a cohort of samples, using the method implemented in Yoshihara et al., 2013.
hack_estimate(expr_data)
hack_estimate(expr_data)
expr_data |
A normalized gene expression matrix (or data frame) with gene symbols as row names and samples as columns. |
The ESTIMATE (Estimation of STromal and Immune cells in MAlignant Tumors using Expression data) method was developed with the aim to estimate the fraction of tumor cells in a sample by using gene expression instead of copy number data. The fundamental assumption of this method is that the tumor microenvironment is a very rich and dynamic ecosystem, in which immune infiltrating cells and stroma play a major role. The ESTIMATE score is defined as the combination (i.e. sum) of immune and stroma scores and can be thought of as a "non-tumor score". Consequently, a high ESTIMATE enrichment gives a low tumor purity score and viceversa.
A tibble with one row for each sample in expr_data
and five columns:
sample_id
, immune_score
, stroma_score
, estimate_score
and purity_score
.
Raw immune and stromal signatures scores are computed using single sample GSEA with rank normalization (Barbie et al., 2009). Then, the ESTIMATE score is computed by summing the immune and stroma scores. Finally, the tumor purity score is obtained with the following formula:
bioinformatics.mdanderson.org/public-software/estimate/
Barbie, D. A., Tamayo, P., Boehm, J. S., Kim, S. Y., Moody, S. E., Dunn, I. F., Schinzel, A. C., Sandy, P., Meylan, E., Scholl, C., Fröhling, S., Chan, E. M., Sos, M. L., Michel, K., Mermel, C., Silver, S. J., Weir, B. A., Reiling, J. H., Sheng, Q., Gupta, P. B., … Hahn, W. C. (2009). Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature, 462(7269), 108–112. doi:10.1038/nature08460.
Yoshihara, K., Shahmoradgoli, M., Martínez, E., Vegesna, R., Kim, H., Torres-Garcia, W., Treviño, V., Shen, H., Laird, P. W., Levine, D. A., Carter, S. L., Getz, G., Stemke-Hale, K., Mills, G. B., & Verhaak, R. G. (2013). Inferring tumour purity and stromal and immune cell admixture from expression data. Nature communications, 4, 2612. doi:10.1038/ncomms3612.
hack_estimate(test_expr)
hack_estimate(test_expr)
Obtain various immune biomarkers scores, which combined together give the immunophenoscore (Charoentong et al., 2017).
hack_immunophenoscore(expr_data, extract = "ips")
hack_immunophenoscore(expr_data, extract = "ips")
expr_data |
A normalized gene expression matrix (or data frame) with gene symbols as row names and samples as columns. |
extract |
A string controlling which type of biomarker scores you want to obtain. Possible choices are:
|
The immunophenoscore is conceived as a quantification of tumor immunogenicity. It is obtained by aggregating multiple immune biomarkers scores, which are grouped into four major classes:
MHC molecules (MHC), expression of MHC class I, class II, and non-classical molecules;
Immunomodulators (CP), expression of certain co-inhibitory and co-stimulatory molecules;
Effector cells (EC), infiltration of activated CD8+/CD4+ T cells and Tem (effector memory) CD8+/CD4+ cells;
Suppressor cells (SC), infiltration of immunosuppressive cells (Tregs and MDSCs).
The table below shows in detail the 26 immune biomarkers and cell types grouped by class together with the number of genes which represent them:
Class | | Biomarker/cell type | | No. genes |
MHC | B2M | 1 |
MHC | HLA-A | 1 |
MHC | HLA-B | 1 |
MHC | HLA-C | 1 |
MHC | HLA-DPA1 | 1 |
MHC | HLA-DPB1 | 1 |
MHC | HLA-E | 1 |
MHC | HLA-F | 1 |
MHC | TAP1 | 1 |
MHC | TAP2 | 1 |
CP | CD27 | 1 |
CP | CTLA-4 | 1 |
CP | ICOS | 1 |
CP | IDO1 | 1 |
CP | LAG3 | 1 |
CP | PD1 | 1 |
CP | PD-L1 | 1 |
CP | PD-L2 | 1 |
CP | TIGIT | 1 |
CP | TIM3 | 1 |
EC | Act CD4 | 24 |
EC | Act CD8 | 26 |
EC | Tem CD4 | 27 |
EC | Tem CD8 | 25 |
SC | MDSC | 20 |
SC | Treg | 20 |
A tibble with one row for each sample in expr_data
, a column sample_id
indicating sample identifiers and a number of additional columns depending
on the choice of extract
.
Samplewise gene expression z-scores are obtained for each of 26 immune cell
types and biomarkers. Then, weighted averaged z-scores are computed for each
class and the raw immunophenoscore () results as the sum of the
four class scores. Finally, the immunophenoscore (
) is given as an
integer value between 0 and 10 in the following way:
, if
;
, if
;
, if
.
github.com/icbi-lab/Immunophenogram
Charoentong, P., Finotello, F., Angelova, M., Mayer, C., Efremova, M., Rieder, D., Hackl, H., & Trajanoski, Z. (2017). Pan-cancer Immunogenomic Analyses Reveal Genotype-Immunophenotype Relationships and Predictors of Response to Checkpoint Blockade. Cell reports, 18(1), 248–262. doi:10.1016/j.celrep.2016.12.019.
hack_sig()
to compute Immunophenoscore biomarkers in different
ways (e.g. use signatures = "ips"
and method = "singscore"
).
check_sig()
to check if all/most of the Immunophenoscore biomarkers are
present in your expression matrix (use signatures = "ips"
).
hack_immunophenoscore(test_expr) hack_immunophenoscore(test_expr, extract = "class")
hack_immunophenoscore(test_expr) hack_immunophenoscore(test_expr, extract = "class")
Compute gene signature single sample scores in one of different ways. You can choose to apply either the original procedure or one of three single sample scoring methods: the combined z-score (Lee et al., 2008), the single sample GSEA (Barbie et al., 2009) or the singscore method (Foroutan et al., 2018).
hack_sig( expr_data, signatures = "all", method = "original", direction = "none", sample_norm = "raw", rank_norm = "none", alpha = 0.25 )
hack_sig( expr_data, signatures = "all", method = "original", direction = "none", sample_norm = "raw", rank_norm = "none", alpha = 0.25 )
expr_data |
A normalized gene expression matrix (or data frame) with gene symbols as row names and samples as columns. |
signatures |
It can be a list of signatures or a character vector indicating
keywords for a group of signatures. The default ( |
method |
A character string specifying which method to use for computing the single sample score for each signature. You can choose one of:
|
direction |
A character string specifying the singscore computation method depending on the direction of the signatures. Can be on of:
|
sample_norm |
A character string specifying the type of normalization affecting the single sample GSEA scores. Can be one of:
|
rank_norm |
A character string specifying how gene expression ranks should be normalized in the single sample GSEA procedure. Valid choices are:
|
alpha |
A numeric scalar. Exponent in the running sum of the single sample GSEA
score calculation which weighs the gene ranks. Defaults to |
For "original"
method, it is intended the procedure used in the original
publication by the authors for computing the signature score.
hack_sig()
can compute signature scores with the original method only if
this is a relatively simple procedure (e.g weighted sum of fitted model
coefficients and expression values).
For more complex methods, such as CINSARC, ESTIMATE and Immunophenoscore,
use the dedicated functions.
If signatures
is a custom list of gene signatures, then the "ssgsea"
method will be applied by default.
A tibble with one row for each sample in expr_data
, a column sample_id
indicating sample identifiers and one column for each input signature giving
single sample scores.
This section gives a brief explanation of how single sample scores are obtained from different methods.
Gene expression values are centered by their mean value and scaled by their standard deviation across samples for each gene (z-scores). Then, for each sample and signature, corresponding z-scores are added up and divided by the square root of the signature size (i.e. the number of genes composing a signature).
The combined z-score method is also implemented in the R package GSVA
(Hänzelmann et al., 2013).
For each sample, genes are ranked by expression value in increasing order and
rank normalization may follow (see argument rank_norm
). Then, two probability-like
vectors are computed for each sample and signature:
, the cumulative sum of weighted ranks divided by their total
sum for genes in the signature;
, the cumulative sum of ones (indicating genes not in the signature)
divided by the number of genes not in the signature.
The single sample GSEA score is obtained by adding up the elements of the
vector difference .
Finally, single sample scores could be normalized either across samples or across
gene signatures and samples.
The single sample GSEA method is also implemented in the R package GSVA
(Hänzelmann et al., 2013).
For signatures whose genes are supposed to be up- or down-regulated, genes are ranked by expression value in increasing or decreasing order, respectively. For signatures whose direction is unknown, genes are ranked by absolute expression in increasing order and are median-centered. Enrichment scores are then computed for each sample and signature by averaging gene ranks for genes in the signature. Finally, normalized scores are obtained by subtracting the theoretical minimum mean rank from the score and dividing by the difference between the theoretical maximum and minimum mean ranks.
The hacksig
implementation of this method works only with unidirectional (i.e.
all genes up- or down-regulated) and undirected gene signatures.
If you want to get single sample scores for bidirectional gene signatures (i.e.
signatures composed of both up- and down-regulated genes), please use the R
package singscore
(Foroutan et al., 2018).
Barbie, D. A., Tamayo, P., Boehm, J. S., Kim, S. Y., Moody, S. E., Dunn, I. F., Schinzel, A. C., Sandy, P., Meylan, E., Scholl, C., Fröhling, S., Chan, E. M., Sos, M. L., Michel, K., Mermel, C., Silver, S. J., Weir, B. A., Reiling, J. H., Sheng, Q., Gupta, P. B., … Hahn, W. C. (2009). Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature, 462(7269), 108–112. doi:10.1038/nature08460.
Foroutan, M., Bhuva, D. D., Lyu, R., Horan, K., Cursons, J., & Davis, M. J. (2018). Single sample scoring of molecular phenotypes. BMC bioinformatics, 19(1), 404. doi:10.1186/s12859-018-2435-4.
Hänzelmann, S., Castelo, R., & Guinney, J. (2013). GSVA: gene set variation analysis for microarray and RNA-seq data. BMC bioinformatics, 14, 7. doi:10.1186/1471-2105-14-7.
Lee, E., Chuang, H. Y., Kim, J. W., Ideker, T., & Lee, D. (2008). Inferring pathway activity toward precise disease classification. PLoS computational biology, 4(11), e1000217. doi:10.1371/journal.pcbi.1000217.
get_sig_info()
to get information about all implemented signatures.
check_sig()
to check if signatures are applicable to your data.
hack_cinsarc()
to apply the original CINSARC procedure.
hack_estimate()
to obtain the original ESTIMATE scores.
hack_immunophenoscore()
to apply the original Immunophenoscore procedure.
# Raw ssGSEA scores for all implemented signatures can be obtained with: hack_sig(test_expr, method = "ssgsea") # To obtain 0-1 normalized ssGSEA scores, use: hack_sig(test_expr, method = "ssgsea", sample_norm = "separate") # You can also change the exponent of the ssGSEA running sum with: hack_sig(test_expr, method = "ssgsea", sample_norm = "separate", alpha = 0.5) # To obtain combined z-scores for custom gene signatures, use: custom_list <- list(rand_sig1 = rownames(test_expr)[1:5], rand_sig2 = c(rownames(test_expr)[6:8], "RANDOMGENE")) hack_sig(test_expr, custom_list, method = "zscore")
# Raw ssGSEA scores for all implemented signatures can be obtained with: hack_sig(test_expr, method = "ssgsea") # To obtain 0-1 normalized ssGSEA scores, use: hack_sig(test_expr, method = "ssgsea", sample_norm = "separate") # You can also change the exponent of the ssGSEA running sum with: hack_sig(test_expr, method = "ssgsea", sample_norm = "separate", alpha = 0.5) # To obtain combined z-scores for custom gene signatures, use: custom_list <- list(rand_sig1 = rownames(test_expr)[1:5], rand_sig2 = c(rownames(test_expr)[6:8], "RANDOMGENE")) hack_sig(test_expr, custom_list, method = "zscore")
stratify_sig()
is supposed to be used in combination after hack_sig()
in
order to classify your samples in one of two or more signature classes.
stratify_sig(sig_data, cutoff = "original", probs = seq(0, 1, 0.25))
stratify_sig(sig_data, cutoff = "original", probs = seq(0, 1, 0.25))
sig_data |
A tibble result of a call to |
cutoff |
A character specifying which function to use to categorize samples by signature scores. Can be one of:
|
probs |
A numeric vector of probabilities with values in |
A tibble with the same dimension as sig_data
, having a column sample_id
with sample identifiers and one column for each input signature giving
sample classes.
scores <- hack_sig(test_expr, "immune") stratify_sig(scores)
scores <- hack_sig(test_expr, "immune") stratify_sig(scores)
A gene expression matrix simulating expression profiles of 20 samples. It should be used only for testing purpose.
test_expr
test_expr
A random normal data matrix with 20000 genes as rows and 20 samples as columns.
class(test_expr) dim(test_expr) check_sig(test_expr)
class(test_expr) dim(test_expr) check_sig(test_expr)