Drug response in cancer patients varies dramatically due to inter- and intra-tumor heterogeneity. While many studies have aimed to identify signature genes or biomarkers for drug response, transcriptome context plays a significant role in shaping the actual treatment outcome.
To this end, we developed a deep learning approach for drug response prediction based on deep regenerative models. By applying the variational autoencoder (VAE) framework, we generated representative models for more than 1000 cell lines using their baseline expression profile and trained prediction models for drug response based on the latent representation of the baseline expression profiles. Rigorous quality assessment and validation were implemented, including cross-validation, multiple replications, and cross-panel evaluation. Because the models were built on baseline gene expression profile, they are widely applicable.
Figure 1. General pipeline for building models.
Our imputation pipeline and analysis consist of three major components (Figure 1): a deep regenerative model for the representation of the baseline expression of cell lines used in CCLE and GDSC projects (Figure 1A), a regression model to impute drug response using the measured drug response data from CCLE and GDSC (Figure 1B), and validations and applications of our approach to TCGA and other clinical data sets. First, we used the baseline gene expression data to build the VAE models, resulting in compressed representation of the samples in the low-dimensional latent space, which consisted of latent vectors. Second, we utilized the latent vectors as the exposure variables and trained regression models to predict drug response using the observed data from CCLE and GDSC. These predictive models were achieved by employing an Elastic Net strategy (VAE model followed by Elastic Net, or VAEN) and we did it for each measured compound.
2.1. Data collection
Baseline data. The Cancer Cell Line Encyclopedia (CCLE) project has assessed gene expression in 1156 cell lines (version July 18, 2018). We downloaded RPKM values from RNA-sequencing (RNA-seq) data. Each cell line has its original cell lineage. The pool of tested cell lines could be matched to a wide range of cancer types, including both solid cancers and haematopoietic and lymphoid tissues. We excluded the lineages that had less than 20 samples. Our working dataset included 1100 cell lines from 19 cell lineages, which were used to build the VAE models. We selected the most variably expressed genes to construct the VAE models. Note that the same gene expression data in those cell lines was used for drug response prediction for CCLE and GDSC drugs.
Drug response data. The CCLE project assessed the pharmacological profiles of 24 anti-cancer compounds in 504 cell lines. We downloaded data from the CCLE (https://portals.broadinstitute.org/ccle) data portal, including mutations and drug response measured by the activity area (named “ActArea”). The GDSC project assessed drug response for 251 compounds using the same pool of cell lines. We downloaded the fitted dose response file from GDSC website (Release 7.0, version 17.3, access date: 8/29/2018). In this study, we used the negative log transformed IC50 (-LN_IC50) in the Elastic Net models, although AUC can also be used for the same purpose.
2.2. Model training
We implemented a three-layer VAE model, with the input layer, encoder, and latent layer, decoder, and the output layer (Figure 1A). The python library for deep learning, Keras (version 2.1.6) with a TensorFlow backend (version 1.0.1), was used to implement the VAE. Encoder is a process to encode the input vector with a mean vector and a standard deviation vector, respectively, followed by a nonlinear transformation, e.g., the rectified linear units (ReLU) or the Sigmoid activation. We defined the loss function as the mean squared error plus the KL loss. As for VAEN, a regression model was trained for each drug following a Elastic Net strategy with 5-fold cross-validation to select lambda. We used the average R2 in the holdout samples to select the model.
We explored TWO ways of normalization:
- Rank-based inverse normal transformation (Rank)
- Z-score normalization of all genes for each sample (ZS)
2.3. Model performance
Figure 2. Comparisons betweend observed and predicted drug response.
The predicted drug response is similar to the original data. Some compounds, such as 17-AAG, Irinotecan, Paclitaxel, PD−0325901, and Topotecan, had a relatively wider range of response, whereas other compounds (e.g., AEW541, Erlotinib, Nutlin-3, and PLX4720) had a narrow distribution.
Figure 3. Model performance evaluated by CCLE, GDSC, and TCGA data.
We used the rank-based transformation and the sigmoid activation for the VAE part. Using the VAEN models, the self-prediction accuracy for 24 CCLE compounds, measured by Pearson correlation coefficient (PCC), ranged between 0.38 (LBW242) and 0.77 (Irinotecan). For the 251 GDSC compounds, the self-prediction accuracy ranged between 0.26 (Avagacestat) and 0.82 (AZ628), with 203/251 (80.88%) compounds having PCC greater than 0.5 (Figure 3A).
We used the 14 drugs measured by both CCLE and GDSC to evaluate the prediction results. We used ActArea to represent the drug response in CCLE. In the comparison of the predicted drug response among the cell lines, we found a positive correlation between the observed and predicted drug response (Figure 3B). When a drug showed a high consistent drug response between CCLE and GDSC [e.g., PCC (Nilotinib) = 0.75 between the observed drug response in CCLE and in GDSC among the 227 cell lines], it also displayed a trend towards a high consistency in the predicted data [PCC (Nilotinib) = 0.8 for predicted drug response]. Notably, for each drug, the VAE models selected for response imputation differed in the CCLE prediction model and in the GDSC prediction model. Because different VAE models are projected in different spaces with inconsistent conformation, it is mathematically challenging to obtain highly correlated prediction. The results in Figure 2B indicated that our models are reproducible across study panels. Strikingly, the predicted response to each drug using the CCLE model and the GDSC model had even higher correlation than the observed response. A similar positive trend was observed when we applied the models to TCGA data. The drugs having high consistency in the original CCLE and GDSC projects tended to have high consistency among the predicted drug response (using the models trained by CCLE and GDSC, respectively) in cancer samples (Figure 3C).
3. Webserver workflow
Figure 4. Workflow of the webserver.
Our webserver provides user-friendly interfaces for users to submit jobs, check job status, and retrieve results.
Figure 5. Job submission form.
- Job identifier: Job identifier can be generated automatically or customized by the submitter. It is confidential to other users and can be used for job status monitoring and result retrieval.(See Results page).It is required.
- Model selection: DrVAEN supports 4 backend-VAE models for expression data representation. In our test, Rank+Sigmoid has the better performance for most of drugs, but not for all. Other models may serve better for some drugs respectively.
- Drug panel: There are 24 drugs in CCLE panel and 251 drugs in GDSC panel (14 drugs are shared by both CCLE and GDSC). We have set a limitation on the total number of selected compounds to 50. For each job, users can submit up to 50 drugs. It will take up to 10 minutes to finish.
- Expression file: An expression file must be uploaded. The file should be in TSV format and following the format: genes are in rows and samples are in columns. For genes, we support Gene symbols, NCBI gene IDs, and Ensembl IDs (e.g., ENSG00000123091. No dot suffix). Please refer to the example dataset.
sample1 sample2 ...... sampleN Gene1 Gene2 ...... GeneX
- Operation buttons: Submit, reset the submission form, or access the example dataset.
(1) Predicted drug response velues
For each selected model, DrVAEN will generate a data table to show the predicted drug response values for all submitted drugs. (Of note, in CCLE, we used ActArea as the measurement of drug response. ActArea had a negative relationship with IC50 (i.e., a low IC50 means a high ActArea and high sensitivity). In GDSC, we used –LN(IC50) as the measurement of drug response and, thus, the predicted values had the same trend as ActArea, i.e., a high predicted value indicates a high sensitivity).
(2) Drug rank summary table
For each sample in the expression file, the predicted response value for each drug is compared with the predicted drug response data for TCGA pan-cancer samples (10459 samples across 33 cancer types). The rank of the user-input drugs in each sample was calculated as compared to all 10459 TCGA samples. For each drug, a lower rank number indicates the sample receives higher drug response (higher sensitivity). This table shows the sorted drugs in each sample based the rank.
(3) Drug response distribution and comparison with TCGA Data
The distributions of the predicted drug response for user submitted data and TCGA data will be displayed as boxplots. Users can compare their drug response results with the predicted drug responses for TCGA pan-cancer samples (all cancer types are shown in one box, for the drug response distribution in each cancer type, you can check them in our designed function DrTCGA).
At the page bottom, a “download” button is provided for downloading all the results.
3.3. Results retrieval
Job queue. DrVAEN implements a queue mechanism. Our DrVAEN webserver can run five jobs at the same time (note: each job can contain multiple samples). When a new job is submitted, if there are less than 5 jobs running, the newly submitted job will be ran directly with a "Running" status. Otherwise, the new job will be put into the waitlist with a "Pending" status. Once a job is finished, DrVAEN will check the queue and pick the next job to run until the waitlist is empty. The results page allows users to monitor the job queue and check their own job status by retrieving their job ID. The job ID is partially masked for a confidential purpose.
Job status. We used different color schemes to present the job status.
4.Drug response in TCGA
We have applied VAEN in all TCGA pan-cancer samples for ~10 000 cancer samples from 33 cancer types. Users may explore these results in DrTCGA page. The drug/compound name will be automatically searched and listed after you type at least two letters. If there is no drug name listed, it means our DrVAEN did not support the inputed drug name.
6. How to cite
Please cite: Peilin Jia#, Ruifeng Hu#, Zhongming Zhao*. DrVAEN: Drug response prediction through deep Variational Autoencoder model followed by Elastic Net. Web site: https://bioinfo.uth.edu/drvaen