Introduction of the DeepVISP:


Viral infection induced various human diseases including cancer. The molecular mechanisms of viral oncogenesis were complicated, which may associate with chronic inflammation, cell cycle dysregulation, interference with cellular DNA repair mechanisms resulting in genome instability and disruption of host genetic and epigenetic integrity. Epstein-Barr virus (EBV) was the first human virus to be categorized as carcinogenic, which was involved in a variety of human cancers, including Burkitt lymphoma, Hodgkin lymphomas, NK/T cell lymphomas and numerous subtypes of gastric carcinomas. In addition, persistent hepatitis B virus (HBV) infection joined with chronic inflammation may lead to chronic liver diseases, progression to cirrhosis and consequential development of hepatocellular carcinoma (HCC), which is either the fifth or sixth most common cancer today. Human herpesvirus (HPV) infection, especially the subset of mucosotropic HPVs (i.e., “high-risk” HPVs), were closely associated with more than 99% of human cervical carcinoma, squamous-cell carcinoma and head and neck squamous cell carcinoma (HNSC). Since oncogenic viruses was reported to cause ~15% of tumor cases as whole, the detection of viral integration sites (VISs) in human genome was crucial to significantly enhance our understanding of the underlying carcinogenic mechanisms.

      In this study, we compiled three oncogenic viruses i.e., Epstein–Barr virus (EBV), hepatitis B virus (HBV) and human herpesvirus (HPV), integration sites (VISs) and carefully studied the more frequently target genes and some genomic features regarding these VISs. Based on the benchmark data set, we developed a new deep convolutional neural networks (CNN) model with attention architecture, named DeepVISP (Deep learning for VIS prediction) for accurately predicting oncogenic VISs in human genome. Our evaluation indicated DeepVISP outperforms conventional machine-learning methods by automatically learning informative features and essential genomic positions from primary DNA sequences and achieved high accuracy and robust performance with the average Area Under Curve (AUC) values greater than 0.8 of different oncogenic viruses in the training and independent data sets. Moreover, through exploiting the specific sequence features captured by DeepVISP, we decoded some regulatory factors potentially involved in viruses integration and tumorigenesis, such as HOXB7, LHX6, and IKZF1. Systematic literature retrieval further validated the biological relevance of the prediction results. In addition, cluster analysis of informative motifs indicated that representative k-mers in the simultaneous motifs may contribute to the virus recognition of host genes. A user-friendly web server (https://bioinfo.uth.edu/DeepVISP) was built for predicting putative oncogenic VISs in human genome. Taken together, our study not only performs the first attempt to model the selection of oncogenic VIS by deep learning approaches with a superior accuracy, but also systematically identified potential DNA functional elements that may be related to virus insertion and related carcinogenesis.


  Schematic overview of the DeepVISP pipeline:


(A) The experimental determination VISs were collected as positive data from an internally maintained database of VISDB database, and 20,588 HBV, 5,118 HPV and 1,112 EBV VISs were obtained, respectively. The one-hot encoding is used to represent the position-specific composition of the nucleotides in an VDS, of which a 4-digit binary vector is associated with each nucleotide; (B) We designed the deep learning framework model with four parts, including an input layer, two convolution-pooling modules, an attention layer and an output layer; (C) The evaluation the model and development of the web server and decoding potential regulatory factors involved in oncogenic viruses integration.

The Workflow of DeepVIS

  The deep learning framework implemented in DeepVISP:


CNNs usually contain multiple parts, including the input layer, the convolutional layers, the fully connected layer and the output layer. In this work, we designed our model with an input layer, several convolutional layers, an attention layer, and an output layer. We used the rectified linear unit (ReLU) as the activation function. Specifically, the input layer accepts the training data set with labels and representative features and the convolutional layers were adopted for feature extraction and representation. The attention layer was included to catch underlying importance of the VDS (500, 500). The attention layer takes the feature representation of the last convolutional layer as input and calculated a score suggesting whether the neural network should pay more attention to the features at that position. Subsequently, the feature vectors captured by the convolutional layers and the attention scores were integrated and fed to a logistic regression classifier to acquire an output score.

The Workflow of DeepVIS

  If you find DeepVISP useful, please cite:

blank_cover


Xu H, Jia P, Zhao Z. DeepVISP: deep learning for virus site integration prediction and motif discovery. Advanced Science, 2021, 8(9): 2004958..