Introduction of the DeepVISP:
Viral infection induced various human diseases including cancer. The molecular mechanisms of viral oncogenesis were complicated, which may associate with chronic inflammation, cell cycle dysregulation, interference with cellular DNA repair mechanisms resulting in genome instability and disruption of host genetic and epigenetic integrity. Epstein-Barr virus (EBV) was the first human virus to be categorized as carcinogenic, which was involved in a variety of human cancers, including Burkitt lymphoma, Hodgkin lymphomas, NK/T cell lymphomas and numerous subtypes of gastric carcinomas. In addition, persistent hepatitis B virus (HBV) infection joined with chronic inflammation may lead to chronic liver diseases, progression to cirrhosis and consequential development of hepatocellular carcinoma (HCC), which is either the fifth or sixth most common cancer today. Human herpesvirus (HPV) infection, especially the subset of mucosotropic HPVs (i.e., “high-risk” HPVs), were closely associated with more than 99% of human cervical carcinoma, squamous-cell carcinoma and head and neck squamous cell carcinoma (HNSC). Since oncogenic viruses was reported to cause ~15% of tumor cases as whole, the detection of viral integration sites (VISs) in human genome was crucial to significantly enhance our understanding of the underlying carcinogenic mechanisms.
In this study, we compiled three oncogenic viruses i.e., Epstein–Barr virus (EBV), hepatitis B virus (HBV) and human herpesvirus (HPV), integration sites (VISs) and carefully studied the more frequently target genes and some genomic features regarding these VISs. Based on the benchmark data set, we developed a new deep convolutional neural networks (CNN) model with attention architecture, named DeepVISP (Deep learning for VIS prediction) for accurately predicting oncogenic VISs in human genome. Our evaluation indicated DeepVISP outperforms conventional machine-learning methods by automatically learning informative features and essential genomic positions from primary DNA sequences and achieved high accuracy and robust performance with the average Area Under Curve (AUC) values greater than 0.8 of different oncogenic viruses in the training and independent data sets. Moreover, through exploiting the specific sequence features captured by DeepVISP, we decoded some regulatory factors potentially involved in viruses integration and tumorigenesis, such as HOXB7, LHX6, and IKZF1. Systematic literature retrieval further validated the biological relevance of the prediction results. In addition, cluster analysis of informative motifs indicated that representative k-mers in the simultaneous motifs may contribute to the virus recognition of host genes. A user-friendly web server (https://bioinfo.uth.edu/DeepVISP) was built for predicting putative oncogenic VISs in human genome. Taken together, our study not only performs the first attempt to model the selection of oncogenic VIS by deep learning approaches with a superior accuracy, but also systematically identified potential DNA functional elements that may be related to virus insertion and related carcinogenesis.