Oncogenic viruses account for about one sixth of tumorigenesis cases. A detailed and clear understanding of viral integration location, distribution and identification of specific viral sequences within human genome is helpful for curing cancer caused by viral infection. It is of great significance to investigate the integration of viruses into host genes or sequences from the perspective of genomics to screen virus susceptible populations, prevent virus infection, develop new therapeutics and precisely optimize treatment. In this study, we develop VISDB(Virus Integration Site DataBase,VISDB )to provide a knowledgebase of site-related information about viruses integrated into human genome and benefit researchers studying viruses and correlated malignant diseases.
VISDB covers 9 main viruses, including 5 DNA oncoviruses (HBV, HPV, EBV, MCV, AAV2) and 4 RNA retroviruses (HIV, MLV, HTLV, XMRV). The current version of VISDB deposits 77,602 integration sites carefully curated from 108 publications(some articles harbor multiple viruses).
  1. HBV, 20558 VISes, 45 publications
  2. HPV, 5118 VISes, 31 publications
  3. EBV, 1144 VISes, 7 publications
  4. MCV, 55 VISes, 9 publications
  5. AAV2, 24 VISes, 3 publications
  6. HIV, 16797 VISes,10 publications
  7. HTLV-1,33845 VISes, 4 publications
  8. MLV, 32 VISes, 1 publication
  9. HBV, 29 VISes, 2 publications
Figure 1 shows the overview of VISDB. Scientific literatures harboring VISes, VISes and VIS-related data such as virus sequences, genes, miRNAs, fragile sites and other kinds of annotations, are curated and stored in VISDB. We firstly extract VISes from literature downloaded from public databases such as PubMed and ScienceDirect. Target genes, nearby genes, fragile sites and miRNA are provided in some publications, but in other cases, we only get a partof the information from the original publication. However, that information can be curated if exact integration position is provided by publications,including genome assembly,chromosome and locations on chromosome. Therefore, we discard studies that do not contain exact integration position information. Moreover, because we have collected a large number of VISes in VISDB, discarding a small number of VISes will not affect the integrity of VIS coverage. Oncogenes and tumor suppressor genes are further screened out from target genes and nearby genes, and the associations between genes and miRNA are also curated. Finally, functions such as browse, search, curation, gene feature, microRNA feature, download, statistics, help, and feedback are provided for users.
Figure1. Overview of VISDB

VIS data model

We build a universal virus integration event model to coordinate virus integration sites across various original publications or datasets. The model covers detection, analysis and archival of virus integration sites (VIS), allowing junction sequences to show complicated cases of viral integration such as rearrangement, mutation, reverse inserts and microhomology junctions. As shown in figure 2, our model consists of four kinds of data:basic information, virus sequence, target sequence and junction sequence.The category of basic information includes objects or tools used to detect or describe a VIS in the original article: samples and experimental assays used in experiment and the source of literature, etc. In addition, we add some identifying attributes such as a status attribute to show whether the VIS is validated by an experimental assay and a completeness attribute to evaluate the integrity of VIS.
For the virus sequence category of VISDB, we extract and record some information about the reference genome that the virus sequence was originally aligned to, including the code of genome assembly, the genome name, the FASTA file of the viral genome and the hyperlink of genome in the NCBI Nucleotide database, etc. We also record the integrated sequences and metadata about each sequence, such as start breakpoint, stop breakpoint and viral gene name, etc. In the case that the VIS was not provided by original publication, we curate the VIS if the reference genome and breakpoints are presented. For the case of two breakpoints, we extract the sequence between the start breakpoint and end breakpoint. For the case of only one breakpoint, we extract two 100bp sequences from the upstream and downstream of VIS, concatenate them into one sequence divided by substring "| |".
The integrated position on human reference genome is critical for curating of VIS. The best case is that we can extract chromosome, cytoband, locations on chromosome and genome assembly from original articles. Sequences the virus integrated into are also extracted and deposited in database. For VIS with a sole location, we record sequences both upstream and downstream of inserted sites. However, if both start position and end position are provided, then the upstream sequence of start position, the downstream sequence of end position, and the sequence between the start position and end position are curated according to the genome assembly declared in the article.
The junction sequence category has a significant role in analyzing integration patterns. Though we wish to store the FASTA file of the junction sequence and its annotation, the coverage and mapped reads of each VIS, we only find junction sequences in less than 5 percent of VISes. Furthermore, we mark all endpoints pertaining to the human sequence or virus sequence and map these points with specific positions in the reference genome to ensure the visualization of integration event as shown in figure 3.
Figure 2. A data view of integrating event. An integration event contains basic information about VIS, information about the integrated virus sequence, target sequence or point in host genome and the junction sequence after the occurrence of integration event. Double ellipses mean multiple virus sequences may exist.

VIS visualization

We develop a visualization tool to display rich information about features of integration sites. Virus integration with human genome may have many different patterns. The simplest pattern is when a segment of the virus sequence is broken and inserted into the host's genome without any other process in the occurrence of integration event. However, reverse-inserts, rearrangements, microhomology and mutations may take place in the process of integration, and the integration event may be complex. In this study, we consider a virus-integrated within a human sequence to have the form of “human sequence” + ”virus-mixed sequences” + ”human sequence”. In other words, a junction sequence is composed of a human sequence preceding the integrating region, a sequence mixed with virus sequences and unknown sequences excluding human sequences, and a human sequence following the integration region. Notably, overlap of human sequence and virus sequence and unknown sequence between human sequence and virus sequence are both allowed. However, no human sequence can exist in the mixed sequence, otherwise, the integration event is divided into two events.
Figure3. Examples of integration events. Red lines are virus sequences, black lines are human sequences, gray lines are sequences belong to neither virus nor human, green lines are the overlap of virus sequence and human sequence. Mutations in integration event are not included. Characters in blue mean they are coordinates on virus or human reference genome. Coordinates are designed as hyperlinks linked to NCBI genome assembly or NCBI Nucleotide.

Steps of constructing this knowledgebase

VISes in our database are extracted from scientific literature or downloaded from the other VIS databases. As our data model is very comprehensive on VIS criteria, only a few VISes can meet our requirements completely. Therefore, we firstly fill those items with values extracted from original articles or downloaded from public databases, then curate those VISes to improve the integrity of VIS information. However,a null value is unavoidable for VIS-related information that cannot be obtained using the methods described.

Literature-based VISes

Collecting literature containing VISes
All scientific literature containing VISes was downloaded from PubMed, ScienceDirect, Google Scholar and Wiley with the authorization of the University of Texas, Health Science Center at Houston. We searched these data sources by using different combinations of the following keywords: virus integration, viral integration, integration, integration site, full name of virus, abbreviation of virus, etc. Articles recruited were referred to as an initial literature set. For example, articles related to HBV are retrieved by the following statements:
  1. (( viral integration[Title/Abstract] or virus integration[Title/Abstract] ) AND (HBV[Title/Abstract] or hepatitis B virus[Title/Abstract]))
  2. (( integration site[Title/Abstract] ) AND (HBV[Title/Abstract] or hepatitis B virus[Title/Abstract]))
  3. (( integration[Title] ) AND (HBV[Title/Abstract] or hepatitis B virus[Title/Abstract]) )
In this way, we collected about 150 research articles for HBV, 23 papers for AAV2, 29 papers for EBV, 21 papers for MCV, 90 papers for HPV,etc.
Filtering literature with better quality VISes and downloading supplementary files
For previous literature set, we scanned the full text to check whether there were valid integration sites. If a paper contained VISes, supplementary files of original articles were downloaded for further interpolation. If supplementary files for a certain VIS were not provided, we discarded the paper. Furthermore, as the development of next generation sequencing advanced, more VISes were found and VISes from articles published in recent years almost cover those VISes reported in literature published 20 or 30 years ago. Indeed, few studies published before 2004 reported specific locations in host genome or breakpoints in virus genome and we had to discard them. Thus, some articles published before 2004 are not deployed in VISDB. After finishing this step, we ascertained 108 papers.
Extracting VIS information from literature
VIS information was distributed in scientific literature and public biological database without any fixed format. Therefore, we had to extract VIS from literature manually. In some cases, the following VIS information could be extracted from literature library, original paper or supplementary files.
  1. Paper metadata, such as PubMed ID, title, journal, year of publication, authors and their affiliations
  2. Samples used in detecting VIS, including sample name, sample type, sample size, donator's age and gender
  3. Human reference genome used to align the junction sequence
  4. Virus genome used to detect the integrated virus sequence
  5. Method used to detect VIS
  6. Disease of a sample
  7. Integration locations in human genome, including chromosome, loci, orientation, start position and end position, etc
  8. Integrated virus sequence, including start breakpoint, end breakpoint, orientation, etc(for VIS detected by NGS-related technology, coverage and reads number are recorded as well)
  9. Target genes or genes in the vicinity of VIS
  10. Fragile sites that the VIS is located
  11. miRNA related to target genes
  12. Other information
Unfortunately, it was troublesome to obtain aforementioned information about VIS from literature directly. In most cases, we used the following tactics:
  1. Search the whole article to extract the objective information such as human genome, virus genome, experimental assay
  2. Copy the text in literature with pdf format, save it to a text file and import it to Excel
  3. Use string functions provided by Excel( such as concat, find, left, right and len)to extract or normalize data items
  4. Use sort and replace functions provided by Excel to remove duplicates or group the rows to accelerate data compilation.
Curating VISes with public biological database
After extracting VIS information from literature, we curate VIS with public biological database such as NCBI GenBank, KEGG, ENCODE, Genecards, RID, ONGene, TSGene, HumCFS, miTarBase, miRNA, etc.
  1. Validation of target genes
  2. Validation of genes near VIS without targeting any gene
  3. Annotatation of VISes with oncogene from Oncogene database
  4. Annotatation of VISes with tumor suppressor gene from TSGene database
  5. Extraction of upstream and downstream sequences of VISes
  6. Extraction of integrated virus sequence and human sequence if two locations or breakpoints are provided
  7. Extract fragile sites from HumCFS and correlate them with VISes
  8. Extract miRNAs from miTarBase and miRNA and correlate them with VISes
  9. Set links to GenBank, KEGG, ENCODE, Genecards, etc.
Revising VISes and computing auxiliary data to visualize VISes
It was a big challenge when we finished collecting VISes for 9 viruses because we found there were too many inconsistencies among the same type of VIS information. For instance, some authors use chr23 and chr24 to refer to allosomes, while most papers use chrX and chrY. MLL4 was reported as a hotspot oncogene resulting in hepatoma carcinomas, but this gene's official symbol is KMT2B in GenBank. Moreover, some authors also use its alias as gene name such as RX2,MLL2, TRX2, WBP7, DYT28, MLL1B, WBP-7 and CXXC10. These cases cause inaccuracy in statistics. To decrease the risk of inaccuracy, we perform the following revision or normalization:
  1. Normalize genes name with the official symbol in GenBank
  2. Use GRCh38/hg38 as the reference genome to visualize the integration event for literature-curated VIS(for imported VIS, we’ll navigate to the source database)
  3. State VISes detected by NGS-related technology (no need for experimental assay)
  4. Use bp as the unit of measurement in calculating the distance of gene to VIS
  5. Use millimeter as the unit of measurement for sample sizes
  6. Give each virus a default code if no specific virus reference genome is provided, and this code can’t link to any genome in Nucleotide database
  7. Give all VISes without specific reference genome a default code that does not link to UCSC or NCBI.

Imported VISes

Downloading VISes and literature
RID (Retrovirus Integration Database, https://rid.ncifcrf.gov/) is a relational database containing information about retrovirus integration sites in host genomes and is sponsored by the HIV Dynamics and Replication Program (HIV DRP), National Cancer Institute, NIH. It collects about 4 million VISes from 18 papers of HIV, HTLV, MLV and ALV. Insert position on host chromosome, target genes or nearest genes and distance are presented. In addition, the locations in host human genome are mapped to hg19. However, the reference virus genome, the details of sample and experiment assay are not listed by the website, let alone the sequence around the integration site and the junction sites. To give a a more details about those VISes, we download some VISes as well as literatures from RID for further curation of VIS information that is not provide by RID.
HPVbase is an integrated viral resource and analysis platform for Human Papillomaviruses mediated carcinomas. It contains of 1257 entries including methylation pattern and miRNA expression. We download some VISes that were consistent with our scheme and curate them with more VIS information.

Replenishing VISes to match VISDB
After downloading VISes from RID and referencing to literature, we curated those VISes according to the original paper and public biological databases.
  1. Extract experimental assay from original paper and correlate it to VIS
  2. Extract samples information from original paper and correlate them to corresponding VISes
  3. Extract disease information related to the sample and correlate them to samples
  4. Extract virus reference genome to which junction sequence is aligned
  5. Supplement genes without gene ID in RID and normalize with official symbol.

Contact us

We appreciate your feedback. Please send an Email if you wish to make a request, a comment, or report a bug.

Zhongming Zhao, PhD, MS
Chair Professor for Precision Health
Professor of Biomedical Informatics and Human Genetics
School of Biomedical Informatics and School of Public Health
Founding Director, Center for Precision Health
Director, UTHealth Cancer Genomics Core
The University of Texas Health Science Center at Houston (UTHealth)
Phone: 713-500-3631
Email: zhongming.zhao@uth.tmc.edu

Deyou Tang, PhD
Visiting scholar
School of Biomedical Informatics and School of Public Health
University of Texas Health Science Center at Houston
Email: Deyou.Tang@uth.tmc.edu

Citation of the Database:

To cite the VISDB website in a publication, please quote the following:
Tang D, Li B, Xu T, Hu R, Tan D, Song X, Jia P, Zhao Z (2020) VISDB: a manually curated database of viral integration sites in the human genome. Nucleic Acids Research 48(D1):D633-D641