Oncogenic viruses account for about one sixth of tumorigenesis cases. A detailed and clear understanding of
viral integration location, distribution and identification of specific viral sequences within human genome is helpful for curing
cancer caused by viral infection. It is of great significance to investigate the integration
of viruses into host genes or sequences from the perspective of genomics to screen virus susceptible
populations, prevent virus infection, develop new therapeutics and precisely optimize treatment.
In this study, we develop VISDB(Virus Integration Site
DataBase,VISDB )to provide a knowledgebase of site-related
information about viruses integrated into human genome and benefit researchers studying
viruses and correlated malignant diseases.
VISDB covers 9 main viruses, including 5 DNA oncoviruses (HBV, HPV, EBV, MCV, AAV2) and
4 RNA retroviruses (HIV, MLV, HTLV, XMRV). The current version of VISDB deposits 77,602 integration sites
carefully curated from 108 publications(some articles harbor multiple viruses).
HBV, 20558 VISes, 45 publications
HPV, 5118 VISes, 31 publications
EBV, 1144 VISes, 7 publications
MCV, 55 VISes, 9 publications
AAV2, 24 VISes, 3 publications
HIV, 16797 VISes,10 publications
HTLV-1,33845 VISes, 4 publications
MLV, 32 VISes, 1 publication
HBV, 29 VISes, 2 publications
Figure 1 shows the overview of VISDB. Scientific literatures harboring VISes, VISes and VIS-related data such as virus sequences,
genes, miRNAs, fragile sites and other kinds of annotations, are curated and stored in VISDB.
We firstly extract VISes from literature downloaded from public databases such as PubMed and ScienceDirect.
Target genes, nearby genes, fragile sites and miRNA are provided in some publications, but in other cases,
we only get a partof the information from the original publication. However, that information can be curated
if exact integration position is provided by publications,including genome assembly,chromosome and locations on chromosome.
Therefore, we discard studies that do not contain exact
integration position information.
Moreover, because we have collected a large number of VISes in VISDB, discarding a small number of VISes will not affect
the integrity of VIS coverage. Oncogenes and tumor suppressor genes are
further screened out from target genes and nearby genes, and the associations between genes and miRNA are
also curated. Finally, functions such as browse, search, curation, gene feature, microRNA feature,
download, statistics, help, and feedback are provided for users.
VIS data model
We build a universal virus integration event model to coordinate virus integration sites
across various original publications or datasets. The model covers detection, analysis and archival
of virus integration sites (VIS), allowing junction sequences to show complicated cases of viral integration such as rearrangement,
mutation, reverse inserts and microhomology junctions. As shown in figure 2,
our model consists of four kinds of data:basic information, virus sequence,
target sequence and junction sequence.The category of basic information includes
objects or tools used to detect or describe a VIS in the original article: samples and experimental assays
used in experiment and the source of literature, etc. In addition, we add
some identifying attributes such as a status attribute to show whether the VIS is validated by
an experimental assay and a completeness attribute to evaluate the integrity of VIS.
For the virus sequence category of VISDB, we extract and record some information about the reference
genome that the virus sequence was originally aligned to, including the code of genome assembly,
the genome name, the FASTA file of the viral genome and the hyperlink of genome in the NCBI Nucleotide
database, etc. We also record the integrated sequences and metadata about each sequence, such as start
breakpoint, stop breakpoint and viral gene name, etc. In the case that the VIS was not provided
by original publication, we curate the VIS if the reference genome and breakpoints are presented.
For the case of two breakpoints, we extract the sequence between the start breakpoint and end breakpoint.
For the case of only one breakpoint, we extract two 100bp sequences from the upstream and downstream
of VIS, concatenate them into one sequence divided by substring "| |".
The integrated position on human reference genome is critical for curating of VIS.
The best case is that we can extract chromosome, cytoband, locations on chromosome and genome assembly
from original articles. Sequences the virus integrated into are also extracted and deposited in database.
For VIS with a sole location, we record sequences both upstream and downstream of inserted sites.
However, if both start position and end position are provided, then the upstream sequence of start position,
the downstream sequence of end position, and the sequence between the start position and end position are
curated according to the genome assembly declared in the article.
The junction sequence category has a significant role in analyzing integration patterns.
Though we wish to store the FASTA file of the junction sequence and its annotation, the coverage
and mapped reads of each VIS, we only find junction sequences in less than 5 percent of VISes.
Furthermore, we mark all endpoints pertaining to the human sequence or virus sequence and map
these points with specific positions in the reference genome to ensure the visualization of
integration event as shown in figure 3.
We develop a visualization tool to display rich information about features of integration sites.
Virus integration with human genome may have many different patterns. The simplest pattern is when a
segment of the virus sequence is broken and inserted into the host's genome without any other process
in the occurrence of integration event. However, reverse-inserts, rearrangements, microhomology and
mutations may take place in the process of integration, and the integration event may be complex.
In this study, we consider a virus-integrated within a human sequence to have the form of “human
sequence” + ”virus-mixed sequences” + ”human sequence”. In other words, a junction sequence is composed
of a human sequence preceding the integrating region, a sequence mixed with virus sequences and unknown
sequences excluding human sequences, and a human sequence following the integration region. Notably,
overlap of human sequence and virus sequence and unknown sequence between human
sequence and virus sequence are both allowed. However, no human sequence can exist in the mixed sequence,
otherwise, the integration event is divided into two events.
Steps of constructing this knowledgebase
VISes in our database are extracted from scientific literature or downloaded from the other VIS
databases. As our data model is very comprehensive on VIS criteria, only a few
VISes can meet our requirements completely. Therefore, we firstly fill those items with
values extracted from original articles or downloaded from public databases, then curate those
VISes to improve the integrity of VIS information. However,a null value is unavoidable for
VIS-related information that cannot be obtained using the methods described.
Collecting literature containing VISes
All scientific literature containing VISes was downloaded from PubMed, ScienceDirect, Google
Scholar and Wiley with the authorization of the University of Texas, Health Science Center at
Houston. We searched these data sources by using different combinations of the following keywords: virus
integration, viral integration, integration, integration site, full name of virus, abbreviation of
virus, etc. Articles recruited were referred to as an initial literature set. For
example, articles related to HBV are retrieved by the following statements:
(( viral integration[Title/Abstract] or virus integration[Title/Abstract] ) AND
(HBV[Title/Abstract] or hepatitis B virus[Title/Abstract]))
(( integration site[Title/Abstract] ) AND (HBV[Title/Abstract] or hepatitis B
(( integration[Title] ) AND (HBV[Title/Abstract] or hepatitis B
In this way, we collected about 150 research articles for HBV, 23 papers for AAV2, 29 papers for
EBV, 21 papers for MCV, 90 papers for HPV,etc.
Filtering literature with better quality VISes and downloading supplementary files
For previous literature set, we scanned the full text to check whether there were valid
integration sites. If a paper contained VISes, supplementary files of original articles were
downloaded for further interpolation. If supplementary files for a certain VIS were not provided,
we discarded the paper. Furthermore, as the development of next generation sequencing advanced,
more VISes were found and VISes from articles published in recent years almost cover those VISes
reported in literature published 20 or 30 years ago. Indeed, few studies published before 2004
reported specific locations in host genome or breakpoints in virus genome and we had to discard them.
Thus, some articles published before 2004 are not deployed in VISDB. After finishing this step,
we ascertained 108 papers.
Extracting VIS information from literature
VIS information was distributed in scientific literature and public biological database without any fixed format.
Therefore, we had to extract VIS from literature manually. In some cases, the following VIS information could be
extracted from literature library, original paper or supplementary files.
Paper metadata, such as PubMed ID, title, journal, year of publication, authors and their
Samples used in detecting VIS, including sample name, sample type, sample size, donator's age
Human reference genome used to align the junction sequence
Virus genome used to detect the integrated virus sequence
Method used to detect VIS
Disease of a sample
Integration locations in human genome, including chromosome, loci, orientation, start position and
end position, etc
Integrated virus sequence, including start breakpoint, end breakpoint, orientation, etc(for
VIS detected by NGS-related technology, coverage and reads number are recorded as well)
Target genes or genes in the vicinity of VIS
Fragile sites that the VIS is located
miRNA related to target genes
Unfortunately, it was troublesome to obtain aforementioned information about VIS from literature directly.
In most cases, we used the following tactics:
Search the whole article to extract the objective information such as human genome, virus
genome, experimental assay
Copy the text in literature with pdf format, save it to a text file and import it to
Use string functions provided by Excel( such as concat, find, left, right and len)to
extract or normalize data items
Use sort and replace functions provided by Excel to remove duplicates or group the rows to
accelerate data compilation.
Curating VISes with public biological database
After extracting VIS information from literature, we curate VIS with public biological database such
as NCBI GenBank, KEGG, ENCODE, Genecards, RID, ONGene, TSGene, HumCFS, miTarBase, miRNA, etc.
Validation of target genes
Validation of genes near VIS without targeting any gene
Annotatation of VISes with oncogene from Oncogene database
Annotatation of VISes with tumor suppressor gene from TSGene database
Extraction of upstream and downstream sequences of VISes
Extraction of integrated virus sequence and human sequence if two locations or breakpoints are
Extract fragile sites from HumCFS and correlate them with VISes
Extract miRNAs from miTarBase and miRNA and correlate them with VISes
Set links to GenBank, KEGG, ENCODE, Genecards, etc.
Revising VISes and computing auxiliary data to visualize VISes
It was a big challenge when we finished collecting VISes for 9 viruses because we found there
were too many inconsistencies among the same type of VIS information. For instance, some authors use
chr23 and chr24 to refer to allosomes, while most papers use chrX and chrY. MLL4 was reported as a hotspot
oncogene resulting in hepatoma carcinomas, but this gene's official symbol is KMT2B in GenBank. Moreover, some
authors also use its alias as gene name such as RX2,MLL2, TRX2, WBP7, DYT28, MLL1B, WBP-7 and CXXC10.
These cases cause inaccuracy in statistics. To decrease the risk of inaccuracy, we perform the
following revision or normalization:
Normalize genes name with the official symbol in GenBank
Use GRCh38/hg38 as the reference genome to visualize the integration event for literature-curated
VIS(for imported VIS, we’ll navigate to the source database)
State VISes detected by NGS-related technology (no need for experimental assay)
Use bp as the unit of measurement in calculating the distance of gene to VIS
Use millimeter as the unit of measurement for sample sizes
Give each virus a default code if no specific virus reference genome is provided, and this
code can’t link to any genome in Nucleotide database
Give all VISes without specific reference genome a default code that does not link to UCSC or
Downloading VISes and literature
RID (Retrovirus Integration Database, https://rid.ncifcrf.gov/) is a relational database containing
information about retrovirus integration sites in host genomes and is sponsored by the HIV Dynamics
and Replication Program (HIV DRP), National Cancer Institute, NIH. It collects about 4 million VISes
from 18 papers of HIV, HTLV, MLV and ALV. Insert position on host chromosome, target genes or nearest
genes and distance are presented. In addition, the locations in host human genome are mapped to hg19.
However, the reference virus genome, the details of sample and experiment assay are not listed by
the website, let alone the sequence around the integration site and the junction sites. To give a
a more details about those VISes, we download some VISes as well as literatures from RID for further
curation of VIS information that is not provide by RID.
HPVbase is an integrated viral resource and analysis platform for Human Papillomaviruses mediated
carcinomas. It contains of 1257 entries including methylation pattern and miRNA expression. We
download some VISes that were consistent with our scheme and curate them with more VIS information.
Replenishing VISes to match VISDB
After downloading VISes from RID and referencing to literature, we curated those VISes according to the original
paper and public biological databases.
Extract experimental assay from original paper and correlate it to VIS
Extract samples information from original paper and correlate them to corresponding VISes
Extract disease information related to the sample and correlate them to samples
Extract virus reference genome to which junction sequence is aligned
Supplement genes without gene ID in RID and normalize with official symbol.
We appreciate your feedback. Please send an Email if you wish to make a request, a comment, or report a bug.
Zhongming Zhao, PhD, MS
Chair Professor for Precision Health
Professor of Biomedical Informatics and Human Genetics
School of Biomedical Informatics and School of Public Health
Founding Director, Center for Precision Health
Director, UTHealth Cancer Genomics Core
The University of Texas Health Science Center at Houston (UTHealth)
Deyou Tang, PhD
School of Biomedical Informatics and School of Public Health
University of Texas Health Science Center at Houston
Citation of the Database:
To cite the VISDB website in a publication, please quote the following: Tang D, Li B, Xu T, Hu R, Tan D, Song X, Jia P, Zhao Z (2020) VISDB: a manually curated database of viral integration sites in the human genome. Nucleic Acids Research 48(D1):D633-D641 https://www.ncbi.nlm.nih.gov/pubmed/31598702