DeepFun
  • Home(current)
  • Run
    Screen analysis Saturation analysis
  • Results
  • Tutorial
  • Contact
  • Links
    BSML CPH CGC

Tutorial


  • 1. Basic information
  • 2. Model construction and evaluation
  • 3. General workflow of DeepFun webserver
  • 4. Screen analysis
    • 4.1 Input
    • 4.2 Output
  • 5. Saturation analysis
    • 5.1 Input
    • 5.2 Output

1. Basic Information


DeepFun aims to predict the impact of genetic variants on a wide range of chromatin features based on deep convolutional neural networks (CNN) model. DeepFun provides two major functions: “screen analysis” and "in silico saturated mutagenesis" analysis. The “screen analysis” will predict the impact of a list of query variants across 7879 chromatin features (See the full list here). For each variant, DeepFun will consider each variant nearby 1000 bp region context information, and then predict the accessibility or binding probability of sequences contain reference allele or alternative allele. As DeepFun model does not take any specific variant information into consideration, but prediction of a 1000 bp sequence context accessibility or binding affinity, this feature in DeepFun enables prediction of variants impacts even for many that are never previously observed.

The “in silico saturated mutagenesis analysis” will dissect the impact of one specific variant of the region surrounding 200 bp under user specified chromatin feature. Specifically, DeepFun will mutate every single base (from variant’s upstream 99 to downstream 100 bp) and predict the accessibility or binding probability of the input variant nearby sequence contain reference allele or alternative allele, then calculate their SAD score for each base substitution, respectively.

DeepFun Design

Note: Currently, we allow input variants based on the human genome version hg19/GRCh37 and hg38/GRCh38 (Liftover). The training of DeepFun model was based on hg19/GRCh37. Thus, for input variants based on hg38/GRCh38 coordinates, we will first use LiftOver convert the coordinates to hg19/GRCh37 and then run the job. Variants that cannot be mapped to appropriate hg19/GRCh37 coordinates will be excluded from downstream analyses.

2. Model construction and evaluation


The DeepFun framework used the expanded chromatin features resources from ENCODE Project Consortium and Roadmap Epigenomics Consortium (May 06, 2019): DNase-seq (DNA accessibility profiles) and ChIP-seq (include histone mark and transcription factor (TF) binding profiles) (See the full list here). According to assays functional category and completeness, we classified them into two tiers. Tier 1 assays include DNase-seq, 16 histone marks and transcription factor CTCF across 3451 total experiments; Tier 2 assays include remain 305 transcription factors across 4428 total experiments.

For both datasets in tier 1 and 2, the downloaded annotation for peaks were created as 1000 bp genomic intervals to all narrow peaks by extending 500 bp on each side of the midpoint of the peak. We then greedily merged peaks based on their distance to an adjacent peak, until no peaks overlapped by >200 bp. The center of the merged peak was determined as a weighted average of the midpoints of the merged peaks from individual profile. These peaks were regarded as potential epigenomic active sites. Next, we applied an extended version of the Basset model with default 3 convolutional layers, two fully connected hidden layers, combined with a fully connected sigmoid transformation layer to predict the peaks accessibility or binding probability across different chromatin features. From total epigenomic active sites, we randomly selected 80% for training and another 10% for validation and testing, respectively. We used the area under receiver operating characteristic (AUC) to evaluate the performance on validation and testing sets. To avoid overfitting, the network training was stopped until the loss in the validation set did not decrease within 12 successive epochs of Bayesian optimization. By this measurement, DeepFun achieved a median AUC of 0.933, 0.872 and 0.80 over DNase-seq, histone mark and transcription factor assays. Details description and evaluation of DeepFun model is available from Pei et. al 2020 work.

3. General workflow of DeepFun webserver


DeepFun Design

4. Screen analysis: compute functional impact of query variants across chromatin features


DeepFun cannot directly predict the functional impact of variants, but making predictions for both alleles separately. Specifically, for each variant, DeepFun will consider variant nearby 1000 bp region context information, and then predict the activity (accessibility or binding) probability of sequence(s) contain reference allele or alternative allele. To evaluate the impact of variant, we implemented SNP activity difference (SAD) or relative log fold changes of odds (log-odds) difference between two alleles, reflect the degree of the functional impact of input variants. As the output layer of DeepFun model is fully connected sigmoid transformation layer, for both reference and alternative alleles, the bases predict activity probability range from 0 to 1. Therefore, variants have a higher positive SAD value indicates that the alternative allele increases the epigenetic signal compared to the reference allele, while negative value indicates decrease the epigenetic signal.

DeepFun Design

4.1 Input:

DeepFun Design
  1. Job Identifier: Job identifier can be generated automatically or customized by the submitter. It is confidential and can be used for job status monitoring and result retrieval. (See Results page).It is required.
  2. Genome version: DeepFun supports the coordinates from the human genome version hg19/GRCh37 (default training) or hg38/GRCh38 (Liftover).
  3. Model/Panel: Model A integrated 3451 samples, including DNase-seq (1548), histone mark (1536) and transcription factor CTCF (367) profiles. Model B integrated 4428 samples, including all remaining transcription factors binding profiles.
  4. Paste variant calls: Input box: the format can be a simple 4-columns format: chromosome, position, reference allele, and alternative allele, separate by space or tab and each variant is in one line, or the VCF format with 5 or more than 5-columns: chromosome, position, SNP ID, reference allele, alternate allele combined with other information. Each screen analysis job allows a maximum number of 3000 variants. DeepFun only accepts variants located on Chr 1 to 22 and X, any variants located on Chr Y and mitochondria will be declined due to original Basset model configuration limitation.
  5. Upload a file: In the screen analysis, user can upload a file containing the variants in the required format.
  6. Operation buttons: After inputting all the necessary information, the user should verify the inputs first, if passed, the submit button will be activated for job submission.
  7. Email: Users can provide the email address for receiving notifications of job status (Submitted/Finished).

4.2 Output
A downloading link will be available after the job is finished. All output files are zipped into one downloadable file.

DeepFun Design
  1. Ref: Predicted chromatin feature activity probabilities for sequences carrying the reference allele.
  2. Alt: Predicted chromatin feature activity probabilities for sequences carrying the alternative allele.
  3. SAD: Predicted changes of chromatin feature activity probabilities between sequences contain reference allele and alternative allele, defined as SAD

    SAD = Alt - Ref

  4. log_odds_diff: Predicted log fold changes of chromatin feature activity probabilities between sequences contain reference allele and alternative allele, defined as log-odds difference.
    DeepFun Design
  5. SAD_summary: The SAD summary over all chromatin features and the top most significant associated chromatin features (based on SAD value).
  6. log_odds_diff_summary: The log-odds difference summary over all chromatin features and the top most significant associated chromatin feature (based on log-odds difference value).

In addition, we also present heat maps for better demonstrate the tissue and cell type specificity of variants impact from same assay. When the number of input variants >50, only top 50 variants with the highest SAD score will be displayed.

5. In silico saturated mutagenesis: in-depth analysis of the sequence context.


DeepFun performs "in silico saturated mutagenesis" analysis to discover potential informative sequence features. Specifically, it will mutate every single base in the input sequence contain reference allele and alternative allele, and calculates the changes of sequence activity probabilities. Note that the in silico saturated mutagenesis is time consuming. Currently, DeepFun only accepts one variant per job. Users need to specify the target chromatin features of interest at first. This can be determined by the screen analysis. Typically the top 1~10 most informative features are of interest for the in silico saturated mutagenesis.

5.1 Input:

DeepFun Design
  1. Job Identifier: Job identifier can be generated automatically or customized by the submitter. It is confidential and can be used for job status monitoring and result retrieval. (See Results page).It is required.
  2. Genome version: DeepFun supports the coordinates from the human genome version hg19/GRCh37 (default training) or hg38/GRCh38 (Liftover).
  3. Model/Panel: Model A integrated 3451 samples, including DNase-seq (1548), histone mark (1536) and transcription factor CTCF (367) profiles. Model B integrated 4428 samples, including all remaining transcription factors binding profiles.
  4. Paste variant calls: Input box: the format can be a simple 4-columns format: chromosome, position, reference allele, and alternative allele, separate by space or tab and each variant is in one line, or the VCF format with 5 or more than 5-columns: chromosome, position, SNP ID, reference allele, alternate allele combined with other information. Each in silico saturated mutagenesis job only allows one variant. DeepFun only accepts variants located on Chr 1 to 22 and X, any variants located on Chr Y and mitochondria will be declined due to original Basset model configuration limitation.
  5. Email: Users can provide the email address for receiving notifications of job status (Submitted/Finished).
  6. Operation buttons: After inputting all the necessary information, the user should verify the inputs first, if passed, the submit button will be activated for job submission.
  7. Selection of user-specified profiles.
    • Experiment target: (a) DNase-seq, histone marks, and transcription factor-CTCF profiles;
      (b) Transcription factors profiles (excluding CTCF).
    • Tissue/Cell type: Available tissue or cell type with selected DNase-seq, histone mark or transcription factor.
    • Accession: replicates of chromatin features (See the full profile list here).

5.2 Output
For each chromatin feature, DeepFun in silico saturated mutagenesis analysis will provide two heat maps to show the change patterns of sequence activity probabilities from variant upstream 99 to downstream 100 bp, corresponding to sequences contain reference allele and alternative allele, respectively. In addition, the max gain and loss SAD value for each base will be provided in the results.

DeepFun Design

@

Copyright © 2009-Present - The University of Texas Health Science Center at Houston (UTHealth)
Site Policies | Emergency Information