iAnnotateSV package

Required Packages

We require that you install:

pandas:v0.16.2
biopython:v1.65
Pillow:v3.4.2
reportlab:v3.3.0
coloredlogs:v5.2

Quick Usage

If you know python I have created a small test script in /iAnnotateSV/test directory it runs a test on existing code and compares the result with the output file.

Else To Run:
  • If you want to run with default options:

python iAnnotateSV.py -i svFile.txt -ofp outputfilePrefix -o /path/to/output/dir -r hg19 -d 3000

  • If you want to run with your own transcripts:

python iAnnotateSV.py -i svFile.txt -ofp outputfilePrefix -o /path/to/output/dir -r hg19 -d 3000 -c canonicalTranscripts.txt

  • If you want to run with your own transcripts & make plots:

python iAnnotateSV.py -i svFile.txt -ofp outputfilePrefix -o /path/to/output/dir -r hg19 -d 3000 -c canonicalTranscripts.txt -u uniprot.txt -p

usage: iAnnotateSV.py [options]

**Annotate SV based on a specific human reference**

**optional arguments:**
    -h, --help            show this help message and exit
    -v, --verbose         make lots of noise [default]
    -r hg19, --refFileVersion hg19
                        Which human reference file to be used, hg18,hg19 or
                        hg38
    -ofp test, --outputFilePrefix test
                        Prefix for the output file
    -o /somedir, --outputDir /somedir
                        Full Path to the output dir
    -i svfile.txt, --svFile svfile.txt
                        Location of the structural variants file to annotate
    -d 3000, --distance 3000
                        Distance used to extend the promoter region
    -a, --autoSelect      Auto Select which transcript to be used[default]
    -c canonicalExons.txt, --canonicalTranscripts canonicalExons.txt
                        Location of canonical transcript list for each gene.
                        Use only if you want the output for specific
                        transcripts for each gene.
    -p, --plotSV          Plot the structural variant in question[default]
    -u uniprot.txt, --uniprotFile uniprot.txt
                        Location of UniProt list contain information for
                        protein domains. Use only if you want to plot the
                        structural variant
    -rr RepeatRegionFile.tsv, --repeatFile RepeatRegionFile.tsv
                        Location of the Repeat Region Bed File
    -dgv DGvFile.tsv, --dgvFile DGvFile.tsv
                        Location of the Database of Genomic Variants Bed File
    -cc CosmicConsensus.tsv, --cosmicConsensusFile CosmicConsensus.tsv
                        Location of the Cosmic Consensus TSV file
    -cct CosmicFusionCounts.tsv, --cosmicCountsFile CosmicConsensus.tsv
                        Location of the Cosmic Fusion Counts TSV file

Input file format is a tab-delimited file containing:

chr1 pos1 str1 chr2 pos2 str2

as the header and where:

  • chr1: Its the chromosome name for first break point [1,2,3,4,5,6,7 etc..],
  • pos1: Its the chromosome loaction for first break point [1-based],
  • str1: Its the read direction for the first break point [0=top/plus/reference, 1=bottom/minus/complement],
  • chr2: Its the chromosome name for second break point [1,2,3,4,5,6,7 etc..],
  • pos2: Its the chromosome loaction for second break point [1-based],
  • str2: Its the read direction for the second break point [0=top/plus/reference, 1=bottom/minus/complement],

Output file will is a tab-delimited file containing:

chr1 pos1 str1 chr2 pos2 str2 gene1 transcript1 site1 gene2 transcript2 site2 fusion

as the header and where:

  • chr1 : Its the chromosome name for first break point [1,2,3,4,5,6,7 etc..],
  • pos1 : Its the chromosome loaction for first break point [1-based],
  • str1 : Its the read direction for the first break point [0=top/plus/reference, 1=bottom/minus/complement],
  • chr2 : Its the chromosome name for second break point [1,2,3,4,5,6,7 etc..],
  • pos2 : Its the chromosome loaction for second break point [1-based],
  • str2 : Its the read direction for the second break point [0=top/plus/reference, 1=bottom/minus/complement],
  • gene1 : Gene for the first break point,
  • transcript1 : Transcript used for the first breakpoint,
  • site1 : Explanation of the site where the first breakpoint occurs [Example=>Intron of EWSR1(+):126bp after exon 10],
  • kinasedomain1 : Explanation of the site where the first breapoint involves a Kinase Domain or not[Example=>Partial Kinase Domain Included]
  • gene2 : Gene for the second break point,
  • transcript2 : Transcript used for the second breakpoint,
  • site2 : Explanation of the site where the second breakpoint occurs [Example=>Intron of ERG(-):393bp after exon 4],
  • kinasedomain2 : Explanation of the site where the second breapoint involves a Kinase Domain or not[Example=>Partial Kinase Domain Included]
  • fusion : Explanation if the evnet leads to fusion or not. [Example=>Protein Fusion: in frame {EWSR1:ERG}]
  • Cosmic_Fusion_Counts : Number of Counts for the Events from Cosmic Fusion Results
  • repName-repClass-repFamily:-site1 : Repeat Region Annotation for site 1
  • repName-repClass-repFamily:-site2 : Repeat Region Annotation for site 2
  • CC_Chr_Band : Cosmic Cancer Census Chromosome Band
  • CC_Tumour_Types(Somatic) : Cosmic Cancer Census Tumor Type in Somatic Samples
  • CC_Cancer_Syndrome : Cosmic Cancer Census Cancer Syndrome the genes are related to.
  • CC_Mutation_Type : Cosmic Cancer Census Mutation Types the Genes are related to.
  • CC_Translocation_Partner : Cosmic Cancer Census Translocation Partners for the gene.
  • DGv_Name-DGv_VarType-site1 : Database of Genomic Variants annotation for site 1
  • DGv_Name-DGv_VarType-site : Database of Genomic Variants annotation for site 2
Example Plot:Image of EWSR1-ERG Fusion

The above plot shows the following:

  • There are three tracks for each break point. The first three tracks belong to breakpoint 1 and second three tracks belong to breakpoint 2.
  • Thre three tracks are:
    • Gene Model Track:
      • Displays Exons as brown and there direction with exons in arrow formation.
      • In orange it displays the breakpoint description.
    • Alignment Track:
      • Displays the direction of the reads for the breakpoint. Also displays the co-ordinate in Purple.
      • Read in Positive direction are Blue and Negative direction are Red
    • Protein Domain Track:
      • Displays the Protein Domain as green boxes with there names in green.

Output file name for plot is Gene1-Chromosome1_Position1_Gene2-Chromosome2_Position2_EventType.jpg All the Outputs are written into a folder called iAnnotateSVplots in the given output directory

Please look at examples of input and output files in /data/test directory where: /data/test/testData.txt is the input file /data/test/testResult.txt is the output file

The refFileVersion are automaticslly chosen from /data/references. But caution this is only tested on hg19. All these files are created using UCSC table browser.

The example for canonical transcripts can be also found in /data/canonicalInfo. In general the file is tab-delimited containing:

Gene Transcripts

as the headers where:

  • Gene : Gene symobol should match the gene name from /data/references file.
  • Transcripts : Transcripts is a particular transcript that you are interested in using instead of auto-selection.

The file for hg19 uniprot is created using UCSC table browser (Uniprot spAnnot track). The file for hg19 is in /data/UcscUniprotdomainInfo

Module iAnnotateSV contents

Submodules

AnnotateEachBreakpoint module

  • This module will annotate each breakpoint taking in:
    • chr : chromosome for the event,
    • pos : position in the chromosome for the event,
    • str : direction of the reads for the event[either 0 or 1],
    • referenceDataframe : a pandas data-frame that will store reference information
    Example:AnnotateEachBreakpoint(chr1,pos1,str1,refDF)

FindATranscript module

  • This module will automatically find the highest preference transcript based on input:
    • queryDF : Its a dataframe with
      • c = zone: 1=exon, 2=intron, 3=3’-UTR, 4=5’-UTR, 5=promoter
      • d,e = for exons: which one, and how far
      • d1,d2,e1,e2 = for introns: between which exons and how far?
      • f = for introns: how many bases in the partially completed codon?,
    • referenceDataframe : a pandas data-frame that will store reference information
    Example:FindATranscript(queryDF,refDF)

FindAllTranscripts module

  • This module will find all transcripts based on input:
    • queryDF : Its a dataframe with
      • c = zone: 1=exon, 2=intron, 3=3’-UTR, 4=5’-UTR, 5=promoter
      • d,e = for exons: which one, and how far
      • d1,d2,e1,e2 = for introns: between which exons and how far?
      • f = for introns: how many bases in the partially completed codon?,
    • referenceDataframe : a pandas data-frame that will store reference information
    Example:FindAllTranscripts(queryDF,refDF)

FindCanonicalTranscript module

  • This module will Finad a canonical transcript based on the input for main function and output of FindAllTranscripts:
    • geneList : List of genes [this will normally be list with same names] for the Structural Variant in question,
    • transcriptList : List of transcripts for the Structural Variant in question,
    • siteList : direction of the site for the event[either 0 or 1],
    • zoneList : different zones in which the event can occur [zone: 1=exon, 2=intron, 3=3’-UTR, 4=5’-UTR, 5=promoter]
    • strandList : direction of the read for the event[either 0 or 1],
    • intronnumList : Which intron the event occurs if the event is in intron for each transcript,
    • intronframeList : What is the frame of the intron where the event is occuring for each transcript.
    • ctDict : a dictionary containing the canonical transcript information for each gene. [Gene=>Transcript]
    Example:FindCT(geneList,transcriptList,siteList,zoneList,strandList,intronnumList,intronframeList,ctDict)

PredictFunction module

  • This module will predict the function of each annotated breakpoint

  • It takes two pandas series which has following information:
    • gene : Gene for the event,
    • transcript : Transcript used for the event,
    • site : Explanation for site where the event occurs,
    • zone : Where does the event occur [ 1=exon, 2=intron, 3=3’-UTR, 4=5’-UTR, 5=promoter ],
    • strand : Direction of the transcript,
    • str : Direction of the read,
    • intronnum : Which intron the event occurs if the event is in intron,
    • intronframe : What is the frame of the intron where the event is occuring.
    Example:

    ann1S = pandas.Series([gene1,transcript1,site1,zone1,strand1,str1,intronnum1,intronframe1],index=['gene1', 'transcript1', 'site1', 'zone1', 'txstrand1', 'readstrand1', 'intronnum1','intronframe1'])

    ann2S = pandas.Series([gene2,transcript2,site2,zone2,strand2,str2,intronnum2,intronframe2],index=['gene2', 'transcript2', 'site2', 'zone2', 'txstrand2', 'readstrand2', 'intronnum2','intronframe2'])

    So ann1S & ann2S are series that will go to PredictFuntionForSV()

    PredictFunctionForSV(ann1S,ann2S)

AddExternalAnnotations module

  • This module will annotate each breakpoint for Repeat Region, Database of Genomic Variants and Cosmic Census taking in:
    • repeat region file : Repeat Track from UCSC in tab-delimited format (see: /data/repeat_region/hg19_repeatRegion.tsv),
    • data base of genimic variant file : DGv in tab-delimited format (see: /data/database_of_genomic_variants/hg19_DGv_Annotation.tsv),
    • cosmic census file : cosmic census file from sanger (see: /data/cosmic/cancer_gene_census.tsv),
    • cosmic fusion counts file : cosmic fusion counts file from from cosmic fusion export (see: /data/cosmic/cosmic_fusion_counts.tsv),
    • structural variants file : File containing the breakpoint information,
    • output prefix : Output Prefix for the output files (.xlsx,.json,.txt),
    • output directory : Directory where the output needs to be written
    Example:

    makeCommandLineForAEA = "-r " + repeatregionFilePath + " -d " + dgvFilePath + " -c " + ccFilePath + " -cct " + cctFilePath + " -s " + svFilePath + " -ofp AnnotatedSV" + " -o " + outputDir

    AddExternalAnnotations.main(makeCommandLineForAEA)

AnnotateForRepeatRegion module

  • This module has two submodules will read and annotate each breakpoint for Repeat Region
    1. ReadRepeatFile

      • This will read a tab-delimited file into a panadas dataframe
    2. AnnotateRepeatRegion

      • This is will annotate the breakpoints for repeat region.
      Example:AnnotateRepeatRegion(verbose, count, svObject, repeatregionDict)

AnnotateForDGv module

  • This module has two submodules will read and annotate each breakpoint for Database of Genomic Variants
    1. ReadDGv

      • This will read a tab-delimited file into a panadas dataframe
    2. AnnotateDGv

      • This is will annotate the breakpoints for Database of Genomic Variants.
      Example:AnnotateDGv(verbose, count, svObject, dgvDict)

AnnotateForCosmic module

  • This module will annotate each breakpoint for Cosmic Census

    Example:AnnotateFromCosmicCensusFile(comic_census_filename, verbose, count, svObject)
    Example:AnnotateFromComicFusionCountsFile(comic_fusion_counts_filename, verbose, count, svObject)

helper module

  • This module has multiple submodules
    1. ReadFile()
      • This will read a tab-delimited file into a panadas dataframe
    2. ExtendPromoterRegion()
      • This will extend the promoter region to a given length
    3. bp2str()
      • This will convert base pair information to string information

VisualizeSV module

  • This module will annotate each breakpoint taking in:
    • svDataFrame : Annotated structurla varaints dataframe obtained from PredictFuntion Module,
    • referenceDataFrame : a pandas data-frame that will store reference information,
    • uniprotDataFrame : making a dataframe from the uniprot data.
    • args : This has all the arguments that are generated from iAnnotateSV module
    Example:

    VisualizeSV(svDataFrame,referenceDataFrame,uniprotDataFrame,args)

    Example Plot:Image of EWSR1-ERG Fusion

iAnnotateSV module

  • This module is the driver module, it takes user information and runs all other module to produce proper structural variant annotation
usage: iAnnotateSV.py [options]**

Annotate SV based on a specific human reference**

optional arguments:**
-h, --help            show this help message and exit
-v, --verbose         make lots of noise [default]
-r hg19, --refFileVersion hg19
                        Which human reference file to be used, hg18,hg19 or
                        hg38
-ofp test, --outputFilePrefix test
                        Prefix for the output file
-o /somedir, --outputDir /somedir
                        Full Path to the output dir
-i svfile.txt, --svFile svfile.txt
                        Location of the structural variants file to annotate
-d 3000, --distance 3000
                        Distance used to extend the promoter region
-a, --autoSelect      Auto Select which transcript to be used[default]
-c canonicalExons.txt, --canonicalTranscripts canonicalExons.txt
                        Location of canonical transcript list for each gene.
                        Use only if you want the output for specific
                        transcripts for each gene.
-p, --plotSV          Plot the structural variant in question[default]
-u uniprot.txt, --uniprotFile uniprot.txt
                        Location of UniProt list contain information for
                        protein domains. Use only if you want to plot the
                        structural variant
-rr RepeatRegionFile.tsv, --repeatFile RepeatRegionFile.tsv
                        Location of the Repeat Region Bed File
-dgv DGvFile.tsv, --dgvFile DGvFile.tsv
                        Location of the Database of Genomic Variants Bed File
-cc CosmicConsensus.tsv, --cosmicConsensusFile CosmicConsensus.tsv
                        Location of the Cosmic Consensus TSV file
-cct CosmicFusionCounts.tsv, --cosmicCountsFile CosmicConsensus.tsv
                        Location of the Cosmic Fusion Counts TSV file