iAnnotateSV package =================== Required Packages ================= We require that you install: :pandas: `v0.16.2 `_ :biopython: `v1.65 `_ :Pillow: `v3.4.2 `_ :reportlab: `v3.3.0 `_ :coloredlogs: `v5.2 `_ Quick Usage =========== If you know python I have created a small test script in /iAnnotateSV/test directory it runs a test on existing code and compares the result with the output file. Else To Run: * If you want to run with default options: ``python iAnnotateSV.py -i svFile.txt -ofp outputfilePrefix -o /path/to/output/dir -r hg19 -d 3000`` * If you want to run with your own transcripts: ``python iAnnotateSV.py -i svFile.txt -ofp outputfilePrefix -o /path/to/output/dir -r hg19 -d 3000 -c canonicalTranscripts.txt`` * If you want to run with your own transcripts & make plots: ``python iAnnotateSV.py -i svFile.txt -ofp outputfilePrefix -o /path/to/output/dir -r hg19 -d 3000 -c canonicalTranscripts.txt -u uniprot.txt -p`` .. code-block:: bash usage: iAnnotateSV.py [options] **Annotate SV based on a specific human reference** **optional arguments:** -h, --help show this help message and exit -v, --verbose make lots of noise [default] -r hg19, --refFileVersion hg19 Which human reference file to be used, hg18,hg19 or hg38 -ofp test, --outputFilePrefix test Prefix for the output file -o /somedir, --outputDir /somedir Full Path to the output dir -i svfile.txt, --svFile svfile.txt Location of the structural variants file to annotate -d 3000, --distance 3000 Distance used to extend the promoter region -a, --autoSelect Auto Select which transcript to be used[default] -c canonicalExons.txt, --canonicalTranscripts canonicalExons.txt Location of canonical transcript list for each gene. Use only if you want the output for specific transcripts for each gene. -p, --plotSV Plot the structural variant in question[default] -u uniprot.txt, --uniprotFile uniprot.txt Location of UniProt list contain information for protein domains. Use only if you want to plot the structural variant -rr RepeatRegionFile.tsv, --repeatFile RepeatRegionFile.tsv Location of the Repeat Region Bed File -dgv DGvFile.tsv, --dgvFile DGvFile.tsv Location of the Database of Genomic Variants Bed File -cc CosmicConsensus.tsv, --cosmicConsensusFile CosmicConsensus.tsv Location of the Cosmic Consensus TSV file -cct CosmicFusionCounts.tsv, --cosmicCountsFile CosmicConsensus.tsv Location of the Cosmic Fusion Counts TSV file Input file format is a tab-delimited file containing: chr1 pos1 str1 chr2 pos2 str2 as the header and where: * **chr1:** Its the chromosome name for first break point [1,2,3,4,5,6,7 etc..], * **pos1:** Its the chromosome loaction for first break point [1-based], * **str1:** Its the read direction for the first break point [0=top/plus/reference, 1=bottom/minus/complement], * **chr2:** Its the chromosome name for second break point [1,2,3,4,5,6,7 etc..], * **pos2:** Its the chromosome loaction for second break point [1-based], * **str2:** Its the read direction for the second break point [0=top/plus/reference, 1=bottom/minus/complement], Output file will is a tab-delimited file containing: chr1 pos1 str1 chr2 pos2 str2 gene1 transcript1 site1 gene2 transcript2 site2 fusion as the header and where: * **chr1** : Its the chromosome name for first break point [1,2,3,4,5,6,7 etc..], * **pos1** : Its the chromosome loaction for first break point [1-based], * **str1** : Its the read direction for the first break point [0=top/plus/reference, 1=bottom/minus/complement], * **chr2** : Its the chromosome name for second break point [1,2,3,4,5,6,7 etc..], * **pos2** : Its the chromosome loaction for second break point [1-based], * **str2** : Its the read direction for the second break point [0=top/plus/reference, 1=bottom/minus/complement], * **gene1** : Gene for the first break point, * **transcript1** : Transcript used for the first breakpoint, * **site1** : Explanation of the site where the first breakpoint occurs [Example=>Intron of EWSR1(+):126bp after exon 10], * **kinasedomain1** : Explanation of the site where the first breapoint involves a Kinase Domain or not[Example=>Partial Kinase Domain Included] * **gene2** : Gene for the second break point, * **transcript2** : Transcript used for the second breakpoint, * **site2** : Explanation of the site where the second breakpoint occurs [Example=>Intron of ERG(-):393bp after exon 4], * **kinasedomain2** : Explanation of the site where the second breapoint involves a Kinase Domain or not[Example=>Partial Kinase Domain Included] * **fusion** : Explanation if the evnet leads to fusion or not. [Example=>Protein Fusion: in frame {EWSR1:ERG}] * **Cosmic_Fusion_Counts** : Number of Counts for the Events from Cosmic Fusion Results * **repName-repClass-repFamily:-site1** : Repeat Region Annotation for site 1 * **repName-repClass-repFamily:-site2** : Repeat Region Annotation for site 2 * **CC_Chr_Band** : Cosmic Cancer Census Chromosome Band * **CC_Tumour_Types(Somatic)** : Cosmic Cancer Census Tumor Type in Somatic Samples * **CC_Cancer_Syndrome** : Cosmic Cancer Census Cancer Syndrome the genes are related to. * **CC_Mutation_Type** : Cosmic Cancer Census Mutation Types the Genes are related to. * **CC_Translocation_Partner** : Cosmic Cancer Census Translocation Partners for the gene. * **DGv_Name-DGv_VarType-site1** : Database of Genomic Variants annotation for site 1 * **DGv_Name-DGv_VarType-site** : Database of Genomic Variants annotation for site 2 :Example Plot: .. image:: ../images/EWSR1-chr22_29688289_ERG-chr21_39775034_Translocation.jpg :height: 300px :width: 300px :scale: 100 % :alt: Image of EWSR1-ERG Fusion :align: center The above plot shows the following: * There are three tracks for each break point. The first three tracks belong to breakpoint 1 and second three tracks belong to breakpoint 2. * Thre three tracks are: * Gene Model Track: * Displays **Exons** as **brown** and there direction with exons in arrow formation. * In **orange** it displays the **breakpoint description**. * Alignment Track: * Displays the direction of the reads for the breakpoint. Also displays the **co-ordinate** in **Purple.** * Read in **Positive** direction are **Blue** and **Negative** direction are **Red** * Protein Domain Track: * Displays the **Protein Domain** as **green boxes** with there names in green. Output file name for plot is Gene1-Chromosome1_Position1_Gene2-Chromosome2_Position2_EventType.jpg All the Outputs are written into a folder called **iAnnotateSVplots** in the given output directory Please look at examples of input and output files in /data/test directory where: /data/test/testData.txt is the input file /data/test/testResult.txt is the output file The refFileVersion are automaticslly chosen from /data/references. **But caution this is only tested on hg19**. All these files are created using UCSC table browser. The example for canonical transcripts can be also found in /data/canonicalInfo. In general the file is tab-delimited containing: Gene Transcripts as the headers where: * **Gene** : Gene symobol should match the gene name from /data/references file. * **Transcripts** : Transcripts is a particular transcript that you are interested in using instead of auto-selection. The file for hg19 uniprot is created using UCSC table browser (Uniprot spAnnot track). The file for hg19 is in /data/UcscUniprotdomainInfo Module ``iAnnotateSV`` contents ------------------------------- .. automodule:: iAnnotateSV :members: :undoc-members: :show-inheritance: Submodules ---------- ``AnnotateEachBreakpoint`` module --------------------------------- .. automodule:: iAnnotateSV.AnnotateEachBreakpoint :members: FindATranscript, FindAllTranscript :undoc-members: :show-inheritance: - This module will annotate each breakpoint taking in: * **chr** : chromosome for the event, * **pos** : position in the chromosome for the event, * **str** : direction of the reads for the event[either 0 or 1], * **referenceDataframe** : a pandas data-frame that will store reference information :Example: ``AnnotateEachBreakpoint(chr1,pos1,str1,refDF)`` ``FindATranscript`` module -------------------------- .. automodule:: iAnnotateSV.FindTranscrpit.FindATranscript :members: :undoc-members: :show-inheritance: - This module will automatically find the highest preference transcript based on input: * **queryDF** : Its a dataframe with * **c** = zone: 1=exon, 2=intron, 3=3'-UTR, 4=5'-UTR, 5=promoter * **d,e** = for exons: which one, and how far * **d1,d2,e1,e2** = for introns: between which exons and how far? * **f** = for introns: how many bases in the partially completed codon?, * **referenceDataframe** : a pandas data-frame that will store reference information :Example: ``FindATranscript(queryDF,refDF)`` ``FindAllTranscripts`` module ----------------------------- .. automodule:: iAnnotateSV.FindTranscrpit.FindAllTranscripts :members: :undoc-members: :show-inheritance: - This module will find all transcripts based on input: * **queryDF** : Its a dataframe with * **c** = zone: 1=exon, 2=intron, 3=3'-UTR, 4=5'-UTR, 5=promoter * **d,e** = for exons: which one, and how far * **d1,d2,e1,e2** = for introns: between which exons and how far? * **f** = for introns: how many bases in the partially completed codon?, * **referenceDataframe** : a pandas data-frame that will store reference information :Example: ``FindAllTranscripts(queryDF,refDF)`` ``FindCanonicalTranscript`` module ---------------------------------- .. automodule:: iAnnotateSV.FindCanonicalTranscript :members: :undoc-members: :show-inheritance: - This module will Finad a canonical transcript based on the input for main function and output of FindAllTranscripts: * **geneList** : List of genes [this will normally be list with same names] for the Structural Variant in question, * **transcriptList** : List of transcripts for the Structural Variant in question, * **siteList** : direction of the site for the event[either 0 or 1], * **zoneList** : different zones in which the event can occur [zone: 1=exon, 2=intron, 3=3'-UTR, 4=5'-UTR, 5=promoter] * **strandList** : direction of the read for the event[either 0 or 1], * **intronnumList** : Which intron the event occurs if the event is in intron for each transcript, * **intronframeList** : What is the frame of the intron where the event is occuring for each transcript. * **ctDict** : a dictionary containing the canonical transcript information for each gene. [Gene=>Transcript] :Example: ``FindCT(geneList,transcriptList,siteList,zoneList,strandList,intronnumList,intronframeList,ctDict)`` ``PredictFunction`` module -------------------------- .. automodule:: iAnnotateSV.PredictFunction :members: :undoc-members: :show-inheritance: - This module will predict the function of each annotated breakpoint - It takes two pandas series which has following information: * **gene** : Gene for the event, * **transcript** : Transcript used for the event, * **site** : Explanation for site where the event occurs, * **zone** : Where does the event occur [ 1=exon, 2=intron, 3=3'-UTR, 4=5'-UTR, 5=promoter ], * **strand** : Direction of the transcript, * **str** : Direction of the read, * **intronnum** : Which intron the event occurs if the event is in intron, * **intronframe** : What is the frame of the intron where the event is occuring. :Example: ``ann1S = pandas.Series([gene1,transcript1,site1,zone1,strand1,str1,intronnum1,intronframe1],index=['gene1', 'transcript1', 'site1', 'zone1', 'txstrand1', 'readstrand1', 'intronnum1','intronframe1'])`` ``ann2S = pandas.Series([gene2,transcript2,site2,zone2,strand2,str2,intronnum2,intronframe2],index=['gene2', 'transcript2', 'site2', 'zone2', 'txstrand2', 'readstrand2', 'intronnum2','intronframe2'])`` So **ann1S** & **ann2S** are series that will go to PredictFuntionForSV() ``PredictFunctionForSV(ann1S,ann2S)`` ``AddExternalAnnotations`` module --------------------------------- .. automodule:: iAnnotateSV.AddExternalAnnotations :members: ReadSVFile :undoc-members: :show-inheritance: - This module will annotate each breakpoint for Repeat Region, Database of Genomic Variants and Cosmic Census taking in: * **repeat region file** : Repeat Track from UCSC in tab-delimited format (see: ``/data/repeat_region/hg19_repeatRegion.tsv``), * **data base of genimic variant file** : DGv in tab-delimited format (see: ``/data/database_of_genomic_variants/hg19_DGv_Annotation.tsv``), * **cosmic census file** : cosmic census file from sanger (see: ``/data/cosmic/cancer_gene_census.tsv``), * **cosmic fusion counts file** : cosmic fusion counts file from from cosmic fusion export (see: ``/data/cosmic/cosmic_fusion_counts.tsv``), * **structural variants file** : File containing the breakpoint information, * **output prefix** : Output Prefix for the output files (.xlsx,.json,.txt), * **output directory** : Directory where the output needs to be written :Example: ``makeCommandLineForAEA = "-r " + repeatregionFilePath + " -d " + dgvFilePath + " -c " + ccFilePath + " -cct " + cctFilePath + " -s " + svFilePath + " -ofp AnnotatedSV" + " -o " + outputDir`` ``AddExternalAnnotations.main(makeCommandLineForAEA)`` ``AnnotateForRepeatRegion`` module ---------------------------------- .. automodule:: iAnnotateSV.AnnotateForRepeatRegion :members: ReadRepeatFile,AnnotateRepeatRegion :undoc-members: :show-inheritance: - This module has two submodules will read and annotate each breakpoint for Repeat Region 1. **ReadRepeatFile** * This will read a tab-delimited file into a panadas dataframe 2. **AnnotateRepeatRegion** * This is will annotate the breakpoints for repeat region. :Example: ``AnnotateRepeatRegion(verbose, count, svObject, repeatregionDict)`` ``AnnotateForDGv`` module ------------------------- .. automodule:: iAnnotateSV.AnnotateForDGv :members: ReadDGvFile,AnnotateDGv :undoc-members: :show-inheritance: - This module has two submodules will read and annotate each breakpoint for Database of Genomic Variants 1. **ReadDGv** * This will read a tab-delimited file into a panadas dataframe 2. **AnnotateDGv** * This is will annotate the breakpoints for Database of Genomic Variants. :Example: ``AnnotateDGv(verbose, count, svObject, dgvDict)`` ``AnnotateForCosmic`` module ---------------------------- .. automodule:: iAnnotateSV.AnnotateForCosmic :members: AnnotateFromCosmicCensusFile,AnnotateFromComicFusionCountsFile :undoc-members: :show-inheritance: - This module will annotate each breakpoint for Cosmic Census :Example: ``AnnotateFromCosmicCensusFile(comic_census_filename, verbose, count, svObject)`` :Example: ``AnnotateFromComicFusionCountsFile(comic_fusion_counts_filename, verbose, count, svObject)`` ``helper`` module ----------------- .. automodule:: iAnnotateSV.helper :members: ReadFile,ExtendPromoterRegion,bp2str :undoc-members: :show-inheritance: - This module has multiple submodules 1. **ReadFile()** * This will read a tab-delimited file into a panadas dataframe 2. **ExtendPromoterRegion()** * This will extend the promoter region to a given length 3. **bp2str()** * This will convert base pair information to string information ``VisualizeSV`` module ---------------------- .. automodule:: iAnnotateSV.VisualizeSV :members: :undoc-members: :show-inheritance: - This module will annotate each breakpoint taking in: * **svDataFrame** : Annotated structurla varaints dataframe obtained from PredictFuntion Module, * **referenceDataFrame** : a pandas data-frame that will store reference information, * **uniprotDataFrame** : making a dataframe from the uniprot data. * **args** : This has all the arguments that are generated from iAnnotateSV module :Example: ``VisualizeSV(svDataFrame,referenceDataFrame,uniprotDataFrame,args)`` :Example Plot: .. image:: ../images/EWSR1-chr22_29688289_ERG-chr21_39775034_Translocation.jpg :height: 300px :width: 300px :scale: 100 % :alt: Image of EWSR1-ERG Fusion :align: center ``iAnnotateSV`` module ---------------------- .. automodule:: iAnnotateSV.iAnnotateSV :members: processSV :undoc-members: :show-inheritance: - This module is the driver module, it takes user information and runs all other module to produce proper structural variant annotation .. code-block:: bash usage: iAnnotateSV.py [options]** Annotate SV based on a specific human reference** optional arguments:** -h, --help show this help message and exit -v, --verbose make lots of noise [default] -r hg19, --refFileVersion hg19 Which human reference file to be used, hg18,hg19 or hg38 -ofp test, --outputFilePrefix test Prefix for the output file -o /somedir, --outputDir /somedir Full Path to the output dir -i svfile.txt, --svFile svfile.txt Location of the structural variants file to annotate -d 3000, --distance 3000 Distance used to extend the promoter region -a, --autoSelect Auto Select which transcript to be used[default] -c canonicalExons.txt, --canonicalTranscripts canonicalExons.txt Location of canonical transcript list for each gene. Use only if you want the output for specific transcripts for each gene. -p, --plotSV Plot the structural variant in question[default] -u uniprot.txt, --uniprotFile uniprot.txt Location of UniProt list contain information for protein domains. Use only if you want to plot the structural variant -rr RepeatRegionFile.tsv, --repeatFile RepeatRegionFile.tsv Location of the Repeat Region Bed File -dgv DGvFile.tsv, --dgvFile DGvFile.tsv Location of the Database of Genomic Variants Bed File -cc CosmicConsensus.tsv, --cosmicConsensusFile CosmicConsensus.tsv Location of the Cosmic Consensus TSV file -cct CosmicFusionCounts.tsv, --cosmicCountsFile CosmicConsensus.tsv Location of the Cosmic Fusion Counts TSV file