FusionHunter: identifying fusion transcripts using paired-end RNA-seq
Author and Code License
- Yang Li
- The GNU General Public License v3.0
- 32-bit or 64-bit GNU/Linux
- Perl v5.10.1 or higher
- GCC 4.0 + with Standard C++ Library
- GNU make
- version 1.4, 2012/06/09
- Now FusionHunter can handle RNA-seq data from non-human species
- IUM reads are also mapped onto gene model before considering as fusion transcripts.
- version 1.3, 2011/05/20
- bowtie to align full/segment reads
- Download FusionHunter source code
- Download FusionHunter annotation package for hg18 / hg19
- Extract the archive, e.g. tar -xzvf FusionHunter.tar.gz
- Under FusionHunter directory, type: ./install.sh
Note from authors
- FusionHunter uses perl package: Parallel::ForkManager, the installation code for Parallel::ForkManager is included in FusionHunter package. If you have root privilege, install.sh would help you install Parallel::ForkManager, or you should ask someone with root privilege in your server for help.
- FusionHunter uses Bowtie as alignment tool. The bowtie and bowtie-build (all for 64-bit) are in path of FusionHunter/bin, if your mathine type is 32-bit, you can copy the executable file 'bowtie', 'bowtie-build' in FusionHunter/src/bowtie-0.12.7-linux-i386/ into FusionHunter/bin (replace the original ones for 64-bit)
- FusionHunter utilizes some Kent Source Code from Jim Kent, UCSC
Inputs (highlighted are mandatory)
Left(first) part of paired-end reads(fastq format). Should be named as XXXX/1.
Right(second) part of paired-end reads(fastq format). Should be named as XXXX/2.
The directory containing fasta format reference genome. reference for various species can be downloaded from UCSC Genome Browser. When you have downloaded .fa files for chromosomes, you should merge them together e.g. for assembly hg18, use command : cat chr*.fa > hg18.fa.
Only 'major' chromosomes shall be included in reference genome. Undetermined scaffolds must be excluded since they might lead to spurious fusion outputs.
The directory to Bowtie index / base name of Bowtie index. NO '/' in the end, e.g. for hg18, the Bowtie index would be hg18.1.ebwt, hg18.2.ebwt, hg18.3.ebwt, hg18.4.ebwt, hg18.rev.1.ebwt, hg18.rev.2.ebwt, so the base name of Bowtie index is 'hg18', thus BowtieIdx = DirtoBowtieIndex/hg18.
Directory and name of gene annotation list, we suggest UCSC annotation. We provide hg18 Gene_annotation in our annotation package (file name hg18.ucscKnownGene). For species other than human, users can download gene annotation file from UCSC table browser
, with first 10 columns as the GenePred table format
, and last column should be the gene name.
Directory and file name repeats region annotation. We provide Repeats annotation for hg18 in our annotation package (file name hg18.repeats). OPTIONAL in other species. Leave as blank if not available for your data.
Directory and file name of self-alignment regions.We provide SelfAlign annotation for hg18 in our annotation package (file name hg18.chain.pairs). OPTIONAL in other species. Leave as blank if not available for your data.
Directory and file name of human EST database, We provide SelfAlign annotation for hg18 in our annotation package (file name hg18.SpliceEST). OPTIONAL in other species. Leave as blank if not available for your data.
Output of fusion by FusionHunter. Default is FusionHunter.fusion
Output of readthrough by FusionHunter. Default is FusionHunter.readthrough
BASIC OPTIONS (highlighted are mandatory)
Set 1 if running on hg18/hg19; otherwise 0.
Number of cores used for bowtie alignment, since the most time consuminng process in FusionHunter is Bowtie alignment, we suggest you use as many cores as possible
Size of the partial reads, should not be longer than half of full read length, and we strongly suggest you use half length, e.g. 25 if RNA-seq reads are 50bps.
Min number of paired-end reads that support a fusion (encompassing a fusion junction). Default is 2.
Min number of junction spanning reads to support a fusion. Default is 1.
TO BE NOTED: if you set the MINSPAN = 1, in order to reduce false positives, any candidate junction supported by only 1 spanning read would be discarded UNLESS the fusion junction point is exactly on annotated exon boundary. This process is embeded in FusionHunter.
If set to 1, repetitive regions in reference genome will be filtered out when performing gapped alignment. If set to 0, no filtering would be done on reference genome. Default is 1.
Size of exact match for each junction flanking tile. Default is 4.
Min size of the maximum base coverage on either side of the junction. Default is 8.
max allowed repeat proportion of a read (used in reduceBwt). Default is 0.6.
number of chains to overlap with a read (used in reduceBwt). Default is 20.
max allowed repeat proportion of a read (more stringent, used in leftRightOvlp) . Default is 0.2.
max allowed alignment proportion between a pair of reads (used in leftRightOvlp) . Default is 0.2.
distance to self-chain boundary (used in postLeftRightOvlp) . Default is 200000.
proportion of a overlaps with a region (used in regionPairs) . Default is 0.8.
- Li Y, Chien J, Smith DI, and Ma J. FusionHunter: identifying fusion transcripts in cancer using paired-end RNA-seq. Bioinformatics, 27(12):1708-10, 2011.
Specificity and Sensitivity
To increase specificity (get reliable results), set the variables in highlighted box (PAIRNUM, MINSPAN, TILE, MINOVLP) larger; to increase sensitivity(get more results), set the variables in highlighted box smaller.
How to run FusionHunter
- Copy FusionHunter.cfg to your working directory
- Configure variables in FusionHunter.cfg. Variables in INPUT section must be changed.
- Under your working directory, type: DirToFusionHunter/bin/FusionHunter.pl FusionHunter.cfg
We use paired-end reads collected from K-562-4 sample (Test sample extracted from SRX006134 in SRA of NCBI)
- Install FusionHunter.
- Download the sample project and the sample output.
- Extract the archive, e.g. tar -xzvf FusionHunter_sample_data.tar.gz
- Copy FusionHunter.cfg to your working directory and configure the file, e.g. you should set L = DirtoSampleData/ K-562-4_BCR_ABL1_1.fastq; R = DirtoSampleData/ K-562-4_BCR_ABL1_2.fastq
- Run FusionHunter, it would take about 5 minutes with CORE=10
- Check the results in FusionHunter_output/FusionHunter.fusion with the sample output
Email: yangli9 AT illinois.edu