TrueSight: Self-training Algorithm for Splice Junction Detection using RNA-seq

Author and Code License


Source Code

Codes including:


-f/--fastq <string>
input file name. For paired-end reads, -f fq_1 fq_2;if there are multiple files for fq_1/fq_2, they should be seperated by comma; names for paired-end reads CAN be template_name/1 and template_name/2, OR just template_name. All reads should have the same length and >= 36bp; this argument is REQUIRED.
-r/--bowtie-index <string>
bowtie reference dir/name, thus bowtie index should be dir/name.*.ebwt, built using bowtie-build; reference genome should also be located here, e.g. dir/name.fa; chromosome line for dir/name.fa should be clean such that there is NO space within chromosome line, such as ">XXXXX\n"; this argument is REQUIRED.
-p/--thread <int>
number of cores to use
-v/--mismatch <int>
number of mismatches (0-3); default is 2.
-i/--min-intron-length <int>
min intron length; default 20bp.
-I/--max-intron-length <int>
max intron length; default 200,000bp.
-s/--segment-length <int>
segment length (18-25bp). For reads with 36<=LENGTH<=50, segment length is forced to be int(LENGTH/2); for reads with LENGTH>50, the default segment length is 25. TrueSight is faster with larger segment length, however, will be more possible to fail in aligning reads spanning more than one junction within the segment.
if set, only canonical splice junctions will be reported.
if set, TrueSight will report "template_name/1" and "template_name/2" for names of paired-end reads, instead of reporting "template_name" as default. Note that cufflinks only handle alignment results with "template_name".


sorted alignment results in bam format
gapped alignment results in sam format
inferred splice junctions. Six columns represent chrom, junction start (first base on intron), junction end (first base on exon), if canonical (1 for canonical, 2 for semi-canonical and 0 for non-canonical), mapping numbers and TrueSight score (0-1).


Simulation Datasets

Simulated evaluation datasets (50, 75, 100bp paired-end human RNA-seq).
Benchmark junctions from simulated datasets.
TrueSight output junctions.

Note from authors

TrueSight is not designed as a universal gapped alignment tool, such as BLAT. Instead, TrueSight takes all possible splice junctions of one transcriptome as a whole, and learns a regression model to find best assignment for them. The behaviour of TrueSight is not guaranteed when the number of splice junctions is too small (e.g. <1,000, while typical human RNA-seq with 20M 75bp reads would have >100,000 splice junctions) or exonic (fully aligned) reads are not included.

Running TrueSight

For single end datasets: .

For paired-end datasets: .

Bug report

Email: yangli9 AT


Yang Li and Jian Ma