TrueSight: Self-training Algorithm for Splice Junction Detection using RNA-seq
Author and Code License
- Yang Li
- The GNU General Public License v3.0
- 32-bit or 64-bit GNU/Linux
- Perl v5.10.1 or higher
- GCC 4.0 + with Standard C++ Library
- GNU make
- ~10G memory for ~30M RNA-seq reads
- samtools in default PATH
- version 0.06, Sep 15, 2012
- 'train' function in lr_trirls package to estimate logistic regression parameters
- bowtie to align full/segment reads
input file name. For paired-end reads, -f fq_1 fq_2;if there are multiple files for fq_1/fq_2, they should be seperated by comma; names for paired-end reads CAN be template_name/1 and template_name/2, OR just template_name. All reads should have the same length and >= 36bp; this argument is REQUIRED.
bowtie reference dir/name, thus bowtie index should be dir/name.*.ebwt, built using bowtie-build
; reference genome should also be located here, e.g. dir/name.fa; chromosome line for dir/name.fa should be clean such that there is NO
space within chromosome line, such as ">XXXXX\n"; this argument is REQUIRED
number of cores to use
number of mismatches (0-3); default is 2.
min intron length; default 20bp.
max intron length; default 200,000bp.
segment length (18-25bp). For reads with 36<=LENGTH<=50, segment length is forced to be int(LENGTH/2); for reads with LENGTH>50, the default segment length is 25. TrueSight is faster with larger segment length, however, will be more possible to fail in aligning reads spanning more than one junction within the segment.
if set, only canonical splice junctions will be reported.
if set, TrueSight will report "template_name/1" and "template_name/2" for names of paired-end reads, instead of reporting "template_name" as default. Note that cufflinks
only handle alignment results with "template_name".
sorted alignment results in bam format
gapped alignment results in sam format
inferred splice junctions. Six columns represent chrom, junction start (first base on intron), junction end (first base on exon), if canonical (1 for canonical, 2 for semi-canonical and 0 for non-canonical), mapping numbers and TrueSight score (0-1).
- Li Y, Li-Byarlay H, Burns P, Borodovsky M, Robinson GE, and Ma J. TrueSight: a new algorithm for splice junction detection using RNA-seq. Nucleic Acids Research. doi: 10.1093/nar/gks1311. First published online: December 18, 2012
- Early version appeared in Proceedings of the 16th Annual International Conference on Research in Computational Molecular Biology (RECOMB), 2012.
Simulated evaluation datasets
(50, 75, 100bp paired-end human RNA-seq).
from simulated datasets.
TrueSight output junctions
Note from authors
TrueSight is not designed as a universal gapped alignment tool, such as BLAT
. Instead, TrueSight takes all possible splice junctions of one transcriptome as a whole, and learns a regression model to find best assignment for them. The behaviour of TrueSight is not guaranteed when the number of splice junctions is too small (e.g. <1,000, while typical human RNA-seq with 20M 75bp reads would have >100,000 splice junctions) or exonic (fully aligned) reads are not included.
For single end datasets: truesight_single.pl .
For paired-end datasets: truesight_pair.pl .
Email: yangli9 AT illinois.edu