Small RNA Transfrags from CSHL
Small RNA reads from Cold Spring Harbor Lab (CSHL) were assembled
into transfrags by merging overlapping reads. In order to minimize
ambiguity from reads that have the potential to map to multiple genomic
loci, only the uniquely mapping reads were used to generate transfrags.
The BED6+ format files are based on, but not generated directly by, the
"intervals-to-contigs" Galaxy tool written by Assaf Gordon (gordon@cshl.edu) in the Hannon lab at CSHL. Below is a description of the columns in this format, and how each column is calculated.
Output Columns
(Bed-style transfrag information)
- chromosome
- transfrag's start coordinate
- transfrag's end coordinate
- Contig's name. The numeral in the name indicates rank in terms of abundance within this dataset.
- Score (0 to 1000). Scores are calculated thusly: 1000*[# reads in transfrag]/[# reads in most abundant transfrag in this dataset]
- Strand (orientation, + or -)
(Additional Sequences Information)
- transfrag's length (number of covered bases = end - start)
- number of unique sequences in this transfrag
- total reads count in this transfrag
- minimum sequence-count value
- maximum sequence-count value
- average seqeunce-count value
- first-quartile sequence-count value
- median sequence-count value
- third-quartile sequence-count value
(Additional Reads Information)
- minimum reads-count value
- maximum reads-count value
- average seqeunce-count value
- first-quartile reads-count value
- median reads-count value
- third-quartile reads-count value
(Additional Intervals Information)
- number of regions in this transfrag (each region has different value
for sequence-count and reads-count)
- starting coordinates of significant regions in this transfrag (see
example below)
- length (in bases) of each significant regions
- sequence-count for each significant region
- reads-count for each significant region
- Integrated reads-count sum (inner-product of columns 22 and 24)
Concrete Example
Assume the following intervals over an imaginary chromosome chr1:
chr1 100 132 4
chr1 110 142 3
chr1 130 160 7
chr1 170 201 3
chr1 190 225 1
Plotting these intervals:
These intervals cover two transfraguous regions (marked in red): 100-160
and 170-225.
The output file will contain two lines (one for each transfrag):
chr1 100 160 transfrag-1 1000 + 60 3 14 1 3 1.6 1 2 2 4 14 7.35 7 7 7 5 100,110,130,133,143 10,20,3,10,17 1,2,3,2,1 4,7,14,10,7 441
chr1 170 225 transfrag-2 286 + 55 2 4 1 2 1.21818 1 1 1 1 4 2.38182 1 3 3 3 170,190,202 20,12,23 1,2,1 3,4,1 131
The rest of the explanation will focus on the first transfrag only:
(transfrag information):
- transfrag is on chromosome chr1 (column 1)
- transfrag has sense orientation (column 6) - assumed so
beacause no orientation information was found.
- transfrag starts at coordinate 100 (column 2)
- transfrag ends at coordinate 160 (column 3)
- transfrag's name is transfrag-1 (column 4). It is the most abundant transfrag in the sample, hence the rank score of 1.
- transfrag has a score of 1000 (column 5) - it is the most abundant
congtig on this chromosome. The second transfrag would have a score of
1000*(4/14) = 286.
- transfrag has sense orientation (column 6) - assumed so
beacause no orientation information was found.
- transfrag covers 60 bases (column 7)
- transfrag has 3 sequences (column 8)
- transfrag has 14 reads (column 9)
(sequence-count information):
- minimum sequence-count is 1 (only one interval covers coordinates
100 to 110) (column 10)
- maximum sequence-count is 3 (three intervals are covering
coordinates 130 to 132) (column 11)
- average sequence-count value is 1.6 ( 60 bases are covered, with
coverage sum = 10x1 + 20x2 + 3x3 + 10x2 + 17x1 = 96. 96/60=1.6 ) (column
12)
- first-quartile sequence-count value is 1 ( There are 27 bases
covered with value 1, 30 bases covered with value 2, and three bases
covered with value 3) (column 13)
- median sequence-count value is 2 (column 14)
- third-quartile sequence-count value is 2 ( column 15 )
(reads-count information):
- minimum reads-count is 4 (coordinates 100 to 110 are covered by the
lowest number of reads = 4) (column 16)
- maximum reads-count is 14 (three intervals, whose reads-count sum is
14, are covering coordinates 130 to 132) (column 17)
- average reads-count value is 7.35 ( 60 bases are covered, with
coverage sum = 10x4 + 20x7 + 3x14 + 10x10 + 17x7 = 441. 441/60=7.35 )
(column 18)
- first-quartile reads-count value is 7 ( 10 bases coverged with 4, 37
covered with 7, 10 covered with 10, 3 covered with 14 ) (column 19)
- median reads-count value is 7 (column 20)
- third-quartile reads-count value is 7 ( column 21 )
(significant regions information):
- This transfrag has five significant regions (column 22).
- Significant coordinates are ones in which the
sequence-count and reads-count change. In this examples, the coordinates
are 100,110,130,133,143. Look at the plot to better understand how
these coodinates are determined. (column 23).
- Each significant region covers a varied number of bases. Example:
the region which starts at 100 covers 10 bases. the region which starts
at 110 covers 20 bases. the region which starts at 130 covers 3 bases,
etc. (column 24)
- Sequence-Count value for each significant region . Example: the
region which starts at 100 has sequence-count=1. (column 25)
- Reads-Count value for each significant region . Example: the region
which starts at 100 has reads-count=4. (column 26)
- Integrated reads-count sum - (Inner product of column 22 and column
26) = 10*4 + 20*7 + 3*14 + 10*10 + 17*7 = 441