DRACO uses a sliding window approach to analyze short reads tiling long target transcripts. The default size of the window is equal to 90% the median length of the reads, slid with an offset equal to 5% the window's size.
For each window, the following analysis steps are performed:
-
Reads covering the entire window are used to build a graph. In the graph, each base is a vertex; when two bases are observed to co-mutate within the same read, they are connected by an edge in the graph. Each edge, is weighted according to the co-mutation frequency of the two bases (or essentialy, the number of reads in which both bases are simultaneously mutated).
-
From the adjacency matrix of the graph, the normalized Laplacian matrix is calculated.
-
Spectral clustering is applied to this matrix, to identify the number of co-existing structural conformations (clusters) for the window being analyzed.
-
Graph-Cut is then used to weight each vertex in the graph according to its affinity to each conformation.
-
Reads are assigned to each window
-
The window is slid by the chosen offset, and steps 1-4 are repeated, until the entire transcript has been analyzed.
-
Consecutive windows forming the same number of conformations are grouped into window sets and merged.
-
For each window set, the relative conformation stoichiometries, and the individual reactivity profiles are reconstructed. Window sets found to form different numbers of conformations, are reported separately.
When analyzing a transcript, DRACO keeps in memory all the reads mapping to that transcript. At extreme sequencing depths (>500,000X), the memory consumption can become prohibitive on some system. If your system does not have sufficient RAM, the reads can be randomly downsampled along the transcript, to achieve a lower mean coverage. Usually, a coverage of 20,000X is sufficient for DRACO to efficiently deconvolve the underlying structural heterogeneity.
For additional information concerning downsampling, please consult the documentation for the rf-count module of the RNA Framework.
Usage
To list the required parameters, simply type:
$ draco --help
Parameter | Type | Description |
---|---|---|
--mm | string | Input mutation map (MM) file |
--output | string | Output JSON file (Default: draco_deconvolved.json) |
--processors | int | Number of processors to use (Default: 0) Note #1: when set to 0, all available processors will be used Note #2: only the analysis of multiple transcripts can be parallelized. If analyzing a single transcript, setting this parameter to a value > 1 will not speed up the execution |
--whitelist | string | A whitelist file, containing the IDs of the transcripts to be analyzed, one per raw (Default: all transcripts will be analyzed) |
--shape | Enables spectral analysis on all four bases (default is only A/C bases) Note: this feature is highly experimental |
|
Mutation filtering options | ||
--minBaseCoverage | int | Minimum coverage per base (Default: 100) |
--minBaseMutations | int | Minimum mutations per base (Default: 2) |
--minMutationFreq | float | Bases below this mutation frequency will be discarded as noise (Default: 0.005) |
--minReadMutations | float | Reads with less than these mutations will be discarded as non-informative (Default: 2) |
Spectral deconvolution options | ||
--maxClusters | int | Maximum allowed number of clusters/conformations (Default: unlimited) |
--minFilteredReads | int | Minimum number of reads (post-filtering) to perform spectral analysis (Default: 5) |
--minPermutations | int | Minimum number of permutations performed to build the null model (Default: 8) |
--maxPermutations | int | Maximum number of permutations performed to build the null model (Default: 400) |
--firstEigengapShift | float | How much the mean of the first eigengap's null model should be shifted by (Default: 0.9) |
--eigengapCumRelThresh | float | Minimum relative difference between the eigengap and the null model, as a fraction of the cumulative difference between the previous eigengaps and their respective null models (Default: 0.1) Note: this does not apply to the first eigengap |
--eigengapAbsThresh | float | Minimum absolute difference between the eigengap and the null model (Default: 0.03) |
--alpha | float | Below this p-value, the null hypothesis is rejected and the eigengap is marked as informative (Default: 0.01) |
--beta | float | Above this p-value, the alternative hypothesis is rejected and the eigengap is marked as non-informative (Default: 0.2) Note: this threshold does not apply to the first eigengap |
--firstEigengapBeta | float | Beta p-value threshold for the first eigengap (Default: 0.4) |
--minNullStdev | float | Minimum standard deviation for the null model (Default: 0.025) Note: when this threshold is not met, --extraPermutations additional permutations will be performed |
--extraPermutations | int | Additional permutations to perform when the standard deviation of the null model is < --minNullStdev (Default: 50) |
--minWinBases | int | Minimum number of bases in window (post-filtering) to perform spectral analysis (Default: 10) |
--lookaheadEigengaps | int | Number of eigengaps to look ahead after a non-informative eigengap is encountered (Default: 3) |
--saveEigengapData | Saves eigengap data for plotting (Default: off) | |
--eigengapDataOut | string | Eigengap data output folder (Default: ./eigengap_data) |
Graph-Cut options | ||
--minClusterFraction | float | Minimum fraction of reads assigned to each cluster/conformation (Default: 0.05) Note: if this threshold is not met, the number of clusters is automatically decreased |
--softClusteringInits | float | Number of iterations for the initialization process of the graph-cut. Note: The initialization with the lowest score is picked (Default: 500) |
--softClusteringIters | float | Number of iterations performed on graph-cut Note: The cut with the lowest score is picked (Default: 20) |
--softClusteringWeightModule | float | The module of the weight that is used to change the cluster weights in order to find the lowest score (Default: 0.005) |
Windowed analysis options | ||
--winLenFraction | float | Length of the window as a fraction of the median read length (Default: 0.9) Note: this parameter and --absWinSize are mutually exclusive |
--absWinLen | int | Absolute length of the window (Default: 0) Note: this parameter and --winSizeFraction are mutually exclusive |
--winOffsetFraction | float | Slide offset as fraction of the size of the window (Default: 0.05) Note: this parameter and --absWinOffset are mutually exclusive |
--absWinOffset | int | Absolute slide offset (Default: 0) Note: this parameter and --winOffsetFraction are mutually exclusive |
--maxIgnoreWins | int | Maximum number of internal windows with a different number of clusters to ignore when merging two external sets of windows (Default: 0) |
--minExtWins | int | Minimum number of external windows, having the same number of clusters, needed to trigger merging (Default: 6) |
--nonInformativeToSurround | int | Non-informative windows (windows with 0 detected clusters) are set to the same number of clusters of surrounding windows (Default: false) |
--allNonInformativeToOne | int | If all windows in the transcript are non-informative (0 clusters), the number of clusters is forced to 1 (Default: off) |
--reportNonInformative | int | Reports also non-informative windows in the output JSON file |
Understanding the algorithm
While it is advisable for most users to run DRACO with its default parameters, as these are the results of a careful and thorough calibration, it might be useful to adjust the analysis on a case-by-case basis.
DRACO analysis is performed in windows. The size of the window is by default winLenFraction
× the median read length; an arbitrary length (in bp) for the window can also be used, by setting absWinLen
. Similarly, the slide offset is by default winOffsetFraction
× the length of the window, but an arbitrary offset (in bp) can be specified via absWinOffset
. When long reads are used, reducing the size of the window might allow the identification of small highly dynamic regions of the target transcript. For instance, if 300 bp-long reads are being used, and the structurally-dynamic region within the target transcript is only 90 nt-long, the spectral analysis might not be able to detect it (especially if probing with DMS, as only ~50% of the bases will be informative); in these situations, reducing the size of the window can increase the sensitivity of the analysis.
The spectral deconvolution represents the most critical step of the algorithm, as this is responsible of determining the number of conformations formed by each window. During this step, the eigengaps, representing the distance between consecutive eigenvalues, are compared to a null model, built by performing random permutations over the original data matrix. The distribution of the null model is well approximated by a Weibull distribution. The number of permutations performed varies for each eigengap, and it is comprised between minPermutations
and maxPermutations
. Permutations are performed as long as the standard deviation (σ) of the null model is < minNullStdev
, increasing by extraPermutations
.
Starting with the first eigengap, DRACO compares each eigengap to its respective null model. Aim of this comparison is to determine which eigengaps can be considered to be informative, as this directly translates into the number of conformations/clusters formed by the analyzed window. For instance, if the first two eigengaps are informative, two conformations (clusters) are present; if the first three eigengaps are informative, three conformations are present, and so on.
In order to be considered informative, each eigengap must fulfill a number of criteria:
-
The distance (d) between the mean (μ) of the null model and the eigengap must be ≥
eigengapAbsThresh
. -
The distance between the μ and the eigengap must be greater than
egengapCumRelThreshold
× the cumulative sum of the previous eigengaps. For instance, d4, the distance between the 4th eigengap and its null model, must be ≥egengapCumRelThreshold
× (d1 + d2 + d3). This, of course, does not apply to the first eigengap -
A t-test is used to assess whether the distance between the eigengap and the null model is significant. If the p-value is ≤
alpha
, the null hypothesis is rejected, and the eigengap is marked as informative. If the p-value is ≥beta
(or in the case of the first eigengap, ≥firstEigengapBeta
), the alternative hypothesis is rejected, and the eigengap is marked as non-informative. If the p-value is comprised between these two thresholds, neither of the two hypotheses can be rejected, so additional permutations are performed. If, after having performedmaxPermutations
permutations the p-value is still comprised between the two thresholds, then the alternative hypothesis is rejected and the eigengap is marked as non-informative.
The analysis terminates when a non-informative eigengap is encountered as, normally, the subsequent eigengaps are non-informative as well. In some particular cases, however, eigengaps have been observed to behave unexpectedly. In such situations, although an eigengap is marked as non-informative, one or more eigengaps immediately downstream of it are again informative. To account for these situations, whenever a non-informative eigengap is encountered, DRACO can perform a lookahead evaluation of the lookaheadEigengaps
subsequent eigengaps. If one of these is marked as informative, then the analysis of the eigengaps is allowed to continue.
Once the number of conformations has been determined, the algorithm uses a normalized Graph-Cut approach to weight each base in the window, according to its affinity to each of the different conformations. This approach tries to find the best way to partition the previously built graph. It is composed of two steps:
- Initialization. During this step, each base of each cluster is assigned a random weight. The step is repeated
softClusteringInits
times and the solution with the lowest score is picked. - Refinement. During this step, the weights of the bases are dynamically altered by
softClusteringWeightModule
in an attempt to further minimize the score. This step is repeatedsoftClusteringIters
and the solution with the lowest score is picked.
Note: in the original implementation of DRACO, this step was carried out only once. While being significantly faster, the downside was a higher risk of finding suboptimal graph partitioning solutions, corresponding to local score minima. Lowering the softClusteringWeightModule
and increasing the softClusteringIters
can slow down the analysis, but it significantly increases the likelihood that the true optimal solution is identified, further ensuring reproducibility of the analysis results.
Following identification of the optimal graph partitioning, reads are assigned to each conformation. If the fraction of reads assigned to a conformation is < minClusterFraction
, the number of conformations for the window is decreased by one, and the step is repeated.
Final step of the analysis involves merging consecutive windows, found to form the same number of conformations, into window sets. Let's however consider the following case:
In this situation, two sets of windows found to form two conformations, are interrupted by one window forming one conformation. This often happens with short reads (<100 bp), as the analyzed window might not contain enough information to identify the coexisting conformations. By default (left), DRACO would report three window sets. It is possible to account for these cases, hence allowing DRACO to ignore at most maxIgnoreWins
internal windows forming a discordant number of conformations. For the merging to occur, the total number of external windows on either sides of the discordant internal windows, must be ≥ minExtWins
.
It is however possible for the windows on the left and on the right sides of the discordant internal windows, to form a different number of conformations:
In such a case, only if the number of windows on one side is > than the number of windows on the other side, as well as ≥ minExtWins
, the internal discordant windows will be merged into a single set with the windows on that side.
Output JSON files
DRACO produces a JSON file, with the following structure:
{
"filename":"sample.mm",
"transcripts":[
{
"id":"RNA_1",
"sequence":"TCTATTCTACATTGATAGAACAACTCT...TCTGTCTATCACCGCTAGAGCACTCGGTGATTGCA",
"nReads":20000,
"windows":[
{
"start":0,
"end":294,
"stoichiometries":[ 0.388,0.304,0.308 ],
"counts":[
[ 0,1,0,23,0,0,33,0,54,0,5,0,0,...,301,0,334,0,3,1,0,0,0,0,0,0 ],
[ 0,6,0,0,0,0,0,0,0,0,36,0,0,2,...,0,0,0,0,0,0,134,0,86,0,0,62,63 ],
[ 0,1,0,17,0,0,57,0,0,0,75,0,11,...,181,0,0,0,0,0,0,0,0,118,0,0,0,0 ]
],
"coverage":[
[ 40,69,121,157,194,245,297,337,378,416,465,...,521,482,462,430,404,359,311,287,250,211 ],
[ 37,73,108,135,163,198,238,279,317,340,369,...,534,512,481,463,440,414,380,240,204,173 ],
[ 35,66,107,141,168,206,250,292,337,380,415,...,428,394,364,336,309,271,240,199,181,111 ]
],
"weights":[
[ 0.000,1.000,0.000,0.086,0.000,0.055,0.000,0.000,...,0.110,0.110,0.000,0.000,0.000,0.000,0.000,0.000 ],
[ 0.039,0.035,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.922,0.000,0.000,0.035,0.925,0.000,0.000 ],
[ 0.165,0.000,0.000,0.855,0.000,0.020,0.000,0.855,...,0.000,0.000,0.000,0.055,0.000,0.055,0.000,0.835 ]
],
"preCoverage":[ 23979,24261,24506,24751,24996,25257,25539,...,25813,26078,26340,26616,26889,27182,27441,27719,28009 ]
}
]
}
]
}
The transcripts array contains all the transcripts processed by DRACO. Each transcript includes the following keys:
Key | Description |
---|---|
id | Transcript's ID |
sequence | Transcript's sequence |
nReads | Total number of reads mapping to the transcript |
windows | An array containing the deconvolution results for each window set |
The windows array is further structured as follows:
Key | Description |
---|---|
start | Start position of the window (0-based) |
end | End position of the window (1-based) |
stoichiometries | An array containing the relative abundances for the reconstructed conformations |
counts | An array of arrays, each one containing the number of mutations per base for each of the reconstructed conformations |
coverage | An array of arrays, each one containing the coverage per base for each of the reconstructed conformations, calculated using all the assigned reads post-filtering |
weights | An array of arrays, each one containing the weight per base from the Graph-Cut analysis |
preCoverage | An array containing the total coverage per base along the window, calculated using only the reads pre-filtering, covering the entire window, that have been used to perform the spectral analysis (therefore, it can be < than the sum of the coverage arrays) |
Post-processing of DRACO output
DRACO JSON output files can be somehow hard to interpret and process. To facilitate this operation, we developed the rf-json2rc
module, that has been included in the RNA Framework. The module allows converting the reactivity profiles deconvolved by DRACO into RNA Count (RC) files, that can be subsequently normalized with the rf-norm
module, and then fed into the rf-fold
module for secondary structure modeling. For additional details, please refer to the rf-json2rc
docs.