DRACO uses a sliding window approach to analyze short reads tiling long target transcripts. The default size of the window is 100 nt, slid with an offset equal to 1% the window's size.
For each window, the following analysis steps are performed:
-
Reads covering the entire window are used to build a graph. In the graph, each base is a vertex; when two bases are observed to co-mutate within the same read, they are connected by an edge in the graph. Each edge, is weighted according to the co-mutation frequency of the two bases (or essentialy, the number of reads in which both bases are simultaneously mutated).
-
From the adjacency matrix of the graph, the normalized Laplacian matrix is calculated.
-
Spectral clustering is applied to this matrix, to identify the number of co-existing structural conformations (clusters) for the window being analyzed.
-
Graph-Cut is then used to weight each vertex in the graph according to its affinity to each conformation.
-
The window is slid by the chosen offset, and steps 1-4 are repeated, until the entire transcript has been analyzed.
-
Consecutive windows forming the same number of conformations are grouped into window sets and merged.
-
Reads are assigned to each window set
-
For each window set, the relative conformation stoichiometries, and the individual reactivity profiles are reconstructed. Window sets found to form different numbers of conformations, are reported separately.
When analyzing a transcript, DRACO keeps in memory all the reads mapping to that transcript. At extreme sequencing depths (>500,000X), the memory consumption can become prohibitive on some system. If your system does not have sufficient RAM, reads can be randomly downsampled along the transcript, to achieve a lower mean coverage. Usually, a coverage of 10,000-20,000X is sufficient for DRACO to efficiently deconvolve the underlying structural heterogeneity.
For additional information concerning downsampling, please consult the documentation for the rf-count module of the RNA Framework.
Usage
To list the required parameters, simply type:
$ draco --help
| Parameter | Type | Description |
|---|---|---|
| --mm | string | Input mutation map (MM) file Note: to pass multiple replicates, simply call the parameter multiple times (e.g., `--mm rep1.mm --mm rep2.mm) |
| --output | string | Output JSON file (Default: draco_deconvolved.json) |
| --processors | int | Number of processors to use (Default: 0) Note: when set to 0, all available processors will be used |
| --whitelist | string | A whitelist file, containing the IDs of the transcripts to be analyzed, one per raw (Default: all transcripts will be analyzed) |
| --shape | Enables spectral analysis on all four bases (Default: only A/C bases) Note: this feature is highly experimental |
|
| --outputRawNClusters | string | Outputs a BED-like file containing, for each transcript, the raw number of clusters detected for each window |
| --log-level | string | Specifies the log level (allowed values: trace, debug, info, warn, error) |
| Mutation filtering options | ||
| --minBaseCoverage | int | Minimum coverage per base (Default: 100) |
| --minBaseMutations | int | Minimum mutations per base (Default: 2) |
| --minMutationFreq | float | Bases below this mutation frequency will be discarded as noise (Default: 0.005) |
| --minReadMutations | float | Reads with fewer than this number of mutations will be discarded as non-informative (Default: 2) |
| Spectral deconvolution options | ||
| --maxClusters | int | Maximum allowed number of clusters/conformations (Default: unlimited) |
| --minFilteredReads | int | Minimum number of reads (post-filtering) to perform spectral analysis (Default: 100) |
| --minPermutations | int | Minimum number of permutations performed to build the null model (Default: 10) |
| --maxPermutations | int | Maximum number of permutations performed to build the null model (Default: 100) |
| --ignoreFirstEigengap | The first eigengap is ignored | |
| --eigengapCumRelThresh | float | Minimum relative difference between the eigengap and the null model, as a fraction of the cumulative difference between the previous eigengaps and their respective null models (Default: 0.1) |
| --eigengapAbsThresh | float | Minimum absolute difference between the eigengap and its null model (Default: 0.03) |
| --alpha | float | Below this p-value, the null hypothesis is rejected and the eigengap is marked as informative (Default: 0.01) |
| --beta | float | Above this p-value, the alternative hypothesis is rejected and the eigengap is marked as non-informative (Default: 0.2) |
| --minNullStdev | float | Minimum standard deviation for the null model (Default: 0.025) Note: when this threshold is not met, --extraPermutations additional permutations will be performed |
| --extraPermutations | int | Additional permutations to perform when the standard deviation of the null model is < --minNullStdev (Default: 50) |
| --minWinBases | int | Minimum number of bases in window (post-filtering) to perform spectral analysis (Default: 10) |
| --lookaheadEigengaps | int | Number of eigengaps to look ahead after a non-informative eigengap is encountered (Default: 0) |
| --saveEigengapData | Saves eigengap data for plotting (Default: off) | |
| --eigengapDataOut | string | Eigengap data output folder (Default: ./eigengap_data) |
| Graph-Cut options | ||
| --minClusterFraction | float | Minimum fraction of reads assigned to each cluster/conformation (Default: 0.005) Note: if this threshold is not met, the number of clusters is automatically decreased |
| --softClusteringInits | float | Number of iterations for the initialization process of the graph-cut. Note: The initialization with the lowest score is picked (Default: 500) |
| --softClusteringIters | float | Number of iterations performed on graph-cut Note: The cut with the lowest score is picked (Default: 50) |
| --softClusteringWeightModule | float | The module of the weight that is used to change the cluster weights in order to find the lowest score (Default: 0.005) |
| Windowed analysis options | ||
| --winLen | float | Length of the window. If this value is comprised between 0 and 1, it is interpreted as a fraction of the median read length. If this value is > 1, it is interpreted as the absolute length of the window (Default: 100) Note: this parameter and --winLenFracRnaLen are mutually exclusive |
| --winLenFracRnaLen | float | Length of the window as fraction of the length of the RNA Note: this parameter and --winLen are mutually exclusive |
| --winOffset | float | Window sliding offset. If this value is comprised between 0 and 1, it is interpreted as a fraction of the window's length. If this value is ≥ 1 , it is interpreted as the number of bases to slide the window by (Default: 0.01) |
| --maxIgnoreWins | int | Maximum number of internal windows with a different number of clusters to ignore when merging two external sets of windows (Default: 0) |
| --minExtWins | int | Minimum number of external windows, having the same number of clusters, needed to trigger merging (Default: 6) |
| --nonInformativeToSurround | int | Non-informative windows (windows with 0 detected clusters) are set to the same number of clusters of surrounding windows (Default: false) |
| --allNonInformativeToOne | int | If all windows in the transcript are non-informative (0 clusters), the number of clusters is forced to 1 (Default: off) |
| --reportNonInformative | int | Reports also non-informative windows in the output JSON file |
| --assignmentsDumpDir | string | When specified, a folder is generated containing an MM file for each set of merged windows, containing a dump of the reads assigned to each conformation |
| --skipAmbiguousAssignments | When specified, if the best assignment score for a read is equal across multiple clusters, the read is discarded | |
| --minWindowsOverlap | Minimum overlap between non-contiguous windows with the same number of conformations in order to merge them in the same set. The value must be comprised between 0 and 1, and it is interpreted as a fraction of the window length (Default: 1) | |
| ## Understanding the algorithm |
Note
Since v1.3 it is possible to follow step-by-step the analysis perfromed by DRACO by setting the parameter --log-level trace
While it is advisable for most users to run DRACO with its default parameters, as these are the results of a careful and thorough calibration, it might be useful to adjust the analysis on a case-by-case basis.
DRACO analysis is performed in sliding windows. The size of the window (--winLen) is by default 100, and the sliding offset (--winOffset) is by default 1% of the length of the window. We observed that a window size of 100 is optimal in most cases, as it allows the identification of both small and large structurally-heterogeneous regions. This length is recommended even if longer reads have been sequenced, as, if the structurally-heterogeneous region in the transcript is much smaller than the read length (for example, 300 bp-long reads and a structurally-heterogeneous region of 90 nt), information might get lost in the spectral deconvolution.
The spectral deconvolution represents the most critical step of the algorithm, as this is responsible for determining the number of conformations populated by each window. During this step, the eigengaps, representing the distance between consecutive eigenvalues, are compared to a null model, built by performing random permutations over the original data matrix. The distribution of the null model is well approximated by a Weibull distribution. The number of permutations performed varies for each eigengap, and it is comprised between --minPermutations and --maxPermutations. Permutations are performed as long as the standard deviation (σ) of the null model is < --minNullStdev, increasing by --extraPermutations.
Starting with the first eigengap, DRACO compares each eigengap to its respective null model. Aim of this comparison is to determine which eigengaps can be considered to be informative, as this directly translates into the number of conformations/clusters populated by the analyzed window. For instance, if the first two eigengaps are informative, two conformations (clusters) are present; if the first three eigengaps are informative, three conformations are present, and so on.
In order to be considered informative, each eigengap must fulfill a number of criteria:
-
The distance (d) between the mean (μ) of the null model and the eigengap must be ≥
--eigengapAbsThresh. -
The distance between the μ and the eigengap must be greater than
--egengapCumRelThreshold× the cumulative sum of the previous eigengaps. For instance, d4, the distance between the 4th eigengap and its null model, must be ≥--egengapCumRelThreshold× (d1 + d2 + d3). This, of course, does not apply to the first eigengap -
A t-test is used to assess whether the distance between the eigengap and the null model is significant. If the p-value is ≤
--alpha, the null hypothesis is rejected, and the eigengap is marked as informative. If the p-value is ≥--beta, the alternative hypothesis is rejected, and the eigengap is marked as non-informative. If the p-value is comprised between these two thresholds, neither of the two hypotheses can be rejected, so additional permutations are performed. If, after having performed--maxPermutationspermutations the p-value is still comprised between the two thresholds, then the alternative hypothesis is rejected and the eigengap is marked as non-informative.
The analysis terminates when a non-informative eigengap is encountered as, normally, the subsequent eigengaps are non-informative as well. In some particular cases, however, we have eigengaps to behave unexpectedly. For instance, at times the first eigengap might overlap with its null model, which would cause DRACO to report 0 clusters and terminate the analysis for that window. This can be avoided by enabling the --ignoreFirstEigengap parameter. In other cases, although an eigengap is marked as non-informative, one or more eigengaps immediately downstream of it are again informative. To account for these situations, whenever a non-informative eigengap is encountered, DRACO can perform a lookahead evaluation of the --lookaheadEigengaps subsequent eigengaps. If one of these is marked as informative, then the analysis of the eigengaps is allowed to continue. This analysis is disabled by default.
Unless the parameter --outputRawNClusters had been enabled (which would cause DRACO to only report the raw number of identified clusters/conformations per window), once the number of conformations has been determined, the algorithm uses a normalized Graph-Cut approach to weight each base in the window, according to its affinity to each of the different conformations. This approach tries to find the best way to partition the previously built graph. It is composed of two steps:
- Initialization. During this step, each base of each cluster is assigned a random weight. The step is repeated
--softClusteringInitstimes and the solution with the lowest score is picked. - Refinement. During this step, the weights of the bases are dynamically altered by
--softClusteringWeightModulein an attempt to further minimize the score. This step is repeated--softClusteringItersand the solution with the lowest score is picked.
Note
In the original implementation of DRACO, this step was carried out only once. While being significantly faster, the downside was a higher risk of finding suboptimal graph partitioning solutions, corresponding to local score minima. Lowering the --softClusteringWeightModule and increasing the --softClusteringIters can slow down the analysis, but it significantly increases the likelihood that the true optimal solution is identified, further ensuring reproducibility of the analysis results.
Final step of the analysis involves merging consecutive windows found to form the same number of conformations, into window sets. Let's however consider the following case:
In this situation, two sets of windows found to form two conformations, are interrupted by one window forming one conformation. By default (see figure above, left panel), DRACO would report three window sets. There are two ways to handle these cases:
-
The first possibility is to make DRACO ignore at most
--maxIgnoreWinsinternal windows forming a discordant number of conformations. For the merging to occur, the total number of external windows on either sides of the discordant internal windows must be ≥--minExtWins(see figure above, right panel). -
The second possibility (preferred, no information loss) is to make DRACO report every window set it did identify, but allowing non-contiguous window sets forming the same number of conformations to be merged, provided that the overlap between the last window of one set and the first window of the following set is at least
--minWindowsOverlap× the window length (e.g.,--minWindowsOverlap 0.5requires a minimum overlap of 50% of the window length; see figure below).
Following window merging, reads are assigned to each conformation. If the fraction of reads assigned to a conformation is < --minClusterFraction × the number of conformations for that window set is decreased by one, and the step is repeated.
Handling of replicates
Since v1.3, DRACO can handle replicate experiments. Multiple MM files can be passed by simply calling the --mm parameter multiple times.
During the spectral analysis DRACO will separately determine the number of conformations populated by corresponding windows across replicates, and the number of conformations for a given window will be set as the median of the number of conformations identified across all replicates. For instance, in an experiment with 3 replicates, if two identified 1 conformation and one identified 2 conformations, the number of conformations for the window will be set to 1. Similarly, if the three replicates respectively identified 1, 2 and 3 conformations, the number of conformations for the window will be set to 2.
During the Graph-Cut part of the algorithm, graph partitioning and score minimization will be simultaneously performed on the graphs of each replicate, hence avoiding experiment-specific local minima and ensuring optimal parititioning.
Output JSON files
DRACO produces a JSON file, with the following structure:
{
"filenames":["sample.mm"],
"transcripts":[
{
"id":"RNA_1",
"sequence":"TCTATTCTACATTGATAGAACAACTCT...TCTGTCTATCACCGCTAGAGCACTCGGTGATTGCA",
"nReads":[20000],
"windows":[
[{
"start":0,
"end":294,
"stoichiometries":[ 0.388,0.304,0.308 ],
"counts":[
[ 0,1,0,23,0,0,33,0,54,0,5,0,0,...,301,0,334,0,3,1,0,0,0,0,0,0 ],
[ 0,6,0,0,0,0,0,0,0,0,36,0,0,2,...,0,0,0,0,0,0,134,0,86,0,0,62,63 ],
[ 0,1,0,17,0,0,57,0,0,0,75,0,11,...,181,0,0,0,0,0,0,0,0,118,0,0,0,0 ]
],
"coverage":[
[ 40,69,121,157,194,245,297,337,378,416,465,...,521,482,462,430,404,359,311,287,250,211 ],
[ 37,73,108,135,163,198,238,279,317,340,369,...,534,512,481,463,440,414,380,240,204,173 ],
[ 35,66,107,141,168,206,250,292,337,380,415,...,428,394,364,336,309,271,240,199,181,111 ]
],
"weights":[
[ 0.000,1.000,0.000,0.086,0.000,0.055,0.000,0.000,...,0.110,0.110,0.000,0.000,0.000,0.000,0.000,0.000 ],
[ 0.039,0.035,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.922,0.000,0.000,0.035,0.925,0.000,0.000 ],
[ 0.165,0.000,0.000,0.855,0.000,0.020,0.000,0.855,...,0.000,0.000,0.000,0.055,0.000,0.055,0.000,0.835 ]
],
"preCoverage":[ 23979,24261,24506,24751,24996,25257,25539,...,25813,26078,26340,26616,26889,27182,27441,27719,28009 ]
}]
]
}
]
}
The transcripts array contains all the transcripts processed by DRACO. Each transcript includes the following keys:
| Key | Description |
|---|---|
| id | Transcript's ID |
| sequence | Transcript's sequence |
| nReads | An array containing the total number of reads mapping to the transcript in each replicate |
| windows | An array of arrays containing the deconvolution results for each window set across each replicate being analyzed |
The windows array is further structured as follows:
| Key | Description |
|---|---|
| start | Start position of the window (0-based) |
| end | End position of the window (1-based) |
| stoichiometries | An array containing the relative abundances for the reconstructed conformations |
| counts | An array of arrays, each containing the number of mutations per base for each of the reconstructed conformations |
| coverage | An array of arrays, each containing the coverage per base for each of the reconstructed conformations, calculated using all the assigned reads post-filtering |
| weights | An array of arrays, each containing the weight per base from the Graph-Cut analysis |
| preCoverage | An array containing the total coverage per base along the window, calculated using only the reads pre-filtering, covering the entire window, that have been used to perform the spectral analysis (therefore, it can be < than the sum of the coverage arrays) |
Since v1.3, in case multiple replicates are being analyzed, an additional clusterConfidence array is present right after the windows array, which is structured as it follows:
| Key | Description |
|---|---|
| nClusters | Number of detected clusters/conformations |
| confidence | A value between 0 and 1 representing the fraction of replicates for which the spectral analysis found the above number of conformations (e.g., if nClusters is 2 and out of two replicates one was found to form 2 conformations and the other 3 conformations, confidence will be 0.5) |
| start | Start position of the window (0-based) |
| end | End position of the window (1-based) |
Post-processing of DRACO output
DRACO JSON output files can be somehow hard to interpret and process. To facilitate this operation, we developed the rf-json2rc module, that has been included in the RNA Framework. The module allows converting the reactivity profiles deconvolved by DRACO into RNA Count (RC) files, that can be subsequently normalized with the rf-norm module, and then fed into the rf-fold module for secondary structure modeling. For additional details, please refer to the rf-json2rc docs.