DRACO uses a sliding window approach to analyze short reads tiling long target transcripts. The default size of the window is equal to 90% the median length of the reads, slid with an offset equal to 5% the window's size.

DRACO Workflow

For each window, the following analysis steps are performed:

  1. Reads covering the entire window are used to build a graph. In the graph, each base is a vertex; when two bases are observed to co-mutate within the same read, they are connected by an edge in the graph. Each edge, is weighted according to the co-mutation frequency of the two bases (or essentialy, the number of reads in which both bases are simultaneously mutated).

  2. From the adjacency matrix of the graph, the normalized Laplacian matrix is calculated.

  3. Spectral clustering is applied to this matrix, to identify the number of co-existing structural conformations (clusters) for the window being analyzed.

  4. Graph-Cut is then used to weight each vertex in the graph according to its affinity to each conformation.

  5. Reads are assigned to each window

  6. The window is slid by the chosen offset, and steps 1-4 are repeated, until the entire transcript has been analyzed.

  7. Consecutive windows forming the same number of conformations are grouped into window sets and merged.

  8. For each window set, the relative conformation stoichiometries, and the individual reactivity profiles are reconstructed. Window sets found to form different numbers of conformations, are reported separately.

When analyzing a transcript, DRACO keeps in memory all the reads mapping to that transcript. At extreme sequencing depths (>500,000X), the memory consumption can become prohibitive on some system. If your system does not have sufficient RAM, the reads can be randomly downsampled along the transcript, to achieve a lower mean coverage. Usually, a coverage of 20,000X is sufficient for DRACO to efficiently deconvolve the underlying structural heterogeneity.

For additional information concerning downsampling, please consult the documentation for the rf-count module of the RNA Framework.


Usage

To list the required parameters, simply type:

$ draco --help
Parameter Type Description
--mm string Input mutation map (MM) file
--output string Output JSON file (Default: draco_deconvolved.json)
--processors int Number of processors to use (Default: 0)
Note #1: when set to 0, all available processors will be used
Note #2: only the analysis of multiple transcripts can be parallelized. If analyzing a single transcript, setting this parameter to a value > 1 will not speed up the execution
--whitelist string A whitelist file, containing the IDs of the transcripts to be analyzed, one per raw (Default: all transcripts will be analyzed)
--shape Enables spectral analysis on all four bases (default is only A/C bases)
Note: this feature is highly experimental
Mutation filtering options
--minBaseCoverage int Minimum coverage per base (Default: 100)
--minBaseMutations int Minimum mutations per base (Default: 2)
--minMutationFreq float Bases below this mutation frequency will be discarded as noise (Default: 0.005)
--minReadMutations float Reads with less than these mutations will be discarded as non-informative (Default: 2)
Spectral deconvolution options
--maxClusters int Maximum allowed number of clusters/conformations (Default: unlimited)
--minFilteredReads int Minimum number of reads (post-filtering) to perform spectral analysis (Default: 5)
--minPermutations int Minimum number of permutations performed to build the null model (Default: 8)
--maxPermutations int Maximum number of permutations performed to build the null model (Default: 400)
--firstEigengapShift float How much the mean of the first eigengap's null model should be shifted by (Default: 0.9)
--eigengapCumRelThresh float Minimum relative difference between the eigengap and the null model, as a fraction of the cumulative difference between the previous eigengaps and their respective null models (Default: 0.1)
Note: this does not apply to the first eigengap
--eigengapAbsThresh float Minimum absolute difference between the eigengap and the null model (Default: 0.03)
--alpha float Below this p-value, the null hypothesis is rejected and the eigengap is marked as informative (Default: 0.01)
--beta float Above this p-value, the alternative hypothesis is rejected and the eigengap is marked as non-informative (Default: 0.2)
Note: this threshold does not apply to the first eigengap
--firstEigengapBeta float Beta p-value threshold for the first eigengap (Default: 0.4)
--minNullStdev float Minimum standard deviation for the null model (Default: 0.025)
Note: when this threshold is not met, --extraPermutations additional permutations will be performed
--extraPermutations int Additional permutations to perform when the standard deviation of the null model is < --minNullStdev (Default: 50)
--minWinBases int Minimum number of bases in window (post-filtering) to perform spectral analysis (Default: 10)
--lookaheadEigengaps int Number of eigengaps to look ahead after a non-informative eigengap is encountered (Default: 3)
--saveEigengapData Saves eigengap data for plotting (Default: off)
--eigengapDataOut string Eigengap data output folder (Default: ./eigengap_data)
Graph-Cut options
--minClusterFraction float Minimum fraction of reads assigned to each cluster/conformation (Default: 0.05)
Note: if this threshold is not met, the number of clusters is automatically decreased
--softClusteringInits float Number of iterations for the initialization process of the graph-cut.
Note: The initialization with the lowest score is picked (Default: 500)
--softClusteringIters float Number of iterations performed on graph-cut
Note: The cut with the lowest score is picked (Default: 20)
--softClusteringWeightModule float The module of the weight that is used to change the cluster weights in order to find the lowest score (Default: 0.005)
Windowed analysis options
--winLenFraction float Length of the window as a fraction of the median read length (Default: 0.9)
Note: this parameter and --absWinSize are mutually exclusive
--absWinLen int Absolute length of the window (Default: 0)
Note: this parameter and --winSizeFraction are mutually exclusive
--winOffsetFraction float Slide offset as fraction of the size of the window (Default: 0.05)
Note: this parameter and --absWinOffset are mutually exclusive
--absWinOffset int Absolute slide offset (Default: 0)
Note: this parameter and --winOffsetFraction are mutually exclusive
--maxIgnoreWins int Maximum number of internal windows with a different number of clusters to ignore when merging two external sets of windows (Default: 0)
--minExtWins int Minimum number of external windows, having the same number of clusters, needed to trigger merging (Default: 6)
--nonInformativeToSurround int Non-informative windows (windows with 0 detected clusters) are set to the same number of clusters of surrounding windows (Default: false)
--allNonInformativeToOne int If all windows in the transcript are non-informative (0 clusters), the number of clusters is forced to 1 (Default: off)
--reportNonInformative int Reports also non-informative windows in the output JSON file


Understanding the algorithm

While it is advisable for most users to run DRACO with its default parameters, as these are the results of a careful and thorough calibration, it might be useful to adjust the analysis on a case-by-case basis.

DRACO analysis is performed in windows. The size of the window is by default winLenFraction × the median read length; an arbitrary length (in bp) for the window can also be used, by setting absWinLen. Similarly, the slide offset is by default winOffsetFraction × the length of the window, but an arbitrary offset (in bp) can be specified via absWinOffset. When long reads are used, reducing the size of the window might allow the identification of small highly dynamic regions of the target transcript. For instance, if 300 bp-long reads are being used, and the structurally-dynamic region within the target transcript is only 90 nt-long, the spectral analysis might not be able to detect it (especially if probing with DMS, as only ~50% of the bases will be informative); in these situations, reducing the size of the window can increase the sensitivity of the analysis.

The spectral deconvolution represents the most critical step of the algorithm, as this is responsible of determining the number of conformations formed by each window. During this step, the eigengaps, representing the distance between consecutive eigenvalues, are compared to a null model, built by performing random permutations over the original data matrix. The distribution of the null model is well approximated by a Weibull distribution. The number of permutations performed varies for each eigengap, and it is comprised between minPermutations and maxPermutations. Permutations are performed as long as the standard deviation (σ) of the null model is < minNullStdev, increasing by extraPermutations.

Starting with the first eigengap, DRACO compares each eigengap to its respective null model. Aim of this comparison is to determine which eigengaps can be considered to be informative, as this directly translates into the number of conformations/clusters formed by the analyzed window. For instance, if the first two eigengaps are informative, two conformations (clusters) are present; if the first three eigengaps are informative, three conformations are present, and so on.

DRACO eigengaps

In order to be considered informative, each eigengap must fulfill a number of criteria:

  1. The distance (d) between the mean (μ) of the null model and the eigengap must be ≥ eigengapAbsThresh.

  2. The distance between the μ and the eigengap must be greater than egengapCumRelThreshold × the cumulative sum of the previous eigengaps. For instance, d4, the distance between the 4th eigengap and its null model, must be ≥ egengapCumRelThreshold × (d1 + d2 + d3). This, of course, does not apply to the first eigengap

  3. A t-test is used to assess whether the distance between the eigengap and the null model is significant. If the p-value is ≤ alpha, the null hypothesis is rejected, and the eigengap is marked as informative. If the p-value is ≥ beta (or in the case of the first eigengap, ≥ firstEigengapBeta), the alternative hypothesis is rejected, and the eigengap is marked as non-informative. If the p-value is comprised between these two thresholds, neither of the two hypotheses can be rejected, so additional permutations are performed. If, after having performed maxPermutations permutations the p-value is still comprised between the two thresholds, then the alternative hypothesis is rejected and the eigengap is marked as non-informative.

The analysis terminates when a non-informative eigengap is encountered as, normally, the subsequent eigengaps are non-informative as well. In some particular cases, however, eigengaps have been observed to behave unexpectedly. In such situations, although an eigengap is marked as non-informative, one or more eigengaps immediately downstream of it are again informative. To account for these situations, whenever a non-informative eigengap is encountered, DRACO can perform a lookahead evaluation of the lookaheadEigengaps subsequent eigengaps. If one of these is marked as informative, then the analysis of the eigengaps is allowed to continue.

Once the number of conformations has been determined, the algorithm uses a normalized Graph-Cut approach to weight each base in the window, according to its affinity to each of the different conformations. This approach tries to find the best way to partition the previously built graph. It is composed of two steps:

  1. Initialization. During this step, each base of each cluster is assigned a random weight. The step is repeated softClusteringInits times and the solution with the lowest score is picked.
  2. Refinement. During this step, the weights of the bases are dynamically altered by softClusteringWeightModule in an attempt to further minimize the score. This step is repeated softClusteringIters and the solution with the lowest score is picked.

Note: in the original implementation of DRACO, this step was carried out only once. While being significantly faster, the downside was a higher risk of finding suboptimal graph partitioning solutions, corresponding to local score minima. Lowering the softClusteringWeightModule and increasing the softClusteringIters can slow down the analysis, but it significantly increases the likelihood that the true optimal solution is identified, further ensuring reproducibility of the analysis results.

Following identification of the optimal graph partitioning, reads are assigned to each conformation. If the fraction of reads assigned to a conformation is < minClusterFraction, the number of conformations for the window is decreased by one, and the step is repeated.

Final step of the analysis involves merging consecutive windows, found to form the same number of conformations, into window sets. Let's however consider the following case:

DRACO eigengaps

In this situation, two sets of windows found to form two conformations, are interrupted by one window forming one conformation. This often happens with short reads (<100 bp), as the analyzed window might not contain enough information to identify the coexisting conformations. By default (left), DRACO would report three window sets. It is possible to account for these cases, hence allowing DRACO to ignore at most maxIgnoreWins internal windows forming a discordant number of conformations. For the merging to occur, the total number of external windows on either sides of the discordant internal windows, must be ≥ minExtWins.
It is however possible for the windows on the left and on the right sides of the discordant internal windows, to form a different number of conformations:

DRACO eigengaps

In such a case, only if the number of windows on one side is > than the number of windows on the other side, as well as ≥ minExtWins, the internal discordant windows will be merged into a single set with the windows on that side.

Output JSON files

DRACO produces a JSON file, with the following structure:

{
   "filename":"sample.mm",
   "transcripts":[
      {
         "id":"RNA_1",
         "sequence":"TCTATTCTACATTGATAGAACAACTCT...TCTGTCTATCACCGCTAGAGCACTCGGTGATTGCA",
         "nReads":20000,
         "windows":[
            {
               "start":0,
               "end":294,
               "stoichiometries":[ 0.388,0.304,0.308 ],
               "counts":[
                  [ 0,1,0,23,0,0,33,0,54,0,5,0,0,...,301,0,334,0,3,1,0,0,0,0,0,0 ],
                  [ 0,6,0,0,0,0,0,0,0,0,36,0,0,2,...,0,0,0,0,0,0,134,0,86,0,0,62,63 ],
                  [ 0,1,0,17,0,0,57,0,0,0,75,0,11,...,181,0,0,0,0,0,0,0,0,118,0,0,0,0 ]
               ],
               "coverage":[
                  [ 40,69,121,157,194,245,297,337,378,416,465,...,521,482,462,430,404,359,311,287,250,211 ],
                  [ 37,73,108,135,163,198,238,279,317,340,369,...,534,512,481,463,440,414,380,240,204,173 ],
                  [ 35,66,107,141,168,206,250,292,337,380,415,...,428,394,364,336,309,271,240,199,181,111 ]
               ],
               "weights":[
                  [ 0.000,1.000,0.000,0.086,0.000,0.055,0.000,0.000,...,0.110,0.110,0.000,0.000,0.000,0.000,0.000,0.000 ],
                  [ 0.039,0.035,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.922,0.000,0.000,0.035,0.925,0.000,0.000 ],
                  [ 0.165,0.000,0.000,0.855,0.000,0.020,0.000,0.855,...,0.000,0.000,0.000,0.055,0.000,0.055,0.000,0.835 ]
               ],
               "preCoverage":[ 23979,24261,24506,24751,24996,25257,25539,...,25813,26078,26340,26616,26889,27182,27441,27719,28009 ]
            }
         ]
      }
   ]
}

The transcripts array contains all the transcripts processed by DRACO. Each transcript includes the following keys:

Key Description
id Transcript's ID
sequence Transcript's sequence
nReads Total number of reads mapping to the transcript
windows An array containing the deconvolution results for each window set


The windows array is further structured as follows:

Key Description
start Start position of the window (0-based)
end End position of the window (1-based)
stoichiometries An array containing the relative abundances for the reconstructed conformations
counts An array of arrays, each one containing the number of mutations per base for each of the reconstructed conformations
coverage An array of arrays, each one containing the coverage per base for each of the reconstructed conformations, calculated using all the assigned reads post-filtering
weights An array of arrays, each one containing the weight per base from the Graph-Cut analysis
preCoverage An array containing the total coverage per base along the window, calculated using only the reads pre-filtering, covering the entire window, that have been used to perform the spectral analysis (therefore, it can be < than the sum of the coverage arrays)


Post-processing of DRACO output

DRACO JSON output files can be somehow hard to interpret and process. To facilitate this operation, we developed the rf-json2rc module, that has been included in the RNA Framework. The module allows converting the reactivity profiles deconvolved by DRACO into RNA Count (RC) files, that can be subsequently normalized with the rf-norm module, and then fed into the rf-fold module for secondary structure modeling. For additional details, please refer to the rf-json2rc docs.