DRACO uses a sliding window approach to analyze short reads tiling long target transcripts. The default size of the window is 100 nt, slid with an offset equal to 1% the window's size.

DRACO Workflow

For each window, the following analysis steps are performed:

Reads covering the entire window are used to build a graph. In the graph, each base is a vertex; when two bases are observed to co-mutate within the same read, they are connected by an edge in the graph. Each edge, is weighted according to the co-mutation frequency of the two bases (or essentialy, the number of reads in which both bases are simultaneously mutated).
From the adjacency matrix of the graph, the normalized Laplacian matrix is calculated.
Spectral clustering is applied to this matrix, to identify the number of co-existing structural conformations (clusters) for the window being analyzed.
Graph-Cut is then used to weight each vertex in the graph according to its affinity to each conformation.
The window is slid by the chosen offset, and steps 1-4 are repeated, until the entire transcript has been analyzed.
Consecutive windows forming the same number of conformations are grouped into window sets and merged.
Reads are assigned to each window set
For each window set, the relative conformation stoichiometries, and the individual reactivity profiles are reconstructed. Window sets found to form different numbers of conformations, are reported separately.

When analyzing a transcript, DRACO keeps in memory all the reads mapping to that transcript. At extreme sequencing depths (>500,000X), the memory consumption can become prohibitive on some system. If your system does not have sufficient RAM, reads can be randomly downsampled along the transcript, to achieve a lower mean coverage. Usually, a coverage of 10,000-20,000X is sufficient for DRACO to efficiently deconvolve the underlying structural heterogeneity.

For additional information concerning downsampling, please consult the documentation for the rf-count module of the RNA Framework.

Usage

To list the required parameters, simply type:

$ draco --help

Parameter	Type	Description
--mm	string	Input mutation map (MM) file Note: to pass multiple replicates, simply call the parameter multiple times (e.g., `--mm rep1.mm --mm rep2.mm)
--output	string	Output JSON file (Default: draco_deconvolved.json)
--processors	int	Number of processors to use (Default: 0) Note: when set to 0, all available processors will be used
--whitelist	string	A whitelist file, containing the IDs of the transcripts to be analyzed, one per raw (Default: all transcripts will be analyzed)
--shape		Enables spectral analysis on all four bases (Default: only A/C bases) Note: this feature is highly experimental
--outputRawNClusters	string	Outputs a BED-like file containing, for each transcript, the raw number of clusters detected for each window
--log-level	string	Specifies the log level (allowed values: trace, debug, info, warn, error)
		Mutation filtering options
--minBaseCoverage	int	Minimum coverage per base (Default: 100)
--minBaseMutations	int	Minimum mutations per base (Default: 2)
--minMutationFreq	float	Bases below this mutation frequency will be discarded as noise (Default: 0.005)
--minReadMutations	float	Reads with fewer than this number of mutations will be discarded as non-informative (Default: 2)
		Spectral deconvolution options
--maxClusters	int	Maximum allowed number of clusters/conformations (Default: unlimited)
--minFilteredReads	int	Minimum number of reads (post-filtering) to perform spectral analysis (Default: 100)
--minPermutations	int	Minimum number of permutations performed to build the null model (Default: 10)
--maxPermutations	int	Maximum number of permutations performed to build the null model (Default: 100)
--ignoreFirstEigengap		The first eigengap is ignored
--eigengapCumRelThresh	float	Minimum relative difference between the eigengap and the null model, as a fraction of the cumulative difference between the previous eigengaps and their respective null models (Default: 0.1)
--eigengapAbsThresh	float	Minimum absolute difference between the eigengap and its null model (Default: 0.03)
--alpha	float	Below this p-value, the null hypothesis is rejected and the eigengap is marked as informative (Default: 0.01)
--beta	float	Above this p-value, the alternative hypothesis is rejected and the eigengap is marked as non-informative (Default: 0.2)
--minNullStdev	float	Minimum standard deviation for the null model (Default: 0.025) Note: when this threshold is not met, `--extraPermutations` additional permutations will be performed
--extraPermutations	int	Additional permutations to perform when the standard deviation of the null model is < `--minNullStdev` (Default: 50)
--minWinBases	int	Minimum number of bases in window (post-filtering) to perform spectral analysis (Default: 10)
--lookaheadEigengaps	int	Number of eigengaps to look ahead after a non-informative eigengap is encountered (Default: 0)
--saveEigengapData		Saves eigengap data for plotting (Default: off)
--eigengapDataOut	string	Eigengap data output folder (Default: ./eigengap_data)
		Graph-Cut options
--minClusterFraction	float	Minimum fraction of reads assigned to each cluster/conformation (Default: 0.005) Note: if this threshold is not met, the number of clusters is automatically decreased
--softClusteringInits	float	Number of iterations for the initialization process of the graph-cut. Note: The initialization with the lowest score is picked (Default: 500)
--softClusteringIters	float	Number of iterations performed on graph-cut Note: The cut with the lowest score is picked (Default: 50)
--softClusteringWeightModule	float	The module of the weight that is used to change the cluster weights in order to find the lowest score (Default: 0.005)
		Windowed analysis options
--winLen	float	Length of the window. If this value is comprised between 0 and 1, it is interpreted as a fraction of the median read length. If this value is > 1, it is interpreted as the absolute length of the window (Default: 100) Note: this parameter and `--winLenFracRnaLen` are mutually exclusive
--winLenFracRnaLen	float	Length of the window as fraction of the length of the RNA Note: this parameter and `--winLen` are mutually exclusive
--winOffset	float	Window sliding offset. If this value is comprised between 0 and 1, it is interpreted as a fraction of the window's length. If this value is ≥ 1 , it is interpreted as the number of bases to slide the window by (Default: 0.01)
--maxIgnoreWins	int	Maximum number of internal windows with a different number of clusters to ignore when merging two external sets of windows (Default: 0)
--minExtWins	int	Minimum number of external windows, having the same number of clusters, needed to trigger merging (Default: 6)
--nonInformativeToSurround	int	Non-informative windows (windows with 0 detected clusters) are set to the same number of clusters of surrounding windows (Default: false)
--allNonInformativeToOne	int	If all windows in the transcript are non-informative (0 clusters), the number of clusters is forced to 1 (Default: off)
--reportNonInformative	int	Reports also non-informative windows in the output JSON file
--assignmentsDumpDir	string	When specified, a folder is generated containing an MM file for each set of merged windows, containing a dump of the reads assigned to each conformation
--skipAmbiguousAssignments		When specified, if the best assignment score for a read is equal across multiple clusters, the read is discarded
--minWindowsOverlap		Minimum overlap between non-contiguous windows with the same number of conformations in order to merge them in the same set. The value must be comprised between 0 and 1, and it is interpreted as a fraction of the window length (Default: 1)

## Understanding the algorithm

Note

Since v1.3 it is possible to follow step-by-step the analysis perfromed by DRACO by setting the parameter --log-level trace

While it is advisable for most users to run DRACO with its default parameters, as these are the results of a careful and thorough calibration, it might be useful to adjust the analysis on a case-by-case basis.

DRACO analysis is performed in sliding windows. The size of the window (--winLen) is by default 100, and the sliding offset (--winOffset) is by default 1% of the length of the window. We observed that a window size of 100 is optimal in most cases, as it allows the identification of both small and large structurally-heterogeneous regions. This length is recommended even if longer reads have been sequenced, as, if the structurally-heterogeneous region in the transcript is much smaller than the read length (for example, 300 bp-long reads and a structurally-heterogeneous region of 90 nt), information might get lost in the spectral deconvolution.

The spectral deconvolution represents the most critical step of the algorithm, as this is responsible for determining the number of conformations populated by each window. During this step, the eigengaps, representing the distance between consecutive eigenvalues, are compared to a null model, built by performing random permutations over the original data matrix. The distribution of the null model is well approximated by a Weibull distribution. The number of permutations performed varies for each eigengap, and it is comprised between --minPermutations and --maxPermutations. Permutations are performed as long as the standard deviation (σ) of the null model is < --minNullStdev, increasing by --extraPermutations.

Starting with the first eigengap, DRACO compares each eigengap to its respective null model. Aim of this comparison is to determine which eigengaps can be considered to be informative, as this directly translates into the number of conformations/clusters populated by the analyzed window. For instance, if the first two eigengaps are informative, two conformations (clusters) are present; if the first three eigengaps are informative, three conformations are present, and so on.

DRACO eigengaps

In order to be considered informative, each eigengap must fulfill a number of criteria:

The distance (d) between the mean (μ) of the null model and the eigengap must be ≥ --eigengapAbsThresh.
The distance between the μ and the eigengap must be greater than --egengapCumRelThreshold × the cumulative sum of the previous eigengaps. For instance, d₄, the distance between the 4^th eigengap and its null model, must be ≥ --egengapCumRelThreshold × (d₁ + d₂ + d₃). This, of course, does not apply to the first eigengap
A t-test is used to assess whether the distance between the eigengap and the null model is significant. If the p-value is ≤ --alpha, the null hypothesis is rejected, and the eigengap is marked as informative. If the p-value is ≥ --beta, the alternative hypothesis is rejected, and the eigengap is marked as non-informative. If the p-value is comprised between these two thresholds, neither of the two hypotheses can be rejected, so additional permutations are performed. If, after having performed --maxPermutations permutations the p-value is still comprised between the two thresholds, then the alternative hypothesis is rejected and the eigengap is marked as non-informative.

The analysis terminates when a non-informative eigengap is encountered as, normally, the subsequent eigengaps are non-informative as well. In some particular cases, however, we have eigengaps to behave unexpectedly. For instance, at times the first eigengap might overlap with its null model, which would cause DRACO to report 0 clusters and terminate the analysis for that window. This can be avoided by enabling the --ignoreFirstEigengap parameter. In other cases, although an eigengap is marked as non-informative, one or more eigengaps immediately downstream of it are again informative. To account for these situations, whenever a non-informative eigengap is encountered, DRACO can perform a lookahead evaluation of the --lookaheadEigengaps subsequent eigengaps. If one of these is marked as informative, then the analysis of the eigengaps is allowed to continue. This analysis is disabled by default.

Unless the parameter --outputRawNClusters had been enabled (which would cause DRACO to only report the raw number of identified clusters/conformations per window), once the number of conformations has been determined, the algorithm uses a normalized Graph-Cut approach to weight each base in the window, according to its affinity to each of the different conformations. This approach tries to find the best way to partition the previously built graph. It is composed of two steps:

Initialization. During this step, each base of each cluster is assigned a random weight. The step is repeated --softClusteringInits times and the solution with the lowest score is picked.
Refinement. During this step, the weights of the bases are dynamically altered by --softClusteringWeightModule in an attempt to further minimize the score. This step is repeated --softClusteringIters and the solution with the lowest score is picked.

Note

In the original implementation of DRACO, this step was carried out only once. While being significantly faster, the downside was a higher risk of finding suboptimal graph partitioning solutions, corresponding to local score minima. Lowering the --softClusteringWeightModule and increasing the --softClusteringIters can slow down the analysis, but it significantly increases the likelihood that the true optimal solution is identified, further ensuring reproducibility of the analysis results.

Final step of the analysis involves merging consecutive windows found to form the same number of conformations, into window sets. Let's however consider the following case:

DRACO eigengaps

In this situation, two sets of windows found to form two conformations, are interrupted by one window forming one conformation. By default (see figure above, left panel), DRACO would report three window sets. There are two ways to handle these cases:

The first possibility is to make DRACO ignore at most --maxIgnoreWins internal windows forming a discordant number of conformations. For the merging to occur, the total number of external windows on either sides of the discordant internal windows must be ≥ --minExtWins (see figure above, right panel).
The second possibility (preferred, no information loss) is to make DRACO report every window set it did identify, but allowing non-contiguous window sets forming the same number of conformations to be merged, provided that the overlap between the last window of one set and the first window of the following set is at least --minWindowsOverlap × the window length (e.g., --minWindowsOverlap 0.5 requires a minimum overlap of 50% of the window length; see figure below).

DRACO eigengaps

Following window merging, reads are assigned to each conformation. If the fraction of reads assigned to a conformation is < --minClusterFraction × the number of conformations for that window set is decreased by one, and the step is repeated.

Handling of replicates

Since v1.3, DRACO can handle replicate experiments. Multiple MM files can be passed by simply calling the --mm parameter multiple times.

During the spectral analysis DRACO will separately determine the number of conformations populated by corresponding windows across replicates, and the number of conformations for a given window will be set as the median of the number of conformations identified across all replicates. For instance, in an experiment with 3 replicates, if two identified 1 conformation and one identified 2 conformations, the number of conformations for the window will be set to 1. Similarly, if the three replicates respectively identified 1, 2 and 3 conformations, the number of conformations for the window will be set to 2.

During the Graph-Cut part of the algorithm, graph partitioning and score minimization will be simultaneously performed on the graphs of each replicate, hence avoiding experiment-specific local minima and ensuring optimal parititioning.

Output JSON files

DRACO produces a JSON file, with the following structure:

{
   "filenames":["sample.mm"],
   "transcripts":[
      {
         "id":"RNA_1",
         "sequence":"TCTATTCTACATTGATAGAACAACTCT...TCTGTCTATCACCGCTAGAGCACTCGGTGATTGCA",
         "nReads":[20000],
         "windows":[
            [{
               "start":0,
               "end":294,
               "stoichiometries":[ 0.388,0.304,0.308 ],
               "counts":[
                  [ 0,1,0,23,0,0,33,0,54,0,5,0,0,...,301,0,334,0,3,1,0,0,0,0,0,0 ],
                  [ 0,6,0,0,0,0,0,0,0,0,36,0,0,2,...,0,0,0,0,0,0,134,0,86,0,0,62,63 ],
                  [ 0,1,0,17,0,0,57,0,0,0,75,0,11,...,181,0,0,0,0,0,0,0,0,118,0,0,0,0 ]
               ],
               "coverage":[
                  [ 40,69,121,157,194,245,297,337,378,416,465,...,521,482,462,430,404,359,311,287,250,211 ],
                  [ 37,73,108,135,163,198,238,279,317,340,369,...,534,512,481,463,440,414,380,240,204,173 ],
                  [ 35,66,107,141,168,206,250,292,337,380,415,...,428,394,364,336,309,271,240,199,181,111 ]
               ],
               "weights":[
                  [ 0.000,1.000,0.000,0.086,0.000,0.055,0.000,0.000,...,0.110,0.110,0.000,0.000,0.000,0.000,0.000,0.000 ],
                  [ 0.039,0.035,0.000,0.000,0.000,0.000,0.000,0.000,...,0.000,0.922,0.000,0.000,0.035,0.925,0.000,0.000 ],
                  [ 0.165,0.000,0.000,0.855,0.000,0.020,0.000,0.855,...,0.000,0.000,0.000,0.055,0.000,0.055,0.000,0.835 ]
               ],
               "preCoverage":[ 23979,24261,24506,24751,24996,25257,25539,...,25813,26078,26340,26616,26889,27182,27441,27719,28009 ]
            }]
         ]
      }
   ]
}

The transcripts array contains all the transcripts processed by DRACO. Each transcript includes the following keys:

Key	Description
id	Transcript's ID
sequence	Transcript's sequence
nReads	An array containing the total number of reads mapping to the transcript in each replicate
windows	An array of arrays containing the deconvolution results for each window set across each replicate being analyzed

The windows array is further structured as follows:

Key	Description
start	Start position of the window (0-based)
end	End position of the window (1-based)
stoichiometries	An array containing the relative abundances for the reconstructed conformations
counts	An array of arrays, each containing the number of mutations per base for each of the reconstructed conformations
coverage	An array of arrays, each containing the coverage per base for each of the reconstructed conformations, calculated using all the assigned reads post-filtering
weights	An array of arrays, each containing the weight per base from the Graph-Cut analysis
preCoverage	An array containing the total coverage per base along the window, calculated using only the reads pre-filtering, covering the entire window, that have been used to perform the spectral analysis (therefore, it can be < than the sum of the coverage arrays)

Since v1.3, in case multiple replicates are being analyzed, an additional clusterConfidence array is present right after the windows array, which is structured as it follows:

Key	Description
nClusters	Number of detected clusters/conformations
confidence	A value between 0 and 1 representing the fraction of replicates for which the spectral analysis found the above number of conformations (e.g., if nClusters is 2 and out of two replicates one was found to form 2 conformations and the other 3 conformations, confidence will be 0.5)
start	Start position of the window (0-based)
end	End position of the window (1-based)

Post-processing of DRACO output

DRACO JSON output files can be somehow hard to interpret and process. To facilitate this operation, we developed the rf-json2rc module, that has been included in the RNA Framework. The module allows converting the reactivity profiles deconvolved by DRACO into RNA Count (RC) files, that can be subsequently normalized with the rf-norm module, and then fed into the rf-fold module for secondary structure modeling. For additional details, please refer to the rf-json2rc docs.