consecutive_patterns


This function enables the identification of patterns in the orientation of consecutive occurrences of a feature.


Required inputs:

Argument

Explanation

paths

One or more BED-formatted files.


Optional inputs:

Argument

Explanation

–names

Assign a name for each of the inputted BED-formatted files to allow for more human readable output.

–min_distance

Minimum distance between consecutive occurrences to consider in the analysis. Default is 0.

–max_distance

Maximum distance between consecutive occurrences to consider in the analysis. Default is 100.

–patterns

Patterns to search, comma separated. Default is same / opposite strand orientation analysis.

–orientation

Orient file(s) relative to annotated BED-formated file(s) and perform the analysis for the un-annotated file with the new annotations.

–bins

Number of bins to subdivide the analysis in. Default is 1 which does not perform this analysis.

–threshold

Threshold of p-value of consecutive patterns to save in new BED file.

–simulations

Number of simulations to perform from which the empirical p-value is derived. Default is N=100.

–plots

Returns the associated plots of the asymmetries for each file.


Outputs:

  • Table of strand asymmetries for all patterns for each motif.

  • BED files with the statistically significant consecutive regions. One file for each pattern for each file inputted.

  • Barplots of expected and observed consecutive occurrences of each of the patterns.

  • Plots showing the distribution of consecutive occurrences for each pattern.

  • Heatmap of subdivisions of the signal in distances across all patterns if –bins is selected.


Note

The results from files consisting of palindromic motifs will not show strand asymmetries.


Note

  • The default strand asymmetry estimation is obtained as shown in the tutorial of this function. However, when the option for custom patterns is selected, then the following procedure is followed to estimate strand asymmetries.

We define as p the probability that the motif is found in strand An, calculated as the number of appearances of the motif in strand Andivided by the total number of appearances of the motif in the genome. Similarly, we define qas the probability that the motif is found in strand Bn. Since there are only two strands, p=1-q. the probability that any given pattern would emerge by chance is estimated based on the number of + or - signs it contains. A pattern that contains u “+” signs and v “-” signs, should appear with a probability pu*qv. We also apply the Bonferroni correction, based on the total number of motif occurrences. We can thus determine the p-value of any given pattern emerging.