CRISPR-associated systems are potent tools for genome manipulation. From gene knockouts to whole-genome screening to precise genome editing through ever-growing sets of nucleases with different activities, they provide an extremely broad spectrum of applications. Whether the expected effect is a gene expression regulation or therapeutic single nucleotide modification, the crucial point is a well-prepared selection of one or more guide RNA (gRNA) sequences.
In this article, inspired by our recent experience in the computationally supported design of gRNAs, we discuss the main areas where bioinformatics accelerates that process and how bioinformatics improves gRNA selection.
To obtain precise effects of genome manipulation, one should consider several factors. The main issue is the experiment design. Where should the nuclease bind? Should it cut a DNA molecule to introduce insertion or deletion in any particular position? Is there a need for more than one gRNA sequence to obtain the expected effect? Or is the goal to achieve regulatory compliance?
Seemingly, knowing the potential location of the target binding sequence does not suffice. The experiment’s success is also determined by the properties of the selected gRNA sequence. Depending on the sequence features, gRNA can show various on-target activities, which can be estimated by analysis of the sequence and their neighboring nucleotide patterns.
The other major question about gRNA is how specific it is – or, how many alternative sites can the genome consist of, where it can bind, and whether they overlap with coding regions. In other words, it is necessary to estimate the risk of unwanted modifications.
Because of the high abundance of protospacer adjacent motif (PAM) sequences in a reference genome, which is also dependent on the nuclease type, the selection of gRNA must be supported by computational tools. They aim to provide the user selection of gRNA sequences based on their location, specificity within the genome, and sequence properties.
Several online tools allow the user to look for gRNA that would be the most suitable for the experiment purposes – like ATUM (https://www.atum.bio/) or E-CRISP (http://www.e-crisp.org/). They usually provide a selection of PAM sequences, reference genomes, and experimental designs. Each tool also incorporates a custom way of on-target and off-target activity scoring.
Nonetheless, they don’t exhaust all the possibilities of evaluating CRISPR usage and knowledge about its mechanisms e.g. new PAM sequences or new information about gRNA activity prediction. Therefore, to obtain a comprehensive design solution for a given experimental setup, additional work is likely unavoidable.
Preparing a custom database of gRNA sequences encounters several challenges. This necessitates a systematic exploration of corresponding solutions. Now, we will cover numerous issues that such databases encounter and present potential solutions to each of them.
First, we will address the problem concerning size. Managing millions of entries for one type of nuclease and one reference genome requires well-optimized tools. Then, the application of gRNA scoring algorithms should be aimed at gathering the maximum amount of information valuable for the given experiment design, for example, by including published knowledge about nuclease and gRNA interaction properties.
To evaluate the on-target activity of a given gRNA, most of the known gRNA design tools apply machine-learning models based on extensive experimental studies published by Doench [2] or Xu [3]. Their results provide profiles of target sequence preferences, also evaluating PAM sequence and flanking nucleotides. The studies extensively support the understanding of the interaction of Cas9 with DNA molecules. However, their results can be only partially extended to another nuclease analysis.
Additional features, like GC content importance and flanking nucleotide preferences, that are also important for Cas9 binding were presented in several articles [4,6] and can be used as additional scoring factors. A possible approach applied by a gRNA design tool is to combine all the potentially important factors into hand-crafted scores, which provide additional ranking options but should be treated carefully as mostly theoretical predictions [7].
Once the gRNA candidate sequence is established, there is another challenge – to estimate its specificity. To obtain that knowledge, the alignment of the gRNA sequence to the whole genome space is regularly performed. The aim is to find not only the perfectly matching off-targets (i.e. alternative but unwanted locations) but also sites within some mismatch range that can compete with the target. The effect of mismatches on the probability of Cas9 binding to the off-target locations was experimentally investigated. Many design tools use the published scoring algorithms [7, 8].
One of the most popular scores is the CFD score is defined by Doench [9]. The simplified methods count the number of off-targets with an acceptable number of mismatches, which varies from one to three. In addition, the effect of mismatch occurrence is weighted by the distance from the PAM sequence. That is due to mismatches located in the seed sequence, defined as 8 to 10 nucleotides preceding the PAM sequence, disrupting the enzyme binding mechanism with greater probability [10].
Finally, each gRNA should be ranked by a combined score measuring the off-target activity defined as a hand-crafted rule of counting and weighting competing locations. The final score effectiveness depends on several factors, including the alignment method or incorporated scoring algorithms [7]. Therefore, it should be treated as a rough estimation of the gRNA specificity.
To sum up, we have enumerated points of the gRNA design process that need computational aid. The existing bioinformatics tools still need to be perfected. Therefore, their usage depends on the desired experiment outcome and requires some expertise. However, growing research on CRISPR mechanisms and effectiveness improves the existing algorithms that translate into better design. That knowledge can already be turned into custom design tools or databases that significantly increase the efficiency of conducted experiments.