CRISPR-associated systems are potent tools for genome manipulation. From gene knockouts to whole-genome screening to precise genome editing through ever-growing sets of nucleases with different activities, they provide an extremely broad spectrum of applications. Whether the expected effect is a gene expression regulation or therapeutic single nucleotide modification, the crucial point is a well-prepared selection of one or more guide RNA (gRNA) sequences.
In this article, inspired by our recent experience in the computationally supported design of gRNAs, we discuss the main areas where bioinformatics accelerates that process and how bioinformatics improves gRNA selection.
Optimizing Genome Manipulation – What Are the Challenges?
To obtain precise effects of genome manipulation, one should consider several factors. The main issue is the experiment design. Where should the nuclease bind? Should it cut a DNA molecule to introduce insertion or deletion in any particular position? Is there a need for more than one gRNA sequence to obtain the expected effect? Or is the goal to achieve regulatory compliance?
Seemingly, knowing the potential location of the target binding sequence does not suffice. The experiment’s success is also determined by the properties of the selected gRNA sequence. Depending on the sequence features, gRNA can show various on-target activities, which can be estimated by analysis of the sequence and their neighboring nucleotide patterns.
The other major question about gRNA is how specific it is – or, how many alternative sites can the genome consist of, where it can bind, and whether they overlap with coding regions. In other words, it is necessary to estimate the risk of unwanted modifications.
CRISPR and ATUM – Online Computational Tools for gRNA Selection
Because of the high abundance of protospacer adjacent motif (PAM) sequences in a reference genome, which is also dependent on the nuclease type, the selection of gRNA must be supported by computational tools. They aim to provide the user selection of gRNA sequences based on their location, specificity within the genome, and sequence properties.
Several online tools allow the user to look for gRNA that would be the most suitable for the experiment purposes – like ATUM (https://www.atum.bio/) or E-CRISP (http://www.e-crisp.org/). They usually provide a selection of PAM sequences, reference genomes, and experimental designs. Each tool also incorporates a custom way of on-target and off-target activity scoring.
Nonetheless, they don’t exhaust all the possibilities of evaluating CRISPR usage and knowledge about its mechanisms e.g. new PAM sequences or new information about gRNA activity prediction. Therefore, to obtain a comprehensive design solution for a given experimental setup, additional work is likely unavoidable.
Custom gRNA Sequences Database – Problems and Solutions
Large-scale gRNA Sequence Databases – Management Optimization
First, we will address the problem concerning size. Managing millions of entries for one type of nuclease and one reference genome requires well-optimized tools. Then, the application of gRNA scoring algorithms should be aimed at gathering the maximum amount of information valuable for the given experiment design, for example, by including published knowledge about nuclease and gRNA interaction properties.gRNA On-Target Activity Evaluation with Machine Learning Tools – Insights from Doench and Xu
To evaluate the on-target activity of a given gRNA, most of the known gRNA design tools apply machine-learning models based on extensive experimental studies published by Doench [2] or Xu [3]. Their results provide profiles of target sequence preferences, also evaluating PAM sequence and flanking nucleotides. The studies extensively support the understanding of the interaction of Cas9 with DNA molecules. However, their results can be only partially extended to another nuclease analysis.Evaluating GC Content and Flanking Nucleotide Preferences as Supplementary Factors
Additional features, like GC content importance and flanking nucleotide preferences, that are also important for Cas9 binding were presented in several articles [4,6] and can be used as additional scoring factors. A possible approach applied by a gRNA design tool is to combine all the potentially important factors into hand-crafted scores, which provide additional ranking options but should be treated carefully as mostly theoretical predictions [7].gRNA Specificity – Genome-wide Alignment and Mismatch Investigations in the Context of Cas9 Binding Probability
Once the gRNA candidate sequence is established, there is another challenge – to estimate its specificity. To obtain that knowledge, the alignment of the gRNA sequence to the whole genome space is regularly performed. The aim is to find not only the perfectly matching off-targets (i.e. alternative but unwanted locations) but also sites within some mismatch range that can compete with the target. The effect of mismatches on the probability of Cas9 binding to the off-target locations was experimentally investigated. Many design tools use the published scoring algorithms [7, 8].Off-Target Effects Quantification – A Focus on the CFD Score in gRNA Design
One of the most popular scores is the CFD score is defined by Doench [9]. The simplified methods count the number of off-targets with an acceptable number of mismatches, which varies from one to three. In addition, the effect of mismatch occurrence is weighted by the distance from the PAM sequence. That is due to mismatches located in the seed sequence, defined as 8 to 10 nucleotides preceding the PAM sequence, disrupting the enzyme binding mechanism with greater probability [10].Balancing Counting and Weighting Rules for Off-Target Locations
Finally, each gRNA should be ranked by a combined score measuring the off-target activity defined as a hand-crafted rule of counting and weighting competing locations. The final score effectiveness depends on several factors, including the alignment method or incorporated scoring algorithms [7]. Therefore, it should be treated as a rough estimation of the gRNA specificity.The Bioinformatics of CRISPR gRNA Design: Conclusion
- Addgene, CRISPR 101: A Desktop Resource, retrieved from www.addgene.org
- Doench, J. G. et al. Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. Nat. Biotechnol. 32, 1262-7 (2014)
- Xu H. et al. Sequence determinants of improved CRISPR sgRNA design. Genome Res. 25, 1147-57 (2015)
- Wang, T., Wei, J. J., Sabatini, D. M., Lander, E. S. Genetic screens in human cells using the CRISPR/Cas9 system. Science 343, 80-48 (2014)
- Farboud, B., Meyer, B.J., Dramatic Enhancement of Genome Editing by CRISPR/Cas9 Through Improved Guide RNA Design. Genetics 199, 959-971 (2015)
- Sander, J.D., Joung, J.K. CRISPR-Cas systems for genome editing, regulation and targeting, Nat. Biotech. 329, 347-355 (2014)
- Listgarten, J. et al. Prediction of off-target activities for the end-to-end design of CRISPR guide RNAs, Nature Biomedical Engineering 2, 38–47 (2018)
- Hsu, P. D. et al. DNA targeting specificity of RNA-guided Cas9 nucleas es, Nat Biotechnol. 31, 827-832 (2013)
- Doench, J.G. et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9, Nat.Biotechnol. 34, 184-191 (2016)
- Jiang, E., Doudna, J.A. CRISPR-Cas9 Structures and Mechanisms, An. Rev. Biophys. 46, 505-529 (2017)