Share news in:
29 May 2018
Author: Barbara Kalinowska, PhD
Author: Barbara Kalinowska, PhD

The bioinformatics of CRISPR gRNA design

CRISPR-associated systems are potent tools for genome manipulation. From gene knockouts to whole-genome screening to precise genome editing through ever-growing sets of nucleases with different activities, they provide an extremely broad spectrum of applications. Whether the expected effect is a gene expression regulation or therapeutic single nucleotide modification, the crucial point is a well-prepared selection of one or more guide RNA (gRNA) sequences. 

In this article, inspired by our recent experience in the computationally supported design of gRNAs, we discuss the main areas where bioinformatics accelerates that process and how bioinformatics improves gRNA selection.

What will you learn from this article:

  1. What are the challenges of genome manipulation optimization process?
  2. Why CRISPR and ATUM are suitable for gRNA selection?
  3. What are the issues with custom gRNA sequences database?
  4. Conclusion

Optimizing Genome Manipulation – What Are the Challenges?

To obtain precise effects of genome manipulation, one should consider several factors.  The main issue is the experiment design. Where should the nuclease bind? Should it cut a DNA molecule to introduce insertion or deletion in any particular position? Is there a need for more than one gRNA sequence to obtain the expected effect? Or is the goal to achieve regulatory compliance?

Seemingly, knowing the potential location of the target binding sequence does not suffice. The experiment’s success is also determined by the properties of the selected gRNA sequence. Depending on the sequence features, gRNA can show various on-target activities, which can be estimated by analysis of the sequence and their neighboring nucleotide patterns. 

The other major question about gRNA is how specific it is – or, how many alternative sites can the genome consist of, where it can bind, and whether they overlap with coding regions. In other words, it is necessary to estimate the risk of unwanted modifications.

CRISPR and ATUM – Online Computational Tools for gRNA Selection

Because of the high abundance of protospacer adjacent motif (PAM) sequences in a reference genome, which is also dependent on the nuclease type, the selection of gRNA must be supported by computational tools. They aim to provide the user selection of gRNA sequences based on their location, specificity within the genome, and sequence properties. 

Several online tools allow the user to look for gRNA that would be the most suitable for the experiment purposes – like ATUM (https://www.atum.bio/) or E-CRISP (http://www.e-crisp.org/). They usually provide a selection of PAM sequences, reference genomes, and experimental designs. Each tool also incorporates a custom way of on-target and off-target activity scoring.

Nonetheless, they don’t exhaust all the possibilities of evaluating CRISPR usage and knowledge about its mechanisms e.g. new PAM sequences or new information about gRNA activity prediction. Therefore, to obtain a comprehensive design solution for a given experimental setup, additional work is likely unavoidable.

Custom gRNA Sequences Database – Problems and Solutions

Preparing a custom database of gRNA sequences encounters several challenges. This necessitates a systematic exploration of corresponding solutions. Now, we will cover numerous issues that such databases encounter and present potential solutions to each of them.

Large-scale gRNA Sequence Databases – Management Optimization 

First, we will address the problem concerning size. Managing millions of entries for one type of nuclease and one reference genome requires well-optimized tools. Then, the application of gRNA scoring algorithms should be aimed at gathering the maximum amount of information valuable for the given experiment design, for example, by including published knowledge about nuclease and gRNA interaction properties.

gRNA On-Target Activity Evaluation with Machine Learning Tools – Insights from Doench and Xu

To evaluate the on-target activity of a given gRNA, most of the known gRNA design tools apply machine-learning models based on extensive experimental studies published by Doench [2] or Xu [3]. Their results provide profiles of target sequence preferences, also evaluating PAM sequence and flanking nucleotides. The studies extensively support the understanding of the interaction of Cas9 with DNA molecules. However, their results can be only partially extended to another nuclease analysis.

Evaluating GC Content and Flanking Nucleotide Preferences as Supplementary Factors

Additional features, like GC content importance and flanking nucleotide preferences, that are also important for Cas9 binding were presented in several articles [4,6] and can be used as additional scoring factors. A possible approach applied by a gRNA design tool is to combine all the potentially important factors into hand-crafted scores, which provide additional ranking options but should be treated carefully as mostly theoretical predictions [7].

gRNA Specificity – Genome-wide Alignment and Mismatch Investigations in the Context of Cas9 Binding Probability

Once the gRNA candidate sequence is established, there is another challenge –  to estimate its specificity. To obtain that knowledge, the alignment of the gRNA sequence to the whole genome space is regularly performed. The aim is to find not only the perfectly matching off-targets (i.e. alternative but unwanted locations) but also sites within some mismatch range that can compete with the target. The effect of mismatches on the probability of Cas9 binding to the off-target locations was experimentally investigated. Many design tools use the published scoring algorithms [7, 8].

Off-Target Effects Quantification – A Focus on the CFD Score in gRNA Design

One of the most popular scores is the CFD score is defined by Doench [9]. The simplified methods count the number of off-targets with an acceptable number of mismatches, which varies from one to three. In addition, the effect of mismatch occurrence is weighted by the distance from the PAM sequence. That is due to mismatches located in the seed sequence, defined as 8 to 10 nucleotides preceding the PAM sequence, disrupting the enzyme binding mechanism with greater probability [10].

Balancing Counting and Weighting Rules for Off-Target Locations

Finally, each gRNA should be ranked by a combined score measuring the off-target activity defined as a hand-crafted rule of counting and weighting competing locations. The final score effectiveness depends on several factors, including the alignment method or incorporated scoring algorithms [7]. Therefore, it should be treated as a rough estimation of the gRNA specificity.

The Bioinformatics of CRISPR gRNA Design: Conclusion

To sum up, we have enumerated points of the gRNA design process that need computational aid. The existing bioinformatics tools still need to be perfected. Therefore, their usage depends on the desired experiment outcome and requires some expertise. However, growing research on CRISPR mechanisms and effectiveness improves the existing algorithms that translate into better design. That knowledge can already be turned into custom design tools or databases that significantly increase the efficiency of conducted experiments.    

Works Cited:

  1. Addgene, CRISPR 101: A Desktop Resource, retrieved from www.addgene.org
  2. Doench, J. G. et al. Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. Nat. Biotechnol. 32, 1262-7 (2014)
  3. Xu H. et al. Sequence determinants of improved CRISPR sgRNA design. Genome Res. 25, 1147-57 (2015)
  4. Wang, T., Wei, J. J., Sabatini, D. M.,  Lander, E. S. Genetic screens in human cells using the CRISPR/Cas9 system. Science 343, 80-48 (2014)
  5. Farboud, B., Meyer,  B.J., Dramatic Enhancement of Genome Editing by CRISPR/Cas9 Through Improved Guide RNA Design. Genetics 199, 959-971 (2015)
  6. Sander, J.D., Joung, J.K. CRISPR-Cas systems for genome editing, regulation and targeting, Nat. Biotech. 329, 347-355 (2014)
  7. Listgarten, J. et al. Prediction of off-target activities for the end-to-end design of CRISPR guide RNAs, Nature Biomedical Engineering 2, 38–47 (2018)
  8. Hsu, P. D. et al. DNA targeting specificity of RNA-guided Cas9 nucleas es, Nat Biotechnol. 31, 827-832 (2013)
  9. Doench,  J.G. et al.  Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9, Nat.Biotechnol. 34, 184-191 (2016)
  10. Jiang, E., Doudna, J.A. CRISPR-Cas9 Structures and Mechanisms, An. Rev. Biophys. 46, 505-529 (2017)
30 May 2018
Arrows Instead of Bullets: How to Handle Large Amounts of Columnar Data with Arrow and Parquet
Go up