CRISPR-associated systems are powerful tools for genome manipulation. They provide incredibly wide spectrum of applications from gene knockouts, through genome-wide screens, up to precise genome editing as results of still increasing set of nucleases characterised by various activities. Either the expected effect is a gene expression regulation or therapeutic single nucleotide modification, the crucial point is a well-prepared selection of one or more guide RNA (gRNA) sequences. In this article, inspired by our recent experience in computationally supported design of gRNAs, we discuss the main areas where bioinformatics accelerates that process and how bioinformatics improves gRNA selection.
To obtain precise effects of genome manipulation a several aspects must be considered. The main issue is of course the experiment design – where should the nuclease bind? Should it cut DNA molecule in order to introduce insertion or deletion in any particular position? Is there a need for more than one gRNA sequence to obtain the expected effect? Or maybe the regulatory effect is desired?
It seems that knowing the potential location of the target binding sequence is not enough. What decides about the experiment success are also the properties of selected gRNA sequence. Depending on the sequence features, gRNA can show various on-target activity, which can be estimated by analysis of the sequence and their neighbouring nucleotide patterns. The other important question about gRNA is how specific it is, meaning how many alternative locations can be found in the genome where it can bind and if they overlap any coding regions. In other words it is necessary to assess the risk of unwanted modifications.
Because of the high abundance of protospacer adjacent motif (PAM) sequences in referencial genomes, which is also dependent on the nuclease type, the selection of gRNA must be supported by computational tools. Their aim is to provide the user selection of gRNA sequences based on their location, specificity within the genome and sequence properties. There are several online tools allowing the user to look for gRNA which would be the most suitable for the experiment purposes (like ATUM, E-CRISP). They usually provide a selection of PAM sequences, reference genomes and experimental designs. Each of the tool incorporates also a custom way of on-target and off-target activity scoring. Nonetheless, they don’t exhaust all the possibilities of evaluating CRISPR usage and knowledge about its mechanisms, e.g new PAM sequences or new information about gRNA activity prediction. Therefore for a comprehensive design solution for a given experimental setup, extra work is probably inevitable.
Furthermore preparing a custom database of gRNA sequences encounters several challenges. First of all is the problem of size. Managing millions of entries for one type of nuclease and one reference genome requires well optimized tools. Then, the application of gRNA scoring algorithms should be aimed at maximizing the information valuable for given experiment design, for example by including published knowledge about nuclease and gRNA interaction properties.
Most of the known gRNA design tools, in order to evaluate on-target activity of a given gRNA, apply machine learning models based on extensive experimental studies published by Doench  or Xu . Their results provide profiles of target sequence preferences, evaluating also PAM sequence and flanking nucleotides. The studies extensively support the understanding of interaction of Cas9 with DNA molecule but their results can be only partially extended to another nucleases analysis. Additional features, like GC content importance and flanking nucleotides preferences, that are also important for Cas9 binding were presented in several articles [4-6] and can be used as additional scoring factors. A possible approach applied by a gRNA design tool is to combine all the potentially important factors into hand-crafted scores, which provide additional ranking option but should be treated carefully as mostly theoretical predictions .
Once the gRNA candidate sequence is established there is another challenge – to estimate its specificity. In order to obtain that knowledge the alignment of the gRNA sequence to the whole genome space is regularly performed. The aim is to find not only the perfectly matching off-targets (i.e. alternative but unwanted locations) but also sites within some mismatch range that can compete with the target. The effect of mismatches on the probability of Cas9 binding to the off-target locations was experimentally investigated and many design tools use the published scoring algorithms [7, 8]. One of the most popular scores is the CFD score defined by Doench . The simplified methods count the number of off-targets with an acceptable number of mismatches, which varies from one to three. Additionally, the effect of mismatch occurence is weighted by its distance to PAM sequence, because of the fact that mismatches located in a seed sequence, which is defined as 8 to 10 nucleotides preceding the PAM sequence, disrupt the zipping mechanism of enzyme binding with higher probability . Finally, each gRNA should be ranked by a combined score measuring the off-target activity which is often defined as a hand-crafted rule of counting and weightening competing locations. The final score effectiveness strongly depends on several factors including the alignment method or incorporated scoring algorithms , therefore it should be treated rather as a rough estimation of the gRNA specificity.
In conclusion, there are clear points of the gRNA design process which need computational aid. The existing bioinformatics tools are not perfect yet, therefore their usage depends on the desired experiment outcome and requires some expertise. However growing research of CRISPR mechanisms and effectiveness provide improvements to the existing algorithms that translate into better design. That knowledge can already be turned into custom design tools or databases that significantly increase the efficiency of conducted experiments.
- Addgene, CRISPR 101: A Desktop Resource, retrieved from www.addgene.org
- Doench, J. G. et al. Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. Nat. Biotechnol. 32, 1262-7 (2014)
- Xu H. et al. Sequence determinants of improved CRISPR sgRNA design. Genome Res. 25, 1147-57 (2015)
- Wang, T. , Wei, J. J., Sabatini, D. M., Lander, E. S. Genetic screens in human cells using the CRISPR/Cas9 system. Science 343, 80-48 (2014)
- Farboud, B., Meyer, B.J., Dramatic Enhancement of Genome Editing by CRISPR/Cas9 Through Improved Guide RNA Design. Genetics 199, 959-971 (2015)
- Sander, J.D., Joung, J.K. CRISPR-Cas systems for genome editing, regulation and targeting, Nat. Biotech. 329, 347-355 (2014)
- Listgarten, J. et al. Prediction of off-target activities for the end-to-end design of CRISPR guide RNAs, Nature Biomedical Engineering 2, 38–47 (2018)
- Hsu, P. D. et al. DNA targeting specificity of RNA-guided Cas9 nucleases,Nat Biotechnol. 31, 827-832 (2013)
- Doench, J.G. et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9, Nat.Biotechnol. 34, 184-191 (2016)
- Jiang, E., Doudna, J.A. CRISPR-Cas9 Structures and Mechanisms, An. Rev. Biophys. 46, 505-529 (2017)