Frequency of nucleotide sequences of specific length in DNA

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

1. How frequently would you find the nucleotide sequence GGATATCCGC (5' to 3' direction) by chance in a DNA molecule?
2. On average, how many times do you expect to find a specific 20 nucleotide sequence in a genome with a total size of 4 x 109 base pairs?

Since that probability of finding each base pair in its given position is 1 in 4, thus the total probability of finding that particular sequence would be $$0.25^{10}$$

But how do I approach the second problem?

TIA

DNA Sequence Statistics (2)¶

In the chapter on How to install R, you learnt about variables in R, such as scalars, vectors, and lists. You also learnt how to use functions to carry out operations on variables, for example, using the log10() function to calculate the log to the base 10 of a scalar variable x, or using the mean() function to calculate the average of the values in a vector variable myvector:

You also learnt that you can extract an element of a vector by typing the vector name with the index of that element given in square brackets. For example, to get the value of the 3rd element in the vector myvector, we type:

A useful function in R is the seq() function, which can be used to create a sequence of numbers that run from a particular number to another particular number. For example, if we want to create the sequence of numbers from 1-100 in steps of 1 (ie. 1, 2, 3, 4, . 97, 98, 99, 100), we can type:

We can change the step size by altering the value of the “by” argument given to the function seq(). For example, if we want to create a sequence of numbers from 1-100 in steps of 2 (ie. 1, 3, 5, 7, . 97, 99), we can type:

In R, just as in programming languages such as Python, it is possible to write a for loop to carry out the same command several times. For example, if we want to print out the square of each number between 1 and 10, we can write the following for loop:

In the for loop above, the variable i is a counter for the number of cycles through the loop. In the first cycle through the loop, the value of i is 1, and so i * i =1 is printed out. In the second cycle through the loop, the value of i is 2, and so i * i =4 is printed out. In the third cycle through the loop, the value of i is 3, and so i * i =9 is printed out. The loop continues until the value of i is 10. In the tenth cycle through the loop, the value of i is 10, and so i * i =100 is printed out.

Note that the commands that are to be carried out at each cycle of the for loop must be enclosed within curly brackets (“<” and “>”).

You can also give a for loop a vector of numbers containing the values that you want the counter i to take in subsequent cycles. For example, you can make a vector avector containing the numbers 2, 9, 100, and 133, and write a for loop to print out the square of each number in vector avector:

How can we use a for loop to print out the square of every second number between 1 and 10? The answer is to use the seq() function to tell the for loop to take every second number between 1 and 10:

In the first cycle of this loop, the value of i is 1, and so i * i =1 is printed out. In the second cycle through the loop, the value of i is 3, and so i * i =9 is printed out. The loop continues until the value of i is 9. In the fifth cycle through the loop, the value of i is 9, and so i * i =81 is printed out.

R allows the production of a variety of plots, including scatterplots, histograms, piecharts, and boxplots. For example, if you have two vectors of numbers myvector1 and myvector2, you can plot a scatterplot of the values in myvector1 against the values in myvector2 using the plot() function. If you want to label the axes on the plot, you can do this by giving the plot() function values for its optional arguments xlab and ylab:

If you look at the help page for the plot() function, you will see that there are lots of optional arguments (inputs) that it can take that. For example, one optional argument is the type argument, that determines the type of the plot. By default, plot() will draw a dot at each data point, but if we set type to be “b”, then it will also draw a line between each subsequent data point:

We have been using built-in R functions such as mean(), length(), print(), plot(), etc. We can also create our own functions in R to do calculations that you want to carry out very often on different input data sets. For example, we can create a function to calculate the value of 20 plus the square of some input number:

This function will calculate the square of a number (x), and then add 20 to that value. The return() statement returns the calculated value. Once you have typed in this function, the function is then available for use. For example, we can use the function for different input numbers (eg. 10, 25):

You can view the code that makes up a function by typing its name (without any parentheses). For example, we can try this by typing “myfunction”:

When you are typing R, if you want to, you can write comments by writing the comment text after the “#” sign. This can be useful if you want to write some R commands that other people need to read and understand. R will ignore the comments when it is executing the commands. For example, you may want to write a comment to explain what the function log10() does:

Introduction

Studies of genomes based on linguistic approaches date a few decades back (Brendel et al. 1986 Pevzner et al. 1989 Searls 1992 Botstein and Cherry 1997 Gimona 2006 Faltýnek et al. 2019 Ji 2020). An interplay with methods of statistical physics as well as theory of complex systems brought new insights into biology (Dehmer et al. 2009 Qian 2013). Studies range from attempted n-gram-based classification of genomes (Tomović et al. 2006 Huang and Yu 2016) to algorithms for optimal segmentation of RNAs in secondary structure predictions (Licon et al. 2010) and analysis of substitution rates of coding genes during evolution (Lin et al. 2019), just to mention a few. Various types of sequences in genomes are related to multiple genetic codes (Trifonov et al. 2012) and can be studied both using quantitative linguistic point of view (Ferrer-i-Cancho et al. 2013 Ferrer-i-Cancho et al. 2014) and from a wider perspective, within more abstract approaches (Neuman and Nave 2008 Barbieri 2012). Recently, neural networks and deep learning algorithms emerged as new tools to analyze nucleotide sequences (Fang et al. 2019 Singh et al. 2019 Melkus et al. 2020 Ren et al. 2020) offering wider prospects for studies of genomes. Viruses, balancing on the fuzzy border between non-alive and alive, hence remaining on the verge of life (Villarreal 2004 Kolb 2007 Carsetti 2020), are within the most interesting subjects of studies.

The aim of the present Letter is to draw attention to simple treatments of nucleotide sequences in viral RNAs by means of new parameters, which can be immediately extracted from genome data. We expect that such parameters can be potentially used as an auxiliary tool in the classification of viruses (cf., in particular, Wang 2013). The idea of this study is linked to the recent COVID-19 outbreak, and the analysis started from comparing human coronaviruses (Su et al. 2016 Wu et al. 2020) and some other viruses. To achieve relative homogeneity of the material, we restrict our sample to single-stranded RNA viruses only. Both positive- and negative-sense RNAs are considered. For future reference, we also include two retroviruses, HIV-1 and HIV-2.

The paper is organized as follows. Summary of data and description of methods are given in “Data and Methods” section. Results are presented in “Results” section. Finally, brief discussion is given in “Discussion” section.

3. Results and discussion

3.1. Palindrome occurrence across the tree of life

We counted the occurrence of the 16 palindromic words of length 4 (Table 1), along with an equal number of non-palindromic words of length 4 (Table 1), in DNA sequences of selected genomes. Twenty different species were analysed for each of 10 different phylogenetic groups, i.e. the vertebrates, invertebrates, fungi, plants, protozoa, mitochondria, bacteria, archaea, double-stranded DNA viruses, and retroviruses. Perfect repeats were removed from the input sequences to avoid introduction of a trivial bias from regions of extremely low complexity such as telomeric or centromeric repeats. For each input DNA sequence and each 4-mer word, we then calculated the ratio R of actual occurrence of the word divided by the expected number of occurrences, given its GC content and that of the input DNA sequence. Most of the palindromes were underrepresented (R < 1) across all genomes analysed. Overall, the palindromes exhibited a mean R of 0.86, in contrast to a mean R of 1.08 for the non-palindromic controls (Table 1). The underrepresentation of palindromes was most pronounced in vertebrate genomes, plants, double-stranded DNA viruses, and retroviruses (Fig. 1). Contrary to previous reports, 20 palindromes were underrepresented even in mitochondrial genomes, demonstrating that the infrequence of palindromes in prokaryote genomes cannot solely be explained by the selective pressure exerted by restriction enzymes. Additional selective forces against palindromes might comprise their impact on DNA structure or their role as transcription factor-binding sites. 17 Whatever the underlying forces, short palindromes are underrepresented in all kinds of genomes (Fig. 1). Exactly which palindromes and how strongly depend on the source of the DNA. Interestingly, the inter-genome frequencies of short palindromes exhibit more than twice the variance of the non-palindromic control sequences (22 versus 9% Table 1), whereas the intra-genome frequencies, e.g. between different chromosomes of the same organism, are uniform (Figs 2–4). This renders short palindromes optimally suited for the typing of DNA.

Frequency of palindromes throughout a diverse selection of genomes. Palindrome frequency is expressed as the ratio (R) of occurrence to expectancy. Palindromes are underrepresented (R < 1, dotted line) in all kinds of genomes, most strongly in vertebrates, plants, and viruses, and they exhibit about twice the inter-species variance in frequency (error bars) than non-palindromes. Twenty different genomes were analysed per group (see Section 2).

Frequency of palindromes throughout a diverse selection of genomes. Palindrome frequency is expressed as the ratio (R) of occurrence to expectancy. Palindromes are underrepresented (R < 1, dotted line) in all kinds of genomes, most strongly in vertebrates, plants, and viruses, and they exhibit about twice the inter-species variance in frequency (error bars) than non-palindromes. Twenty different genomes were analysed per group (see Section 2).

Examples of palindrome frequency patterns. Frequency of the 16 palindromes of length 4 in selected genomes, expressed as log2 of ratio (R) of actual to expected occurrence. Hierarchical clustering was performed based on the city-block distance. 23 (Top) Mean and variance by palindrome. (Bottom) The signals from three random sequences are shown for comparison.

Examples of palindrome frequency patterns. Frequency of the 16 palindromes of length 4 in selected genomes, expressed as log2 of ratio (R) of actual to expected occurrence. Hierarchical clustering was performed based on the city-block distance. 23 (Top) Mean and variance by palindrome. (Bottom) The signals from three random sequences are shown for comparison.

Variance of palindrome frequencies in random DNA sequence of different lengths (n = 20 for each). The mean variance for each palindrome of length 4 across the 20 different sequences is compared with those across the first 20 human chromosomes (dotted grey line) and across the 20 different vertebrate chromosomes analysed in Fig. 1 (see Supplementary Table S1).

Variance of palindrome frequencies in random DNA sequence of different lengths (n = 20 for each). The mean variance for each palindrome of length 4 across the 20 different sequences is compared with those across the first 20 human chromosomes (dotted grey line) and across the 20 different vertebrate chromosomes analysed in Fig. 1 (see Supplementary Table S1).

Case studies on Caenorhabditis spp. (A), mammalian chromosomes (B), and sensu stricto yeasts (C). Most of the chromosomes are correctly resolved by clustering based on palindrome frequency. Perfect tandem repeats were removed prior to analysis to avoid trivial differences from repetitive regions. Note the striking difference between vertebrate and invertebrate DNA.

Case studies on Caenorhabditis spp. (A), mammalian chromosomes (B), and sensu stricto yeasts (C). Most of the chromosomes are correctly resolved by clustering based on palindrome frequency. Perfect tandem repeats were removed prior to analysis to avoid trivial differences from repetitive regions. Note the striking difference between vertebrate and invertebrate DNA.

3.2. Clustering of DNA based on palindrome frequency

Here, we represent a given DNA sequence by a vector of 16 numbers: for each of the 16 palindromes of length 4, the log2 of the ratio R of actual to expected frequency (given the GC content of the analysed DNA and that of the palindrome). When such vectors, generated from a diverse selection of DNA sequences, were aligned and hierarchically clustered based on the city-block distance, different DNA sequences of the same species readily grouped together (see Fig. 2 for a representative set of diverse genomes). The clustering worked for all kinds of genome sequences tested—eukaryote, prokaryote, plastid, or virus—but the topology of the resulting tree was not phylogenetically meaningful (Fig. 2). The lack of a large-scale phylogenetic signal was equally apparent from the analysis of the complete set of 200 genomes (Supplementary Fig. S1). The resolution of palindrome frequency clustering would increase further by using the 64 different palindromes of length 6. However, this would also require the input sequences to be longer. On the basis of the random sequences included in Fig. 2, the present approach appeared to work for sequences longer than about 10 kb. To obtain a better estimate on the minimally required size of input DNA, we analysed randomly generated sequences of increasing length (Fig. 3). Above 9 kb, the average variance of R per palindrome dropped below the value obtained for different vertebrate chromosomes (0.025, dashed grey line in Fig. 3). For comparison, the average variance of R per palindrome across human chromosomes was 0.0008 (dotted grey line in Fig. 3), demonstrating again that the variance of palindrome frequency is much lower intra- than inter-genome.

Invertebrates exhibiting the smallest inter-genome variance of palindrome frequency (Fig. 1), we chose Caenorhabditis species to challenge its power of discrimination. The complete nuclear genomes of C. briggsae and C. elegans were compared as described above and all the chromosomes were correctly resolved in spite of the weak patterns (Fig. 4A). Clustering based on palindrome frequency also segregated different mammalian chromosomes which, in contrast to invertebrate DNA, showed the characteristic pattern caused by strong underrepresentation of palindromes containing a CG dinucleotide (ACGT, TCGA, CCGG, GCGC, and CGCG Fig. 4B). This is in agreement with the model that in vertebrates, DNA methylation is restricted to cytosines followed by guanine (CpG), whereas in invertebrates, cytosines are methylated in a wider context. 25 Spontaneous mutation of the palindromic CG to the non-palindromic TG by deamination of methylated cytosine thus eliminates short palindromes from vertebrate DNA. The limit of resolution of palindrome frequency clustering was reached with a data set of highly similar sensu stricto yeasts. 26 The different chromosomes of the closely related species Saccharomyces cerevisiae, S. bayanus, S. mikatae, and S. kudriavzevii did not segregate perfectly those of the more distantly related S. castellii did (Fig. 4C).

Clustering based on palindrome frequency also worked for prokaryotes, generating species-specific patterns for archaea as well as bacteria. Prokaryote genomes exhibited highly diverse patterns (Supplementary Fig. S1). Natural plasmids of Escherichia coli clearly clustered with the host DNA (Fig. 5A). The same applied to certain dsDNA bacteriophages such as Lambda or P2. However, other dsDNA phages such as T3, as well as all analysed ssDNA phages, did not exhibit the same palindrome frequency patterns as E. coli (Fig. 5A). An interesting picture emerged when comparing human viruses: while all ssRNA minus-strand viruses and the retro-transcribing HIV clustered with human DNA, dsDNA viruses and ssRNA plus-strand viruses did not (Fig. 5B).

Palindrome frequency patterns of host genomic DNA (A, E. coli B, Homo sapiens labelled in black) and associated viruses (colour-coded according to nucleic acid type of the genome) or plasmids (grey).

Palindrome frequency patterns of host genomic DNA (A, E. coli B, Homo sapiens labelled in black) and associated viruses (colour-coded according to nucleic acid type of the genome) or plasmids (grey).

3.3. Potential application to metagenomics

The quickly developing field of environmental shotgun sequencing allows metagenomic analyses of communities of microorganisms, the majority of which cannot be cultured in the lab and have therefore remained undetected until recently. 27 A key challenge in interpreting environmental shotgun sequencing data is the binning of non-overlapping DNA scaffolds into groups which, ideally, correspond to the different species of microorganisms present. 28 Standard methods such as similarity searches to known genomes or phylogenetic analysis of marker genes are of limited use when dealing with DNA fragments sampled from previously undescribed species. 28 Di-, tri- and tetra-nucleotide frequencies have been proposed to provide DNA signatures. 29– 31 Palindrome frequencies carrying a species-specific signal (Figs 2 and 4), the ratios of occurrence to expectancy as applied here may also be useful to bin environmental shotgun sequencing data, provided that the contigs to be analysed are longer than 9 kb (Fig. 3). From the 2007 Sorcerer II Global Ocean Sampling Expedition, which at that time predominantly produced novel sequences, 32 the hundred largest contigs, sized between 11 and 59 kb, were analysed as described above. This revealed a diverse picture of palindrome frequency patterns with several major clusters (Supplementary Fig. S2). However, the analysed sequences still did not return high-quality hits when searched with blastn 33 against the NCBI non-redundant nucleotide collection, with only one exception of 99% identity to Prochlorococcus phage P-SSM4 (GenBank accession no. AY940168). Thus, it was not possible to assess the benefit of palindrome frequency clustering with this data set. To nevertheless test the potential of the method, we randomly selected 10 non-overlapping fragments of length 10 kb from each of the 20 different bacterial genomes analysed in Fig. 1 (Supplementary Table S1). When these 200 sequences were clustered according to palindrome frequency patterns, over 90% of them correctly assembled according to species of origin.

Results

HT-SELEX experimental data provide accurate M-word scores for diverse TF families

We analyzed HT-SELEX data, including 548 experiments covering 410 human and mouse proteins from 40 different TF families, to produce M-word binding scores. Increased sequencing depth allowed us to derive accurate scores for longer M-words. This aspect is particularly important because DNA shape is affected by the flanking regions of TFBSs. Therefore, we augmented the original dataset (Jolma et al, 2013 ) with additional sequencing to increase the read depth of the experiments by almost 10-fold (from an average of

168,000 reads per sequencing file to

1,656,000 reads). Experimental data were filtered by rigorous quality control (QC) criteria to identify cases with sufficient library complexity and read counts to allow the building of multiparametric models. A total of 218 TFs from 29 families passed the first filter based on high variability and large sample size of the data, and a total of 215 TFs from 27 different families passed the QC step based on regression performance (Fig 1).

Figure 1. Pipeline used to generate HT-SELEX M-word scores and filter datasets

For each TF, we selected a core-binding motif, to enable identification of the most probable binding site within M-words and filter out oligonucleotides that are likely to be unbound. The motifs used were derived from a previous study (Jolma et al, 2013 ). These motifs generally contain long flanks in addition to the core consensus sequence, which would prevent us from getting robust M-word scores due to low read coverage for long sequences. To overcome this difficulty, we used motifs from the catalogue compiled by Weirauch and Hughes (Weirauch & Hughes, 2011 ) to identify and use only the core positions. We calculated the binding score for each M-word that included the core motif in the center (allowing for a few mismatches) and any possible flanking sequences 5′ and 3′ of the motif. We sought to avoid the possibility of cooperative TFDNA binding, in which multiple copies of the TF occupy different DNA-binding sites (BSs) on the same sequence, as well as to minimize noise caused by inaccurate alignment of M-words based on the core motif. Thus, we excluded HT-SELEX reads that contained multiple instances of the core motifs.

Next, we derived M-word binding scores based on observed experimental enrichment. Each HT-SELEX experiment included several rounds of binding site (BS) selection by the TF, with the binding specificity of selected DNA sequences increasing in each round. We calculated the M-word score as the ratio of the frequency of the M-word in round i over its estimated frequency in the initial round, using a fifth-order Markov model (Slattery et al, 2011 ). The final output of this process was the M-word scores of the core sequence and its flanks for each HT-SELEX experiment (Appendix Fig S1A).

To evaluate the accuracy of our M-word scoring scheme and the value of deeper sequencing, we compared scores derived by HT-SELEX to those measured by genomic-context PBMs (gcPBMs). The gcPBMs use arrays specifically designed with the core sequence in the center, flanked by a genomic context (Gordân et al, 2013 ). These probes are intended to measure the effect of flanking sequences and, therefore, provide an accurate gold standard for long M-word (M ≥ 12) binding scores. The only protein for which both gcPBM and HT-SELEX experimental data exist was the Max homodimer (Zhou et al, 2015 ). Appendix Fig S1B shows the good correlation (r = 0.64) of 12-word scores produced by the two technologies, demonstrating the accuracy of our process in producing M-word scores from HT-SELEX data. To test how much we gain with respect to gcPBM binding scores by using the new data, we examined three different M-word scores: frequency, ratio compared with the initial round, and ratio compared with the estimated initial round. Deeper sequencing improved the correlation of these three scores to gcPBM 12-word scores, and the ratio-to-estimated score achieved the highest correlation (Appendix Fig S1C). Notably, when processing the data previously published in (Jolma et al, 2013 ) with the same pipeline, only 22 proteins passed the quality control, compared with 218 with the higher coverage, showing the advantage of deeper sequencing.

Principal component analysis (PCA) reveals TF family-specific DNA-binding specificities and heterogeneities within TF families

We performed PCA to visualize TF family-specific DNA-binding specificities. The DNA-binding preference of each TF was represented by the DNA M-word with the highest binding affinity for this TF. We encoded this M-word into numeric feature vectors that included (i) only mononucleotide (i.e., 1-mer) features, and (ii) both 1-mer and DNA shape features. DNA shape features include minor groove width (MGW), Roll, propeller twist (ProT), and helix twist (HelT) and are predicted with our DNAshape approach (Zhou et al, 2013 ). Figure 2A and B shows the first two principal components obtained using each feature vector.

Figure 2. PCA reveals different DNA-binding specificities between TF families

1. PCA using 1-mer features. Each dot represents a TF. Dots of the same color belong to the same TF family. An ellipse was drawn for each TF family. The ellipse is a contour of a fitted two-variate normal distribution that encloses 0.68 probability (R package default).
2. PCA using 1-mer and shape features, annotated in the same way as described in (A).
3. Boxplots of inter- and intra-family TF distances derived from (A). Difference between medians of inter- and intra-family distances is 2.02 (red).
4. Boxplots of inter- and intra-family TF distances derived from (B). Difference between medians of inter- and intra-family distances is 3.68 (red).

Different TF families tended to form distinct clusters in the PCA scatter plots. To compare the clustering quality in the two plots, we obtained the two-dimensional Euclidean distances between all pairs of TFs from Fig 2A and B. Distances were classified as intra- or inter-family and visualized as boxplots (Fig 2C and D). Inter-family distances were generally larger than intra-family distances. When we used both 1-mer and DNA shape features, the difference between the medians of the inter- and intra-family groups was slightly larger than the difference obtained when using 1-mer features alone (Fig 2C and D). This result was consistent with Fig 2A and B, indicating that more variance could be explained by introducing DNA shape features, in part due to the better separation of the homeodomain family (Fig 2B). To test whether such effects were simply due to the higher dimensionality introduced by the additional DNA shape features, we added randomly generated shape features based on Gaussian distribution with mean and standard deviation of the original shape features. Both the variance explained and the distance between intra- and inter-family groups were lower in this test (Appendix Fig S2).

DNA shape features improve modeling of DNA-binding specificities across TF families

We tested the importance of the recognition of DNA shape by each TF through quantitative modeling of DNA-binding specificities and comparison of model performance in terms of the R 2 between predicted and experimental M-word scores. Similar to the methodology in Yang et al ( 2014 ) and Zhou et al ( 2015 ), we built regression models that used only DNA mononucleotide features (i.e., 1mer models) or that combined DNA mononucleotide and shape features (i.e., 1mer+shape models). A result in which the 1mer+shape model outperforms the 1mer model indicates that DNA shape readout might play a role in TF binding.

Based on an analysis of 215 TFs from 27 different families, we found that 1mer+shape models generally outperformed 1mer models (Fig 3A), indicating the prevalence of DNA shape readout across different TF families (for a complete list of datasets used in Fig 3, see Table EV1). With DNA sequence readout playing a dominant role in TF binding, the importance of DNA shape recognition as additional contribution varied both between and within TF families. For example, model performance for homeodomain TFs was generally more substantially improved than for C2H2 TFs. Within the homeodomain TF family, there was a large variance among individual members. Homeodomain and bHLH TFs have been previously observed to be sensitive to DNA shape features (Slattery et al, 2011 Gordân et al, 2013 Yang et al, 2014 Zhou et al, 2015 ). Here, we confirmed and extended this observation to the bZIP, CENPB, CP2, CUT, ETS, HSF, IRF, MYB, NFAT, nuclear receptor, PAX, POU, PROX, TBX, and TEA TF families. At least half of the members in each of these families, covered by our data, showed greater than 10% performance improvement when DNA shape features were added to the model. However, some families were underrepresented in the data with only one TF present (Table EV1 for full names and detailed information of the TF families, see Table EV2).

Figure 3. Performance comparisons between models using different features

1. Comparison between 1mer and 1mer+shape models.
2. Comparison between shape models that are based on the original DNAshape method (Zhou et al, 2013 ) and randomly shuffled pentamer query tables.
3. Comparison between 1mer+2mer+3mer and 1mer+shape models.
4. Comparison between 1mer+2mer+3mer and 1mer+shape+3merE2 models. The label 3merE2 represents 3mer features from the two end positions at the 5' and 3' terminal of each DNA sequence.
5. Comparison between 1mer+2merNoE2+3merNoE2 and 1mer+shape models. The labels 2merNoE2 and 3merNoE3 indicate that 2mer and 3mer features, respectively, were removed from the end positions.
6. Comparison between 1mer+shape and 1mer+shape+3merE2 models.

To test the robustness of the experimental data and our computational pipeline, we repeated the above analysis on replicate experimental data for three TFs from the bHLH and homeodomain families. Our results consistently showed contributions of DNA shape readout for these two families (Appendix Fig S3A). To test whether the performance gain is simply a result of the increased number of model parameters due to the added DNA shape features, we shuffled the query table for DNA shape features. Shape models based on the shuffled query table generally have poorer performance than those based on the original query table (Fig 3B). We also tested whether the results were robust to the motif seeds used during data preprocessing. We repeated the above analyses using the Weirauch and Hughes seeds (Weirauch & Hughes, 2011 ) as the final seeds instead of using them for identifying the core positions of the HT-SELEX-based motifs published by Jolma et al ( 2013 ). We calculated Pearson's correlation coefficients between the performance of models that were based on the Weirauch and Hughes seeds (Weirauch & Hughes, 2011 ) and the Jolma et al ( 2013 ) seeds. The high correlation between the two sets of motif seeds indicated that the results were robust to the choice of motif seeds (Appendix Fig S3B). We also tested the robustness of the results under slight changes in the mismatch threshold (see 4) and length of the flanking regions. Both tests showed high correlation between different parameter settings, demonstrating sufficient robustness (Appendix Fig S3C and D).

The homeodomain TFs in this study presumably bind DNA as monomers, whereas our previous studies demonstrated the importance of DNA shape for Exd–Hox heterodimers (Slattery et al, 2011 ). X-ray and nuclear magnetic resonance (NMR) structures of homeodomain DNA-binding domains in complex with DNA repeatedly show that the N-terminal tail of the homeodomain DNA-binding domain interacts with the DNA through minor groove and backbone contacts, which is a signature of DNA shape readout (Joshi et al, 2007 ).

DNA shape features in flanking regions are important for different TF families

We previously observed that 1mer+2mer+3mer models usually outperform 1mer+shape models (Zhou et al, 2015 ). Here, we gained additional clues for possible explanations of this observation. As noted previously (Zhou et al, 2015 ), both 2-mer and 3-mer features are indirect representations of DNA shape characteristics. The 2-mer features describe stacking interactions between adjacent base pairs, whereas 3-mer features describe short structural elements, such as A-tracts that tend to form narrow minor groove regions. Thus, it is not surprising that 1mer+2mer+3mer models can capture TFDNA binding specificities with high accuracy.

Using our high-quality HT-SELEX data, we observed that, for most TFs, 1mer+2mer+3mer models outperformed 1mer+shape models (Fig 3C). As our prediction of local DNA shape features was based on a sliding window of 5 base pairs (Zhou et al, 2013 ), we were unable to predict shape features for the two extreme positions at the 5′ and 3′ ends of each DNA sequence. This limitation could give an edge to 1mer+2mer+3mer models. However, we could encode 2-mer and 3-mer features for those terminal positions, which in turn would work as a proxy for DNA shape. To test this hypothesis, we added 3-mer features from only the two end (E2) positions (i.e., 3merE2 features) to the 1mer+shape model. Performance of the resulting 1mer+shape+3merE2 model was indeed comparable to that of the 1mer+2mer+3mer model (Fig 3D). As an additional test, we removed 2-mer and 3-mer features at the end positions from the 1mer+2mer+3mer model, which resulted in the 1mer+2merNoE2+3merNoE2 model that showed similar performance to the 1mer+shape model (Fig 3E).

We also hypothesized that if longer flanking sequences were available for predicting shape features, then 1mer+shape models would perform similar to 1mer+2mer+3mer models without adding 3merE2 features. To verify this possibility, we used an independent dataset generated by the gcPBM platform (Zhou et al, 2015 ). As expected, 1mer+shape models performed comparable to 1mer+2mer+3mer models for the data without additional 3merE2 features (Appendix Fig S3E). These results imply that DNA shape features in the flanking regions contribute to TFDNA binding specificities, which was previously known for bHLH TFs (Gordân et al, 2013 Yang et al, 2014 Zhou et al, 2015 ). Here, we showed for the first time that this phenomenon is of general nature, as adding 3merE2 features as proxy for missing DNA shape features consistently improved the model performance for various TF families (Fig 3F).

Beyond better interpretability of shape-augmented models, an important distinction between the models is the different number of features required to achieve similar performance. The 1mer+shape model requires 12 features (including second-order DNA shape features) per nucleotide position compared with the 84 features required by the 1mer+2mer+3mer model per nucleotide position (Zhou et al, 2015 ). Although we previously included lower-order 1-mers and 2-mers in our 1mer+2mer+3mer models for reasons of interpretability, nevertheless, the 3-mer features actually contain all of the information of the 1-mers and 2-mers. Thus, a 3mer model is equivalent to a 1mer+2mer+3mer model (4 and Appendix Fig S3F). This choice, however, would still leave the 3mer model with 64 required features per nucleotide position compared with a maximum of only 12 features in the 1mer+shape model.

Feature selection can provide insights into TF–DNA readout mechanisms

We performed feature selection to identify BS positions where DNA shape features contribute to TF-binding specificities. The method is similar to the one we previously introduced for the analysis of SELEX-seq data for Hox proteins (Abe et al, 2015 ). For each TF, we evaluated the R 2 performance of the baseline 1mer model, denoted . Next, we evaluated models that combined 1-mer features with DNA shape features individually at single nucleotide positions i, denoted 1mer+shapei models. We denoted the performance as . We calculated the difference in model performance for each nucleotide position i (Fig 4A). The ratio indicates the percentage change in performance due to the availability of DNA shape features at nucleotide position i, with a positive ratio suggesting performance gain. The ratio at position i compared with other positions reflects the relative importance of DNA shape features at different nucleotide positions. We visualized the ratio as a function of position i for each TF in the form of a heat map (Fig 5A and Appendix Fig S4).

Figure 4. Schematic representation of feature-selection process

1. Feature-selection scheme for adding DNA shape features at one individual position to a sequence-only model.
2. Feature-selection scheme for removing DNA shape features from one single position from a shape-only model.

Figure 5. Importance of DNA shape features as a function of nucleotide positions revealed by feature selection with machine learning

1. Heat map based on adding DNA shape features to a sequence-only model.
2. Heat map based on removing DNA shape features from a shape-only model.
3. Combined heat map that takes cell-by-cell minimum of heat maps in (A and B).

To avoid interference from DNA sequence information, we devised a second feature-selection approach in which we removed DNA shape features at individual positions from a shape-only model. The ratio was then used for generating the heat map (Figs 4B and 5B, and Appendix Fig S4), where . These two different approaches can sometimes yield conflicting heat maps as discussed below. To address such cases and facilitate the use of these heat maps, we also generated a combined heat map based on the cell-by-cell minimum of the two heat maps (Fig 5C and Appendix Fig S4). Quantitative information about the importance of the position-dependent DNA shape in TFDNA recognition at single-base pair resolution provides the means to determine the structural proteinDNA readout mechanisms based on sequence data. To achieve this goal, we further expanded our feature-selection method to test each individual DNA shape feature category, which enabled us to gauge the importance of each DNA shape feature, that is, MGW, Roll, ProT, or HelT, at every position (Appendix Fig S5). To date, obtaining such information required experimentally solved structures.

Figure 5 shows the position-dependent DNA shape importance for homeodomain TFs that recognize a TAAT motif. For most of these TFs, DNA shape was more important at the 3′ side of the core motif, as indicated by the darkness of colors (Fig 5). Homeodomain TFs that recognize a different motif, for example, TCRTAAA, were shown to have a different positional DNA shape preference (Appendix Fig S4F). Positional preferences were also protein-family specific. For example, for bHLH TFs DNA shape features in both flanking regions were important, whereas for nuclear receptors that bind to an ACANNNTGT motif the central motif region was generally important (Appendix Fig S4A and H). In comparison, bZIP TFs that bind to a TTRCGC motif and homeodomain TFs were generally sensitive to DNA shape features at only one flanking side of the core motif (Appendix Fig S4B and F).

The exact positions where DNA shape features are important were not unambiguously pinpointed for the bHLH TFs and the nuclear receptors that bind to an ACANNNTGT motif (Appendix Fig S4A and H). Both Appendix Fig S4A and H relate to a scenario where the red heat map shows prominent shape effects in multiple consecutive positions, whereas the blue heat map shows almost no effects. We believe that this is due to false positives in the red heat map, that is, positions that are not important for shape readout but identified as such, and false negatives in the blue heat map, that is, positions that are important for shape readout that were not identified. We conclude in this case that DNA shape is important in some positions in the consecutively red regions, but we failed to locate it, even with the help of the blue heat map.

We illustrated the relevance of feature importance heat maps derived from feature-selection approaches by considering experimental structures of the homeodomain proteins PITX2 (PDB ID 2LKX) and GBX1 (PDB ID 2ME6) in complex with DNA (Fig 6A and B). These structures provide possible explanations for entries representing PITX3 and GBX1 on the heat maps (Fig 5). As no experimental structure for PITX3 is available, we used an NMR structure for PITX2 (Chaney et al, 2005 ), which shares the same DNA-binding domain as PITX3. In the heat maps, PITX3 has darker colors at the 3′ side of the TAAT motif, indicating a more important role of DNA shape at these positions. In the PITX2 structure, the N-terminal tail of the protein interacts with DNA in the minor groove of the TAAT motif. The structure contains a narrow minor groove region near the second A within the TAAT motif (Fig 6A). In this case, the protein might exploit the DNA structural characteristics at positions highlighted in the heat maps to achieve its binding specificity.

Figure 6. Three-dimensional structure and DNA sequence and shape logos for the homeodomain TFs PITX2/PITX3 and GBX1

1. NMR structure of PITX2 in complex with DNA (PDB ID 2LKX) and the CURVES (Lavery & Sklenar, 1989 ) derived plot for the MGW of the bound DNA.
2. NMR structure of GBX1 in complex with DNA (PDB ID 2ME6) and the CURVES (Lavery & Sklenar, 1989 ) derived plot for the MGW of the bound DNA.
3. DNA sequence and shape logos for PITX3.
4. DNA sequence and shape logos for GBX1.

We observed similar concurrence between heat map and structural analyses for the TF GBX1, where the structure has a narrow minor groove region at the 3′ flank (Fig 6B). Although the positions indicated by the heat maps do not match the positions in the structure in an exact way, the heat maps successfully highlighted those nearby positions. Moreover, the heat maps were consistent with our conclusion that DNA shape features in flanking regions are important for TFDNA binding specificities (Fig 3D–F). In addition to the homeodomain family, we used a structure of the human progesterone receptor (PDB ID 2C7A) from the nuclear receptor family to illustrate how the heat maps can provide hints to the structural mechanisms of proteinDNA binding. In the structure (Roemer et al, 2006 ), MGW, Roll, and ProT show distinct characteristics in the central region of the DNA-binding site, which potentially explains the central “red” regions in the heat maps (Appendix Fig S6).

DNA shape logos represent structural readout mechanisms

To visualize the detailed DNA shape preferences of individual TFs, we propose a new visualization, DNA shape logos, analogous to sequence logos for PWMs. In these logos, we used the letters H, M, P, and R to represent DNA shape features HelT, MGW, ProT, and Roll, respectively. The height of each letter indicates the importance derived from the feature-selection analysis for the corresponding DNA shape feature at a specific position (Fig 6). As an example, we used ΔR 2 , that is, the performance gain due to adding an individual DNA shape feature to a 1mer model, to generate shape logos for PITX3 and GBX1 (Fig 6C and D). For PITX3, a prominent M at positions 7, 8, 9, and 10 overlaps with the narrow minor groove region in the structure. Similarly, for GBX1, a prominent M at positions 7 and 8 overlaps with the narrow minor groove in the structure. DNA shape information was missing for the two nucleotide positions at each end of the TFBS thus, no letters are shown at these positions in the shape logo. DNA shape logos can facilitate the integration of structural information in motif finding tools. Sequence and shape logos for all the TFs studied in this work are provided as Datasets EV1 and EV2, respectively.

4 Contact:

Repetitive elements in DNA sequences consist two or more copies of approximate patterns of nucleotides and are abundant in both prokaryotic and eukaryotic genomes. Over two-thirds of the human genome and 5 - 10 % bacterial genomes are repetitive regions (de Koning et al., 2011) . Repetitive elements play important roles in genome structure and functions such as nucleoprotein complex formation, chromosome structure, and gene expression. Various diseases including cancer and neurodegentive disease can also arise from changes of repetitive elements. The distribution of repetitive DNA sequences can be used as fingerprints of bacterial genomes (Versalovic et al., 1991) and human individuals.

Repetitive elements are complex structures. They may exist as imperfect tandem repeats, insertion and deletions in repeats, interspersed repeats, and palindromic sequences, etc. These partial and hidden repeat signals in DNA sequences are difficult to analyze through straightforward observation and sequence comparison.

Currently, repetitive elements and hidden periodicities of DNA and protein sequences are primarily detected by digital signal processing and statistical approaches (Treangen and Salzberg, 2011) . In most signal processing methods, DNA sequences are converted to numerical sequences, and the hidden periodicities arising from repetitive elements can be identified by Fourier power spectrum at specific periodicities (Yin and Wang, 2016)

. Commonly used signal processing methods by Fourier transform include SRF maps

(Sharma et al., 2004) , spectral analysis (Buchner and Janjarasjitt, 2003) , Ramanujan-Fourier transform (Yin et al., 2015) , and the periodic power spectrum method (Yin and Wang, 2016) . The statistical methods are based on distribution analysis of nucleotides in DNA sequences. The common statistical methods for repeat findings are tandem repeats finder (Benson, 1999) and statistical spectrum (Epps et al., 2011)

(Arora and Sethares, 2007) , and information decomposition (Korotkov et al., 2003) . Besides signal processing and statistical approaches, sequence alignments such as RepeatMask are also used to identify repetitive patterns in genomes, and but require a known reference repeat sequence.

Despite significant advances in repeat finding, it is still difficult to precisely capture the essential features of repetitive elements such as consensus patterns, perfect levels and copy numbers of repeats. For example, while Fourier transform is the most common used approach for finding repeats, it may not exactly correlate the strength of Fourier power spectrum with the perfect level of repeat patterns. Furthermore, since Fourier power spectrum is weak for short DNA sequences and long harmonious periodicities are embedded in short periodicities, Fourier transform can not capture repeats in short DNA sequences and long harmonious periodicities. Moreover, the relationship between repetitive elements and periodicities of genomes is not fully understood. Thus there is a high potential for improving the accuracy for identifying repetitive elements and better understanding the relationship of periodicities and repeats in DNA sequences (Suvorova et al., 2014 Epps et al., 2011 Illingworth et al., 2008) .

In this paper, we present an ab initio method to quantitatively identify repetitive sequences and periodicities in DNA sequences. The method is based on the nucleotide distribution uniformity at periodic positions in DNA sequences or genomes. The distribution uniformity of nucleotides reflects the unbalance of nucleotide frequencies on periodic positions and thus can indicate the strength for periodic signals in DNA sequences. The method can also reveal the consensus repeat pattern for the major periodicity of DNA sequences, and quantitatively determine the perfect level and copy numbers of repetitive sequences. The proposed method also formulates the relationship between repetitive elements and the corresponding periodicities in DNA sequences.

Experimental procedures

Bacterial strains and growth conditions

Helicobacter pylori strains (Table S1) were grown on solid horse blood agar (HB) plates containing 4% Columbia agar base (Oxoid), 5% defibrinated horse blood (HemoStat Laboratories), 0.2% β-cyclodextrin (Sigma), 10 µg ml −1 vancomycin (Sigma), 5 µg ml −1 cefsulodin (Sigma), 2.5 U ml −1 polymyxin B (Sigma), 5 µg ml −1 trimethoprim (Sigma), and 8 µg ml −1 amphotericin B (Sigma) at 37°C either under a microaerobic atmosphere generated using a CampyGen sachet (Oxoid) in a gas pack jar or in an incubator equilibrated with 14% CO2 and 86% air. For liquid culture, H. pylori was grown in Brucella broth (Difco) containing 10% fetal bovine serum (BB10, Invitrogen) with shaking in a gas pack jar containing a CampyGen sachet. For resistance marker selections, bacterial media were additionally supplemented with 15 µg ml −1 chloramphenicol (Cm, Sigma), 25 µg ml −1 kanamycin (Kan, Fisher Scientific) 2.5 µg ml −1 erythromycin (Ery, Fisher Scientific) or 36 µg ml −1 metronidazole (Mtz, Sigma).

DNA manipulations

DNA manipulations, such as restriction digestion, PCR and agarose gel electrophoresis, were performed according to standard procedures ( Ausubel et al., 1997 ). H. pylori genomic DNA (gDNA) was prepared by Wizard genomic DNA preparation kits (Promega). Primers used for PCR and sequencing are described in Table S2. Plasmid DNA (Table S3) was isolated and prepared from E. coli using Qiagen Maxiprep kit (Qiagen). The FHCRC Genomics Shared Resource performed the sequencing of plasmid DNA and PCR products and the resulting sequences were analysed using Sequencher (Gene Codes Corporation).

Generation of H. pylori knockout isogenic mutants

Knockout alleles were constructed in H. pylori NSH57 using a vector-free allelic replacement strategy to generate alleles in which a non-polar kanamycin resistance (aphA3) cassette ( Menard et al., 1993 ), an erm cassette conferring resistance to erythromycin ( Lampson and Parisi, 1986 Dailidiene et al., 2006 ), or a chloramphenicol acetyl transferase (cat) resistance cassette fused to a sucrose sensitivity marker (sacB) ( Copass et al., 1997 Humbert and Salama, 2008 ) replaced 80–90% of the coding sequence of the gene while preserving the start and stop codons. The primers used for this procedure are designated as 1 through 4 and are given in Table S2. After natural transformation with the appropriate PCR product and selection on Kan-, Ery- or Cm-containing media, four clones were evaluated by PCR to confirm replacement of the WT allele with the null allele. The ΔrecJ::kanΔaddA 852-2540 double mutant was generated by transforming strain ΔrecJ::kanΔaddA::catsacB with a PCR product digested with SspI (New England Biolabs) and ligated with T4 DNA ligase (Invitrogen) to delete a 1.7 kbp intergenic region in addA. Transformants were selected on sucrose-containing HB plates, screened on Cm-containing media and checked by PCR to confirm the addA deletion. Urease activity and flagella-based motility were confirmed for all the clones generated. Single clones were used for transformation experiments.

Generation of H. pylori complemented mutants

Constructs for chromosomal complementation at the rdxA locus were made by cloning each gene individually into pLC292 ( Terry et al., 2005 ), which were then introduced into H. pylori NSH57 by natural transformation and selection on Mtz-containing media ( Dailidiene et al., 2006 ). Each gene was amplified using primers -XbaI and -SalI (Table S2) from H. pylori NSH57 gDNA using high-fidelity Taq polymerase (Platinum Taq, Invitrogen). The resulting PCR product was digested with XbaI and SalI (New England Biolabs), ligated into pLC292, and electroporated in E. coli strain DH10B or XA90 ( Ezaz-Nikpay et al., 1994 ) for pOH10 (Table S3). All inserted genes were fully sequenced and contained the expected nucleotide sequences.

Natural transformation

To generate knockout and complemented mutant strains of H. pylori, bacteria were freshly grown for 24–32 h on HB plates, transferred as patches onto fresh plates and grown for an additional 6–8 h. DNA (plasmid or PCR product) was diluted as appropriate in distilled water and 10 µl was added to each patch and incubated overnight. The mixture was harvested from the plate surface, resuspended in 350 µl phosphate-buffered saline (PBS) and plated onto selective HB plates.

To assess the frequency of natural transformation, recipient H. pylori bacteria freshly grown on HB plates were resuspended in 350 µl BB10 media and used to inoculate a 5 ml liquid culture grown for 6–8 h. The optical density at 600 nm (OD600) of this culture was measured and the culture was diluted back to OD600 0.015 to reach logarithmic phase of growth (OD600∼1) after overnight incubation. One hundred microlitres of recipient bacteria was dispensed in a flat-bottom 96-well plate and transformed in duplicates or triplicates with 10 µl of 1 ng µl −1 donor gDNA. Donor gDNA was constructed by inserting the cat resistance cassette at bp 483 in gene cagH of H. pylori strain NSH57 and J99 (hpG27-499 and jhp0489 respectively). To measure transformation of the ΔdprA mutant, donor gDNA was isolated from the G27 cag2::aphA3-sacB clone ( Pinto-Santini and Salama, 2009 ). After 3 h incubation, 50 µl and 5 µl of the mixture were plated on Cm or Kan HB plates and 20 µl of a 10 −5 dilution was plated on plain HB plates to determine the total number of viable bacteria. Transformation frequency was calculated as the number of Cm or Kan resistant colonies per colony-forming unit.

In the co-culture experiment, NSH57 and J99 ΔcomB10::ermΔcagH::cat were used as donor strains and to maximize DNA released in the culture media, we grew donor bacteria to stationary phase before mixing them with the recipient strain. ΔcomB10 strains show no detectable transformation ( Dorer et al., 2010 ) ensuring unidirectional transformation in the co-culture assay. Recipient strains NSH57 hp0203-hp0204::aphA3 ( Langford et al., 2006 ) and Δhpy188IIIR::aphA3ΔhpyCH4VR::erm were grown to logarithmic phase as described above and mixed at equal volume with the donor strains in a flat-bottom 96-well plate. After 3 h co-incubation, 100 µl of the mixture was plated on Cm + Kan HB plates to select for recombinant clones and 20 µl of a 10 −5 dilution was plated on Kan HB plates to determine the total number of recipient bacteria.

Mapping of integration end-points

Chromosomal DNA of the transformants was prepared and 5–7 kbp of the regions upstream and downstream of the cat marker were amplified by PCR using primer pairs -6FcagH/cagH::cat-3 and cagH::cat-4/5RcagH (Table S2) respectively. The resulting PCR products were purified with the DNA clean and concentrator-5 kit (Zymo Research) and digested with the appropriate restriction enzymes for a minimum of 4 h (New England Biolabs) or sequenced by the FHCRC Genomics Shared Resource.

Sensitivity to UV and antimicrobial agents

UV sensitivity assays were carried out as described previously ( Amundsen et al., 2008 ). For antimicrobial sensitivity testing, H. pylori were grown overnight in liquid culture to OD600 = 0.3, and 200 µl was plated on solid medium lacking all other antimicrobials, and incubated for 30 min in a CO2 incubator. E-test strips (AB Biodisk) were then placed on the plates, which were further incubated for two days and read according to the manufacturer's instructions.

Statistical analysis

A t-test was used to compare the mean of integration lengths or transformation frequency between WT bacteria and mutant clones and those comparisons resulting in a P-value of < 0.05 were considered significant. All statistical analyses were performed using the SAS version 9.1 software (SAS Institute, Cary, NC, USA).

In silico genomic analysis

Helicobacter pylori sequences were retrieved from the H. pylori genome browser http://hpylori.ucsc.edu/. For H. pylori strain NSH57, the sequence of the parent strain G27 was used ( Baltrus et al., 2009 ). The distribution of restriction sites and single nucleotide polymorphism was analysed with Sequencher (Gene Codes Corporation).

6.4: Restriction Mapping

• Contributed by Michael Blaber
• Professor (Biomedical Sciences) at Florida State University

The restriction/modification system in bacteria is a small-scale immune system for protection from infection by foreign DNA.

In the late 1960's it was discovered that E. coli contains enzymes that will methylate specific nucleotide bases in DNA

· Different strains of E. coli contained different types of these methylases

• Typical sites of methylation include the N6 position of adenine, the N4 position of cytosine, or the C5 position of cytosine.

Figure 6.4.1:Methylation sites

• In addition, only a fractional percentage of bases were methylated (i.e. not every adenine was methylated, for example) and these occurred at very specific sites in the DNA.
• A characteristic feature of the sites of methylation, was that they involved palindromic DNA sequences.
• Here is an example from a particular E. coli strain R1:

Figure 6.4.2:Palindromic DNA

(EcoR1 methylase specificity. Rubin and Modrich, 1977)

• In addition to possessing a particular methylase, individual bacterial strains also contained accompanying specific endonuclease activities.
• The endonucleases cleaved at or near the methylation recognition site.

Figure 6.4.3:Cleavage near methylation site

• These specific nucleases, however, would not cleave at these specific palindromic sequences if the DNA was methylated.

Thus, this combination of a specific methylase and associated endonuclease functioned as a type of immune system for individual bacterial strains, protecting them from infection by foreign DNA (e.g. viruses).

• In the bacterial strain EcoR1, the sequence GAATTC will be methylated at the internal adenine base (by the EcoR1 methylase).
• The EcoR1 endonuclease within the same bacteria will not cleave the methylated DNA.
• Foreign viral DNA, which is not methylated at the sequence "GAATTC" will therefore be recognized as "foreign" DNA and willbe cleaved by the EcoR1 endonuclease.
• Cleavage of the viral DNA renders it non-functional.

Such endonucleases are referred to as "restriction endonucleases" because they restrict the DNA within the cell to being "self".

The combination of restriction endonuclease and methylase is termed the "restriction-modification" system.

Since different bacterial strains and species have potentially different R/M systems, their characterization has made available hundreds of endonucleases with different sequence specific cleavage sites.

• They are one of the primary tools in modern molecular biology for the manipulation and identification of DNA sequences.
• Restriction endonucleases are commonly named after the bacterium from which it was isolated.

Arthrobacter luteus

"Four cutter". Leaves blunt ends to the DNA.

Bacteroides fragilis

"Four cutter". Leaves 5' overhang.

Neisseria cinerea

"Five cutter". Middle base can be either cytosine or guanine. Leaves 5' overhang. Different recognition sites may have non-complementary sequences.

"Six cutter". Leaves 5' overhang. Behaves like a "four cutter" ('star' activity) in high salt buffer. \$44 for 10,000 units.

Haemophilusaegyptius

"Six cutter". Pu is any purine, Py is any pyrimidine. Leaves 3' overhang.

"Seven cutter". Pu is any purine, Py is any pyrimidine, N is any base. Leaves 5' overhang. Different recognition sites may have non-complementary sequences.

"Six cutter with interrupted palindrome". Leaves 5' overhang. Different recognition sites may have non-complementary sequences.

Bacillusstearothermophilus

"Six cutter". Different recognition sites will be complementary.

Acetobacter aceti

"Six cutter" with 3' overhang. Same recognition sequence as Bsa HI, but different cleavage position.

Non-palindrome, distal cleavage. Leaves 3' overhang. \$50 for 50 units.

Nocardiaotitidiscaviarum

"Eight cutter". Leaves 5' overhang.

Bacillusstearothermophilus

• The utility of restriction endonucleases lies in their specificity and the frequency with which their recognition sites occur within any given DNA sample.
• If there is a 25% probability for a specific base at any given site, then the frequency with which different restriction endonuclease sites will occur can be easily calculated (0.25 n ):

Frequency of Occurrence

1 Alu site in every 256 bases (0.25 Kb)

1 Nci I site in every 1024 bases (1.0 Kb)

1 EcoR1 site in every 4,096 bases (4.1 Kb)

1 EcoO109I site in every 16,384 bases (16.4 Kb)

1 Not I site in every 65,536 bases (65.5 Kb)

Thus, on average, any given DNA will contain an Alu I site every 0.25 kilobases, whereas a Not I site occurs once about every 65.5 kilobases.

• Not I is therefore a very useful enzyme for isolating large regions of DNA, typically in research involving genomic DNA manipulations.
• Alu I would be expected to digest a DNA sample into lots of little pieces.

The assortment of DNA fragments would represent a specific "fingerprint" of the particular DNA being digested. Different DNA would not yield the same collection of fragment sizes. Thus, DNA from different sources can be either matched or distinguished based on the assembly of fragments after restriction endonuclease treatment. These are termed "Restriction Fragment Length Polymorphisms", or RFLP's. This simple analysis is used in various aspects of molecular biology as well as a law enforcement and genealogy. For example, genetic variations that distinguish individuals also may result in fewer or additional restriction endonuclease recognition sites.

Introduction

Comparative sequence analysis has had a major impact on molecular biology and genetics. Comparison of the sequences of protein-coding genes between multiple species has enabled prediction of gene function [1], identification of protein domains [2], prediction of functional amino acid residues [3,4], and detection of signals of natural selection at the level of whole genes [5] and individual codons [6,7]. Inferring non-neutral sequence elements in the human genome is of considerable interest even without a specific a priori hypothesis concerning their possible functional role(s). On a general level, for example, sequence conservation may considerably inform human genetic studies seeking to identify allelic variants associated with disease phenotypes, particularly in noncoding regions [8]. The effect of human SNPs at the level of molecular function and phenotype depends on the importance of the individual nucleotide position, whereas the information of the sequence region as a whole is not necessarily relevant. For example, about half of human SNPs within protein coding genes are represented by synonymous variants, which are likely to be of limited importance, even though they are embedded within highly conserved exonic sequences. In addition, a subset of individual nucleotides conserved in four mammalian genomes were shown to be under selective pressure [9]. A position-specific measure of selective constraint is therefore highly suitable for analysis of positions that are polymorphic within the human population.

Several algorithms have been developed for detection and scoring of sequence conservation in the context of a multispecies sequence alignment. However, to date these approaches have been applied almost exclusively to detect discrete regions with elevated average sequence conservation that typically extend for up to hundreds of contiguous bases [10–14]. Such regions encompass canonical coding exons, as well as so-called “conserved noncoding sequences” that presumably result from purifying selection, and are thereby indicative of functional importance [15,16].

Recently, comparative genomic sequence of unprecedented depth has been generated by sequencing of multiple mammalian and other vertebrate genomes orthologous to 1% of the human genome defined by the ENCODE regions [17,18]. Several alignment techniques have been applied to construct multiple sequence alignments within ENCODE regions [18]. These alignments have in turn been subjected to analysis with existing sequence conservation detection algorithms, including phastCons[10], GERP [11], and BinCons [13]. The conserved regions identified by these analyses show statistically significant overlap with experimentally identified coding and noncoding functional elements. However, the majority of experimentally characterized noncoding functional elements fall outside of currently delineated conserved regions, and, conversely, most conserved regions were located outside of experimentally detected elements [18]. The fact that many functional elements reside in noncoding regions that do not exhibit uniformly high conservation is perhaps not surprising given that binding sites for transcriptional factors that mediate many biological processes are quite plastic evolutionarily [19]. Conversely, many individual nucleotides located outside of well-defined conserved regions exhibit sequence conservation across multiple species. Such conservation may be due to mere chance or, for a certain fraction of these nucleotides, may reflect their importance for fitness and hence function. The aforementioned observations emphasize the need for higher resolution methods for analysis of evolutionary conservation within functional elements and generally across the genome.

Here we develop an approach for analyzing sequence conservation at the individual base-pair level, with an aim toward correlating conservation with human genetic variation and with functional genomic annotations. We present a new probabilistic conservation score, SCONE (Sequence Conservation Evaluation). SCONE provides conservation scores for individual nucleotide positions, and can be applied to predict continuous sequence regions with an elevated level of conservation.

We apply SCONE to the study of annotated functional elements and human sequence polymorphism. We focus on the statistical distribution of position-specific conservation scores rather than on the bulk overlap between conserved regions and functional features. It is clear from the outset that the power to detect conservation at the single base-pair resolution is limited, even when comparing multiple species [20]. We surmount this obstacle by deriving considerable statistical power from combined analysis of numerous individual nucleotide positions from many genomic regions. While this analysis does not allow us to detect individual functional positions accurately, we can show that, collectively, a subset of noncontiguous individual positions are important. A key advantage of the analysis of the distribution of position-specific scores is that it is unbiased with respect to the pattern of conservation along a given sequence region. SCONE thus has the potential to analyze putative functional elements in which the conservation signal is not homogeneous or manifested by exon-like contiguous conserved stretches.

We report herein on the relationship between sequence conservation, functional sequence elements, and human allelic variation, as revealed by single-nucleotide conservation analysis.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended Data Fig. 1 Highly efficient base editing by A3A-BE4max or hyA3A-BE4max in mouse embryos.

(a, b) Genotyping of F0 generation pups by A3A-BE4max and hyA3A-BE4max. The frequencies of WT and mutant alleles were determined by analyzing HTS using BE-analyzer. The percentage on the right represents the frequency of the indicated mutant allele with the corresponding mutation-induced amino acid conversion shown in parentheses. The frequency of the wild-type allele was omitted. Wt, wild-type.

Extended Data Fig. 2 Off-target analysis and germline transmission of the founders derived from hyA3A-BE4max injection.

(a) HTS was performed with mouse tails to determine editing efficiencies at 15 potential off-target sites in three Dmd mutant F0 mice (#BD03, #BD04 and #BD07). Mismatched nucleotide letters are indicated in lowercase. Data are means ± SD (n = 3 mice).(b) HTS alignments of mutant sequences from F1 generated by mating founder #BD12(♀) with Wt (♂). The column on the right indicates frequencies of mutant alleles. Wt, wild-type.Statistical source data are provided in Source Data Extended Data Fig. 2.

Extended Data Fig. 3 Comparison of base editing efficiency and protein levels by CBEs and hyCBEs in HEK293T cells.

(a)Comparison of base editing efficiency induced by A3A-BE4max or hyeA3A-BE4max in HEK293T cells. The average mutation percentage derived from three independent experiments of A3A-BE4max and hyeA3A-BE4max at the same site is listed. Some of the data (hyeA3A-BE4max) are the same as presented in Fig. 4a. Statistical source data are provided in Source Extended Data Fig. 3. (b) The protein levels of BE4max, hyBE4max, A3A-BE4max, hyA3A-BE4max, eA3A-BE4max and hyeA3A-BE4max were determined by Western blotting in HEK293T cells 3 days after transfection of similar amounts of plasmid DNA. Specific antibodies against Cas9 (top) or GAPDH (bottom) were used. Western blotting images are representative of three independent experiments. Unprocessed blots are shown in Source Data Extended Data Fig. 3.

Extended Data Fig. 4 Comparison of base editing product purity induced by variant base editors in HEK293T cells.

(a) Comparison of base editing products induced by BE4max vs hyBE4max. HTS data were analyzed and the ratio of each type of nucleotides was listed on each target position. Data are means ± SD (n = 3 independent experiments). (b) Comparison of base editing products induced by A3A-BE4max vs hyA3A-BE4max. HTS data were analyzed and the ratio of each type of nucleotides was listed on each target position. Data are means ± SD (n = 3 independent experiments) (c) Comparison of base editing product induced by eA3A-BE4max vs hyeA3A-BE4max. HTS data were analyzed and the ratio of each type of nucleotides was listed on each target position. The individual data points are shown as black (C > T), light green (C > A) and light red (C > G) dots. Data are means ± SD (n = 3 independent experiments). Statistical source data are provided in Source Data Extended Data Fig. 4.

Extended Data Fig. 5 Whole genome sequencing of Dmd F0 (#DD11) and wild-type (Wt) mice.

(a) Summary of genome sequencing analysis. WGS for a Dmd mutant mouse (#DD11) and a wild type mouse (Wt) were performed. A total of 82,573 and 62,359 SNPs were identified for #DD11 and Wt, respectively. After filtering out dbSNP (naturally occurring variants in the SNP database), 20,387 SNPs were obtained in the #DD11 genome. Then the sequences at the remaining SNP sites were compared with all on-/off-target sequences (20 bp). (b) Summary of on-/off-target site information. A total of 175,058 sites, including 1 on-target site and 20 374 2,869 22,335 and 148,569 off-target sites with 3, 4, 5, 6, or 7 mismatch/es, respectively, were analyzed. (c) Summary of the whole-genome sequencing. (d) Summary of off-target analysis. After comparing the sequences at the remaining SNP sites with the 175,058 on-/off-target sequences (20 bp), the C-to-T substitution was only detected within the on-target sequencing in #DD11. (e) Validation the off-target candidate site determined in (d) using targeted deep sequencing of genomic DNA isolated from various #DD11 organs (heart, liver, lung and tail). Mismatched nucleotides and PAM sequences are shown in red and in blue, respectively. Data represent mean from two independent experiments. Statistical source data are provided in Source Data Extended Data Fig. 5.