Exons in eukaryotes identified by improved 3mer detector

DNA is mostly aperiodic and various tools have been developed to detect any slight periodicity in regions of DNA. Of special interest is the 3 periodicity or 3mer pattern in DNA.

If the following DNA strand were hypothetically encountered, mathematical tools would indicate a strong 3-periodicity score (spacing and bolding added for clarity):


The periodicity of A is especially prominent if the adjacent two nucleotides are not A. The following pattern repeats A every 3rd nucleotide, and even though strictly speaking it repeats A every 3 nucleotides, it is not considered a 3 periodicity pattern because the nucleotides adjacent to every 3rd nucleotide are also A’s.


Such a pattern would not trigger the 3mer Display in Skittle nor will 3mers be indicated in the Skittle Repeat map, and neither should such a pattern be detected by any other 3mer detection algorithm.

Let us say, hypothetically the strongest possible 3 periodicity would be DNA that had patterns like:





Such patterns will max out the 3mer detector in Skittle and will yield the strongest possible 3mer pattern on the Skittle Repeat map. Let us call such patterns ideal 3mer patterns. To the extent a fragment of DNA approaches these ideal 3mer patterns, it can be said that such a fragment of DNA has some degree of 3 periodicity.

Unfortunately, the scoring procedures for specifying how closely or distantly a stretch of DNA approaches the ideal 3mer pattern can range from very simple to almost impenetrably complex. Some of the more complex scoring strategies borrow from the discipline of Electrical Engineering in the processing of sound waves, radio waves, and other types of signals. Exotic concepts like Discrete Fourier Transforms and Markov models are often used to develop 3mer scoring systems. But even the most complex scoring systems must incorporate fudge factors to force fit a correlation with exonic regions in the DNA.

Much of the 3mer literature is very difficult to understand because of the tedious math involved, but at the heart of 3mers detection algorithms are somewhat simple concepts familiar to everyday experience. To that end, using mostly plain English and almost no math, I wrote a Short and Simple Introduction to 3mers and DFTs in Engineering and DNA<.

In regards to Prokaryotes and Archaea, the Z-curve scoring system works extremely well for finding coding regions. The search for 3 periodicity is at the heart of the Z-curve scoring system. The Z-curve system claims 99% or better identification of coding regions in Prokaryotes. It is described in the paper ZCURVE: a new system for recognizing protein coding genes in bacterial and archaeal genomes. I speculate the Z-curve method has a high success rate due to the fact the Z-curve algorithm explores only open reading frames and the fact that there are no introns to deal with as is the case with Eukaryotes.

I further speculate that if the 3mer scoring implemented within DNA Skittle were based on regions defined by open reading frames instead of a fixed number of nucleotides, DNA skittle would actually identify coding regions with a high level of success for prokartotes. This hypothesis has not yet been tested, but could be a subject of future research. Fundamentally, I don’t think much is necessarily gained by using more complex math to implement a 3mer scoring system…

The following is a graph of a statistically improved 3mer intensity versus nucleotide position for a section of DNA in C. elegans. The 3mer intensity is in blue and the green bars indicate the exon positions. As can be seen there is good correlation with 3mer intensity and exons.


The statistically improved 3mer can be compared to a plot of 3mers intensity using an unadjusted Discrete Fourier Transform (DFT). As can be seen below, the 3mers calculated by the standard DFT do not correlate as well with exon positions:

3mer dft

The following paper describes the improved algorithm:
A Novel Fast Algorithm for Exon Prediction in Eukaryotic Genes using Linear Predictive Coding Model and Goertzel Algorithm based on the Z-Curve

I speculate 3mers are mostly a consequence of codon bias in prokaryotes. It is a more difficult question to answer for the cause of 3mers in eukaryotes.