3mers from synthetic genomes vs. 3mers from real genomes as viewed in Skittle

[wpdm_file id=6] [NOTE: it will give a security warning because Skittle is not recognized by virus software as being safe, so ignore the message. It might have to be unzipped after download depending on the browser. To run skittle, run the file “SkittleToo” NOT “Skittle.exe”.]
[wpdm_file id=7] [NOTE: It might have to be unzipped after download depending on the browser.]

Below are screen captures of the Skittle repeat map and 3mer display of a synthetic DNA designed to generate the theoretical maximum 3mer patterns. The cause of the maximum 3mer pattern is “ATG” repeated through out a hypothetical synthetic genome. Other repeated codons which would also give the theoretical maximum 3mers are:
“ATC”, “ACG”, “ACT”, “AGT”, “AGC”,
“CAG”, “CAT”, “CGA”, “CGT”,”CTA”, “CTG”,
“GAC”, “GAT”, “GCA”, “GCT”, “GTA”, “GTC”,
“TAC”, “TAG”, “TCA”, “TCG”, “TGA”, “TGC.

Nucleotide display set at 128bp:
atg 128

Nucleotide display set at 2048bp:
atg 2048

Continue reading 3mers from synthetic genomes vs. 3mers from real genomes as viewed in Skittle

Mission of BioLanguages Website

To explore the linguistic constructs found at the molecular and structural levels
of individual organisms and groups of organisms.

The discovery of the triplet DNA code was the first major step in relating biological systems to linguistic constructs. Since then, more features of biology have been seen to conform to linguistic constructs beyond the triplet DNA code.

This website reports and comments on developments related to biological languages.

Exons in eukaryotes identified by improved 3mer detector

DNA is mostly aperiodic and various tools have been developed to detect any slight periodicity in regions of DNA. Of special interest is the 3 periodicity or 3mer pattern in DNA.

If the following DNA strand were hypothetically encountered, mathematical tools would indicate a strong 3-periodicity score (spacing and bolding added for clarity):

ATG ACC ATT AGG ACG AGC ATC ACT …..

The periodicity of A is especially prominent if the adjacent two nucleotides are not A. The following pattern repeats A every 3rd nucleotide, and even though strictly speaking it repeats A every 3 nucleotides, it is not considered a 3 periodicity pattern because the nucleotides adjacent to every 3rd nucleotide are also A’s.

AAA AAA AAA AAA ……

Such a pattern would not trigger the 3mer Display in Skittle nor will 3mers be indicated in the Skittle Repeat map, and neither should such a pattern be detected by any other 3mer detection algorithm.

Let us say, hypothetically the strongest possible 3 periodicity would be DNA that had patterns like:

ATG ATG ATG ATG…..

or

GTC GTC GTC GTC …..

etc.

Such patterns will max out the 3mer detector in Skittle and will yield the strongest possible 3mer pattern on the Skittle Repeat map. Let us call such patterns ideal 3mer patterns. To the extent a fragment of DNA approaches these ideal 3mer patterns, it can be said that such a fragment of DNA has some degree of 3 periodicity.

Unfortunately, the scoring procedures for specifying how closely or distantly a stretch of DNA approaches the ideal 3mer pattern can range from very simple to almost impenetrably complex. Some of the more complex scoring strategies borrow from the discipline of Electrical Engineering in the processing of sound waves, radio waves, and other types of signals. Exotic concepts like Discrete Fourier Transforms and Markov models are often used to develop 3mer scoring systems. But even the most complex scoring systems must incorporate fudge factors to force fit a correlation with exonic regions in the DNA.

Much of the 3mer literature is very difficult to understand because of the tedious math involved, but at the heart of 3mers detection algorithms are somewhat simple concepts familiar to everyday experience. To that end, using mostly plain English and almost no math, I wrote a Short and Simple Introduction to 3mers and DFTs in Engineering and DNA<.

In regards to Prokaryotes and Archaea, the Z-curve scoring system works extremely well for finding coding regions. The search for 3 periodicity is at the heart of the Z-curve scoring system. The Z-curve system claims 99% or better identification of coding regions in Prokaryotes. It is described in the paper ZCURVE: a new system for recognizing protein coding genes in bacterial and archaeal genomes. I speculate the Z-curve method has a high success rate due to the fact the Z-curve algorithm explores only open reading frames and the fact that there are no introns to deal with as is the case with Eukaryotes.

I further speculate that if the 3mer scoring implemented within DNA Skittle were based on regions defined by open reading frames instead of a fixed number of nucleotides, DNA skittle would actually identify coding regions with a high level of success for prokartotes. This hypothesis has not yet been tested, but could be a subject of future research. Fundamentally, I don’t think much is necessarily gained by using more complex math to implement a 3mer scoring system…

The following is a graph of a statistically improved 3mer intensity versus nucleotide position for a section of DNA in C. elegans. The 3mer intensity is in blue and the green bars indicate the exon positions. As can be seen there is good correlation with 3mer intensity and exons.

http://www.biolanguages.com/main/wp-content/uploads/2014/08/improved_3mer.png

The statistically improved 3mer can be compared to a plot of 3mers intensity using an unadjusted Discrete Fourier Transform (DFT). As can be seen below, the 3mers calculated by the standard DFT do not correlate as well with exon positions:

3mer dft

The following paper describes the improved algorithm:
A Novel Fast Algorithm for Exon Prediction in Eukaryotic Genes using Linear Predictive Coding Model and Goertzel Algorithm based on the Z-Curve

I speculate 3mers are mostly a consequence of codon bias in prokaryotes. It is a more difficult question to answer for the cause of 3mers in eukaryotes.

Short and Simple Introduction to 3mers and DFTs in Engineering and DNA

It is amazing that a Cable TV wire can simultaneously carry hundreds of TV channels. This is made possible because several signals can be merged into a single signal by the transmitter and then separated by the receiver at the end. The process of doing this involves very complex levels of coding and decoding machinery.

There are some aspects of coding and decoding machinery of the modern internet and cable TV culture that is related to the detection of 3mers in biology. I explore them briefly here.

Suppose I pressed middle C on an electronic piano keyboard that was set to produce pure sine waves. If I had an oscilloscope picture of the sound, it would look like:

middle c

If I pressed C that was 3 octaves above middle C it would have a frequency that is 23 (or 8) times higher. Thus on the oscilloscope, the sound would look like:

3 octaves above

If I pressed both keys at the same time, the resulting sound is simply the addition of both notes, and the combined sound merges into one wave form would look like:

merged sounds

If all I had was the merged signal, would I be able to separate the merged signal into the individual parts? The answer is “yes”. Joseph Fourier formalized the math of such a procedure and created the equations associated with his name such as the Fourier Transform:

fourier slide

In principle, an infinite number of “notes” can be merged into one signal by a hypothetical transmitter and then separated back into the constituents by a receiver. In practice, this cannot happen because of physical limitations.

In nature, there are mechanisms which can implement Fourier decompositions such as the human ear for decomposing sound waves into various periodicity components (aka notes):

ear

or prisms for decomposing light waves into various periodicity components (aka colors):

prism

In the world of engineering designs, Fourier decomposition are accomplished by circuits

equalizer

or computer algorithms implementing Fourier Transform math:

dft

As far as DNA, detecting periodicities (like 3mers) using Fourier transforms is a bit forced and perhaps in the end will not be the best tool for detecting exons, and in my opinion needlessly complex. More math complexity does not necessarily create a more accurate tool for identifying exons. In fact 3mers can be seen even in short stretches of synthetic genomes created by random number generators.

One of the reasons the Fourier transform is an imperfect tool for detecting periodicities in DNA is that the periodicities are so short-lived in DNA and only approximately periodic. Worse, noise due to sampling DNA will create lots of false periodicity signals. And worse yet, each nucleotide position presents only two possible discrete states for a given nucleotide, i.e. Guanine or not-Guanine, Adenine or not-Adenine — which is essentially like having only two possible amplitudes of 0 or 1. Most Fourier Transforms assume an infinite number of amplitudes from 0 to infinity. The fact that there are only 2 discrete amplitudes makes it virtually impossible to overlap information through adding periodic signals together as can easily be done with sound waves (like the example above). One cannot take a DNA strand with a strong 3mer pattern and somehow add it to a DNA strand with a 4mer pattern and get a DNA strand containing both patterns.

The search for overlapping information in DNA must go beyond the use of Discrete Fourier Transforms and more accessible constructs. The BioLanguages website hopes to explore methods of identifying overlapping information in DNA.