The Theoretical Prediction of the Mutagenic Regions of the p53 Tumor Suppressor Gene

For more than 25 years, we have been looking for biomathematical laws controlling and structuring DNA, genes, proteins, chromosomes and genomes [1,2]. In 1997 we discovered a simple numerical law based solely on atomic masses, UNIFYING the 3 languages of biology: DNA, RNA, and amino acids. This law, “the Master Code of Biology” is published in 2009 in the book Codex Biogenesis [3] and then in 2015 in a reference article peer review [4]. In 2017 we publish different applications: HIV, SNPs, brain genetics: Prions, Amyloids, DUF1220. In other hand, the TP53 Tumor Suppressor gene plays a central role in a majority of cancers [5,6]. Between January 1993 and July 1996, more than 4300 papers have been published in which the term “p53” appears in the title! Then, p53 has been named Molecule of the Year by the editors of the journal Science (13), and the “guardian of the genome” by many researchers.


Introduction
For more than 25 years, we have been looking for biomathematical laws controlling and structuring DNA, genes, proteins, chromosomes and genomes [1,2]. In 1997 we discovered a simple numerical law based solely on atomic masses, UNIFYING the 3 languages of biology: DNA, RNA, and amino acids. This law, "the Master Code of Biology" is published in 2009 in the book Codex Biogenesis [3] and then in 2015 in a reference article peer review [4]. In 2017 we publish different applications: HIV, SNPs, brain genetics: Prions, Amyloids, DUF1220. In other hand, the TP53 Tumor Suppressor gene plays a central role in a majority of cancers [5,6]. Between January 1993 and July 1996, more than 4300 papers have been published in which the term "p53" appears in the title! Then, p53 has been named Molecule of the Year by the editors of the journal Science (13), and the "guardian of the genome" by many researchers.
Somatic TP53 mutations occur in almost every type of cancer at rates from 38%-50% in ovarian, esophageal, colorectal, head and neck, larynx, and lung cancers to about 5% in primary leukemia, sarcoma, testicular cancer, malignant melanoma, and cervical cancer [5]. The mutations of TP53 are well known and listed, classified according to their frequencies and the types of cancers involved in these mutations [6]. Based on studies that examined the whole coding sequence, 86% of mutations cluster between codons 125 and 300, corresponding mainly to the DNA binding domain. The results -never published -that will be presented here were obtained in 2001 from the IARC TP51 mutations data banks available around that time [6].

Master Code summary
Starting from the atomic masses constituting nucleotides and amino acids, a numerical scale of integers characterizing each bioatom, each TCAG DNA base, each UCAG RNA base, or each amino acid, an integer numbers scale code is obtained. As demonstrated below, for each sequence of double -stranded DNA to be analyzed, we compute 2 patterned signatures: "Genomics pattern" and "Proteomics pattern". A remarkable fact is that this proteomics pattern still exists, even for regions not translated into proteins (any coding or non-coding DNA region). The computational methodology of the Master code [3,4] then produces 2 patterned images (2D curves, (Figure 1) which arealways -very strongly correlated. This would mean that beyond the visible sequence of DNA there would be a kind of numerical "MASTER CODE" being manifested by two physical and chemical supports of biological information: the sequences of DNA and of amino acids sequences; the RNA image constituting a kind Cancer Therapy & Oncology International Journal of neutral element like the zero of the mathematics (see § The codong step below) [7]. Our thousands of genes and genomes Master Code analyses (viruses, archaeas, bacteria, eucharyotes) demonstrated that the extremums (max and min) signify functional regions like proteins active sites, fragility points like chromosomes breakpoints). "Master code of biology" and Great Unification shows an equivalence of both Genomics (DNA) and Proteomics (amino acids) signatures while the RNA signature is a neutral area like a "zero". b.
A typical correlation between Genomics and Proteomics signatures related to the Prion protein, the whole Malaria chromosome 2, and the whole HIV1 genome.

Detailing the Master Code computing
A quick presentation of the formula for life: In [3,4] we introduced the law we call Formula for Life. This law unifies all of the components of living including bio-atoms, CONHSP and their various isotopes, to genes, RNA, DNA, amino acids, chromosomes and whole genomes. This law is the result of a simple non-linear projection formula of the atomic masses. The result of this projection is then organized in a linear scale of integer number based codes (e.g., -2, -1, 0, 1, 2, 3...) coding multiples Pi/10 regular values. These codes are called Pi-masses. The result is a real number which we retain only the residues (decimal remainder). Detailing the "PPI (mass)" projection: Consider any atomic mass « m », which may be that of a bio-atom, of a nucleotide, of a codon, of an amino acid or of other genetic compound based on bio-atoms or even, any atoms (Mendeleiëv Table, [8]). Where P = 0.742340663.  This process will work especially on the average masses (Tables 1 & 2). But it may also be applied to a particular isotope or any derivative of specific atomic mass proportions of the various isotopes. An overview on the "Biology Master Code" Great Unification of DNA, RNA and Amino acids. It may seem surprising that such a fine tuned process like biology of Life requires the use of three languages as diverse and heterogeneous as DNA with its alphabet of four bases TCAG; RNA with its alphabet of four bases UCAG; and proteins with their language of 20 amino acids. Starting only from the double-stranded DNA sequence data, the "Master Code" is a digital language unifying DNA, RNA and proteins that provide a common alphabet (see below Pi-mass scale) to the three fundamental languages of Genetics, Biology and Genomics. The construction method of "the Master Code" will be now fully described below. It will highlight a significant discovery we summarize as follows: "Above the 3 languages of Biology -DNA, RNA and amino acids, there is a universal common code that unifies, connects and contains all these three languages". We call this code the "Master Code of Biology." Here is a brief description of our process for computing the Master Code-The coding step: First, we apply it to any DNA sequence encoding a gene or any non-coding sequence (formerly mislabelled as junk DNA). So it may be a gene, a contig of DNA, or a whole chromosome or genome. In this sequence, we always consider double-stranded DNA as we explore the following three codon reading frames and following the two possible directions of strand reading (3' ==> 5' or 5' ==> 3'). In all cases, the base unit will always be the triplet codon consisting of three bases.
As shown in above sample, we calculate the Pi-mass related to double stranded triplets DNA bases, double stranded triplets RNA bases, and double-stranded pseudo amino acids. In fact, for each DNA single triplet codon, we deduce the complementary Crick Watson law bases pairing. We run the same process for RNA pseudo triplet codon pairs, then, similarly for amino acids translation of these DNA codon couples using the Universal Genetic Code table. Then we obtain 3 samples of pairs codes: DNA, RNA and amino acids and this, systematically even when this DNA region is gene-coding or junk-DNA. This produces three digital vectors relating to each of the 3 DNA, RNA, and proteomics coded images. At this point we already reach an absolutely remarkable result, as symbolized in Figure 1. We will focus now only on the DNA code (genomics) and amino acids code (proteomics), the RNA code playing a neutral rôle in this context. The globalization and integration step: To these two numeric vectors we apply a simple globalization or integration linear operator. It will "spread" the code for each position triplet across a short, medium or long distance, producing an impact or "resonance" for each position and also on the most distant positions, reciprocally by feedback. This gives a new digital image where we retain not the values but the rankings by sorting them.
We run this process for each codon triplet position, for each of the three codon reading frames and for the two sequence reading directions (3' ==> 5' and 5' ==> 3').
For example, to summarize this method: on starting area of the GENOMICS (DNA) code of Prion above, the "radiation" of triplet codon number 1 would propagate well: Similarly, the same process on starting area of the PROTEOMICS code of Prion above, the "radiation" of triplet codon number 1 would propagate well:  To complete, the same work must be also operate on each codon reading frame... Meanwhile, a more synthetic means to compute these "long range potentials" for each codon position is the following formula: Cumulate potential of codon location "i" Then finally, Example for Genomics image of codon "i" The initial computing method described above provides: The great Unification between Genomics and Proteomics Master Code images: When applying the process described above in any sequence -gene coding, DNA contig, junk-DNA, whole chromosome or genome -a second surprise appears just as stunning as that of RNA neutral element. We find that for one of the three reading frames of the codons given, the Genomics patterned signature and the Proteomics patterned signature are highly correlated.
Contrary to the three genomics signatures which are correlated in all cases, the proteomics signatures are correlated with genomics signatures only for one codon reading frame, and generally in dissonance for the two remaining codon reading frames. Also, there are perfect local areas matching's focusing on functional sites of proteins, hot-spots, chromosomes breaking points, etc. Figure 1 summarizes this universal breakthrough for the general case and for three representative cases: Prion protein [9,10], a whole chromosome of Malaria disease, and a complete HIV1 genome [5]. It is important to note the universal character of this coupling of genomics/proteomics: for example, for some three billion base pairs of the whole human genome, we have verified this law across the entire genome, for all its chromosomes and in all its regions with a global correlation of about 99%.
In this global correlation, specific codon positions were a perfect match. This is remarkable when regions correspond to biologically functional areas: hot-spots, the active sites of proteins, breakpoints and chromosome fragility regions (i.e., Fragile X genetic disease), etc.

Gen Lights"-A Method for Mapping Regions of Mutability of a Gene
Our Master Code basic research output could predict the Mutability level of each codon value and position relating to its effect at the whole structure level of the Genomics/Proteomics Unification data.
Then, we could associate with each codon position a "Mutability Coefficient" varying from 1 to 61: i.
1 if the codon is the best by 61 possible values (without STOP codons values).
ii. 61 if any codon change increases the global Genomics/ Proteomics coupling ratio of the whole gene.
We note that low coefficient region correspond with optimal and "CONSERVED" regions. Low values codon scores near 1 reveals conserved optimal codons related to a good Master Code coupling ratio. Contrarily, high values codon scores near 61 reveals bad Master Code coupling ratio, then probably a high potential mutagen codon location (Figure 4).

Results and Discussion
The P53 is a 393 amino acids proteins coming from a long 20kbases length 11 exons gene from the chromosome 17p13. The P53 protein has 5 blocks of highly conserved regions at residues 117-142, 171-181, 234-258 and 270-286. These highly conserved regions coincide with the mutation clusters and HOTSPOTS found in p53 in human cancers, most of which have been found within exons 5 -8. These mutations have been found to be hightly frequent at the four mutational "hotspots" at codons 175, 245, 248 and 273. In The following figure 2, Cho Y et al illustrate the complex interaction provided between P53 and DNA molecule (Figure 2). "The DNA (blue) and core domain (turquoise) are shown with the zinc atom (red), with the position of the six hot spot amino acid residues (yellow). Mutations in hot spot amino acids either interfere with protein-DNA contacts, or disrupt integrity of the domain. Thus, all naturally occurring mutations in p53 directly or indirectly affect the interaction of p53 with DNA, demonstrating that sequence-specific DNA binding is central to the normal functioning of p53 as a tumour suppressor." In other hand, these Hotspots mutations are present in all kinds of Cancers ( Figure 3).
Then, what about the Prediction of these Mutations by our theoretical predictive method?
In the Figure 4 below, we show a typical output: All codons positions are affected by a Mutability Coefficient in the range of 1 to 61: 1 signify that this real codon value is the optimal one possible. 61 signify that any mutation on this position increases the global Genomics/Proteomics organization of the gene. This kind of codons has then a high Mutability power. In the figure, we show (red) the four Hotspots codons positions 174 245 248 273. Green bars illustrate high mutability codons (coef >55).
Contrarily yellow bars illustrate high conserved and optimal codons.
Then, the prediction mutability score coefficient of the 4 Hotspots is Perfect: i.
iv. Hotspot 273: coef=57 (very high mutability coef ie range 1-61) (Figure 4). Below, in this other simulation run ( Figure 5), we analyse the 622 GERMLINE single mutations reported in [8]. The correlation between Experience and Prediction is Perfect: Horizontally, we represent all mutations sorted by decrease frequency values: on the left, high frequency mutations (codons 248 245 175 273…). On the right, rare mutations points. Simultaneously, the red bars represent the mutation effect on Genomics/Proteomics: on the left, this ratio increases, then on Cancer Therapy & Oncology International Journal the right, this ratio decreases! In blue, we plot the evolution of the mutability coef: high for frequent mutations, low or flat for others. Now, we have a good proof of the Prediction power of out Master Code based theoretical predictive mutagenesis method improved on the best example on Mutability: P53, the "King Cancers gene ( Figure 5). Other powerful representation of the "Mutability Global Space" is the following. In this kind of figures, we do a smoothing on the basic Master Code scores outputs. Then we compare patterns related with high mutability (green) with patterns related to low mutability (blue). The superposition of both graphics provides "patterned ISLANDS". Then green islands regions traduce a globally high Mutagene region. On the contrary, blue predominance regions corresponds with highly globally Optimal region (then conserved). In this graphics, we note that the four Hotspots are located in (or close) high global mutability regions. We see also, in the region of codons 80 an optimal region which must normally be highly conserved ( Figure 6).

Conclusion
The theoretical method of predicting mutagenic regions of TP53 [9][10][11][12][13] is shown to be perfectly correlated with the mutagenic hotspots and codons referenced from thousands of cases of individual cancers observed (IARC database). It is likely that this method is universal insofar as it can be applied to the prediction of mutagenic regions of any other gene, protein, human, animal or plant.