The Theoretical Prediction of the Mutagenic Regions of the p53 Tumor Suppressor Gene
Jean claude Perez*
Retired interdisciplinary Researcher (IBM), France
Submission: November 08, 2017; Published: January 05, 2018
*Corresponding author: Jean-claude Perez, Retired interdisciplinary Researcher (IBM), £ 7 avenue de terre-rouge F33127 Martignas Bordeaux metropole, France, Email: firstname.lastname@example.org
How to cite this article: Jean c P.The Theoretical Prediction of the Mutagenic Regions of the p53 Tumor Suppressor Gene. Canc Therapy & Oncol Int J. 2018; 8(5): 555750. DOI: 10.19080/CTOIJ.2018.08.555750
Mutations in the TP53 gene are encountered in about one in every two cases of cancer. The locations and frequencies of these mutations are well known and listed. It is therefore on these mutations of TP53 that we validate here a theoretical method of prediction of the mutagenic regions of TP53. This method uses the Master Code of Biology, revealing a coupling and unification between the Genomics and Proteomics codes for any DNA sequence analyzed. The "score” of these couplings highlights the functional regions of genes, proteins, chromosomes and genomes. Of the 393 codons of TP53, and for the 61 possible values of these codons authorized by the genetic code (i.e. 393x61 genes simulated), we prioritize the corresponding Master Code scores. Codons with scores close to 1 correspond to conserved regions whereas codons with scores close to 61 reveal highly mutagenic regions. Our method is then validated and correlated with the real mutations observed experimentally on hundreds of cases.
For more than 25 years, we have been looking for biomathematical laws controlling and structuring DNA, genes, proteins, chromosomes and genomes [1,2]. In 1997 we discovered a simple numerical law based solely on atomic masses, UNIFYING the 3 languages of biology: DNA, RNA, and amino acids. This law, "the Master Code of Biology" is published in 2009 in the book Codex Biogenesis  and then in 2015 in a reference article peer review . In 2017 we publish different applications: HIV, SNPs, brain genetics: Prions, Amyloids, DUF1220. In other hand, the TP53 Tumor Suppressor gene plays a central role in a majority of cancers [5,6]. Between January 1993 and July 1996, more than 4300 papers have been published in which the term "p53" appears in the title! Then, p53 has been named Molecule of the Year by the editors of the journal Science (13), and the "guardian of the genome" by many researchers.
Somatic TP53 mutations occur in almost every type of cancer at rates from 38%-50% in ovarian, esophageal, colorectal, head and neck, larynx, and lung cancers to about 5% in primary leukemia, sarcoma, testicular cancer, malignant melanoma, and cervical cancer . The mutations of TP53 are well known and listed, classified according to their frequencies and the types of cancers involved in these mutations . Based on studies that examined the whole coding sequence, 86% of mutations cluster between codons 125 and 300, corresponding mainly to the DNA binding domain. The results - never published - that will be presented here were obtained in 2001 from the IARC TP51 mutations data banks available around that time .
Starting from the atomic masses constituting nucleotides and amino acids, a numerical scale of integers characterizing each bioatom, each TCAG DNA base, each UCAG RNA base, or each amino acid, an integer numbers scale code is obtained. As demonstrated below, for each sequence of double - stranded DNA to be analyzed, we compute 2 patterned signatures: "Genomics pattern" and "Proteomics pattern". A remarkable fact is that this proteomics pattern still exists, even for regions not translated into proteins (any coding or non-coding DNA region). The computational methodology of the Master code [3,4] then produces 2 patterned images (2D curves, (Figure 1) which are - always - very strongly correlated. This would mean that beyond the visible sequence of DNA there would be a kind of numerical "MASTER CODE" being manifested by two physical and chemical supports of biological information: the sequences of DNA and of amino acids sequences; the RNA image constituting a kind of neutral element like the zero of the mathematics (see § The demonstrated that the extremums (max and min) signify codong step below) . Our thousands of genes and genomes functional regions like proteins active sites, fragility points like Master Code analyses (viruses, archaeas, bacteria, eucharyotes) chromosomes breakpoints).
A quick presentation of the formula for life: In [3,4] we introduced the law we call Formula for Life. This law unifies all of the components of living including bio-atoms, CONHSP and their various isotopes, to genes, RNA, DNA, amino acids, chromosomes and whole genomes. This law is the result of a simple non-linear projection formula of the atomic masses. The result of this projection is then organized in a linear scale of integer number based codes (e.g., -2, -1, 0, 1, 2, 3...) coding multiples Pi/10 regular values. These codes are called Pi-masses. The result is a real number which we retain only the residues (decimal remainder). Detailing the "PPI (mass)" projection: Consider any atomic mass « m », which may be that of a bio-atom, of a nucleotide, of a codon, of an amino acid or of other genetic compound based on bio-atoms or even, any atoms (Mendeleiev Table, ). Where P = 0.742340663.
This process will work especially on the average masses (Tables 1 & 2). But it may also be applied to a particular isotope or any derivative of specific atomic mass proportions of the various isotopes. An overview on the "Biology Master Code" Great Unification of DNA, RNA and Amino acids. It may seem surprising that such a fine tuned process like biology of Life requires the use of three languages as diverse and heterogeneous as DNA with its alphabet of four bases TCAG; RNA with its alphabet of four bases UCAG; and proteins with their language of 20 amino acids. Starting only from the double-stranded DNA sequence data, the "Master Code" is a digital language unifying DNA, RNA and proteins that provide a common alphabet (see below Pi-mass scale) to the three fundamental languages of Genetics, Biology and Genomics.
The construction method of "the Master Code" will be now fully described below. It will highlight a significant discovery we summarize as follows: "Above the 3 languages of Biology - DNA, RNA and amino acids, there is a universal common code that unifies, connects and contains all these three languages". We call this code the "Master Code of Biology." Here is a brief description of our process for computing the Master Code- The coding step: First, we apply it to any DNA sequence encoding a gene or any non-coding sequence (formerly mislabelled as junk DNA). So it may be a gene, a contig of DNA, or a whole chromosome or genome. In this sequence, we always consider double-stranded DNA as we explore the following three codon reading frames and following the two possible directions of strand reading (3' ==> 5' or 5' ==> 3'). In all cases, the base unit will always be the triplet codon consisting of three bases.
As shown in above sample, we calculate the Pi-mass related to double stranded triplets DNA bases, double stranded triplets RNA bases, and double-stranded pseudo amino acids. In fact, for each DNA single triplet codon, we deduce the complementary Crick Watson law bases pairing. We run the same process for RNA pseudo triplet codon pairs, then, similarly for amino acids translation of these DNA codon couples using the Universal Genetic Code table. Then we obtain 3 samples of pairs codes: DNA, RNA and amino acids and this, systematically even when this DNA region is gene-coding or junk-DNA.
A simple example: the starting region of Prion gene:
DNA image coding:
ATG CTG GTT CTC TTT...
-1 -1 -1 0 0...
TAC GAC CAA GAG AAA...
0 -1 0 -2 0...
RNA image coding:
AUG CUG CUU CUC UUU...
-2 -2 -3 -1 -3...
UAC GAC GAA GAG AAA...
-1 -1 0 -2 0...
Proteomics image coding:
MET LEU VAL LEU PHE 4 3 3 4 3...
TYR ASP GLN GLU LYS 2 -1 1 0 4...
Pi-masses corresponding to two strands are then added for each triplet:
This produces three digital vectors relating to each of the 3 DNA, RNA, and proteomics coded images. At this point we already reach an absolutely remarkable result, as symbolized in Figure 1. We will focus now only on the DNA code (genomics) and amino acids code (proteomics), the RNA code playing a neutral role in this context. The globalization and integration step: To these two numeric vectors we apply a simple globalization or integration linear operator. It will "spread" the code for each position triplet across a short, medium or long distance, producing an impact or "resonance" for each position and also on the most distant positions, reciprocally by feedback. This gives a new digital image where we retain not the values but the rankings by sorting them.
We run this process for each codon triplet position, for each of the three codon reading frames and for the two sequence reading directions (3' ==> 5' and 5' ==> 3').
For example, to summarize this method: on starting area of the GENOMICS (DNA) code of Prion above, the "radiation" of triplet codon number 1 would propagate well:
-1 -2 -1 -2 0... ==>
-1 -3 -4 -6 -6...then, we cumulate these values: -20 So we made a gradual accumulation of values.
The same operation from the codon number 2 produces:
-1 -2 -1 -2 0... ==>
-2 -3 -5 -5...then, we cumulate these values: -15 etc.
Similarly, the same process on starting area of the PROTEOMICS code of Prion above, the "radiation" of triplet codon number 1 would propagate well:
6 2 4 4 7... ==>
6 8 12 16 23...then, we cumulates these values: 65 So we made a gradual accumulation of values.
The same operation from the codon number 2 produces:
6 2 4 4 7... ==>
2 6 10 17...then, we cumulate these values: 35 etc.
Finally, after computing by this method these "global signatures" for each codon position at Genomics and Proteomics levels, we sort each genomic and proteomic vector to obtain the codon positions ranking: example: as illustrated bellow, the Genomics ranking patterned signature is 2 1 4 3 5 for this Prion starting 5 codons mini subset sequence of 5 codons positions (arbitrary values). Then, to summarize the Master Code computing method on these 5 codon positions starting Prion protein sequence:
The great Unification between Genomics and Proteomics Master Code images: When applying the process described above in any sequence - gene coding, DNA contig, junk-DNA, whole chromosome or genome - a second surprise appears just as stunning as that of RNA neutral element. We find that for one of the three reading frames of the codons given, the Genomics patterned signature and the Proteomics patterned signature are highly correlated.
Contrary to the three genomics signatures which are correlated in all cases, the proteomics signatures are correlated with genomics signatures only for one codon reading frame, and generally in dissonance for the two remaining codon reading frames. Also, there are perfect local areas matching's focusing on functional sites of proteins, hot-spots, chromosomes breaking points, etc. Figure 1 summarizes this universal breakthrough for the general case and for three representative cases: Prion protein [9,10], a whole chromosome of Malaria disease, and a complete HIV1 genome . It is important to note the universal character of this coupling of genomics/proteomics: for example, for some three billion base pairs of the whole human genome, we have verified this law across the entire genome, for all its chromosomes and in all its regions with a global correlation of about 99%.
In this global correlation, specific codon positions were a perfect match. This is remarkable when regions correspond to biologically functional areas: hot-spots, the active sites of proteins, breakpoints and chromosome fragility regions (i.e., Fragile X genetic disease), etc.
Our Master Code basic research output could predict the Mutability level of each codon value and position relating to its effect at the whole structure level of the Genomics/Proteomics Unification data.
Then, we could associate with each codon position a "Mutability Coefficient” varying from 1 to 61:
i. 1 if the codon is the best by 61 possible values (without STOP codons values).
ii. 61 if any codon change increases the global Genomics/
Proteomics coupling ratio of the whole gene.
We note that low coefficient region correspond with optimal and "CONSERVED" regions.
Contrarily, high coefficient regions correspond with high MUTABILITY experimentally observed regions.
Then to summarize, in the case of gene = p53 long of 393 amino acids,
For each codon position « i » from i = 1 to 393,
DO: build 61 pseudo sequences where gene (i) = each possible coding codon
Compute score (real codon i) = score of the real cdon i in the hierarchy of 61 scores (i).
Final result is an array of 393 scores with values between 1 and 61.
Low values codon scores near 1 reveals conserved optimal codons related to a good Master Code coupling ratio. Contrarily, high values codon scores near 61 reveals bad Master Code coupling ratio, then probably a high potential mutagen codon location (Figure 4).
The P53 is a 393 amino acids proteins coming from a long 20kbases length 11 exons gene from the chromosome 17p13. The P53 protein has 5 blocks of highly conserved regions at residues 117-142, 171-181, 234-258 and 270-286. These highly conserved regions coincide with the mutation clusters and HOTSPOTS found in p53 in human cancers, most of which have been found within exons 5 - 8. These mutations have been found to be hightly frequent at the four mutational "hotspots" at codons 175, 245, 248 and 273. In The following figure 2, Cho Y et al illustrate the complex interaction provided between P53 and DNA molecule (Figure 2). "The DNA (blue) and core domain (turquoise) are shown with the zinc atom (red), with the position of the six hot spot amino acid residues (yellow). Mutations in hot spot amino acids either interfere with protein-DNA contacts, or disrupt integrity of the domain. Thus, all naturally occurring mutations in p53 directly or indirectly affect the interaction of p53 with DNA, demonstrating that sequence-specific DNA binding is central to the normal functioning of p53 as a tumour suppressor."
In other hand, these Hotspots mutations are present in all kinds of Cancers (Figure 3).
Then, what about the Prediction of these Mutations by our theoretical predictive method?
In the Figure 4 below, we show a typical output: All codons positions are affected by a Mutability Coefficient in the range of
1 to 61: 1 signify that this real codon value is the optimal one possible. 61 signify that any mutation on this position increases the global Genomics/Proteomics organization of the gene. This kind of codons has then a high Mutability power. In the figure, we show (red) the four Hotspots codons positions 174 245 248 273. Green bars illustrate high mutability codons (coef >55).
Contrarily yellow bars illustrate high conserved and optimal codons.
Then, the prediction mutability score coefficient of the 4 Hotspots is Perfect:
i. Hotspot 175: coef=61 (the higher mutability possible level).
ii. Hotspot 245: coef=61 (the higher mutability possible level).
iii. Hotspot 248: coef=57 (very high mutability coef ie range 1-61).
iv. Hotspot 273: coef=57 (very high mutability coef ie range 1-61) (Figure 4).
Below, in this other simulation run (Figure 5), we analyse the 622 GERMLINE single mutations reported in . The correlation between Experience and Prediction is Perfect:
Horizontally, we represent all mutations sorted by decrease frequency values: on the left, high frequency mutations (codons 248 245 175 273...). On the right, rare mutations points. Simultaneously, the red bars represent the mutation effect on Genomics/Proteomics: on the left, this ratio increases, then on the right, this ratio decreases! In blue, we plot the evolution of the mutability coef: high for frequent mutations, low or flat for others. Now, we have a good proof of the Prediction power of out Master Code based theoretical predictive mutagenesis method improved on the best example on Mutability: P53, the "King Cancers gene (Figure 5).
Other powerful representation of the "Mutability Global Space" is the following.
In this kind of figures, we do a smoothing on the basic Master Code scores outputs. Then we compare patterns related with high mutability (green) with patterns related to low mutability (blue). The superposition of both graphics provides "patterned ISLANDS". Then green islands regions traduce a globally high Mutagene region. On the contrary, blue predominance regions corresponds with highly globally Optimal region (then conserved). In this graphics, we note that the four Hotspots are located in (or close) high global mutability regions. We see also, in the region of codons 80 an optimal region which must normally be highly conserved (Figure 6).
The theoretical method of predicting mutagenic regions of TP53 [9-13] is shown to be perfectly correlated with the mutagenic hotspots and codons referenced from thousands of cases of individual cancers observed (IARC database). It is likely that this method is universal insofar as it can be applied to the prediction of mutagenic regions of any other gene, protein, human, animal or plant.
We especially thank Dr. Robert Friedman (M.D. practiced nutritional and preventive medicine in Santa Fe, New Mexico). We also thank the mathematician Pr. Diego Lucio Rapoport (Buenos aires), Marco Francisco Paya Torres (M.D. Alicante), the french biologist Pr. Francois Gros (Pasteur institute, codiscoverer of RNA messenger with James Watson and Walter Gilbert ) and Professor Sergey V. Petoukhov (Dr. Phys.-Math. Sci, Grand Ph.D., Full Professor, Laureate of the State prize of the USSR), Luc Montagnier, medicine Nobel prizewinner for their interest in my research of bio mathematical laws of genomes.