In silico Modelling of 2D, 3D Molecular Descriptors for Prediction Of Anticancer Activities Of Luteolin And Daidzin From Plants Perilla ocymoides L and Glucine max L

Natural products from plants are of interest in searching for new anti-cancer drugs and can have a direct effect on HeLa cell line and reduce side effects. Recently, we have isolated a few flavonoids from Perilla ocymoides L and Glucine max L [1] and tested in vitro activities pointed out the relatively strong impacts for cancer cells HeLa [2]. Flavonoids are polyphenolic compounds in most plants [3-5]. The flavonoids from Perilla ocymoides L and Glucine max L were also tested the biological activities in some different cancer cells. The flavonoids presented their activities and role of food within flavonoids in the cancer inhibition are widely studied [6-8]. In recent years, the computational methods are applied widely for the study of chemical properties and designing new drugs. The field of new drug design by in sillico method has become an important tool nowadays. In sillico study on quantitative relationships between structure and activity (QSAR) of natural products is concerned with the new drug researchers and pharmaceutical manufacturing facilitators. In Viet Nam, there are also a number of works of scientists from universities and institutions published in Viet Nam journals [9-11]. In the previous studies of 3-aminoflavonoid substances they have focused on the use of semi-empirical calculation [11]. Those studies showed an effect way for designing new drugs with the assisted computers. The In sillicao model it can be used to predict the biological activity of new drugs from the atomic Abstract


Introduction
Natural products from plants are of interest in searching for new anti-cancer drugs and can have a direct effect on HeLa cell line and reduce side effects. Recently, we have isolated a few flavonoids from Perilla ocymoides L and Glucine max L [1] and tested in vitro activities pointed out the relatively strong impacts for cancer cells HeLa [2]. Flavonoids are polyphenolic compounds in most plants [3][4][5]. The flavonoids from Perilla ocymoides L and Glucine max L were also tested the biological activities in some different cancer cells. The flavonoids presented their activities and role of food within flavonoids in the cancer inhibition are widely studied [6][7][8]. In recent years, the computational methods are applied widely for the study of chemical properties and designing new drugs. The field of new drug design by in sillico method has become an important tool nowadays. In sillico study on quantitative relationships between structure and activity (QSAR) of natural products is concerned with the new drug researchers and pharmaceutical manufacturing facilitators. In Viet Nam, there are also a number of works of scientists from universities and institutions published in Viet Nam journals [9][10][11]. In the previous studies of 3-aminoflavonoid substances they have focused on the use of semi-empirical calculation [11]. Those studies showed an effect way for designing new drugs with the assisted computers. The In sillicao model it can be used to predict the biological activity of new drugs from the atomic charges and molecular descriptors. This method allows for the identification of an active-central location of molecule.
The set of flavones and isoflavones is known to have an important activity against cervical cancer cells [12][13][14]. This flavonoid group is also interested currently for researching in different directions such as the synthesis and metabolizing of natural products isolating them from plant [1]. The in sillico model of quantitative relationship between the structure of flavones and isoflavones and anticancer activity is an important issue for searching the flavonoid derivatives to be valid way. In this work, we report in the present paper the use of semi-empirical quantum calculations and construction of quantitative structure activity relationship (QSAR) models using 32 flavone and isoflavone derivatives [15]. The geometries of flavones and isoflavones are optimized by means of molecular mechanics (MM+). The 2D and 3D molecular descriptors resulting from geometric calculation are used to establish the multivariate models such as multiple linear regression (MLR), principal component regression (PCR) and artificial neural network (ANN). The anti-cancer activities GI50/M of flavones and isoflavones in test group and two new flavonoids luteolin and daidzin from Perilla ocymoides L and Glucine max L [1] resulting from in sillico models are compared with those from experimental data..

a.
Materials: To ensure an accurate capability of QSAR model, the dataset used for building and validating QSAR models consists of 32 compounds with anti-cancer activities GI 50 /M for Hela cell line (GI 50 is the concentration for 50% of maximal inhibition of cell proliferation) were reported by Wang et al. in the literature [2], as pointed out in (Figure 1). The value logGI 50 is the subsequent dependent variable that defines the biological parameter for QSAR model.
Quantitative Structure-Activity Relationship (QSAR) studies have often been used to find correlations between biological activities and 2D and 3D molecular descriptors for compounds. We used the flavones and isoflavones reported by Wang et al. to calculate 2D and 3D molecular descriptors. The molecular descriptors are calculated with QSARIS program [16]. The multiple linear regression (QSARMLR) and principal component regression (QSARPCR) models are constructed with XLSTAT 2014 [17]. Because of the artificial neural networks are an artificial intelligent systems, they use a large number of interrelated data-processing neurons to emulate the function of brain. So the artificial neural network (QSARPCA-ANN) can be constructed with program Visual Gene Developer 1.7 [18].

b.
Constructing QSAR models: Linear regression is without doubt the most frequently used statistical method. The multiple regression (several explanatory variables) and simple linear regression are identical linear regression methods in the overall concept as well as calculation techniques. The principle of linear regression is to model a quantitative dependent variable Y though a linear combination of k quantitative explanatory variables, x1, x2, …, xk. In the case where there are N observations, the estimation of the predicted value of the dependent variable Y is given by [17,19]: Where Y is the experimental activity pGI 50 ,exp, x i is k th molecular descriptor.
Values R 2 train and R 2 pred are calculated by Here Yi, Ŷ i are experimental pGI 50 , exp and predicted pGI 50 , pred value; Ȳ is mean of experimental values.
The predicted results derived from the QSAR models are validated and compared with experimental data base on the relative errors (ARE, %) as [3,7]: With N is number of activity values.
Principal component regression (PCR) model [17,20] is a regression technique using principal component analysis (PCA) when evaluating regression coefficients. PCR presents a technique for finding structure in datasets. Its object is to group correlated variables, replacing the earlier descriptors by new set called principal components (PCs). These PC's are uncorrelated and are developed as a simple linear aggregation of earlier variables. It moves the data into a new set of axes such that first few axes indicate most of the variations within the data. First PC (PC1) is expressed in the direction of maximum variance of the whole dataset. Second PC (PC2) is the direction that defines the maximum variance in orthogonal subspace to PC1. Consequent components are taken orthogonal to the particular formerly chosen and defines best of remaining variance, by locating the data on new set of axes, it can points major fundamental structures certainly. Value of each point, when moved to a given axis, is called the PC value. PCA chooses a new set of axes for the data. These are chosen in decreasing order of variance within the data. The aim of principal component regression PCR is the computation of values of a response variable on the basis of chosen PCs of independent variables [16,17,20].

a.
Calculation of molecular descriptors: The program HyperChem 8.05 [21] was used for designing the flavonoid molecules. The molecular structures were optimized by means of MM+ molecular mechanics. The molecular descriptors of molecules were calculated by computational techniques of QSARIS [16] using the optimized geometries. The molecular descriptors were used to construct the multiple linear regression (QSAR MLR ), principal component regression (QSAR PCR ) and artificial neural network (QSAR PCA-ANN ) model [4,5].

b.
Development of QSAR MLR model: Before conducting the QSAR MLR model, the activity values GI 50 (µM) are transformed into the values pGI 50 to adapt the statistical properties. The values pGI 50 (µM) are most appropriate value for modelling the relationships between molecular descriptors and activities. The QSAR MLR models were established by using the relationship of the geometric predictors and biological activities pGI 50 [16]. The QSAR models in this work obtained by two different approaches: (i) cases are selected randomly for training set, and (ii) remaining cases for validation of predictability (test set). There are several methods for selecting the training set. The simplest way is the random selection. In this work, the original data is divided into training and validation set. The accurate predictability of QSAR model is evaluated by comparing the predicted and observed activities of the substances in test set without training set. Table 1: Molecular structure and activities GI 50 (µM) of flavones and isoflavones [2].

Substance
Name Substitutive site Substitutes R GI50 (µM) Training set for establishing QSAR models Test set for validating QSAR models   In recent years theoretical and experimental researchers have focused an increasing attention on finding the most efficient tools for selecting molecular descriptors in QSAR studies [3,4,10]. The change of values R 2 train , R 2 pred and SE (standard error) in the QSAR MLR models with the 2D and 3D predictors, respectively are pointed out in (Table 1). To have those QSAR MLR models, the 2D and 3D molecular descriptors were selected by using forward and backward algorithm. The selection process for 2D and 3D descriptors based on the change of the statistical values R 2 train, SE and R 2 pred. The values R2pred of the QSAR MLR models were calculated by using the cross-validated technique with leaveone-out (LOO) method. The 9 fitness models are shown in ( Table  2).

Organic and Medicinal Chemistry International Journal
The QSAR MLR models (with k of 2 to 10) that were arranged in an orderly change of R 2 train , SE and R 2 pred , as given in (Table 2). The values of R 2 train and R 2 pred from QSAR MLR models (with k from 5 to 7) are higher than others. In particular, the QSAR MLR model (with k = 6) has given the highest values R 2 train of 0.854 and R2pred of 0.812. So, three best models (with k of 5, 6 and 7) are chosen to determine the significant contribution of 2D, 3D descriptors. The valuable contribution percentages MP m x k ,% and GMP m x k ,% [3], with the statistical parameters of three models (with k of 5, 6 and 7), respectively, are given in (Table 3).
The contribution percentages MP m x k , %, GMP m x k , % [3,7,11] of the models (with k of 5, 6 and 7), respectively are calculated by formula ( ) Where N the total number of cases, m number of variables. The global average contribution percentage GMP m x k , % of each independent variable for 3 models is determined by the formula With n number of models The contribution percentages GMP m x k , % in Table 3 depicted the important level of 2D and 3D molecular descriptors for flavonoid compounds. For the QSAR MLR models in Table 3 the important significance of 2D and 3D molecular descriptors is arranged by using values GMP m x k , %: MaxQp > ABSQ > ka2 > MaxNeg > LogP > ka3 > SdssC > SdO > Ovality > ABSQon. The molecular descriptors MaxQp, ABSQ, ka2, MaxNeg and LogP can be considered such as the most important contribution for each molecule. Besides these molecular descriptors exhibit by important nature of carbonyl group C 4 = O 11 and atom O 1 . These atoms wear the free electron pair conjugating with  electronic bond C 2 = C 3 , and C 4 = O 11 to form a conjugate system. The carbonyl group C 4 = O 11 exhibited fully reactive nature of carbonyl substance [2]. So, these descriptors can be demonstrated quantitatively total charges ABSQ, MaxQp and MaxNeg on molecule based on the values GMPmxk, % and these are also consistent with the verdicts from experimental evaluation [16,23]. Furthermore, the atomic positions C 6 and C 3 ' on molecule are the vacant positions and can be explored for attaching the new function groups [9,23,24]. The various atoms seem to be the important impacts for biological activities GI 50 . So these sites are chosen for attaching the new substitutes to construct new flavonoids. Similarly the atom C 2 ' is also empty position and also can be utilized to attach the new function group. Those sites hope to constitute the new compounds with higher activity than sample compound. Also this way, the new flavonoids isolated from leaves of Perilla ocymoides L and Glucine max L are also used such as lead compounds to design new drugs. This is also showed in below discussion.

c.
Development of QSAR PCR model: The molecular descriptors were applied to under goes principal component regression PCR technique to create QSAR PCR model with simulated anealining variable selection mode by using PCR model [17]. The best QSAR MLR model (with k = 6) is selected to generate the QSARPCR model [16,17]. The 6 independent variables MaxQp, SdO, ka3, LogP, Ovality and SdssC were carried out to analyse the principle components. The principle component regression QSAR PCR model is generated with 6 principle components which are corresponding to the original descriptors of QSAR MLR model (with k = 6), as exhibited in equation (8) The number of principle components is extracted by the principle component analysis technique and the the correlation between pGI 50 and pGI 50 , pred values is pointed out in (Figure 2). Building QSA RPCA-ANN model: The QSAR PCA-ANN model is built by the neuro-fuzzy technique with the genetic algorithms using program Visual Gene Developer v1.7 [18]. The artificial neural network has an architecture style I(6)-HL(9)-O(1); it consists of input layer I(6) with 6 neurons such as 6 principle components in equation (8)

e.Isolation of luteolin and daidzin from plant
i. Chemicals and equipment: In this work, we used the chemicals and the equipments for isolating and purifying two flavonoids luteolin and daidzin before determining the substance structures by 1H-NMR and 13C-NMR spectrum [25].
The following materials are used to isolate the flavonoids in ii.
Silica gel with the particle size in range 0.04 to 0.06 mm was used for ordinary and Rp18 phase chromatography.
iii. Thin-layer chromatography was implemented by the thin plate DC-Alufolien F254 (Merck) for the ordinary phase and Rp18 F254s (Merck) for the reverse-phase chromatography.
ix. Analytical Balances AND HR 200.

f. Isolation process of luteolin and daidzin:
To isolate and purify the luteolin and daidzin compound from the leaves of Perilla ocymoides L and Glucine max L we used the techniques of thin-layer and column chromatography [25], as exhibited in (Figures 3-4). After isolating the compounds their structures were identified by the different spectrum as

i.
Melting temperature carried out on Electrothermal IA 9000 series, using unadjusted capillary. ii.
Column chromatography with silica gel for ordinaryphase, reverse-phase chromatography Rp 18 and sephadex techniques combined with thin-layer chromatography.
iii. Substances were detected by ultraviolet light at wavelengths 254 nm and 365 nm or reagent used is liquid H 2 SO 4 /EtOH or FeCl 3 /EtOH. iv. Nuclear magnetic resonance spectrum (NMR) 1