Machine Learning Models for Corn Yield
Prediction: A Survey of Literature
Mohsen Shahhosseini and Guiping Hu*
Department of Industrial and Manufacturing Systems Engineering, Iowa State University, USA
Submission: July 27, 2020; Published: August 06, 2020
*Corresponding author: Guiping Hu, Department of Industrial and Manufacturing Systems Engineering, Iowa State University, Ames, Iowa, USA
How to cite this article: Mohsen S, Guiping H. Machine Learning Models for Corn Yield Prediction: A Survey of Literature. Int J Environ Sci Nat Res. 2020;
25(3): 556161.DOI: 10.19080/IJESNR.2020.25.556161
The ability to predict crop yields enables the timely and effective decision making for crop management, and regional agriculture system planning. The field crop corn is the largest crop in the U.S. and hence significant efforts have been devoted to predicting corn yields through various means. The present survey reviews the studies that used machine learning models and their variations to predict corn yield.
Keywords:Agriculture system planning; Crop management; Environmental data; Deep neural networks; Spatial resolution
Agriculture and its related industries contribute significantly to the US economy by providing 11% of total U.S employment, and with $1.05 trillion of U.S. gross domestic product (GDP) in 2017 . Crop yield prediction is of great importance as it can deliver insightful information for improving crop management and subsequently U.S. and global economy. In 2019, corn was considered as the largest produced crop in the U.S.  and with the increasing demand of corn throughout the country, predicting corn production is essential. The present survey summarizes multiple well-known studies in predicting corn yield using machine learning (ML) models. We first present the most common data preprocessing tasks performed in the literature, and then provide a brief summary of the developed ML models as well as numerical results.
The most common data preprocessing tasks done by the literature for corn yield prediction include dealing with yearly increasing corn yield trend, feature selection, imputing missing data, and dealing with different spatial resolutions of environmental data sets (soil and weather).
Historical corn yields throughout the country demonstrates an increasing trend. This trend is derived from improved genetics (cultivars), improved management, and other technological advances such as farming equipment. Generally, the yearly trend
in the corn yields is addressed with two approaches. The first adds the trend back into the developed model as a linear component [2-7]. On the other hand, some studies use recursive neural network variations that are inherently able to capture the time dependency in the response variable [8,9].
The missing data treatment strategies have been dependent on the nature of the developed data sets. Some studies impute the missing data with statistical measures [9,10], whereas some other studies made use of expert knowledge to impute the missing data with data aggregation or removing them from the developed data set [4,7,11].
One of the common issues when developing initial data sets arises due to data ingestion from different sources. Each data set has a different spatial resolution. Hence, an important pre-processing task is spatial aggregation to re-arrange the data resolutions of different data sets. The most common solution undertaken in the literature is to use a statistical average/median of the information of the nearest neighbors to coordinate the spatial resolutions of different data sets [3- 8, 12-16].
Assuming a linear relationship between the independent and
dependent variables, some studies built linear regression models
to predict corn yield [6,16]. Other regression-based models in
the literature include stepwise linear regression , and linear
discriminant analysis (LDA) model .
The use of tree models in the literature has been limited due
to the superior performance of tree ensemble models. The most
common tree-based model has been M5 prime regression model
which is an extension of regression tree model with the possibility
of linear regression functions at the nodes [18,19].
Tree ensemble models provided better prediction accuracy
with the ability to capture nonlinear patterns. Random forest and
extreme gradient boosting (XGBoost) have been used more than
other tree ensemble models in the literature [3,20].
Like tree ensemble models, neural networks have the ability
to deal with nonlinear patterns as well as presenting decent
predictions. Many of the recent studies use variations of neural
network models from back-propagation neural networks (BPNN)
 to deep neural networks (DNN) [5,11-13,15], long short-term
memory (LSTM)  and convolutional neural network (CNN)
(Khaki et al., 2020) models.
Some studies attempted to combine some ML models in an
appropriate way to create superior ensemble of models. The base
models can be as simple as regression trees or as complex as deep
neural networks [4,7].
We presented a summary of the studies which use machine
learning models to predict corn yields. We explained the most
common preprocessing tasks that is done to prepare the data
for building machine learning models. The developed ML models
throughout the literature were categorized into five general
groups and a summary of the studies that attempted to predict U.S.
corn yields were presented in this study. Reviewing the studies
that used simulation crop models and remote sensors to predict
corn yields can be considered as a future research direction.