Flour concentration prediction using GAPLS and GAWLS focused on data sampling issues and applicability domain

Matheus S. Escobar, Hiromasa Kaneko, Kimito Funatsu

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

In statistical analysis, several issues arise from inadequate data splitting, such as data misunderstanding and predictive quality misconceptions. This work aims to discuss the implications of poor training and test data splitting, focusing on applicability domain (AD) key aspects. This matter is highly overlooked, despite its basic and, to a certain extent, straightforward nature. While it is true that training and test data when poorly chosen result in a poor model, the main idea of this work is to discuss how such splitting should be done and how counter intuitive and misleading data splitting is being approached by several researchers. Relying on Fluorescence and Near infrared (NIR) spectroscopy data sets, prediction of protein concentration for a particular group of flour samples is presented via six different data splitting scenarios. Multiple scenarios allow AD and model predictive power to be evaluated in different ways from a singular data set. The regression models constructed were obtained using three somewhat related regression methods: partial least squares (PLS), generic algorithm-based partial least squares (GAPLS) and generic algorithm-based wavelength selection (GAWLS). The merits and demerits of GAWLS and GAPLS contribute for the assessment of AD in the same way that using two distinct data sets prevent the work to be biased by a single case study. NIR has overall better results than Fluorescence, since it has more information available for modeling and GA methods present better model performance than PLS. In order to evaluate the different data splitting used, T2 and Q indexes are used along prediction errors to assess model performance and determine data reliability. T2 and Q values can determine the similarity between training and test data, indicating how predictive a model will be. Standard deviation helps to identify how reliable a sample is for modeling within a given data set. When it comes to the model assessment and anomaly detection, standard deviation of prediction errors had the most consistent results, diagnosing which model had better prediction capabilities. In the end, for a given data set, arbitrary data splitting can be dangerous, since it can trigger generation of models that do not represent the entire nature of the data set represented.

Original languageEnglish
Pages (from-to)33-46
Number of pages14
JournalChemometrics and Intelligent Laboratory Systems
Volume137
DOIs
Publication statusPublished - 15 Oct 2014

Keywords

  • Applicability domain
  • Data splitting
  • Fluorescence spectroscopy
  • NIR spectroscopy
  • Statistical modeling

Fingerprint Dive into the research topics of 'Flour concentration prediction using GAPLS and GAWLS focused on data sampling issues and applicability domain'. Together they form a unique fingerprint.

  • Cite this