Yuri A.W. Shardt: Research

About

Publications

Data Quality Assessment

Data quality assessment is a burgeoning field in the area of data mining for applications in system identification [3]. As the amount of data stored in industrial data historians increases exponentially, its usefulness is harder to manually determine. Manually parsing large data sets consisting of fifty or more variables sampled every minute for one or two years is practically impossible. Furthermore, it may be useful to automatically extract valuable data regions from a given data set for online process modelling, for example, in just-in-time modelling. Data quality assessment finds applications in many different fields including soft-sensor development [1, 6], process optimisation [1], control [5], and process monitoring. Figure 1 shows the general data quality assessment procedure.

Data Quality Procedure — Figure 1: Data Quality Assessment Framework

The data quality assessment procedure can be summarised as [3, 7, 6]:

Preprocessing: In this step, the data set is loaded and analysed. Often, the data set will be mean scaled and centred. As well, known thresholds and limits will be considered at this point.
Mode changes: It may be known how the operating modes of the process change, for example, the controller modes could change based on the relationships between r_t and u_t. As well, it may be the case that the specific mode will determine how much data is required to obtain good parameter estimates.
Partitioning: For each identified mode, perform the following steps:
1. Initialisation: If the length of the unanalysed data for the given mode is greater than the minimum required length r, set the model counter to the current data point, k_init = k and then set k = k + r. Otherwise, go to the next identified mode.
2. Preprocessing: For certain types of processes, it may be necessary to perform additional manipulations, for example, for an integrating process, it is necessary to integrate the input.
3. Computation: Compute the required values. In most cases, this will include the variances of the signals and the condition number of the information matrix.
4. Comparison: Compare the variances, the condition number of the regressor matrix, and the significance of the parameters against the thresholds.
Simplification: To improve performance and obtain longer data sets, adjacent regions can be compared and determined if they could be considered to come from a single model.

It should be noted that in practice, the required variances are computed using a recursive method that involves the tuning of additional forgetting factors. In practice, setting the tunable parameters and the various thresholds can be difficult since the signal properties can vary greatly and it can be difficult to set general conditions that can cover most, if not all cases.

Resources

The following digital resources are available for data quality assessment:

GitHub repository for the raw MATLAB code: https://github.com/Yuri-Research/DQ
MATLAB installation package: Download
User manual for the MATLAB files: Download. Please note that this manual only explains how to use the MATLAB GUI to perform data quality assessment; it does not explain the code or how that was created.

Comments and suggestions will always be entertained. Please e-mail Yuri A.W. Shardt directly.

References

Yuri A.W. Shardt (2012). Data Quality Assessment for Closed-Loop System Identification and Forecasting with Application to Soft Sensors, Doctoral Thesis, University of Alberta. Online access.
Kevin Brooks, Derik le Roux, Yuri A.W. Shardt, and Chris Steyn (2021). “Comparison of Semirigorous and Empirical Models Using Data Quality Assessment Methods,” Minerals, 11 (9), pp. 954. doi: 10.3390/min11090954.
Yuri A.W. Shardt, Xu Yang, Kevin Brooks, and Andrei Torgashov (2020). “Data Quality Assessment for System Identification in the Age of Big Data and Industry 4.0,” Proceedings of 2020 IFAC World Congress, July 12th − July 17th, in Berlin, Germany. doi: 10.1016/j.ifacol.2020.12.103.
Yuri A.W. Shardt, Xu Yang, and Steven X. Ding (2016). “Quantisation and data quality: Implications for system identification,” Journal of Process Control, 40, pp. 13-23. doi: 10.1016/j.jprocont.2016.01.007.
Yuri A.W. Shardt and Biao Huang (2013). “Data Quality Assessment of Routine Operating Data for Process Identification,” Computers & Chemical Engineering, 55 (8), pp. 19-27. doi: 10.1016/j.compchemeng.2013.03.029.
Daniel Peretzki, Data mining for process identification (Diploma Thesis), Cassel, 2010.
Daniel Peretzki, Alf J. Isaksson, André Carvalho Bittencourt, and Krister Forsman (2011). “Data Mining of Historic Data for Process Identification,” in Proceedings of the 2011 AIChE Annual Meeting, Minneapolis, Minnesota, United States of America.

Research

Menu:

Data Quality Assessment

Resources

References