ipyx.de
HomeCapabilitiesProjectsPublicationsPrivateLinksContactDeutsch

The MEDAS data auditing system

Background

In today's world, an organization's activities are more and more controled on the basis of data and information that was thoroughly analyzed before. Data and information are seen as valuable commodities that have a vast significance when it comes to making important management decisions. For efficiently managing and storing such information, the concept of a data warehouse has gained a lot in importance over the course of the last years. According to its definition, a data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data. One of the main difficulties in building up a data warehouse is to integrate all important data from the different and often heterogeneous data sources. The heterogeneity predominantly results from incompatible hardware platforms and software environments, distinct data models, schema conflicts and data conflicts.

The topic of data conflicts is closely related to the topic of data quality. After the process of integrating data into a data warehouse, deficiencies in the quality of the integrated data often become obvious. For a detailed analysis of all the information that is contained in a data warehouse, a high quality of the underlying data is, however, essential. One possibility to improve data quality is to apply so-called data cleansing techniques on a data warehouse's data. The main objective of such data cleansing techniques is to thoroughly scrutinize data for errors and inconsistencies and to remove such deficiencies in a second step. One specific data cleansing technique are so-called data auditing procedures, which try to identify regularities in large amounts of data; deviations or exceptions from identified regularities often point out corrigible errors and interesting phenomena in the underlying data.

Project information

The objective of the MEDAS ("Metadata-based Data Auditing System") software tool is to improve the quality of data which is captured in a data warehouse. The system implements a data auditing procedure, which automatically searches for regularities in stored data and which - on the basis of these regularities - identifies erroneous or inconsistent information and forecasts missing data. By updating all corrigible information, the quality of the stored data can eventually be improved.

The data auditing procedure that is implemented in MEDAS is based on the so-called process of knowledge discovery in databases (KDD). This process has become more and more popular during the last years. In detail, KDD is defined as the non-trivial process of identifying valid, novel, potential useful and ultimately understandable patterns in data. The central step of the complex KDD process is called data mining. Data mining is defined as the automatical process of discovering significant and potentially useful patterns in large or complex volumes of data.

Aside from being based on knowledge discovery in databases and data mining, one very important feature of MEDAS is the fact that it is implemented in a domain independent way. As a result, the software is not only applicable to one domain, as e. g. economical data, but to any domain. This feature has been realized by modelling domain-specific knowledge with the necessary help of meta-data and by integrating such meta-data into the functionality of the developed software tool. Moreover, meta-data is being used in MEDAS in order to automatize the processes of analyzing data.

From an organizational point of view, the software tool MEDAS is the result of my masters thesis (August 1999 to March 2000), through which I finished the computer science part of my university studies. In detail, MEDAS is part of a superordinate Ph. D. dissertation on data quality management, which was composed at that time by the tutor who guided me during the time of writing my thesis. For the design and development of MEDAS as well as the concepts that are contained in the software I was responsible all by myself. The software was implemented on a Windows operating system using Microsoft's Visual C++ development environment. During the implementation, the tools Microsoft Repository and ILOG Rules were used, and the publicly available data mining class library MLC++ was integrated into the project.

Related websites

Homepage of the university of Oldenburg.
http://www.uni-oldenburg.de/

Website regarding knowledge discovery in databases (KDD) and data mining.
http://www.kdnuggets.com/

Hompage of the MLC++ class library which was used in the MEDAS project.
http://www.sgi.com/tech/mlc/

ipyx.de