Data Pre-Processing Overview

Real-world Operational Databases have dirty data. This means that such data is highly likely to have noisy, missing, redundant and inconsistent data due to their typically huge size (often several gigabytes or more) and their likely origin from multiple, heterogeneous sources. Low-quality data will lead to low-quality data warehouse. So it is very important to pre-process the data before it is loaded in the data warehouse.

Noisy data is meaningless data or corrupt data. It includes any data that cannot be understood and interpreted correctly by machines, such as unstructured text.
  •      Noisy data unnecessarily increases the amount of storage space required and can also adversely affect the results of any data analysis.
  •    Noisy data can be caused by hardware failures, programming errors and wrong input from speech or optical character recognition (OCR) programs. Spelling errors, industry abbreviations and slang can also impede machine reading.

Inconsistent data means containing discrepancies in codes or names. Having inconsistent naming patterns or having wrong derived data. Inconsistent data is the result of wrong data entry or wrong code.
e.g., Age=“42” Birthday= “03/07/1997”. Better store Birthday and calculate Age every time required.

Why Pre-Process data?
Poor quality data directly means poor quality mining results!
  • Quality decisions are always based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics.
  • Data warehouse needs consistent integration of quality data
  • Data extraction, cleaning, and transformation comprise the majority of the work of building a data warehouse.
  • The majority of time & effort is put here.
Attributes of Good Quality Data
  •          Accuracy
  •          Completeness
  •          Consistency
  •          Timeliness
  •          Believability
  •          Interpretability
  •          Accessibility
Steps in Data Pre-Processing
  • Data Cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies.
  • Data Integration: using multiple databases, data cubes, or files.
  • Data Transformation: normalization and aggregation.
  • Data Reduction: reducing the volume but producing the same or similar analytical results.
  • Data Discretization: part of data reduction, replacing numerical attributes with nominal ones.


No comments:

Post a Comment