Real-world
Operational Databases have dirty data. This means that such
data is highly likely to have noisy, missing, redundant and inconsistent data
due to their typically huge size (often several gigabytes or more) and their
likely origin from multiple, heterogeneous sources. Low-quality data will lead
to low-quality data warehouse. So it is very important to pre-process the data
before it is loaded in the data warehouse.
Noisy data is meaningless data or corrupt data. It includes any data
that cannot be understood and interpreted correctly by machines, such as unstructured text.
- Noisy data unnecessarily increases the amount of storage space required and can also adversely affect the results of any data analysis.
- Noisy data can be caused by hardware failures, programming errors and wrong input from speech or optical character recognition (OCR) programs. Spelling errors, industry abbreviations and slang can also impede machine reading.
Inconsistent data means
containing discrepancies in codes or names. Having inconsistent naming
patterns or having wrong derived data. Inconsistent data is the result of wrong
data entry or wrong code.
e.g.,
Age=“42” Birthday= “03/07/1997”. Better store Birthday and calculate Age every time required.
Why Pre-Process data?
Poor quality data directly means poor quality mining results!
- Quality decisions are always based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics.
- Data warehouse needs consistent integration of quality data
- Data extraction, cleaning, and transformation comprise the majority of the work of building a data warehouse.
- The majority of time & effort is put here.
- Accuracy
- Completeness
- Consistency
- Timeliness
- Believability
- Interpretability
- Accessibility
- Data Cleaning: fill
in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.
- Data Integration: using
multiple databases, data cubes, or files.
- Data Transformation:
normalization and aggregation.
- Data Reduction: reducing the volume but producing the same or similar analytical results.
- Data Discretization: part of data reduction, replacing numerical attributes with nominal ones.
No comments:
Post a Comment