Data Warehouse & Mining: Data Discretization & Concept hierarchy generation

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data. This leads to a concise, easy-to-use, knowledge-level representation of mining results.

A concept hierarchy for a given numerical attribute defines a discretization of the attribute.

· Concept hierarchies can be used to reduce the data by collecting and replacing low-level concepts (such as numerical values for the attribute age) with higher-level concepts (such as youth, middle-aged, or senior).

· Although detail is lost by such data generalization, the generalized data may be more meaningful and easier to interpret. This contributes to a consistent representation of data mining results among multiple mining tasks, which is a common requirement.

· In addition, mining on a reduced data set requires fewer input/output operations and is more efficient than mining on a larger, un-generalized data set.

Because of these benefits, discretization techniques and concept hierarchies are typically applied before data mining as a pre-processing step, rather than during mining.

Data Cube Generation

A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.

In general terms, dimensions are the perspectives or entities with respect to which an organization wants to keep records. For example, a store may create a sales data warehouse in order to keep records of the store’s sales with respect to the dimensions time, item, branch, and location. These dimensions allow the store to keep track of things like monthly sales of items and the branches and locations

Each dimension may have a table associated with it, called a dimension table, which further describes the dimension.

A multidimensional data model is typically organized around a central theme, like sales, for instance. This theme is represented by a fact table. Facts are numerical measures. Think of them as the quantities by which we want to analyze relationships between dimensions.

The fact table contains the names of the facts, or measures, as well as keys to each of the related dimension tables.

Although we usually think of cubes as 3-D geometric structures, in data warehousing the data cube is n-dimensional.

Data Discretization & Concept hierarchy generation

1 comment: