Data Warehouse & Mining: MDDM Schemas

Schemas of MDDM

The entity-relationship data model is commonly used in the design of relational databases, where a database schema consists of a set of entities and the relationships between them. Such a data model is appropriate for on-line transaction processing. A data warehouse, however, requires a concise, subject-oriented schema that facilitates on-line data analysis.

The most popular data model for a data warehouse is a multidimensional model. Such a model can exist in the form of a star schema, a snowflake schema, or a fact constellation schema. Let’s look at each of these schema types.

Star schema: The most common modelling paradigm is the star schema, in which the data warehouse contains

A large central table (fact table) containing the bulk of the data, with no redundancy, and
A set of smaller attendant tables (dimension tables), one for each dimension. The schema graph resembles a star, with the dimension tables displayed in a radial pattern around the central fact table.

Notice that in the star schema, each dimension is represented by only one table, and each table contains a set of attributes. This constraint may introduce some redundancy. For example, “Ludhiana” and “Jalandhar” are both cities in Punjab. Entries for such cities in the location dimension table will create redundancy among the attributes.

Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph forms a shape similar to a snowflake.

The major difference between the snowflake and star schema models is that the dimension tables of the snowflake model may be kept in normalized form to reduce redundancies. Such a table is easy to maintain and saves storage space. However, this saving of space is negligible in comparison to the typical magnitude of the fact table. Furthermore, the snowflake structure can reduce the effectiveness of browsing, since more joins will be needed to execute a query. Consequently, the system performance may be adversely impacted. Hence, although the snowflake schema reduces redundancy, it is not as popular as the star schema in data warehouse design.

A snowflake schema for sales is given in this figure. Here, the sales fact table is identical to that of the star schema. The main difference between the two schemas is in the definition of dimension tables. The single dimension table for item in the star schema is normalized in the snowflake schema, resulting in new item and supplier tables.

For example, the item dimension table now contains the attributes item key, item name, brand, type, and supplier key, where supplier key is linked to the supplier dimension table, containing supplier key and supplier type information. Similarly, the single dimension table for location in the star schema can be normalized into two new tables: location and city. The city key in the new location table links to the city dimension.

Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact constellation.

A fact constellation schema is shown in this figure. This schema specifies two fact tables, sales and shipping. The sales table definition is identical to that of the star schema. The shipping table has five dimensions, or keys: item key, time key, shipper key, from location, and to location, and two measures: cost and units shipped. A fact constellation schema allows dimension tables to be shared between fact tables. For example, the dimensions tables for time, item, and location are shared between both the sales and shipping fact tables.

In data warehousing, there is a distinction between a data warehouse and a data mart. A data warehouse collects information about subjects that span the entire organization, such as customers, items, sales, assets, and personnel, and thus its scope is enterprise-wide. For a data warehouse, the fact constellation schema is commonly used, since it can model multiple, interrelated subjects.

A data mart, on the other hand, is a department subset of the data warehouse that focuses on selected subjects, and thus its scope is department wide. For data marts, the star or snowflake schema is commonly used, since both are geared toward modelling single subjects, although the star schema is more popular and efficient.

Measures: Their Categorization and Computation

Note that a multidimensional point in the data cube space can be defined by a set of dimension-value pairs, for example, time = “Q1”, location = “Vancouver”, item = “computer”. A data cube measure is a numerical function that can be evaluated to a particular point in the data cube space.

Measures can be organized into three categories (i.e., distributive, algebraic, holistic)

Distributive: An aggregate function is distributive if it can be computed in a distributed manner. ie. Suppose the data are partitioned into n sets. We apply the function to each partition, resulting in n aggregate values. If the result derived by applying the function to the n aggregate values is the same as that derived by applying the function to the entire data set (without partitioning), the function can be computed in a distributed manner. For example, count() can be computed for a data cube by first partitioning the cube into a set of subcubes, computing count() for each subcube, and then summing up the counts obtained for each subcube. Hence, count() is a distributive aggregate function. For the same reason, sum(), min(), and max() are distributive aggregate functions. A measure is distributive if it is obtained by applying a distributive aggregate function.

Algebraic: An aggregate function is algebraic if it can be computed by an algebraic function with M arguments (where M is a bounded positive integer), each of which is obtained by applying a distributive aggregate function. For example, avg() (average) can be computed by sum()/count(), where both sum() and count() are distributive

aggregate functions. Similarly, it can be shown that min N() and max N() (which find the N minimum and N maximum values, respectively, in a given set) and standard deviation() are algebraic aggregate functions. A measure is algebraic if it is obtained by applying an algebraic aggregate function.

Holistic: An aggregate function is holistic if there is no constant bound on the storage size needed to describe a sub aggregate. That is, there does not exist an algebraic function with M arguments (where M is a constant) that characterizes the computation. Common examples of holistic functions include median(), mode(), and rank(). A measure is holistic if it is obtained by applying a holistic aggregate function.

5 comments:

jhonFebruary 4, 2019 at 2:35 AM
I am sure trees and berms would minimize noise and light issues for adjacent property owners as well as limits on hours of peak operation. The biggest issue with this use at this location is how/where the associated truck traffic gets in and out of the property on their way to and from 33 and/or 22. secure warehouse storage
meme128playAugust 21, 2021 at 1:13 AM
Wow, What a Excellent post. I really found this to much informatics. It is what i was searching for.I would like to suggest you that please keep sharing such type of info.Thanks underground mining definition
Lafay Tech PlazaAugust 24, 2021 at 6:18 AM
Mendixenables you to develop powerful, high-quality mobile and Web business applications easily, and maintain them at scale. Mendix was founded in 2009 by a team of enterprise software veterans.
Allen MarryMarch 6, 2022 at 9:29 PM
IT's very informative blog and useful article thank you for sharing with us , keep posting learn more about Product engineering services | Product engineering solutions.
laxmicynixitNovember 28, 2022 at 8:10 AM
nice information ...................!
micro strategy certification training