Data warehouse
From PHPBB用户手册
A data warehouse is a computer system designed for archiving and analyzing an organization's historical data, such as sales, salaries, or other information from day-to-day operations. Normally, an organization copies information from its operational systems (such as sales and human resources) to the data warehouse on a regular schedule, such as every night or every weekend; after that, management can perform complex queries and analysis (such as data mining) on the information without slowing down the operational systems.
Contents |
Definition
A data warehouse is the main repository of the organization's historical data, or its corporate memory. For example, an organization would use the information that's stored in its data warehouse to find out what day of the week they sold the most widgets in May 1992, or how employee sick leave the week before Christmas differed between California and Quebec from 2001-2005. In other words, the data warehouse contains the raw material for management's decision support system.
While operational systems are optimized for simplicity and speed of modification (online transaction processing, or OLTP) through heavy use of database normalization and an entity-relationship model, the data warehouse is optimized for reporting and analysis (online analytical processing, or OLAP). Frequently data in Data Warehouses is heavily denormalised, summarised and/or stored in a dimension-based model but this is not always required to achieve acceptable query response times.
More formally, Bill Inmon (one of the earliest and most influential practitioners) defined a data warehouse as follows:
- Subject-oriented, meaning that the data in the database is organized so that all the data elements relating to the same real-world event or object are linked together;
- Time-variant, meaning that the changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time;
- Non-volatile, meaning that data in the database is never over-written or deleted, but retained for future reporting; and,
- Integrated, meaning that the database contains data from most or all of an organization's operational applications, and that this data is made consistent.
History of data warehousing
Data Warehouses became a distinct type of computer database during the late 1980's and early 1990's. They developed to meet a growing demand for management information and analysis that could not be met by operational systems. Operational systems were unable to meet this need for a range of reasons:
- The processing load of reporting reduced the response time of the operational systems,
- The database designs of operational systems were not optimised for information analysis and reporting,
- Most organizations had more than one operational system, so company-wide reporting could not be supported from a single system, and
- Development of reports in operational systems often required writing specific computer programs which was slow and expensive
As a result, separate computer databases began to be built that were specifically designed to support management information and analysis purposes. These data warehouses were able to bring in data from a range of different data sources, such as mainframe computers, minicomputers, as well as personal computers and office automation software such as spreadsheet, and integrate this information in a single place. This capability, coupled with user-friendly reporting tools and freedom from operational impacts, has led to a growth of this type of computer system.
As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), data warehouses have evolved through several fundamental stages:
- Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.
- Offline Data Warehouse - Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure
- Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
- Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.
Components of a data warehouse
The primary components of the majority of data warehouses are shown in the attached diagram and described in more detail below:
Data Sources
Data sources refers to any electronic repository of information that contains data of interest for management use or analytics. This definition covers mainframe databases (e.g. IBM DB2, ISAM, Adabas, Teradata, etc.), client-server databases (e.g. Teradata, IBM DB2, Oracle database, Informix, Microsoft SQL Server, etc.), PC databases (e.g. Microsoft Access, Alpha Five), spreadsheets (e.g. Microsoft Excel) and any other electronic store of data. Data needs to be passed from these systems to the data warehouse either on a transaction-by-transaction basis for real-time data warehouses or on a regular cycle (e.g. daily or weekly) for offline data warehouses.
Data Transformation
The Data Transformation layer receives data from the data sources, cleans and standardises it, and loads it into the data repository. This is often called "staging" data as data often passes through a temporary database while it is being transformed. This activity of transforming data can be performed either by manually created code or a specific type of software could be used called an ETL tool. Regardless of the nature of the software used, the following types of activities occur during data transformation:
- comparing data from different systems to improve data quality (e.g. Date of birth for a customer may be blank in one system but contain valid data in a second system. In this instance, the data warehouse would retain the date of birth field from the second system)
- standardising data and codes (e.g. If one system refers to "Male" and "Female", but a second refers to only "M" and "F", these codes sets would need to be standardised)
- integrating data from different systems (e.g. if one system keeps orders and another stores customers, these data elements need to be linked)
- performing other system housekeeping functions such as determining change (or "delta") files to reduce data load times, generating or finding surrogate keys for data etc.
Data Warehouse
The data warehouse is normally (but does not have to be) a relational database. It must be organized to hold information in a structure that best supports not only query and reporting, but also advanced analysis techniques, like data mining. Most data warehouses hold information for at least 1 year and sometimes can reach half century, depending on the business/operations data retention requirement. As a result these databases can become very large.
Reporting
The data in the data warehouse must be available to the organisation's staff if the data warehouse is to be useful. There are a very large number of software applications that perform this function, or reporting can be custom-developed. Examples of types of reporting tools include:
- Business intelligence tools: These are software applications that simplify the process of development and production of business reports based on data warehouse data.
- Executive information systems (known more widely as Dashboard (business): These are software applications that are used to display complex business metrics and information in a graphical way to allow rapid understanding.
- OLAP Tools: OLAP tools form data into logical multi-dimensional structures and allow users to select which dimensions to view data by.
- Data Mining: Data mining tools are software that allow users to perform detailed mathematical and statistical calculations on detailed data warehouse data to detect trends, identify patterns and analyse data.
Metadata
Metadata, or "data about data", is used not only to inform operators and users of the data warehouse about its status and the information held within the data warehouse, but also as a means of integration of incoming data and a tool to update and refine the underlying DW model.
Examples of data warehouse metadata include table and column names, their detailed descriptions, their connection to business meaningful names, the most recent data load date, the business meaning of a data item and the number of users that are logged in currently.
Operations
Data warehouse operations comprises of the processes of loading, manipulating and extracting data from the data warehouse. Operations also cover user management, security, capacity management and related functions
Optional Components
In addition, the following components also exist in some data warehouses:
- Dependent Data Marts: A dependent data mart is a physical database (either on the same hardware as the data warehouse or on a separate hardware platform) that receives all its information from the data warehouse. The purpose of a Data Mart is to provide a sub-set of the data warehouse's data for a specific purpose or to a specific sub-group of the organisation. A data mart is exactly like a data warehouse technically, but it serves a different business purpose: it either holds information for only part of a company (such as a division), or it holds a small selection of information for the entire company (to support extra analysis without slowing down the main system). In either case, however, it is not the organization's official repository, the way a data warehouse is.
- Logical Data Marts: A logical data mart is a filtered view of the main data warehouse but does not physically exist as a separate data copy. This approach to data marts delivers the same benefits but has the additional advantages of not requiring additional (costly) disk space and it is always as current with data as the main data warehouse. The downside is that Logical Data Marts can have slower response times than physicalised ones.
- Operational Data Store: An ODS is an integrated database of operational data. Its sources include legacy systems and it contains current or near term data. An ODS may contain 30 to 60 days of information, while a data warehouse typically contains years of data. ODS's are used in some data warehouse architectures to provide near real time reporting capability in the event that the Data Warehouse's loading time or architecture prevents it being able to provide near real time reporting capability.
Different methods of storing data in a data warehouse
All data warehouses store their data grouped together by subject areas that reflect the general usage of the data (Customer, Product, Finance etc.). The general principle used in the majority of data warehouses is that data is stored at its most elemental level for use in reporting and information analysis.
Within this generic intent, there are two primary approaches to organising the data in a data warehouse.
The first is using a "dimensional" approach. In this style, information is stored as "facts" which are numeric or text data that capture specific data about a single transaction or event, and "dimensions" which contain reference information that allows each transaction or event to be classified in various ways. As an example, a sales transaction would be broken up into facts such as the number of products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and sales person. The main advantages of a dimensional approach is that the Data Warehouse is easy for business staff with limited information technology experience to understand and use. Also, because the data is pre-processed into the dimensional form, the Data Warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that it is quite difficult to add or change later if the company changes the way in which it does business.
The second approach uses database normalisation. In this style, the data in the data warehouse is stored in third normal form. The main advantage of this approach is that it is quite straightforward to add new information into the database, whilst the primary disadvantage of this approach is that it can be quite slow to produce information and reports.
Advantages of using data warehouse
There are many advantages to using a data warehouse, some of them are:
- Enhances end-user access to a wide variety of data.
- Business decision makers can obtain various kinds of trend reports e.g. the item with the most sales in a particular area / country for the last two years. This may be helpful for future investments in a particular item.
- Increases data consistency.
- Increases productivity and decreases computing costs.
- Is able to combine data from different sources, in one place.
- It provides an infrastructure that could support changes to data and replication of the changed data back into the operational systems.
Concerns in using data warehouse
- Extracting, cleaning and loading data could be time consuming. But this can be made easy with the help of warehousing tools.
- Data warehousing project scope might increase.
- Problems with compatibility with systems already in place e.g. transaction processing system.
- Providing training to end-users, who end up not using the data warehouse.
- Security could develop into a serious issue, especially if the data warehouse is web accessible.
See also
- Business intelligence
- Business performance management
- Chief Performance Officer
- Data mart
- Data mining
- Compare with: Data integration
- Database management system
- Executive information system
- Extract, transform, load
- Intelligent document
- Master Data Management
- OLAP
- OLTP
- Operational data store
- Snowflake schema
- Star schema
References
- William H. Inmon, Richard D. Hackathorn: Using the Data Warehouse, John Wiley & Sons, ISBN 0-471-05966-8
- Pyle, Dorian. Business Modeling and Data Mining. Morgan Kaufmann, 2003. ISBN 1-55860-653-X
- Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition), John Wiley & Sons, ISBN 0-471-20024-7
- Stephen Haag, Maeve Cummings, Donald J. McCubbery, Alain Pinsonneault,Richard Donvan:Managements Information System for the Information Age, Third Canadian Edition, McGraw-Hill Ryerson, ISBN 0-07-095569-7
- Data Management And Warehousing knowledge base
- BeyeNetwork – A vendor neutral resource for data management professionals
- Data Warehousing Development Process on gantthead.com
- Information Tecnology Data Warehouse
- Data Warehousing Gotchas
- ODP - Data Warehouse Resources
- ODP - Data Warehouse Articles
- SDN - Business Intelligence on the SAP Developer Network模板:Languages

