The Data Warehouse concept is built atop the notion that all data related to the enterprise can be captured and centrally or holistically managed. This is a powerful idea, yet there is more than one way to achieve that goal. The traditional view of the EDW attacked the problem from a very DBMS-centric perspective. This is primarily why EDW projects become so expensive, difficult and ultimately hard to adopt. The typical EDW approach attempted to gather all of the data related to the enterprise and place it into one massive repository structure. Whether this approach was attempted in chunks or as a “Big Bang” assault made little difference in the long the run as the byproducts of the practice were the same; those byproducts included:
- A more bureaucratic management approach to the data layer in general.
- An added degree of separation between the data owners and the data developers.
- A certain level of inflexibility in regards to how data was updated, corrected or otherwise transformed.
- An added degree of separation between database developers and data exploitation developers.
- An added degree of separation between database developers and application developers.
- An inability to quickly respond to major changes in the business.
- Dependence upon a sub-set of industry experts and equipment that is more expensive than the industry norm.
- A higher cost associated with scalability in general.
DBMS Focus – At the time when EDWs became popular, other areas of data architecture were only just beginning to blossom. Today’s Business Intelligence platforms represent much more than mere reporting engines. Metadata management was only just beginning to be understood in the mid-1990’s and focus on Semantic technologies was virtually non-existent. The world according to DBMS in 1995 had a relational management system in the middle with ETL feeding into and reports coming out. This might be thought of as a three layer, stove-piped database systems view of the data architecture.
The Enterprise Single Instance – While consolidating like capabilities into marts or stores or some other ‘functional single instance’ approach has achieved quite a bit of success over the past two decades, attempting to manage all data in one structure has proven much more difficult. This is why the notion of Massively Parallel Processing (MPP) was needed to make it viable back in the 1990s. MPP in the context of proprietary hardware was expensive though and perhaps failed to recognize the power of networked processors on inexpensive hardware (i.e the Google scalability model). The other key consideration here was the added steps that were needed in order to make such a system perform within reasonable parameters. So, the single instance enterprise faced and still faces major hurdles in terms of costs, manageability and performance.
Data is no longer confined within the context of single systems |
If we were to directly challenge the core EDW assumptions and illustrate the fallacies associated with the philosophy, our list would resemble the following:
- The Business will remain static over a relatively long period of time.
- The Enterprise will remain static over a relatively long period of time.
- That source data and data exploitation should not be managed synergistically, in other words that Decision Support or Business Intelligence solutions built on top of EDW source data should be viewed as separate, albeit related efforts.
- That the data layer and the application layer can or should be viewed or designed separately.
- That computer Hardware would not catch up to the processing load – i.e. that the data layer would always require specialized Massively Parallel Processing (MPP) in order to manage very large quantities of data. Furthermore, this assumption also implied the data would remain in a single instance data source. So instead of parallel processors deployed in specialized equipment, Big Data now uses the cheapest processes / equipment possible in a commodity approach with data spread out in sets of distributed file systems. In fact this has been take even further as this week the US Government announced the completion of the world's most powerful supercomputer. It achieved all of its latest gains by using commodity hardware (in this case, GPUs, game video processors widely available on the market).
- That network architecture, data architecture, application / SOA architecture and enterprise architecture are separate.
- That the Internet (Cloud) would not represent a viable mechanism for connecting to distributed data sources.
- That unstructured data was not as valid as structured data (mainly because no mechanism existed to incorporate into the traditional database management approaches).
- That most major transformations need to occur before data is placed into the primary storage / management entity (i.e. DBMS, warehouse).
- That there is a single version of the truth, period. This is perhaps the biggest fallacy behind all data warehouse, MDM and Governance solutions. Data can be managed, but it is dynamic and all always will be. Viewing data as incontrovertible, orthodox truth immediately eliminates much of the value that data otherwise provides. Situations change, and every stakeholder views the whole from their unique perspectives. Yet, there can still be order in a relativistic environment (much as there is in the real world). This doesn't mean data cannot be standardized or managed - it merely takes into consideration the inevitable evolution that will occur.
Copyright 2012 - Technovation Talks, Semantech Inc.
0 comments:
Post a Comment