week 5

    Cards (33)

    • Data Warehouse
      Collection of data that helps analysts to make informed decisions in an organisation
    • Data Warehouse
      • OnLine Analytical Processing tools (OLAP)
      • Allows us to analyse data in a multi-dimensional space
      • Results in data generalisation and data mining
    • Data Warehouse vs Database
      • It is a database but there is no frequent updating
      • It houses historical data
      • Usually kept separate from the organisation's operational database
      • Helps in the integration of a diversity of applications
      • Can use the data to help make strategic decisions
    • Data Warehouse
      • Subject-oriented – provides information around a subject rather than ongoing operations. Focuses on modelling and analysis of data
      • Integrated – constructed by integrating data from many different sources such as relational databases or flat files
      • Time-variant – the data collected is identified with a particular time period – e.g. last 12 months
      • Non-volatile – previous data is not erased when new data is added
    • Types of data warehouses
      • Information processing
      • Analytical Processing
      • Data mining
    • Information processing
      Allows processing of data stored in it. Can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts or graphs
    • Analytical Processing
      Supports analytical processing of the information stored in it by means of OLAP operations – slice and dice, drill down, drill up and pivoting
    • Data mining
      Supports knowledge discovery by finding hidden patterns and associations, constructing analytical models, performing classification and prediction. Results can be presented using visualisation tools
    • Enterprise Data Warehouse (EDW)

      • Services entire enterprise
      • EDW environment may have an EDW, an Operational Data Store, Physical and Virtual Data Marts
    • Data Marts
      • Used by individual departments or groups
      • Use a dimensional data model to build
    • Operational Data Store
      Hybrid form of data warehouse containing integrated information
    • Data Warehouse (OLAP) vs Operational Database (OLTP)
      • Historical processing of data vs day to day processing
      • Analyse business vs run business
      • Based on Star Schema, Snowflake Schema, Fact Constellation Schema vs Based on Entity Relationship Model (commonly)
      • Summarised data vs highly detailed data
      • Number of users in 100s vs 1000s
      • Number of records accessed in the millions vs in 10s
      • Highly flexible vs high performance
      • Larger vs smaller
    • Query driven approach
      Query - client side - metadata dictionary -heterogeneous sites -queries are mapped -local processor - global answer set
    • Update driven approach
      • Alternative to traditional approach
      • Information from multiple heterogeneous sources are integrated in advance and available for direct querying and analysis
    • Advantages of update driven approach
      • High performance
      • Data is copied, processed, integrated, annotated, summarised and restructured in semantic data store in advance
    • 3 tier architecture
      • Top tier: data mining, reporting tool, analysis/other tools
      • Middle tier: OLAP Server
      • Bottom tier: ETL - data warehouse - data marts
    • Warehouse tools and what they do
      • Data extraction – gather data from many different sources
      • Data cleaning – finding and correcting errors in data
      • Data transformation – convert the data to warehouse format
      • Data loading – sorting, summarising, consolidating, checking integrity and building indices and patterns
      • Refreshing – updating data sources to warehouse
    • Process flow in a data warehouse
      1. Extract and load the data
      2. Clean and transform the data
      3. Backup and archive the data
      4. Manage Queries and direct them to the appropriate data sources
    • ETL - Load Manager
      • Controlling Process
      • Stored Procedures
      • Copy Management tools
      • Fast loader
    • ETL Vs ELT
      • ETL – extract data from source systems, convert the data to a structure suitable for querying and analysis and load data into data warehouse
      • ELT – takes data from various sources and loads it directly into the target system, then transforms the data on demand to enable analysis
    • Warehouse Manager
      • Controlling Process
      • Stored Procedures or C with SQL
      • Backup/Recovery Tool
      • SQL Scripts
    • Query Management Process
      1. Query Redirection via C tool or RDBMS
      2. Stored Procedures (generating views)
      3. Query management tool to monitor indexes and summaries
      4. Query Scheduling via C tool or RDBMS
    • Metadata
      • Data about data e.g. the index of a book is metadata for the contents of a book
      • Summarised data that leads us to detailed data
      • Roadmap to the data warehouse
      • Defines the warehouse objects
      • Acts as a directory – helps to locate the contents of a data warehouse
    • Data Warehouse
      • Use a dimensional data model
      • Data organised into dimension tables and fact tables using star and snowflake schemas
      • Traditional OLTP databases automate transactional operations – strive to eliminate data duplication (normalisation)
      • For a data warehouse - need denormalisation or grouping of data
      • Increases query performance
    • Ways to organise a data warehouse
      • Entity relationship modelling and normalisation
      • Dimensional Modelling
    • Normalisation
      • Normalisation is a process for converting complex data structures into simple and stable data structures
      • Models are robust and stable, have minimal redundancy, 3NF, No repeating groups, No partial dependencies, No transitive dependencies, Avoid anomalies
    • Dimensional Modelling
      • Purpose – structure data for easy and efficient retrieval and analysis
      • ER modelling creates a single model of data required to support the organisation's processes
      • Dimensional Modelling allows for different individual models of interest so we can make decisions
      • Denormalised data structure – fewer tables, better performance
    • Denormalisation
      • The more normalised your data is, the more complex the queries needed to read the data because a single query combines data from many tables. This puts a huge strain on computing resources.
      • The data in a data warehouse does not need to be organized for quick transactions. Therefore, data warehouses normally use a denormalised data structure.
      • A denormalised data structure uses fewer tables because it groups data and doesn't exclude data redundancies. Denormalisation offers better performance when reading data for analytical purposes.
    • Approaches to warehouse design
      • Top down - Enterprise Data Warehouse
      • Bottom up - Dimensional Data Warehouse
    • Top down vs Bottom up
      • Top down: Provides a definite and consistent view of information as information from the data warehouse is used to create Data Marts. Time, Cost and Maintenance is high.
      • Bottom up: Reports can be generated easily as Data marts are created first and it is relatively easy to interact with Data Marts. Time, Cost and Maintenance are low.
    • Components of dimensional model
      • Fact Tables – contain measurements of business e.g. sales, purchase orders, shipment (normalised). Can be quite large (50 billion records, 1-5 terabytes)
      • Dimension Tables – store the descriptions of the dimensions of the business e.g. product, customer, vendor, store. Smaller than Fact Tables (denormalised)
    • Types of fact tables
      • Transaction Fact Tables – record information related to events – individual product sales
      • Snapshot Fact Table – record information that applies to specific moments in time – year-end accounts
      • Accumulating snapshot tables – record information to a running tally of data – year to year sales figures
    • Dimensional tables and fact tables

      • Dimension tables list surrogate primary key – consists of a single column integer related to the natural primary key
      • Wherever you list the Store ID number on the Fact Table, it will map to that specific row of store data on the dimension table
    See similar decks