Data Mining: What’s Ahead

I’ve written on Data Mining before, as it is a fundamental step for higher-order predictive and prescriptive analytics work. Enterprise data warehouses are a trove of useful information, and data mining methods help to separate what is useful from what is not (Sharma, Sharma, & Sharma, 2013). Data mining is itself an analysis method; that is, “the analysis of data that was collected for other purposes but not the questions to be answered through the data mining process” (Maaß, Spruit, & de Waal, 2014, p. 2). Data mining takes on the unknown-unknowns of the dataset and begins to make sense of the vast amount of data points available. It involves both data transformation and reduction. These are necessary as “prediction algorithms have no control over the quality of the features and must accept it as a source of error” (Maaß, Spruit, & de Waal, 2014, p. 6). Data mining reduces the noise and eliminates the dilution of relevant data by irrelevant covariates. It provides the business intelligence framework with usable data and a minimum of error.

Tembhurkar, Tugnayat, & Nagdive (2014) outline five stages for successful data-to-BI transformation:

  1. Collection of raw data
  2. Data mining and cleansing
  3. Data warehousing
  4. Implementation of BI tools
  5. Analysis of outputs (p. 132).

Given the importance of Data Mining in the BI process, I do not see it going away or diminishing in stature. In fact, more attention may be coming to it because of the growing interest in data lakes and ELT over ETL (e.g., Meena & Vidhyameena, 2016; Rajesh & Ramesh, 2016). Increased attention will be paid to mining and cleansing practices. New developments will include advances in unstructured data, IoT data, distributed systems data mining, and NLP/multimedia data mining.

References

Maaß, D., Spruit, M., & de Waal, P. (2014). Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques. Decision Analytics, 1(1), 1–17.

Meena, S. D., & Vidhyameena, S. (2016). Data lake – a new data repository for big data analytics workloads. International Journal of Advanced Research in Computer Science, 7(5), 65-67.

Rajesh, K. V. N., & Ramesh, K. V. N. (2016). An introduction to data lake. i-Manager’s Journal on Information Technology, 5(2), 1-4.

Sharma, S. A., Sharma, A. K., & Sharma, D. M. (2013). Using Data Mining for Prediction: A Conceptual Analysis. I-Manager’s Journal on Information Technology, 2(1), 1–9.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Corporate Information Factories and Business Dimensional Models

Differentiating between a Corporate Information Factory (CIF) and a Business Dimensional Model (BDM) may come down to two different directions of strategic thought: top-down (CIF) or bottom-up (BDM).

In the BDM, otherwise known as the Kimball approach, data remain in their respective logical business units (e.g, Sales or Production) but are brought together into the data warehouse through a commonly defined bus architecture. This approach is most prevalent in the Microsoft BI stack. Star or snowflake schemas are utilized and data are rarely normalized past 1NF, if at all. The logical focus is on the originating business units and the goal is often to allow these units to more effectively share data across the organization. For presentation, fewer queries and joins are necessary than one would need to make sense of CIF data.

The CIF, or Inmon approach, starts with the central data repository as the unit of focus as opposed to the individual business units. The business units can create data marts from the normalized tables. Third normal form is required. The most apparent disadvantage here is the amount of time and thought required to implement a true CIF, but the resulting product is a true enterprise data factory. More joins are needed, though, to put the data into presentable form.

Where Extract-Transform-Load or Extract-Load-Transform is concerned, the former (ETL) is the most conventional understanding of the process and typically implemented in dimensional modeling. The transformation happens before the data reaches the target system and is logically arranged already—to some degree—by business until or purpose. The latter (ELT) is utilized most often in more powerful analytics implementations or data lakes.

References

Bethke, U. (2017, May 15). Dimensional modeling and Kimball data marts in the age of big data and Hadoop.  Retrieved from https://sonra.io/2017/05/15/dimensional-modeling-and-kimball-data-marts-in-the-age-of-big-data-and-hadoop/

Harris, D. ETL vs. ELT: How to choose the best approach for your data warehouse. Retrieved from https://www.softwareadvice.com/resources/etl-vs-elt-for-your-data-warehouse/

Kajeepeta, S. (2010, Jun 7). Is it time to switch to ELT? Intelligent Enterprise – Online. Retrieved from https://proxy.cecybrary.com/login?url=https://search.proquest.com/docview/365390283?accountid=144789

Kumar, G. (2017, Mar 14). Dimensional modelling vs corporate information factory. Retrieved from http://www.data-design.org/blog/dimensional-modelling-vs-corporate-information-factory

Decision Support Systems, Data Warehouses, and OLAP Cubes

As Tembhurkar, Tugnayat, & Nagdive define it, BI is “a collection of tools and techniques [that] transforms raw data into significant information useful for analyzing business systems” (2014, p. 128). BI has evolved from the earlier incarnations of Decision Support Systems, which served the same purpose(s) but were much more rudimentary compared to today’s implementations. These DSS solutions were often comprised of data warehouses (DWs) and online analytical processing (OLAP) engines. Both components worked together to serve the business needs: ETL and storage being handled by the data warehouse, and the front-end analysis handled by the OLAP system.

The data warehouse serves as the central repository for multiple systems of record, often heterogenous and disparate in the beginning. Data is typically replicated and stored in subject-area schemas (e.g., sales or employee data), most typically in fact and dimension tables as part of a SQL-backed relational database. The data warehouse itself can offer views and data marts pre-packaged. It supports the OLAP system. Like the OLAP system in its original form, the data warehouse is starting to be eclipsed by data lakes in enterprise environments that deal with a large amount of heterogenous data that often includes unstructured data. The difference between the two, for purposes of this comparison, is where the “T” (transformation) falls in ETL or ELT. In a data warehouse, the transformation happens before loading into the warehouse, as its purpose is to serve as a central common repository. In a data lake, the transformation happens after loading, as the lake does not impose any schemas or restrictions in order to achieve any kind of homogenous state.

The OLAP system is multi-dimensional, not unlike a three-dimensional spreadsheet. It is not a relational database but enables the analysis of the data in the data warehouse. The OLAP system enables what we typically understand as slicing and dicing the data. While these were sufficient in the early days of BI, the shift towards a DevOps culture and the proliferation of machine learning, predictive analysis, dashboarding, and envelope-pushing analytics capabilities have required more from a BI solution than rigid OLAP cubes.

References

Felzke, M. (2014). Data warehouse vs. OLAP cube. Retrieved from https://www.solverglobal.com/blog/2014/04/data-warehouse-vs-olap-cube/

Harris, D. (n.d.). ETL vs. ELT: How to choose the best approach for your data warehouse. Retrieved from https://www.softwareadvice.com/resources/etl-vs-elt-for-your-data-warehouse/

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).