Data Mining: What’s Ahead

I’ve written on Data Mining before, as it is a fundamental step for higher-order predictive and prescriptive analytics work. Enterprise data warehouses are a trove of useful information, and data mining methods help to separate what is useful from what is not (Sharma, Sharma, & Sharma, 2013). Data mining is itself an analysis method; that is, “the analysis of data that was collected for other purposes but not the questions to be answered through the data mining process” (Maaß, Spruit, & de Waal, 2014, p. 2). Data mining takes on the unknown-unknowns of the dataset and begins to make sense of the vast amount of data points available. It involves both data transformation and reduction. These are necessary as “prediction algorithms have no control over the quality of the features and must accept it as a source of error” (Maaß, Spruit, & de Waal, 2014, p. 6). Data mining reduces the noise and eliminates the dilution of relevant data by irrelevant covariates. It provides the business intelligence framework with usable data and a minimum of error.

Tembhurkar, Tugnayat, & Nagdive (2014) outline five stages for successful data-to-BI transformation:

  1. Collection of raw data
  2. Data mining and cleansing
  3. Data warehousing
  4. Implementation of BI tools
  5. Analysis of outputs (p. 132).

Given the importance of Data Mining in the BI process, I do not see it going away or diminishing in stature. In fact, more attention may be coming to it because of the growing interest in data lakes and ELT over ETL (e.g., Meena & Vidhyameena, 2016; Rajesh & Ramesh, 2016). Increased attention will be paid to mining and cleansing practices. New developments will include advances in unstructured data, IoT data, distributed systems data mining, and NLP/multimedia data mining.

References

Maaß, D., Spruit, M., & de Waal, P. (2014). Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques. Decision Analytics, 1(1), 1–17.

Meena, S. D., & Vidhyameena, S. (2016). Data lake – a new data repository for big data analytics workloads. International Journal of Advanced Research in Computer Science, 7(5), 65-67.

Rajesh, K. V. N., & Ramesh, K. V. N. (2016). An introduction to data lake. i-Manager’s Journal on Information Technology, 5(2), 1-4.

Sharma, S. A., Sharma, A. K., & Sharma, D. M. (2013). Using Data Mining for Prediction: A Conceptual Analysis. I-Manager’s Journal on Information Technology, 2(1), 1–9.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Data Mining and the Enterprise BI Long Game

Data mining provides the foundational work for higher-order predictive and prescriptive analytics. Enterprise data warehouses are a trove of useful information, and data mining methods help to separate what is useful from what is not (Sharma, Sharma, & Sharma, 2013). Data mining is itself an analysis method; that is, “the analysis of data that was collected for other purposes but not the questions to be answered through the data mining process” (Maaß, Spruit, & de Waal, 2014, p. 2). Data mining takes on the unknown-unknowns of the dataset and begins to make sense of the vast amount of data points available. It involves both data transformation and reduction. These are necessary as “prediction algorithms have no control over the quality of the features and must accept it as a source of error” (Maaß, Spruit, & de Waal, 2014, p. 6).

Getty Images

What is produced from these data mining efforts is a set of relevant data points that can be used for aggregate, predictive, and prescriptive analysis in the enterprise organization’s business intelligence platform(s). It is no different than avoiding the “garbage-in, garbage-out” mistake of simple reporting and visualization. Data mining reduces the noise and eliminates the dilution of relevant data by irrelevant covariates. It provides the business intelligence framework with usable data and a minimum of error.

For example, if I were to embark on a predictive modeling project to determine what factors influenced employee attrition from a large manufacturing company over the last five years, I would first want to do extensive data mining on the raw dataset. With over 20,000 employees on all continents across the world, and hundreds of data points per employee, a rigorous data mining phase eliminates the variables that would throw errors into any predictive model such as decision trees or multiple regression.

References

Maaß, D., Spruit, M., & de Waal, P. (2014). Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques. Decision Analytics, 1(1), 1–17.

Sharma, S. A., Sharma, A. K., & Sharma, D. M. (2013). Using Data Mining for Prediction: A Conceptual Analysis. I-Manager’s Journal on Information Technology, 2(1), 1–9.

Data Warehousing and Data Mining

The relationship between data mining tools and data warehousing systems can be most easily seen in the connector options of popular analytics software packages. For example, the image below right shows the many source options from which to pull data in from warehouse backends in Tableau Desktop. Microsoft Power BI includes similar interface options. There are countless packages in R for connecting to data warehouse backends, readily available online from proprietary and open-source vendors. Other proprietary packages such as SPSS, SAS, and JMP have similar interfaces.

Simply put, a data mining tool enables insights into what is stored in the data warehouse, and is only as useful as the quality of the data it accesses. Power (2016) calls this discover, access, and distill. In professional practice, this author has often seen businesses focus erroneously on a particular data mining tool, believing the paid solution will provide immediate value, without ensuring the data warehouse (or equivalent) is in proper order first. Successful implementation of a data mining tool requires a number of preparatory steps, including (but not limited to):

  1. Identifying appropriate Systems of Record (SORs)
  2. Validating the SOR accuracy and alignment with business purposes
  3. Establishing a common understanding of the data points within each SOR and how they translate across business units within the organization (this often requires an organization-wide Data Governance Board or equivalent)
  4. Developing business goals, or questions the data mining tool can answer

These steps ensure the data is valid, useful, and actionable. Organizations that do not take the necessary steps to ensure data quality and develop a business case for the data mining tool run a danger of wasting time and resources on a solution in search of a problem (Gudfinnsson, Strand, & Berndtsson, 2015; LaValle, Lesser, Shockley, Hopkins, & Kruschwitz, 2011).

Consider an international manufacturing company that currently uses a number of disparate systems of record for its business: Cognos (not a true SOR but regarded as one), AS400, 3PL, SQL, Informix, and multiple warehouse management systems. The company does not have a unified data warehouse or data governance procedures in place. In current state, different business units that use different systems of record are unable to successfully work together with common understandings of data. Attempts at data mining and even simple reporting have failed across business units because of the quality of data—for example, sales forecasting does not translate between Finance and Marketing because the basic figures from the disparate systems of record do not match. There can be no useful data mining from this data without significant transformation.

Assuming the foundational steps are done, and the data mining tool is in production, new data points can be put back into the warehouse based on discovered insights. For example, consider a multi-level marketing company has a number of data points on its associates: units sold, associates recruited, years in the program, rewards program tier, et cetera. They know the associates can be grouped into performance categories akin to “novice” and “expert” but are unclear on both how many categories to look at and what factors are important. Principal components analysis and k-means clustering can reveal how the associates differentiate themselves based on the available variables and suggest an appropriate number of categories within which to classify them. These classifications can be put back into the data warehouse and used as co-variates in other analysis work.

References

Brownlee, J. (2016, September 22). Supervised and unsupervised machine learning algorithms.  Retrieved from https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

Gudfinnsson, K., Strand, M., & Berndtsson, M. (2015). Analyzing business intelligence maturity. Journal of Decision Systems, 24(1), 37-54. doi:10.1080/12460125.2015.994287

LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21-31.

Power, D. J. (2016). Data science: Supporting decision-making. Journal of Decision Systems, 25(4), 345-356.

Soni, D. (2018, March 22). Supervised vs. Unsupervised learning – towards data science.  Retrieved from https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Tableau Desktop 2018.2 [Computer software]. (2018). Retrieved from http://www.tableau.com.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).