Data Mining and the Enterprise BI Long Game

Data mining provides the foundational work for higher-order predictive and prescriptive analytics. Enterprise data warehouses are a trove of useful information, and data mining methods help to separate what is useful from what is not (Sharma, Sharma, & Sharma, 2013). Data mining is itself an analysis method; that is, “the analysis of data that was collected for other purposes but not the questions to be answered through the data mining process” (Maaß, Spruit, & de Waal, 2014, p. 2). Data mining takes on the unknown-unknowns of the dataset and begins to make sense of the vast amount of data points available. It involves both data transformation and reduction. These are necessary as “prediction algorithms have no control over the quality of the features and must accept it as a source of error” (Maaß, Spruit, & de Waal, 2014, p. 6).

Getty Images

What is produced from these data mining efforts is a set of relevant data points that can be used for aggregate, predictive, and prescriptive analysis in the enterprise organization’s business intelligence platform(s). It is no different than avoiding the “garbage-in, garbage-out” mistake of simple reporting and visualization. Data mining reduces the noise and eliminates the dilution of relevant data by irrelevant covariates. It provides the business intelligence framework with usable data and a minimum of error.

For example, if I were to embark on a predictive modeling project to determine what factors influenced employee attrition from a large manufacturing company over the last five years, I would first want to do extensive data mining on the raw dataset. With over 20,000 employees on all continents across the world, and hundreds of data points per employee, a rigorous data mining phase eliminates the variables that would throw errors into any predictive model such as decision trees or multiple regression.

References

Maaß, D., Spruit, M., & de Waal, P. (2014). Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques. Decision Analytics, 1(1), 1–17.

Sharma, S. A., Sharma, A. K., & Sharma, D. M. (2013). Using Data Mining for Prediction: A Conceptual Analysis. I-Manager’s Journal on Information Technology, 2(1), 1–9.

Zero-Latency Data and Business Intelligence

Business intelligence enables decision-makers and stakeholders to make strategic decisions based on the information available to them. Just as the quality of the data is critical, the timeliness of the data is equally so. Laursen & Thorlund (2010) identify three types of data:

  1. Lag information. This covers what happened previously, and may be used to feed predictive models attempting to create lead information. Although the data is recorded in real-time (i.e., a flight data recorder), reading and reporting from the data is done ex post facto.
  2. Real-time data. This data shows what has happening at present. Continuing the aviation example, the ADS-B pings from aircraft are real-time data points collected by receivers across the globe and fed to flight tracking sites such as FlightAware.com for real-time reporting.
  3. Lead information. This data is often yielded from predictive models created by real-time or lag information. Airlines use a combination of flight, weather, and air traffic data to project an estimated arrival time for any given commercial aircraft at a particular destination.

There are appropriate instances for all three types. Real-time tends to be the most desired, but of course with decreased lag and immediate demand comes a trade-off of processing power, vulnerability to errors, and cost. Somewhere between “very old” and “absolutely immediate” is the sweet spot of timeliness and cost-efficiency. In other words, the push for zero-latency data may be more costly than profitable. Businesses must develop their own cost/benefit models to determine how real-time their BI data should be.

One area of real-time necessity is item affinity analysis. Every day on Amazon, customers order items and are presented with other items that may be relevant to their purchase, based on purchasing patterns from other customers who have ordered the same thing as well as their own purchasing history (Pophal, 2014). This data must be zero-latency, as a recommendation must be posted almost immediately after the customer makes their initial order. A lag time of minutes, hours, or days would lose the potential sale.

References

Laursen, G. H. N., and Thorlund, J. (2010) Business Analytics for Managers: Taking Business Intelligence Beyond Reporting. Wiley & SAS Business Institute.

Pophal, L. (2014). The technology of contextualized content: What’s next on the horizon? Retrieved from http://www.econtentmag.com/Articles/Editorial/Feature/The-Technology-of-Contextualized-Content-Whats-Next-on-the-Horizon-99029.htm

Online Profiling and BI

First, we must define online profiling, also known as contextual marketing in advertising circles. It is a marketing and data-gathering process which “takes into account the users’ needs, habits, and goals to create a personalised web experience” (Elhawli, 2017). While it may not appear to be what conventional wisdom defines as business intelligence, the same principles are there: data is gathered, made sense of, and served to an information consumer in order to increase the company performance. The information consumer, in this case, may be considered either the actual user of the website or the layer of web platform between the data (recommendations) and website user.

The steps necessary for making the items actionable follow typical business intelligence processes. First, the “sheer volume of data now available to marketers” requires focusing on what data is relevant to the desired outcomes (Pophal, 2014). These outcomes and relevant data vary by market and platform. In many cases, the data itself yields important clues on what exactly is important—akin to an exploratory data analysis in traditional business intelligence implementations.

Another similarity is the platforms upon which the data must be presented. This may be considered a parallel to the various visualization and reporting platforms available to an information consumer. There are multiple data points involving user device and platform (i.e., Mozilla on Mac OS or Chrome on Android), which must then be utilized to determine how the information is best served up.

These data points also influence the intake process, parallel to the ETL stage in traditional business intelligence. Imagine the “multiple inputs around consumers and the devices they’re carrying-or wearing-that receive those inputs” as systems of record contributing to a master data aggregator (Pophal, 2014). These must work in near real-time, driving outputs and giving context to other inputs.

Moving from descriptive to predictive analytics is another similarity between online profiling and conventional business intelligence. Current online profiling “is largely driven by what consumers have done, the future will focus on what they will do” (Pophal, 2014). The transition from descriptive to predictive is a milestone in business intelligence maturity (LaValle, et al., 2011) and this is no different in online profiling. Serving up mountains of What Happened? insights can only go so far. Transitioning from that to So What? and Now What? crosses a hurdle into the next realm of usefulness. This cannot be at the expense of best practices, though—as the celebrated rise and embarrassing fall of Google Flu Trends reminds us (Lazer & Kennedy, 2015).

References

Elhawli, A. (2017, October 12). Why content & contextual marketing are key to your success. Retrieved February 2, 2019, from https://medium.com/@aelhawli1/why-content-contextual-marketing-are-key-to-your-success-4509239b71b9

LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21–31.

Lazer, D., & Kennedy, R. (2015, October 1). What we can learn from the epic failure of Google Flu Trends. Retrieved February 2, 2019, from https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/

Pophal, L. (2014). The technology of contextualized content: What’s next on the horizon? Retrieved from http://www.econtentmag.com/Articles/Editorial/Feature/The-Technology-of-Contextualized-Content-Whats-Next-on-the-Horizon-99029.htm

Target Measures, Probability Mining, and Econometric Modeling

Target measures answer narrow down what must be defined in business intelligence, in a sense of dashboards or Key Performance Metrics. We use the term boiling the ocean often in my professional practice to suggest that scopes of business intelligence cannot be broad-brush approaches. Target measures provide focus and scope. Common target measures are ROI, Net Sales, Efficiency Rate, Response Time, et cetera (Liu, Laguna, Wright, & He, 2014). Probability mining uses predictive modeling methods such as logistic regression, neural networks, multiple regression, and recursive partitioning to predict the probability of particular events happening based on previous data. Econometric modeling is a combination of statistical analysis and economic theory, based on the idea that neither alone can provide an accurate or satisfactory picture of an economic phenomenon.

In terms of business intelligence, these may be combined for a particular approach. Econometric models are more specialized and may not apply to a particular instance, but certainly target measures and probability mining may be. Target measures most often should be defined before probability mining happens, in order to maintain correct focus and scope.

Imagine a human resources predictive model that mines a swath of employee data from SAP to find key correlations, then uses both multiple regression and neural nets to find predictors of voluntary separation. There is a target measure involved there: the new model is applied to the current employees in order to find their percentage likelihood of leaving in the next two years, with a standard error of 6 months.

Another target measure of note is sell-through numbers for a consumer goods corporation. Sell-in shows sales to a particular distributor. However, not all distributors currently provide their sales data back to the company, which shows how much actually got out the door in the hands of consumers (sell-through). It is a target measure, as the delta between in and through is helpful for the business to know, and timely sell-through data can help the sales, marketing, and customer service departments adjust their approaches based on customer habits.

References

Liu, Y., Laguna, J., Wright, M., & He, H. (2014). Media mix modeling – A Monte Carlo simulation study. Journal of Marketing Analytics, 2(3), 173–186.

Pophal, L. (2014). The Technology of Contextualized Content: What’s Next on the Horizon? Retrieved from http://www.econtentmag.com/Articles/Editorial/Feature/The-Technology-of-Contextualized-Content-Whats-Next-on-the-Horizon-99029.htm

Reiss, P. C. & Wolak, F. A. (2007) Structural Econometric Modeling: Rationales and Examples from Industrial Organizations. Retrieved from https://web.stanford.edu/group/fwolak/cgi-bin/sites/default/files/files/Structural%20Econometric%20Modeling_Rationales%20and%20Examples%20From%20Industrial%20Organization_Reiss,%20Wolak.pdf

Data Warehousing and Data Mining

The relationship between data mining tools and data warehousing systems can be most easily seen in the connector options of popular analytics software packages. For example, the image below right shows the many source options from which to pull data in from warehouse backends in Tableau Desktop. Microsoft Power BI includes similar interface options. There are countless packages in R for connecting to data warehouse backends, readily available online from proprietary and open-source vendors. Other proprietary packages such as SPSS, SAS, and JMP have similar interfaces.

Simply put, a data mining tool enables insights into what is stored in the data warehouse, and is only as useful as the quality of the data it accesses. Power (2016) calls this discover, access, and distill. In professional practice, this author has often seen businesses focus erroneously on a particular data mining tool, believing the paid solution will provide immediate value, without ensuring the data warehouse (or equivalent) is in proper order first. Successful implementation of a data mining tool requires a number of preparatory steps, including (but not limited to):

  1. Identifying appropriate Systems of Record (SORs)
  2. Validating the SOR accuracy and alignment with business purposes
  3. Establishing a common understanding of the data points within each SOR and how they translate across business units within the organization (this often requires an organization-wide Data Governance Board or equivalent)
  4. Developing business goals, or questions the data mining tool can answer

These steps ensure the data is valid, useful, and actionable. Organizations that do not take the necessary steps to ensure data quality and develop a business case for the data mining tool run a danger of wasting time and resources on a solution in search of a problem (Gudfinnsson, Strand, & Berndtsson, 2015; LaValle, Lesser, Shockley, Hopkins, & Kruschwitz, 2011).

Consider an international manufacturing company that currently uses a number of disparate systems of record for its business: Cognos (not a true SOR but regarded as one), AS400, 3PL, SQL, Informix, and multiple warehouse management systems. The company does not have a unified data warehouse or data governance procedures in place. In current state, different business units that use different systems of record are unable to successfully work together with common understandings of data. Attempts at data mining and even simple reporting have failed across business units because of the quality of data—for example, sales forecasting does not translate between Finance and Marketing because the basic figures from the disparate systems of record do not match. There can be no useful data mining from this data without significant transformation.

Assuming the foundational steps are done, and the data mining tool is in production, new data points can be put back into the warehouse based on discovered insights. For example, consider a multi-level marketing company has a number of data points on its associates: units sold, associates recruited, years in the program, rewards program tier, et cetera. They know the associates can be grouped into performance categories akin to “novice” and “expert” but are unclear on both how many categories to look at and what factors are important. Principal components analysis and k-means clustering can reveal how the associates differentiate themselves based on the available variables and suggest an appropriate number of categories within which to classify them. These classifications can be put back into the data warehouse and used as co-variates in other analysis work.

References

Brownlee, J. (2016, September 22). Supervised and unsupervised machine learning algorithms.  Retrieved from https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

Gudfinnsson, K., Strand, M., & Berndtsson, M. (2015). Analyzing business intelligence maturity. Journal of Decision Systems, 24(1), 37-54. doi:10.1080/12460125.2015.994287

LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21-31.

Power, D. J. (2016). Data science: Supporting decision-making. Journal of Decision Systems, 25(4), 345-356.

Soni, D. (2018, March 22). Supervised vs. Unsupervised learning – towards data science.  Retrieved from https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Tableau Desktop 2018.2 [Computer software]. (2018). Retrieved from http://www.tableau.com.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Machine Learning: Supervised and Unsupervised

Supervised typically takes the form of classification or regression. We know the input and output variables, and try to make sense of the relationships between the two. Tembhurkar, Tugnayat, & Nagdive (2014) refer to this as Descriptive mining. Common methods include decision tree, kNN algorithm, regression, and discriminant analysis. The methods are dependent upon the type of data input: continuous variables will use regression methods, while discrete variables will use classification methods.

For example, a human resources division in a large multinational company wants to determine what factors have contributed to employee attrition over the past two years. A decision tree methodology can produce a simple “if-then” map of what attributes combine and result in a separated employee. An example tree might point out that a male employee over the age of 45, working in Division X, who commutes more than 25 miles from home, has a manager 10 years or more his junior, and has been in the same unit for more than seven years is a prime candidate for attrition. Although many of the variables are continuous, a decision tree method makes the data manageable and actionable for human resources division use.

Unsupervised are usually clustering or association. The output variables are not known, and we are relying on the system to make sense of the data. No a priori knowledge. Temburkhar et al refers to this as Prescriptive mining. Common methods include neural networks, anomaly detection, k-means clustering, and principal components analysis. The methods are dependent upon the type of data input: continuous variables will use association methods, while discrete variables will use clustering methods.

For example, a multi-level marketing company has a number of data points on its associates: units sold, associates recruited, years in the program, rewards program tier, et cetera. They know the associates can be grouped into performance categories akin to novice and expert but are unclear on both how many categories to look at and what factors are important. Principal components analysis and k-means clustering can reveal how the associates differentiate themselves based on the available variables and suggest an appropriate number of categories within which to classify them.

References

Brownlee, J. (2016, September 22). Supervised and unsupervised machine learning algorithms.  Retrieved from https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

Soni, D. (2018, March 22). Supervised vs. Unsupervised learning – towards data science.  Retrieved from https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Corporate Information Factories and Business Dimensional Models

Differentiating between a Corporate Information Factory (CIF) and a Business Dimensional Model (BDM) may come down to two different directions of strategic thought: top-down (CIF) or bottom-up (BDM).

In the BDM, otherwise known as the Kimball approach, data remain in their respective logical business units (e.g, Sales or Production) but are brought together into the data warehouse through a commonly defined bus architecture. This approach is most prevalent in the Microsoft BI stack. Star or snowflake schemas are utilized and data are rarely normalized past 1NF, if at all. The logical focus is on the originating business units and the goal is often to allow these units to more effectively share data across the organization. For presentation, fewer queries and joins are necessary than one would need to make sense of CIF data.

The CIF, or Inmon approach, starts with the central data repository as the unit of focus as opposed to the individual business units. The business units can create data marts from the normalized tables. Third normal form is required. The most apparent disadvantage here is the amount of time and thought required to implement a true CIF, but the resulting product is a true enterprise data factory. More joins are needed, though, to put the data into presentable form.

Where Extract-Transform-Load or Extract-Load-Transform is concerned, the former (ETL) is the most conventional understanding of the process and typically implemented in dimensional modeling. The transformation happens before the data reaches the target system and is logically arranged already—to some degree—by business until or purpose. The latter (ELT) is utilized most often in more powerful analytics implementations or data lakes.

References

Bethke, U. (2017, May 15). Dimensional modeling and Kimball data marts in the age of big data and Hadoop.  Retrieved from https://sonra.io/2017/05/15/dimensional-modeling-and-kimball-data-marts-in-the-age-of-big-data-and-hadoop/

Harris, D. ETL vs. ELT: How to choose the best approach for your data warehouse. Retrieved from https://www.softwareadvice.com/resources/etl-vs-elt-for-your-data-warehouse/

Kajeepeta, S. (2010, Jun 7). Is it time to switch to ELT? Intelligent Enterprise – Online. Retrieved from https://proxy.cecybrary.com/login?url=https://search.proquest.com/docview/365390283?accountid=144789

Kumar, G. (2017, Mar 14). Dimensional modelling vs corporate information factory. Retrieved from http://www.data-design.org/blog/dimensional-modelling-vs-corporate-information-factory

Decision Support Systems, Data Warehouses, and OLAP Cubes

As Tembhurkar, Tugnayat, & Nagdive define it, BI is “a collection of tools and techniques [that] transforms raw data into significant information useful for analyzing business systems” (2014, p. 128). BI has evolved from the earlier incarnations of Decision Support Systems, which served the same purpose(s) but were much more rudimentary compared to today’s implementations. These DSS solutions were often comprised of data warehouses (DWs) and online analytical processing (OLAP) engines. Both components worked together to serve the business needs: ETL and storage being handled by the data warehouse, and the front-end analysis handled by the OLAP system.

The data warehouse serves as the central repository for multiple systems of record, often heterogenous and disparate in the beginning. Data is typically replicated and stored in subject-area schemas (e.g., sales or employee data), most typically in fact and dimension tables as part of a SQL-backed relational database. The data warehouse itself can offer views and data marts pre-packaged. It supports the OLAP system. Like the OLAP system in its original form, the data warehouse is starting to be eclipsed by data lakes in enterprise environments that deal with a large amount of heterogenous data that often includes unstructured data. The difference between the two, for purposes of this comparison, is where the “T” (transformation) falls in ETL or ELT. In a data warehouse, the transformation happens before loading into the warehouse, as its purpose is to serve as a central common repository. In a data lake, the transformation happens after loading, as the lake does not impose any schemas or restrictions in order to achieve any kind of homogenous state.

The OLAP system is multi-dimensional, not unlike a three-dimensional spreadsheet. It is not a relational database but enables the analysis of the data in the data warehouse. The OLAP system enables what we typically understand as slicing and dicing the data. While these were sufficient in the early days of BI, the shift towards a DevOps culture and the proliferation of machine learning, predictive analysis, dashboarding, and envelope-pushing analytics capabilities have required more from a BI solution than rigid OLAP cubes.

References

Felzke, M. (2014). Data warehouse vs. OLAP cube. Retrieved from https://www.solverglobal.com/blog/2014/04/data-warehouse-vs-olap-cube/

Harris, D. (n.d.). ETL vs. ELT: How to choose the best approach for your data warehouse. Retrieved from https://www.softwareadvice.com/resources/etl-vs-elt-for-your-data-warehouse/

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

NXD and RDBMS Solutions

Comparing native XML database (NXD) and relational DBMS solutions is close to comparing apples and oranges. Both are spherical fruit, but they have very different flavors, applications, and characteristics. RDBMS has been around for a long time and is much more established than NXD; as a result, there is less collective knowledge around NXD and its implementations. RBMS solutions are practically ubiquitous and have a number of different implementations, both open-source and proprietary. Tables are normalized and typically in a fact/dimension model or star schema.

On the other hand, comparative NXD solutions rely on containers and documents in a simple tree structure. Complex joins and queries that are allowable in RDBMS are typically more difficult in NXD (Pavlovic-Lazetic, 2007). One area that NXD shows promise is in Web-enabled data warehousing (Salem, R., Boussaïd, O., & Darmont, J., 2013). Bringing multiple sources of unstructured and structured data together in an Active XML Repository addresses data heterogeneity, distribution, and interoperability issues.

A typical RDMBS implementation for business is a data warehouse in which structured data from various systems of record are brought into a common area and reconciled. These other systems of record may include proprietary relational database systems, mainframe non-relational databases, data exported to delimited formats, et cetera. A data dictionary may be maintained and reconciliation policies may be drawn up by a central data governance board. The output from this data warehouse allows users from different divisions using different systems of record to understand a common organization-wide data taxonomy.

One possible NXD solution involves an IoT data environment. Imagine a number of environmental sensors (e.g., temperature, humidity, pressure) being read on regular intervals and pushed to a central web location. In a typical XML tree structure, readings from each sensor or central controller (handling multiple sensors) could be placed in an XML document. This data does not require complex joins, and is much better suited for a NXD solution.

References

Pavlovic-Lazetic, G. (2007). Native XML databases vs. relational databases in dealing with XML documents. Kragujevac Journal of Mathematics, 30, 181-199.

Salem, R., Boussaïd, O., & Darmont, J. (2013). Active XML-based web data integration. Information Systems Frontiers, 15(3), 371-398.

 

Comparing ADO.NET and Fusion Middleware

ADO.NET provides a distinct Connected and Disconnected layer for data access via the Web, allowing for the unpredictability of internet connections. It is optimized for SQL, being a Microsoft product, and has extensive XML support. In that respect it may act as a bridge between traditional RDBMS systems and XML, which is a material concern for Web-DBMS implementations (Barik & Ramesh, 2011). ADO.NET is feature-rich and has many uses, but the trade-off here is a relatively steep learning curve and complex customization.

Oracle’s Fusion Middleware is a much more sprawling collection of enterprise-level resources that provides a number of distinct applications and frameworks at the web, middle, and data tiers. At the core of the middle tier is the WebLogic server. It is Java-based and runs a number of open standard technologies. There are a number of additional components in the collection, but of particular relevance in relation to ADO.NET is the Service-Oriented Architecture (SOA) Suite. It acts as an incrementally-available bridge between heterogenous systems.

As complex and comprehensive as the Oracle suite is, of course there are trade-offs. Successfully implemented, it is an efficient solution that can incorporate multiple different systems and flow information effectively across the solution. However, developers and resources are relatively scant compared to Microsoft ADO.NET; beyond that, it is a much bigger capital and operational investment than ADO.NET.

These two may be in consideration where Web-DBMS data access is concerned. Neither is an actual programming language. Rather, they should be viewed as middle-tier solutions that enable data access in a Web environment. The scale to which these solutions will be deployed is a primary consideration for which to consider. ADO.NET is a more concise product that functions well within an existing Microsoft ecosystem, whereas Fusion Middleware is a much broader set of tools more suited to an enterprise environment. That isn’t to say Fusion Middleware cannot be suited to smaller environments.

References

Barik, P. K., & Ramesh, D. B. (2011). Design and development of virtual distributed database for web-based library resource sharing network for Orissa technical and management institutions. International Journal of Information Dissemination and Technology, 1(1), 51.

Connolly, T., & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Oracle. (n.d.). Oracle Fusion Middleware. Retrieved from https://www.oracle.com/technetwork/middleware/fusion-middleware/overview/index.html