From Decision Support to Business Intelligence

Decision support systems (DSS) predate business intelligence (BI) by several decades. Sprague and Carlson (1982) define a DSS as “class of information system that draws on transaction processing systems and interacts with the other parts of the overall information system to support the decision-making activities of managers and other knowledge workers in organisations.” This definition is very nearly interchangeable with that of a business intelligence system. We can think of DSS as more of a framework and model more than an actual software package. These have often been aided by computer resources, such as databases and online processing (OLAP), but they may also be offline. Any DSS involves a data or knowledge base, the business rules, and the interface itself. DSS systems may be classified by one of the following drivers (Power, 2000):

  • Communication-Driven
  • Data-Driven
  • Document-Driven
  • Knowledge-Driven
  • Model-Driven

Business intelligence can be viewed as the successor of DSS or the parent of it. I prefer to see it as a hybrid. As methods of collecting, storing, viewing, and analyzing data became more advanced, DSS systems came to be a specific part in a larger BI framework. A DSS is always dependent on “access to accurate, well-structured, and organized data” (Felsberger, Oberegger, & Reiner, 2016, p. 3). The various functions of business intelligence that have grown in recent years all serve to support the data points going into the DSS.

In a manufacturing environment, a practical example might be the evaluation and assignment of work centers. The knowledge base may include data such as what must go in, what must be produced, what constraints are in place, et cetera. Production and diagnostic data from the different work centers would be integrated via the BI capabilities of the organization, as well as forecasted production and schedule data. Business rules such as employee labor hours and machine lifecycle may also be included. The DSS would use all these data points to drive outputs; in this case, the desired outputs and decisions include production labor and machine scheduling that are most efficient to the company.

References

Felsberger, A., Oberegger, B., & Reiner, G. (2016). A review of decision support systems for manufacturing systems.

Power, D. J. (2000). Web-based and model-driven decision support systems: concepts and issues. In proceedings of the Americas Conference on Information Systems, Long Beach, California.

Sprague, R. H., & Carlson, E. D. (1982). Building effective decision support systems. Prentice Hall Professional Technical Reference.

2PC and 3PC (Commit Protocols) in DBMS

Both Two-Phase Commit (2PC) protocol and Three-Phase Commit (3PC) protocol are popular with Distributed DBMS instances because all nodes must commit to a transaction or none of them will. It is an all-or-nothing proposition. Both protocols share a Prepare (Voting) and Commit/Abort phase, but 3PC adds an additional pre-Commit phase in which every participating node must vote yes to a commit before it is actually done. Compared to 3PC, Two-Phase Commit may be characterized as sending the command and hoping for the best, since the bulk of the transaction (the instructions for what to actually do) are transmitted with the commit phase. The return message after the transaction, from each participant, determines commit or abort status globally. The 3PC extra step of pre-commit is intended to clear up any global commit/abort failure issues or blocking. This step polls for availability before anything is done and the nodes can “act independently in the event of a failure” (Connolly & Begg, 2015). This is an important distinction. In 2PC, a single abort vote or acknowledgement undoes the entire process. In 3PC, assuming the pre-commit phase came back with a global commit vote, even a timeout or network partition would not cause a global abort.

Terminating a process, according to Connolly & Begg (2015), is where the differences between these protocols are most critical. In 2PC it is possible to have a block because after the vote, the nodes are waiting on a commit or abort message from coordinator before making the global commit. If partition occurs, they are stuck until coordinator re-establishes communication. A power failure is more catastrophic, as it may involve multiple nodes and the controller. In both 2PC and 3PC, backup procedures are activated. 2PC participants remain in a blocked state. Of course, overall, there are tradeoffs. The major issue with 3PC is the communication overhead, which is to be expected with the extra phase (Kumar, 2016).

References

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Kumar, M. (2016). Commit protocols in distributed database system: A comparison. International Journal for Innovative Research in Science & Technology, 2(12), 277-281.

Concurrency: Optimistic or Pessimistic?

Optimistic concurrency control is the more complex of the two concurrency control methods. A transaction beginning is timestamped, a process is run, and the change is validated. If another transaction completed after this transaction’s start time, this transaction is aborted. In other words, the original record is unavailable because someone got to it first and completed the transaction. The risk here is a dirty read, as it is possible for more than one person to have access to a record at a time. Change validation is done at the end of the transaction block.

Conservative (or pessimistic) concurrency control is akin to checking out a book at the library, and is the simpler of the two methods. Once a transaction begins, the record is locked, and no one else can modify it. In the library example, I would go to the library to check out a book (record) to read it (modify it); if it is there (no one has initiated a change), I may check it out. If the book (record) is not there (someone has locked it, modifying it), I cannot check it out. It is a first-come, first-served method that ensures no two people have concurrent access to a record at a time.

Each has its risks and rewards. Optimistic concurrency control tends to be used in environments without much contention for a single record of truth. It allows a higher volume of transactions per hour. However, as the name implies, the method essentially hopes for the best then deals with the problem if and when it arises. On the other hand, pessimistic concurrency control virtually guarantees that all transactions will be executed correctly and that the database is stable. It is a simpler decision tree: either abort if locked or commit if unlocked. All the drawbacks of pessimistic concurrency control lie in timing: fewer transactions per hour and limited access to the data depending on the number of users making transactions.

One specific advantage of optimistic locking, that isn’t always thought of immediately, is evident in the scenario when a user cannot maintain a consistent connection to the database. Assume for a moment that a user locks a table in a remote database for updating and the connection is severed (either through server reset, ISP woes, et cetera). The user reconnects and is back in the database. However, the previous session was not properly closed, so we have a phantom user with the record still open.

Reference

Connolly, T. & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Database Indexing and Clusters

A useful analogy for database indexing is highlighting an article or book. We highlight or mark passages to make them easier for retrieval at a later date, when we pick up the book and want to find something quickly. Likewise, an index provides quick access to key information in the database tables. It would be silly to highlight an entire book, or mark up all of the pages; by the same token, it would be functionally useless to index a large number of columns in a table. There is a point of diminishing returns here.

It is generally recommended to index columns that are involved in WHERE or JOIN clauses (Larsen, 2010). These columns are frequently sought out by multiple query operations and are typically as critical to the table as the Primary Key. It is important to choose wisely here because for every table operation done, an index update must be done. This work can become exponential if multiple indexes are placed on a single table. Again, we come back to the principle of diminishing returns.

There is also the matter of choosing between clustered and nonclustered indexes. The former typically reads like browsing through a telephone directory: in order. Primary Keys are typically used in clustered indexes. One drawback here is the need to re-order the index when information is updated, added, or deleted. On the other hand, a non-clustered index operates much like an index in the back of a textbook, or like a dimension table in a star-schema database. While the latter may seem more advantageous at all times, but it usually shines when values are constantly updated. In situations where most of the table data is returned in a query, or a Primary Key is the rational identifier, a clustered index is the type of choice.

References

Connolly, T. & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson. Larsen, G. A. (2010). The dos and don’ts of database indexing. Retrieved from https://www.databasejournal.com/features/mssql/article.php/3860851/The-Dos-and-Donts-of-Database-Indexing.htm

Wagner, B. (2017). Clustered vs nonclustered: What index is right for my data? Retrieved from https://hackernoon.com/clustered-vs-nonclustered-what-index-is-right-for-my-data-717b329d042c

Data Mining: What’s Ahead

I’ve written on Data Mining before, as it is a fundamental step for higher-order predictive and prescriptive analytics work. Enterprise data warehouses are a trove of useful information, and data mining methods help to separate what is useful from what is not (Sharma, Sharma, & Sharma, 2013). Data mining is itself an analysis method; that is, “the analysis of data that was collected for other purposes but not the questions to be answered through the data mining process” (Maaß, Spruit, & de Waal, 2014, p. 2). Data mining takes on the unknown-unknowns of the dataset and begins to make sense of the vast amount of data points available. It involves both data transformation and reduction. These are necessary as “prediction algorithms have no control over the quality of the features and must accept it as a source of error” (Maaß, Spruit, & de Waal, 2014, p. 6). Data mining reduces the noise and eliminates the dilution of relevant data by irrelevant covariates. It provides the business intelligence framework with usable data and a minimum of error.

Tembhurkar, Tugnayat, & Nagdive (2014) outline five stages for successful data-to-BI transformation:

  1. Collection of raw data
  2. Data mining and cleansing
  3. Data warehousing
  4. Implementation of BI tools
  5. Analysis of outputs (p. 132).

Given the importance of Data Mining in the BI process, I do not see it going away or diminishing in stature. In fact, more attention may be coming to it because of the growing interest in data lakes and ELT over ETL (e.g., Meena & Vidhyameena, 2016; Rajesh & Ramesh, 2016). Increased attention will be paid to mining and cleansing practices. New developments will include advances in unstructured data, IoT data, distributed systems data mining, and NLP/multimedia data mining.

References

Maaß, D., Spruit, M., & de Waal, P. (2014). Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques. Decision Analytics, 1(1), 1–17.

Meena, S. D., & Vidhyameena, S. (2016). Data lake – a new data repository for big data analytics workloads. International Journal of Advanced Research in Computer Science, 7(5), 65-67.

Rajesh, K. V. N., & Ramesh, K. V. N. (2016). An introduction to data lake. i-Manager’s Journal on Information Technology, 5(2), 1-4.

Sharma, S. A., Sharma, A. K., & Sharma, D. M. (2013). Using Data Mining for Prediction: A Conceptual Analysis. I-Manager’s Journal on Information Technology, 2(1), 1–9.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Data Mining and the Enterprise BI Long Game

Data mining provides the foundational work for higher-order predictive and prescriptive analytics. Enterprise data warehouses are a trove of useful information, and data mining methods help to separate what is useful from what is not (Sharma, Sharma, & Sharma, 2013). Data mining is itself an analysis method; that is, “the analysis of data that was collected for other purposes but not the questions to be answered through the data mining process” (Maaß, Spruit, & de Waal, 2014, p. 2). Data mining takes on the unknown-unknowns of the dataset and begins to make sense of the vast amount of data points available. It involves both data transformation and reduction. These are necessary as “prediction algorithms have no control over the quality of the features and must accept it as a source of error” (Maaß, Spruit, & de Waal, 2014, p. 6).

Getty Images

What is produced from these data mining efforts is a set of relevant data points that can be used for aggregate, predictive, and prescriptive analysis in the enterprise organization’s business intelligence platform(s). It is no different than avoiding the “garbage-in, garbage-out” mistake of simple reporting and visualization. Data mining reduces the noise and eliminates the dilution of relevant data by irrelevant covariates. It provides the business intelligence framework with usable data and a minimum of error.

For example, if I were to embark on a predictive modeling project to determine what factors influenced employee attrition from a large manufacturing company over the last five years, I would first want to do extensive data mining on the raw dataset. With over 20,000 employees on all continents across the world, and hundreds of data points per employee, a rigorous data mining phase eliminates the variables that would throw errors into any predictive model such as decision trees or multiple regression.

References

Maaß, D., Spruit, M., & de Waal, P. (2014). Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques. Decision Analytics, 1(1), 1–17.

Sharma, S. A., Sharma, A. K., & Sharma, D. M. (2013). Using Data Mining for Prediction: A Conceptual Analysis. I-Manager’s Journal on Information Technology, 2(1), 1–9.

Zero-Latency Data and Business Intelligence

Business intelligence enables decision-makers and stakeholders to make strategic decisions based on the information available to them. Just as the quality of the data is critical, the timeliness of the data is equally so. Laursen & Thorlund (2010) identify three types of data:

  1. Lag information. This covers what happened previously, and may be used to feed predictive models attempting to create lead information. Although the data is recorded in real-time (i.e., a flight data recorder), reading and reporting from the data is done ex post facto.
  2. Real-time data. This data shows what has happening at present. Continuing the aviation example, the ADS-B pings from aircraft are real-time data points collected by receivers across the globe and fed to flight tracking sites such as FlightAware.com for real-time reporting.
  3. Lead information. This data is often yielded from predictive models created by real-time or lag information. Airlines use a combination of flight, weather, and air traffic data to project an estimated arrival time for any given commercial aircraft at a particular destination.

There are appropriate instances for all three types. Real-time tends to be the most desired, but of course with decreased lag and immediate demand comes a trade-off of processing power, vulnerability to errors, and cost. Somewhere between “very old” and “absolutely immediate” is the sweet spot of timeliness and cost-efficiency. In other words, the push for zero-latency data may be more costly than profitable. Businesses must develop their own cost/benefit models to determine how real-time their BI data should be.

One area of real-time necessity is item affinity analysis. Every day on Amazon, customers order items and are presented with other items that may be relevant to their purchase, based on purchasing patterns from other customers who have ordered the same thing as well as their own purchasing history (Pophal, 2014). This data must be zero-latency, as a recommendation must be posted almost immediately after the customer makes their initial order. A lag time of minutes, hours, or days would lose the potential sale.

References

Laursen, G. H. N., and Thorlund, J. (2010) Business Analytics for Managers: Taking Business Intelligence Beyond Reporting. Wiley & SAS Business Institute.

Pophal, L. (2014). The technology of contextualized content: What’s next on the horizon? Retrieved from http://www.econtentmag.com/Articles/Editorial/Feature/The-Technology-of-Contextualized-Content-Whats-Next-on-the-Horizon-99029.htm

Online Profiling and BI

First, we must define online profiling, also known as contextual marketing in advertising circles. It is a marketing and data-gathering process which “takes into account the users’ needs, habits, and goals to create a personalised web experience” (Elhawli, 2017). While it may not appear to be what conventional wisdom defines as business intelligence, the same principles are there: data is gathered, made sense of, and served to an information consumer in order to increase the company performance. The information consumer, in this case, may be considered either the actual user of the website or the layer of web platform between the data (recommendations) and website user.

The steps necessary for making the items actionable follow typical business intelligence processes. First, the “sheer volume of data now available to marketers” requires focusing on what data is relevant to the desired outcomes (Pophal, 2014). These outcomes and relevant data vary by market and platform. In many cases, the data itself yields important clues on what exactly is important—akin to an exploratory data analysis in traditional business intelligence implementations.

Another similarity is the platforms upon which the data must be presented. This may be considered a parallel to the various visualization and reporting platforms available to an information consumer. There are multiple data points involving user device and platform (i.e., Mozilla on Mac OS or Chrome on Android), which must then be utilized to determine how the information is best served up.

These data points also influence the intake process, parallel to the ETL stage in traditional business intelligence. Imagine the “multiple inputs around consumers and the devices they’re carrying-or wearing-that receive those inputs” as systems of record contributing to a master data aggregator (Pophal, 2014). These must work in near real-time, driving outputs and giving context to other inputs.

Moving from descriptive to predictive analytics is another similarity between online profiling and conventional business intelligence. Current online profiling “is largely driven by what consumers have done, the future will focus on what they will do” (Pophal, 2014). The transition from descriptive to predictive is a milestone in business intelligence maturity (LaValle, et al., 2011) and this is no different in online profiling. Serving up mountains of What Happened? insights can only go so far. Transitioning from that to So What? and Now What? crosses a hurdle into the next realm of usefulness. This cannot be at the expense of best practices, though—as the celebrated rise and embarrassing fall of Google Flu Trends reminds us (Lazer & Kennedy, 2015).

References

Elhawli, A. (2017, October 12). Why content & contextual marketing are key to your success. Retrieved February 2, 2019, from https://medium.com/@aelhawli1/why-content-contextual-marketing-are-key-to-your-success-4509239b71b9

LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21–31.

Lazer, D., & Kennedy, R. (2015, October 1). What we can learn from the epic failure of Google Flu Trends. Retrieved February 2, 2019, from https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/

Pophal, L. (2014). The technology of contextualized content: What’s next on the horizon? Retrieved from http://www.econtentmag.com/Articles/Editorial/Feature/The-Technology-of-Contextualized-Content-Whats-Next-on-the-Horizon-99029.htm

Target Measures, Probability Mining, and Econometric Modeling

Target measures answer narrow down what must be defined in business intelligence, in a sense of dashboards or Key Performance Metrics. We use the term boiling the ocean often in my professional practice to suggest that scopes of business intelligence cannot be broad-brush approaches. Target measures provide focus and scope. Common target measures are ROI, Net Sales, Efficiency Rate, Response Time, et cetera (Liu, Laguna, Wright, & He, 2014). Probability mining uses predictive modeling methods such as logistic regression, neural networks, multiple regression, and recursive partitioning to predict the probability of particular events happening based on previous data. Econometric modeling is a combination of statistical analysis and economic theory, based on the idea that neither alone can provide an accurate or satisfactory picture of an economic phenomenon.

In terms of business intelligence, these may be combined for a particular approach. Econometric models are more specialized and may not apply to a particular instance, but certainly target measures and probability mining may be. Target measures most often should be defined before probability mining happens, in order to maintain correct focus and scope.

Imagine a human resources predictive model that mines a swath of employee data from SAP to find key correlations, then uses both multiple regression and neural nets to find predictors of voluntary separation. There is a target measure involved there: the new model is applied to the current employees in order to find their percentage likelihood of leaving in the next two years, with a standard error of 6 months.

Another target measure of note is sell-through numbers for a consumer goods corporation. Sell-in shows sales to a particular distributor. However, not all distributors currently provide their sales data back to the company, which shows how much actually got out the door in the hands of consumers (sell-through). It is a target measure, as the delta between in and through is helpful for the business to know, and timely sell-through data can help the sales, marketing, and customer service departments adjust their approaches based on customer habits.

References

Liu, Y., Laguna, J., Wright, M., & He, H. (2014). Media mix modeling – A Monte Carlo simulation study. Journal of Marketing Analytics, 2(3), 173–186.

Pophal, L. (2014). The Technology of Contextualized Content: What’s Next on the Horizon? Retrieved from http://www.econtentmag.com/Articles/Editorial/Feature/The-Technology-of-Contextualized-Content-Whats-Next-on-the-Horizon-99029.htm

Reiss, P. C. & Wolak, F. A. (2007) Structural Econometric Modeling: Rationales and Examples from Industrial Organizations. Retrieved from https://web.stanford.edu/group/fwolak/cgi-bin/sites/default/files/files/Structural%20Econometric%20Modeling_Rationales%20and%20Examples%20From%20Industrial%20Organization_Reiss,%20Wolak.pdf

Data Warehousing and Data Mining

The relationship between data mining tools and data warehousing systems can be most easily seen in the connector options of popular analytics software packages. For example, the image below right shows the many source options from which to pull data in from warehouse backends in Tableau Desktop. Microsoft Power BI includes similar interface options. There are countless packages in R for connecting to data warehouse backends, readily available online from proprietary and open-source vendors. Other proprietary packages such as SPSS, SAS, and JMP have similar interfaces.

Simply put, a data mining tool enables insights into what is stored in the data warehouse, and is only as useful as the quality of the data it accesses. Power (2016) calls this discover, access, and distill. In professional practice, this author has often seen businesses focus erroneously on a particular data mining tool, believing the paid solution will provide immediate value, without ensuring the data warehouse (or equivalent) is in proper order first. Successful implementation of a data mining tool requires a number of preparatory steps, including (but not limited to):

  1. Identifying appropriate Systems of Record (SORs)
  2. Validating the SOR accuracy and alignment with business purposes
  3. Establishing a common understanding of the data points within each SOR and how they translate across business units within the organization (this often requires an organization-wide Data Governance Board or equivalent)
  4. Developing business goals, or questions the data mining tool can answer

These steps ensure the data is valid, useful, and actionable. Organizations that do not take the necessary steps to ensure data quality and develop a business case for the data mining tool run a danger of wasting time and resources on a solution in search of a problem (Gudfinnsson, Strand, & Berndtsson, 2015; LaValle, Lesser, Shockley, Hopkins, & Kruschwitz, 2011).

Consider an international manufacturing company that currently uses a number of disparate systems of record for its business: Cognos (not a true SOR but regarded as one), AS400, 3PL, SQL, Informix, and multiple warehouse management systems. The company does not have a unified data warehouse or data governance procedures in place. In current state, different business units that use different systems of record are unable to successfully work together with common understandings of data. Attempts at data mining and even simple reporting have failed across business units because of the quality of data—for example, sales forecasting does not translate between Finance and Marketing because the basic figures from the disparate systems of record do not match. There can be no useful data mining from this data without significant transformation.

Assuming the foundational steps are done, and the data mining tool is in production, new data points can be put back into the warehouse based on discovered insights. For example, consider a multi-level marketing company has a number of data points on its associates: units sold, associates recruited, years in the program, rewards program tier, et cetera. They know the associates can be grouped into performance categories akin to “novice” and “expert” but are unclear on both how many categories to look at and what factors are important. Principal components analysis and k-means clustering can reveal how the associates differentiate themselves based on the available variables and suggest an appropriate number of categories within which to classify them. These classifications can be put back into the data warehouse and used as co-variates in other analysis work.

References

Brownlee, J. (2016, September 22). Supervised and unsupervised machine learning algorithms.  Retrieved from https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

Gudfinnsson, K., Strand, M., & Berndtsson, M. (2015). Analyzing business intelligence maturity. Journal of Decision Systems, 24(1), 37-54. doi:10.1080/12460125.2015.994287

LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21-31.

Power, D. J. (2016). Data science: Supporting decision-making. Journal of Decision Systems, 25(4), 345-356.

Soni, D. (2018, March 22). Supervised vs. Unsupervised learning – towards data science.  Retrieved from https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Tableau Desktop 2018.2 [Computer software]. (2018). Retrieved from http://www.tableau.com.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).