Database Indexing and Clusters

A useful analogy for database indexing is highlighting an article or book. We highlight or mark passages to make them easier for retrieval at a later date, when we pick up the book and want to find something quickly. Likewise, an index provides quick access to key information in the database tables. It would be silly to highlight an entire book, or mark up all of the pages; by the same token, it would be functionally useless to index a large number of columns in a table. There is a point of diminishing returns here.

It is generally recommended to index columns that are involved in WHERE or JOIN clauses (Larsen, 2010). These columns are frequently sought out by multiple query operations and are typically as critical to the table as the Primary Key. It is important to choose wisely here because for every table operation done, an index update must be done. This work can become exponential if multiple indexes are placed on a single table. Again, we come back to the principle of diminishing returns.

There is also the matter of choosing between clustered and nonclustered indexes. The former typically reads like browsing through a telephone directory: in order. Primary Keys are typically used in clustered indexes. One drawback here is the need to re-order the index when information is updated, added, or deleted. On the other hand, a non-clustered index operates much like an index in the back of a textbook, or like a dimension table in a star-schema database. While the latter may seem more advantageous at all times, but it usually shines when values are constantly updated. In situations where most of the table data is returned in a query, or a Primary Key is the rational identifier, a clustered index is the type of choice.

References

Connolly, T. & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson. Larsen, G. A. (2010). The dos and don’ts of database indexing. Retrieved from https://www.databasejournal.com/features/mssql/article.php/3860851/The-Dos-and-Donts-of-Database-Indexing.htm

Wagner, B. (2017). Clustered vs nonclustered: What index is right for my data? Retrieved from https://hackernoon.com/clustered-vs-nonclustered-what-index-is-right-for-my-data-717b329d042c

Data Mining: What’s Ahead

I’ve written on Data Mining before, as it is a fundamental step for higher-order predictive and prescriptive analytics work. Enterprise data warehouses are a trove of useful information, and data mining methods help to separate what is useful from what is not (Sharma, Sharma, & Sharma, 2013). Data mining is itself an analysis method; that is, “the analysis of data that was collected for other purposes but not the questions to be answered through the data mining process” (Maaß, Spruit, & de Waal, 2014, p. 2). Data mining takes on the unknown-unknowns of the dataset and begins to make sense of the vast amount of data points available. It involves both data transformation and reduction. These are necessary as “prediction algorithms have no control over the quality of the features and must accept it as a source of error” (Maaß, Spruit, & de Waal, 2014, p. 6). Data mining reduces the noise and eliminates the dilution of relevant data by irrelevant covariates. It provides the business intelligence framework with usable data and a minimum of error.

Tembhurkar, Tugnayat, & Nagdive (2014) outline five stages for successful data-to-BI transformation:

  1. Collection of raw data
  2. Data mining and cleansing
  3. Data warehousing
  4. Implementation of BI tools
  5. Analysis of outputs (p. 132).

Given the importance of Data Mining in the BI process, I do not see it going away or diminishing in stature. In fact, more attention may be coming to it because of the growing interest in data lakes and ELT over ETL (e.g., Meena & Vidhyameena, 2016; Rajesh & Ramesh, 2016). Increased attention will be paid to mining and cleansing practices. New developments will include advances in unstructured data, IoT data, distributed systems data mining, and NLP/multimedia data mining.

References

Maaß, D., Spruit, M., & de Waal, P. (2014). Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques. Decision Analytics, 1(1), 1–17.

Meena, S. D., & Vidhyameena, S. (2016). Data lake – a new data repository for big data analytics workloads. International Journal of Advanced Research in Computer Science, 7(5), 65-67.

Rajesh, K. V. N., & Ramesh, K. V. N. (2016). An introduction to data lake. i-Manager’s Journal on Information Technology, 5(2), 1-4.

Sharma, S. A., Sharma, A. K., & Sharma, D. M. (2013). Using Data Mining for Prediction: A Conceptual Analysis. I-Manager’s Journal on Information Technology, 2(1), 1–9.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Data Mining and the Enterprise BI Long Game

Data mining provides the foundational work for higher-order predictive and prescriptive analytics. Enterprise data warehouses are a trove of useful information, and data mining methods help to separate what is useful from what is not (Sharma, Sharma, & Sharma, 2013). Data mining is itself an analysis method; that is, “the analysis of data that was collected for other purposes but not the questions to be answered through the data mining process” (Maaß, Spruit, & de Waal, 2014, p. 2). Data mining takes on the unknown-unknowns of the dataset and begins to make sense of the vast amount of data points available. It involves both data transformation and reduction. These are necessary as “prediction algorithms have no control over the quality of the features and must accept it as a source of error” (Maaß, Spruit, & de Waal, 2014, p. 6).

Getty Images

What is produced from these data mining efforts is a set of relevant data points that can be used for aggregate, predictive, and prescriptive analysis in the enterprise organization’s business intelligence platform(s). It is no different than avoiding the “garbage-in, garbage-out” mistake of simple reporting and visualization. Data mining reduces the noise and eliminates the dilution of relevant data by irrelevant covariates. It provides the business intelligence framework with usable data and a minimum of error.

For example, if I were to embark on a predictive modeling project to determine what factors influenced employee attrition from a large manufacturing company over the last five years, I would first want to do extensive data mining on the raw dataset. With over 20,000 employees on all continents across the world, and hundreds of data points per employee, a rigorous data mining phase eliminates the variables that would throw errors into any predictive model such as decision trees or multiple regression.

References

Maaß, D., Spruit, M., & de Waal, P. (2014). Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques. Decision Analytics, 1(1), 1–17.

Sharma, S. A., Sharma, A. K., & Sharma, D. M. (2013). Using Data Mining for Prediction: A Conceptual Analysis. I-Manager’s Journal on Information Technology, 2(1), 1–9.

Zero-Latency Data and Business Intelligence

Business intelligence enables decision-makers and stakeholders to make strategic decisions based on the information available to them. Just as the quality of the data is critical, the timeliness of the data is equally so. Laursen & Thorlund (2010) identify three types of data:

  1. Lag information. This covers what happened previously, and may be used to feed predictive models attempting to create lead information. Although the data is recorded in real-time (i.e., a flight data recorder), reading and reporting from the data is done ex post facto.
  2. Real-time data. This data shows what has happening at present. Continuing the aviation example, the ADS-B pings from aircraft are real-time data points collected by receivers across the globe and fed to flight tracking sites such as FlightAware.com for real-time reporting.
  3. Lead information. This data is often yielded from predictive models created by real-time or lag information. Airlines use a combination of flight, weather, and air traffic data to project an estimated arrival time for any given commercial aircraft at a particular destination.

There are appropriate instances for all three types. Real-time tends to be the most desired, but of course with decreased lag and immediate demand comes a trade-off of processing power, vulnerability to errors, and cost. Somewhere between “very old” and “absolutely immediate” is the sweet spot of timeliness and cost-efficiency. In other words, the push for zero-latency data may be more costly than profitable. Businesses must develop their own cost/benefit models to determine how real-time their BI data should be.

One area of real-time necessity is item affinity analysis. Every day on Amazon, customers order items and are presented with other items that may be relevant to their purchase, based on purchasing patterns from other customers who have ordered the same thing as well as their own purchasing history (Pophal, 2014). This data must be zero-latency, as a recommendation must be posted almost immediately after the customer makes their initial order. A lag time of minutes, hours, or days would lose the potential sale.

References

Laursen, G. H. N., and Thorlund, J. (2010) Business Analytics for Managers: Taking Business Intelligence Beyond Reporting. Wiley & SAS Business Institute.

Pophal, L. (2014). The technology of contextualized content: What’s next on the horizon? Retrieved from http://www.econtentmag.com/Articles/Editorial/Feature/The-Technology-of-Contextualized-Content-Whats-Next-on-the-Horizon-99029.htm

Online Profiling and BI

First, we must define online profiling, also known as contextual marketing in advertising circles. It is a marketing and data-gathering process which “takes into account the users’ needs, habits, and goals to create a personalised web experience” (Elhawli, 2017). While it may not appear to be what conventional wisdom defines as business intelligence, the same principles are there: data is gathered, made sense of, and served to an information consumer in order to increase the company performance. The information consumer, in this case, may be considered either the actual user of the website or the layer of web platform between the data (recommendations) and website user.

The steps necessary for making the items actionable follow typical business intelligence processes. First, the “sheer volume of data now available to marketers” requires focusing on what data is relevant to the desired outcomes (Pophal, 2014). These outcomes and relevant data vary by market and platform. In many cases, the data itself yields important clues on what exactly is important—akin to an exploratory data analysis in traditional business intelligence implementations.

Another similarity is the platforms upon which the data must be presented. This may be considered a parallel to the various visualization and reporting platforms available to an information consumer. There are multiple data points involving user device and platform (i.e., Mozilla on Mac OS or Chrome on Android), which must then be utilized to determine how the information is best served up.

These data points also influence the intake process, parallel to the ETL stage in traditional business intelligence. Imagine the “multiple inputs around consumers and the devices they’re carrying-or wearing-that receive those inputs” as systems of record contributing to a master data aggregator (Pophal, 2014). These must work in near real-time, driving outputs and giving context to other inputs.

Moving from descriptive to predictive analytics is another similarity between online profiling and conventional business intelligence. Current online profiling “is largely driven by what consumers have done, the future will focus on what they will do” (Pophal, 2014). The transition from descriptive to predictive is a milestone in business intelligence maturity (LaValle, et al., 2011) and this is no different in online profiling. Serving up mountains of What Happened? insights can only go so far. Transitioning from that to So What? and Now What? crosses a hurdle into the next realm of usefulness. This cannot be at the expense of best practices, though—as the celebrated rise and embarrassing fall of Google Flu Trends reminds us (Lazer & Kennedy, 2015).

References

Elhawli, A. (2017, October 12). Why content & contextual marketing are key to your success. Retrieved February 2, 2019, from https://medium.com/@aelhawli1/why-content-contextual-marketing-are-key-to-your-success-4509239b71b9

LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21–31.

Lazer, D., & Kennedy, R. (2015, October 1). What we can learn from the epic failure of Google Flu Trends. Retrieved February 2, 2019, from https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/

Pophal, L. (2014). The technology of contextualized content: What’s next on the horizon? Retrieved from http://www.econtentmag.com/Articles/Editorial/Feature/The-Technology-of-Contextualized-Content-Whats-Next-on-the-Horizon-99029.htm

Target Measures, Probability Mining, and Econometric Modeling

Target measures answer narrow down what must be defined in business intelligence, in a sense of dashboards or Key Performance Metrics. We use the term boiling the ocean often in my professional practice to suggest that scopes of business intelligence cannot be broad-brush approaches. Target measures provide focus and scope. Common target measures are ROI, Net Sales, Efficiency Rate, Response Time, et cetera (Liu, Laguna, Wright, & He, 2014). Probability mining uses predictive modeling methods such as logistic regression, neural networks, multiple regression, and recursive partitioning to predict the probability of particular events happening based on previous data. Econometric modeling is a combination of statistical analysis and economic theory, based on the idea that neither alone can provide an accurate or satisfactory picture of an economic phenomenon.

In terms of business intelligence, these may be combined for a particular approach. Econometric models are more specialized and may not apply to a particular instance, but certainly target measures and probability mining may be. Target measures most often should be defined before probability mining happens, in order to maintain correct focus and scope.

Imagine a human resources predictive model that mines a swath of employee data from SAP to find key correlations, then uses both multiple regression and neural nets to find predictors of voluntary separation. There is a target measure involved there: the new model is applied to the current employees in order to find their percentage likelihood of leaving in the next two years, with a standard error of 6 months.

Another target measure of note is sell-through numbers for a consumer goods corporation. Sell-in shows sales to a particular distributor. However, not all distributors currently provide their sales data back to the company, which shows how much actually got out the door in the hands of consumers (sell-through). It is a target measure, as the delta between in and through is helpful for the business to know, and timely sell-through data can help the sales, marketing, and customer service departments adjust their approaches based on customer habits.

References

Liu, Y., Laguna, J., Wright, M., & He, H. (2014). Media mix modeling – A Monte Carlo simulation study. Journal of Marketing Analytics, 2(3), 173–186.

Pophal, L. (2014). The Technology of Contextualized Content: What’s Next on the Horizon? Retrieved from http://www.econtentmag.com/Articles/Editorial/Feature/The-Technology-of-Contextualized-Content-Whats-Next-on-the-Horizon-99029.htm

Reiss, P. C. & Wolak, F. A. (2007) Structural Econometric Modeling: Rationales and Examples from Industrial Organizations. Retrieved from https://web.stanford.edu/group/fwolak/cgi-bin/sites/default/files/files/Structural%20Econometric%20Modeling_Rationales%20and%20Examples%20From%20Industrial%20Organization_Reiss,%20Wolak.pdf

Data Warehousing and Data Mining

The relationship between data mining tools and data warehousing systems can be most easily seen in the connector options of popular analytics software packages. For example, the image below right shows the many source options from which to pull data in from warehouse backends in Tableau Desktop. Microsoft Power BI includes similar interface options. There are countless packages in R for connecting to data warehouse backends, readily available online from proprietary and open-source vendors. Other proprietary packages such as SPSS, SAS, and JMP have similar interfaces.

Simply put, a data mining tool enables insights into what is stored in the data warehouse, and is only as useful as the quality of the data it accesses. Power (2016) calls this discover, access, and distill. In professional practice, this author has often seen businesses focus erroneously on a particular data mining tool, believing the paid solution will provide immediate value, without ensuring the data warehouse (or equivalent) is in proper order first. Successful implementation of a data mining tool requires a number of preparatory steps, including (but not limited to):

  1. Identifying appropriate Systems of Record (SORs)
  2. Validating the SOR accuracy and alignment with business purposes
  3. Establishing a common understanding of the data points within each SOR and how they translate across business units within the organization (this often requires an organization-wide Data Governance Board or equivalent)
  4. Developing business goals, or questions the data mining tool can answer

These steps ensure the data is valid, useful, and actionable. Organizations that do not take the necessary steps to ensure data quality and develop a business case for the data mining tool run a danger of wasting time and resources on a solution in search of a problem (Gudfinnsson, Strand, & Berndtsson, 2015; LaValle, Lesser, Shockley, Hopkins, & Kruschwitz, 2011).

Consider an international manufacturing company that currently uses a number of disparate systems of record for its business: Cognos (not a true SOR but regarded as one), AS400, 3PL, SQL, Informix, and multiple warehouse management systems. The company does not have a unified data warehouse or data governance procedures in place. In current state, different business units that use different systems of record are unable to successfully work together with common understandings of data. Attempts at data mining and even simple reporting have failed across business units because of the quality of data—for example, sales forecasting does not translate between Finance and Marketing because the basic figures from the disparate systems of record do not match. There can be no useful data mining from this data without significant transformation.

Assuming the foundational steps are done, and the data mining tool is in production, new data points can be put back into the warehouse based on discovered insights. For example, consider a multi-level marketing company has a number of data points on its associates: units sold, associates recruited, years in the program, rewards program tier, et cetera. They know the associates can be grouped into performance categories akin to “novice” and “expert” but are unclear on both how many categories to look at and what factors are important. Principal components analysis and k-means clustering can reveal how the associates differentiate themselves based on the available variables and suggest an appropriate number of categories within which to classify them. These classifications can be put back into the data warehouse and used as co-variates in other analysis work.

References

Brownlee, J. (2016, September 22). Supervised and unsupervised machine learning algorithms.  Retrieved from https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

Gudfinnsson, K., Strand, M., & Berndtsson, M. (2015). Analyzing business intelligence maturity. Journal of Decision Systems, 24(1), 37-54. doi:10.1080/12460125.2015.994287

LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21-31.

Power, D. J. (2016). Data science: Supporting decision-making. Journal of Decision Systems, 25(4), 345-356.

Soni, D. (2018, March 22). Supervised vs. Unsupervised learning – towards data science.  Retrieved from https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Tableau Desktop 2018.2 [Computer software]. (2018). Retrieved from http://www.tableau.com.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Machine Learning: Supervised and Unsupervised

Supervised typically takes the form of classification or regression. We know the input and output variables, and try to make sense of the relationships between the two. Tembhurkar, Tugnayat, & Nagdive (2014) refer to this as Descriptive mining. Common methods include decision tree, kNN algorithm, regression, and discriminant analysis. The methods are dependent upon the type of data input: continuous variables will use regression methods, while discrete variables will use classification methods.

For example, a human resources division in a large multinational company wants to determine what factors have contributed to employee attrition over the past two years. A decision tree methodology can produce a simple “if-then” map of what attributes combine and result in a separated employee. An example tree might point out that a male employee over the age of 45, working in Division X, who commutes more than 25 miles from home, has a manager 10 years or more his junior, and has been in the same unit for more than seven years is a prime candidate for attrition. Although many of the variables are continuous, a decision tree method makes the data manageable and actionable for human resources division use.

Unsupervised are usually clustering or association. The output variables are not known, and we are relying on the system to make sense of the data. No a priori knowledge. Temburkhar et al refers to this as Prescriptive mining. Common methods include neural networks, anomaly detection, k-means clustering, and principal components analysis. The methods are dependent upon the type of data input: continuous variables will use association methods, while discrete variables will use clustering methods.

For example, a multi-level marketing company has a number of data points on its associates: units sold, associates recruited, years in the program, rewards program tier, et cetera. They know the associates can be grouped into performance categories akin to novice and expert but are unclear on both how many categories to look at and what factors are important. Principal components analysis and k-means clustering can reveal how the associates differentiate themselves based on the available variables and suggest an appropriate number of categories within which to classify them.

References

Brownlee, J. (2016, September 22). Supervised and unsupervised machine learning algorithms.  Retrieved from https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

Soni, D. (2018, March 22). Supervised vs. Unsupervised learning – towards data science.  Retrieved from https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Corporate Information Factories and Business Dimensional Models

Differentiating between a Corporate Information Factory (CIF) and a Business Dimensional Model (BDM) may come down to two different directions of strategic thought: top-down (CIF) or bottom-up (BDM).

In the BDM, otherwise known as the Kimball approach, data remain in their respective logical business units (e.g, Sales or Production) but are brought together into the data warehouse through a commonly defined bus architecture. This approach is most prevalent in the Microsoft BI stack. Star or snowflake schemas are utilized and data are rarely normalized past 1NF, if at all. The logical focus is on the originating business units and the goal is often to allow these units to more effectively share data across the organization. For presentation, fewer queries and joins are necessary than one would need to make sense of CIF data.

The CIF, or Inmon approach, starts with the central data repository as the unit of focus as opposed to the individual business units. The business units can create data marts from the normalized tables. Third normal form is required. The most apparent disadvantage here is the amount of time and thought required to implement a true CIF, but the resulting product is a true enterprise data factory. More joins are needed, though, to put the data into presentable form.

Where Extract-Transform-Load or Extract-Load-Transform is concerned, the former (ETL) is the most conventional understanding of the process and typically implemented in dimensional modeling. The transformation happens before the data reaches the target system and is logically arranged already—to some degree—by business until or purpose. The latter (ELT) is utilized most often in more powerful analytics implementations or data lakes.

References

Bethke, U. (2017, May 15). Dimensional modeling and Kimball data marts in the age of big data and Hadoop.  Retrieved from https://sonra.io/2017/05/15/dimensional-modeling-and-kimball-data-marts-in-the-age-of-big-data-and-hadoop/

Harris, D. ETL vs. ELT: How to choose the best approach for your data warehouse. Retrieved from https://www.softwareadvice.com/resources/etl-vs-elt-for-your-data-warehouse/

Kajeepeta, S. (2010, Jun 7). Is it time to switch to ELT? Intelligent Enterprise – Online. Retrieved from https://proxy.cecybrary.com/login?url=https://search.proquest.com/docview/365390283?accountid=144789

Kumar, G. (2017, Mar 14). Dimensional modelling vs corporate information factory. Retrieved from http://www.data-design.org/blog/dimensional-modelling-vs-corporate-information-factory

Decision Support Systems, Data Warehouses, and OLAP Cubes

As Tembhurkar, Tugnayat, & Nagdive define it, BI is “a collection of tools and techniques [that] transforms raw data into significant information useful for analyzing business systems” (2014, p. 128). BI has evolved from the earlier incarnations of Decision Support Systems, which served the same purpose(s) but were much more rudimentary compared to today’s implementations. These DSS solutions were often comprised of data warehouses (DWs) and online analytical processing (OLAP) engines. Both components worked together to serve the business needs: ETL and storage being handled by the data warehouse, and the front-end analysis handled by the OLAP system.

The data warehouse serves as the central repository for multiple systems of record, often heterogenous and disparate in the beginning. Data is typically replicated and stored in subject-area schemas (e.g., sales or employee data), most typically in fact and dimension tables as part of a SQL-backed relational database. The data warehouse itself can offer views and data marts pre-packaged. It supports the OLAP system. Like the OLAP system in its original form, the data warehouse is starting to be eclipsed by data lakes in enterprise environments that deal with a large amount of heterogenous data that often includes unstructured data. The difference between the two, for purposes of this comparison, is where the “T” (transformation) falls in ETL or ELT. In a data warehouse, the transformation happens before loading into the warehouse, as its purpose is to serve as a central common repository. In a data lake, the transformation happens after loading, as the lake does not impose any schemas or restrictions in order to achieve any kind of homogenous state.

The OLAP system is multi-dimensional, not unlike a three-dimensional spreadsheet. It is not a relational database but enables the analysis of the data in the data warehouse. The OLAP system enables what we typically understand as slicing and dicing the data. While these were sufficient in the early days of BI, the shift towards a DevOps culture and the proliferation of machine learning, predictive analysis, dashboarding, and envelope-pushing analytics capabilities have required more from a BI solution than rigid OLAP cubes.

References

Felzke, M. (2014). Data warehouse vs. OLAP cube. Retrieved from https://www.solverglobal.com/blog/2014/04/data-warehouse-vs-olap-cube/

Harris, D. (n.d.). ETL vs. ELT: How to choose the best approach for your data warehouse. Retrieved from https://www.softwareadvice.com/resources/etl-vs-elt-for-your-data-warehouse/

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).