What Makes Big Data “Big?”

I’ve never been a fan of buzzwords. The latests source of my discomfort is the term thought leader, which is one of those ubiquitous but necessary phrases in almost every professional space. That hasn’t kept me from poking fun at it, though, as I believe we should be able to laugh at ourselves and not take things too seriously.

No alt text provided for this image

Big Data is a buzzword. But it’s also my career.

What is the difference between regular, conventional, garden-variety data and Big Data? There’s a lot we could say here, but they key differences that come to mind for me are use, size, scope, and storage. I immediately think of two specific datasets I’ve used for teaching purposes: LendingClub and Stattleship.

LendingClub posts their loan history (anonymized, of course) for public consumption so that any audience may feed it into an engine or tool of their choice for analysis. I’ve used this dataset before to demonstrate predictive modeling and how financial institutions use it to aid decision-making in loan approvals. Stattleship is a sports data service with an API that allows access to a myriad of major league sports data. They also provide a custom wrapper to be used in R, and I’ve used these tools to teach R.

One of the primary differences between big data and conventional data is use case. Take these two datasets, for example. The architects of these sets understand that a variety of users will be downloading the data for various reasons, and there is no specific use case intended for either set. The possibilities are endless. With smaller troves of data, we typically have an intended use attached, and the data is specific to that use. Not so with big data.

These datasets illustrate two other key factors in big data: size and scope. Again, the datasets are not at all meant to answer one specific question or have a narrow focus. Sizing is often at least in gigabytes or terabytes—and in many cases tipping over into petabytes. The freedom to explore multiple lines of inquiry is inherent in big data sets without any sort of restriction on scope.

Finally, the storage and maintenance of big data is another key difference that sets it apart from conventional datasets. The trend of moving database operations offsite and using Database-as-a-Service models have enabled the growth of big data, as has the development of distributed computing and storage. Smaller conventional datasets do not require such an infrastructure and are not quite as impactful on a company’s bottom line.

Future of BI: Opportunities, Pitfalls, and Threats

Opportunities

Master data management (MDM). A few years ago this was thought to be a dead concept and I wonder how much of that sentiment was driven by the advent of data lakes, unstructured processing, artificial intelligence, et cetera. We have come far enough now to know that (a) the two do not have to be mutually exclusive, and (b) MDM is seeing a resurgence as the importance of data governance and quality management grows. Regardless of how the data is used, it must be clean and relevant.

Ethics. Cambridge Analytica should not have been the first watershed moment in the ethics of big data and business intelligence. While a number of industries have established sub-disciplines in ethics, data science and business intelligence are young, and this will continue to grow. That particular scandal did peel the layer of collective public naivete back. We are more attuned now to the potential pitfalls of big data in the hands of companies with less-than-best intentions. However, willful ignorance does remain and this is a major opportunity for growth.

Data-driven cultures and citizen data scientists. Business intelligence has expanded from a small cadre of statisticians and developers to include more subject-area experts and regular business users. This democratization of data science is largely due to the ease of use of popular analytics packages such as Tableau and Qlik. As the black box of analytics is demystified and the power is put in the hands of more users, data-driven cultures will become easier to create in organizations.

Pitfalls

Over-reliance on the next-best-thing. Let’s admit it: there are some impressive analytics packages on the market right now. The innovations in data science are exciting. But without a focus on less-flashy elements such as data governance and the right people-processes, whatever the next best thing might be will fail. It is tempting to get caught up in the continuous cycle of innovation and forget about these critical elements.

De-valuing BI talent. The release of analytics packages that an average business user can pilot without the need of a dedicated statistician or business intelligence developer has done many good things for the discipline, but going too far in this direction is a potential pitfall. Socially, we are in the era of experts and scientists being ignored in favor of what people believe they know (cite). Between this predisposition and more functions being in the reach of regular business users, there is a potential for BI experts to be brushed aside and their talent de-valued.

Checking our brains at the door. As useful and amazing as business intelligence has become in organizations, it may be tempting to put more and more decision-making power on artificial intelligence at the expense of human intelligence. Plenty of films have used this premise as fodder for apocalyptic computers-take-over-the-world stories. But on a more practical level, business intelligence is all about serving up the right information so decision-makers can make the right calls—not making all the decisions for them.

Threats

Inflexible organizations. Organizational culture can be a great asset or opportunity, but it can also be an incredible hindrance. Even the best deployments with the best intentions can be rendered useless if an organization is not willing to embrace whatever change is necessary to take advantage of it all. This is not a new threat, per se, but one that will always be around.

Bad actors. We like to believe that big data and the algorithms that drive how we interact with it are neutral at best. However, as McNamee (2019) notes, it is possible for bad actors to utilize otherwise benign data and algorithms for nefarious purposes. As collections of data grow and algorithms to drive outcomes or profit grow, the chances of these bad actors to utilize them become more and more likely.

Lack of transparency. This may be considered a threat more in the big data realm in general more so than in business intelligence, but it does bear highlighting within this context. Businesses use proprietary algorithms and logic that turn troves of data into consequential decisions about our lives. These also shape the world that we see through our consumption of news and social media websites. Do we remain in willful ignorance of how those are served up to us, or do we push for more transparency there?

References

Graham, Mackenzie. (2018). Facebook, Big Data, and the Trust of the Public. Retrieved from http://blog.practicalethics.ox.ac.uk/2018/04/facebook-big-data-and-the-trust-of-the-public/

Jürgensen, K. (2016). Master Data Management (MDM): Help or Hindrance? Retrieved from https://www.red-gate.com/simple-talk/sql/database-delivery/master-data-management-mdm-help-or-hindrance/

McNamee, Roger. (2019). Zucked: Waking up to the Facebook catastrophe. New York: Penguin.

Nichols, T. (2017). The death of expertise: The campaign against established knowledge and why it matters. New York: Oxford UP.

Pyramid Analytics. The Business Intelligence Trends of 2019 Discussed. Retrieved from https://www.pyramidanalytics.com/blog/details/blog-guest-bi-trends-of-2019-discussed

Rees, G., & Colqhuon, L. (2017). Predict future trends with business intelligence. Retrieved from https://www.intheblack.com/articles/2017/12/01/future-trends-business-intelligence

CRM, OLAP Cubes, and Business Intelligence

Customer Relationship Management, as a concept, brings together a number of various systems from functions across the business (sales, marketing, operations, external, etc) that allow the enterprise to create, maintain, and grow positive and productive relationships with customers. We might think of it as being the glue that brings front office and back office together and allows the business to de-silo what would otherwise be proprietary information across the organization.

No alt text provided for this image

But what good are all these data points if they aren’t utilized effectively? It would be easy to fall victim to information overload if we tried to explore the data from a particular axis or angle. This is where classic data mining and online analytical processing (OLAP) come in. If we think of various systems of record as one-dimensional axes on a graph, bringing these together in a three-dimensional cube and taking a particular block within that cube to analyze would be much more efficient. Rather than starting with the data and searching for questions to answer that might involve those points (as is tempting to do at times), we are able to start with a specific business question and use OLAP to answer it.

For example, assume I am a cosmetics manufacturer and want to know how much of my product actually goes out the door to consumers after it is sold to a distributor. I want to use that information to adjust my marketing efforts and potentially re-evaluate my production line. I have the following data points available by way of my existing business intelligence environment:

  • Production line data
  • Inventory balances in my warehouse
  • Marketing campaign data
  • Sales data from my company to the distributor
  • Sales data from the distributor to the end consumer

Rather than starting from one or two of these data points and throwing things against the wall to see what might stick, I can use OLAP capabilities to find the different relationships between these points, eventually driving my answer. Understand here that answering the initial question is simply a matter of reading one data point (the last one in this case)—however, a strategic approach that addresses the customer relationship is the end goal.

One caveat here. OLAP may be considered a predecessor to currently-understood data mining, depending on which view of business intelligence you find appealing. Strictly speaking, traditional OLAP has been used for a number of years already for marketing, forecasting, and sales. Data mining capabilities at present far surpass what has been traditionally available in the OLAP sense.

Reference

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

From Decision Support to Business Intelligence

Decision support systems (DSS) predate business intelligence (BI) by several decades. Sprague and Carlson (1982) define a DSS as “class of information system that draws on transaction processing systems and interacts with the other parts of the overall information system to support the decision-making activities of managers and other knowledge workers in organisations.” This definition is very nearly interchangeable with that of a business intelligence system. We can think of DSS as more of a framework and model more than an actual software package. These have often been aided by computer resources, such as databases and online processing (OLAP), but they may also be offline. Any DSS involves a data or knowledge base, the business rules, and the interface itself. DSS systems may be classified by one of the following drivers (Power, 2000):

  • Communication-Driven
  • Data-Driven
  • Document-Driven
  • Knowledge-Driven
  • Model-Driven

Business intelligence can be viewed as the successor of DSS or the parent of it. I prefer to see it as a hybrid. As methods of collecting, storing, viewing, and analyzing data became more advanced, DSS systems came to be a specific part in a larger BI framework. A DSS is always dependent on “access to accurate, well-structured, and organized data” (Felsberger, Oberegger, & Reiner, 2016, p. 3). The various functions of business intelligence that have grown in recent years all serve to support the data points going into the DSS.

In a manufacturing environment, a practical example might be the evaluation and assignment of work centers. The knowledge base may include data such as what must go in, what must be produced, what constraints are in place, et cetera. Production and diagnostic data from the different work centers would be integrated via the BI capabilities of the organization, as well as forecasted production and schedule data. Business rules such as employee labor hours and machine lifecycle may also be included. The DSS would use all these data points to drive outputs; in this case, the desired outputs and decisions include production labor and machine scheduling that are most efficient to the company.

References

Felsberger, A., Oberegger, B., & Reiner, G. (2016). A review of decision support systems for manufacturing systems.

Power, D. J. (2000). Web-based and model-driven decision support systems: concepts and issues. In proceedings of the Americas Conference on Information Systems, Long Beach, California.

Sprague, R. H., & Carlson, E. D. (1982). Building effective decision support systems. Prentice Hall Professional Technical Reference.

2PC and 3PC (Commit Protocols) in DBMS

Both Two-Phase Commit (2PC) protocol and Three-Phase Commit (3PC) protocol are popular with Distributed DBMS instances because all nodes must commit to a transaction or none of them will. It is an all-or-nothing proposition. Both protocols share a Prepare (Voting) and Commit/Abort phase, but 3PC adds an additional pre-Commit phase in which every participating node must vote yes to a commit before it is actually done. Compared to 3PC, Two-Phase Commit may be characterized as sending the command and hoping for the best, since the bulk of the transaction (the instructions for what to actually do) are transmitted with the commit phase. The return message after the transaction, from each participant, determines commit or abort status globally. The 3PC extra step of pre-commit is intended to clear up any global commit/abort failure issues or blocking. This step polls for availability before anything is done and the nodes can “act independently in the event of a failure” (Connolly & Begg, 2015). This is an important distinction. In 2PC, a single abort vote or acknowledgement undoes the entire process. In 3PC, assuming the pre-commit phase came back with a global commit vote, even a timeout or network partition would not cause a global abort.

Terminating a process, according to Connolly & Begg (2015), is where the differences between these protocols are most critical. In 2PC it is possible to have a block because after the vote, the nodes are waiting on a commit or abort message from coordinator before making the global commit. If partition occurs, they are stuck until coordinator re-establishes communication. A power failure is more catastrophic, as it may involve multiple nodes and the controller. In both 2PC and 3PC, backup procedures are activated. 2PC participants remain in a blocked state. Of course, overall, there are tradeoffs. The major issue with 3PC is the communication overhead, which is to be expected with the extra phase (Kumar, 2016).

References

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Kumar, M. (2016). Commit protocols in distributed database system: A comparison. International Journal for Innovative Research in Science & Technology, 2(12), 277-281.

Concurrency: Optimistic or Pessimistic?

Optimistic concurrency control is the more complex of the two concurrency control methods. A transaction beginning is timestamped, a process is run, and the change is validated. If another transaction completed after this transaction’s start time, this transaction is aborted. In other words, the original record is unavailable because someone got to it first and completed the transaction. The risk here is a dirty read, as it is possible for more than one person to have access to a record at a time. Change validation is done at the end of the transaction block.

Conservative (or pessimistic) concurrency control is akin to checking out a book at the library, and is the simpler of the two methods. Once a transaction begins, the record is locked, and no one else can modify it. In the library example, I would go to the library to check out a book (record) to read it (modify it); if it is there (no one has initiated a change), I may check it out. If the book (record) is not there (someone has locked it, modifying it), I cannot check it out. It is a first-come, first-served method that ensures no two people have concurrent access to a record at a time.

Each has its risks and rewards. Optimistic concurrency control tends to be used in environments without much contention for a single record of truth. It allows a higher volume of transactions per hour. However, as the name implies, the method essentially hopes for the best then deals with the problem if and when it arises. On the other hand, pessimistic concurrency control virtually guarantees that all transactions will be executed correctly and that the database is stable. It is a simpler decision tree: either abort if locked or commit if unlocked. All the drawbacks of pessimistic concurrency control lie in timing: fewer transactions per hour and limited access to the data depending on the number of users making transactions.

One specific advantage of optimistic locking, that isn’t always thought of immediately, is evident in the scenario when a user cannot maintain a consistent connection to the database. Assume for a moment that a user locks a table in a remote database for updating and the connection is severed (either through server reset, ISP woes, et cetera). The user reconnects and is back in the database. However, the previous session was not properly closed, so we have a phantom user with the record still open.

Reference

Connolly, T. & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Database Indexing and Clusters

A useful analogy for database indexing is highlighting an article or book. We highlight or mark passages to make them easier for retrieval at a later date, when we pick up the book and want to find something quickly. Likewise, an index provides quick access to key information in the database tables. It would be silly to highlight an entire book, or mark up all of the pages; by the same token, it would be functionally useless to index a large number of columns in a table. There is a point of diminishing returns here.

It is generally recommended to index columns that are involved in WHERE or JOIN clauses (Larsen, 2010). These columns are frequently sought out by multiple query operations and are typically as critical to the table as the Primary Key. It is important to choose wisely here because for every table operation done, an index update must be done. This work can become exponential if multiple indexes are placed on a single table. Again, we come back to the principle of diminishing returns.

There is also the matter of choosing between clustered and nonclustered indexes. The former typically reads like browsing through a telephone directory: in order. Primary Keys are typically used in clustered indexes. One drawback here is the need to re-order the index when information is updated, added, or deleted. On the other hand, a non-clustered index operates much like an index in the back of a textbook, or like a dimension table in a star-schema database. While the latter may seem more advantageous at all times, but it usually shines when values are constantly updated. In situations where most of the table data is returned in a query, or a Primary Key is the rational identifier, a clustered index is the type of choice.

References

Connolly, T. & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson. Larsen, G. A. (2010). The dos and don’ts of database indexing. Retrieved from https://www.databasejournal.com/features/mssql/article.php/3860851/The-Dos-and-Donts-of-Database-Indexing.htm

Wagner, B. (2017). Clustered vs nonclustered: What index is right for my data? Retrieved from https://hackernoon.com/clustered-vs-nonclustered-what-index-is-right-for-my-data-717b329d042c

Data Mining: What’s Ahead

I’ve written on Data Mining before, as it is a fundamental step for higher-order predictive and prescriptive analytics work. Enterprise data warehouses are a trove of useful information, and data mining methods help to separate what is useful from what is not (Sharma, Sharma, & Sharma, 2013). Data mining is itself an analysis method; that is, “the analysis of data that was collected for other purposes but not the questions to be answered through the data mining process” (Maaß, Spruit, & de Waal, 2014, p. 2). Data mining takes on the unknown-unknowns of the dataset and begins to make sense of the vast amount of data points available. It involves both data transformation and reduction. These are necessary as “prediction algorithms have no control over the quality of the features and must accept it as a source of error” (Maaß, Spruit, & de Waal, 2014, p. 6). Data mining reduces the noise and eliminates the dilution of relevant data by irrelevant covariates. It provides the business intelligence framework with usable data and a minimum of error.

Tembhurkar, Tugnayat, & Nagdive (2014) outline five stages for successful data-to-BI transformation:

  1. Collection of raw data
  2. Data mining and cleansing
  3. Data warehousing
  4. Implementation of BI tools
  5. Analysis of outputs (p. 132).

Given the importance of Data Mining in the BI process, I do not see it going away or diminishing in stature. In fact, more attention may be coming to it because of the growing interest in data lakes and ELT over ETL (e.g., Meena & Vidhyameena, 2016; Rajesh & Ramesh, 2016). Increased attention will be paid to mining and cleansing practices. New developments will include advances in unstructured data, IoT data, distributed systems data mining, and NLP/multimedia data mining.

References

Maaß, D., Spruit, M., & de Waal, P. (2014). Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques. Decision Analytics, 1(1), 1–17.

Meena, S. D., & Vidhyameena, S. (2016). Data lake – a new data repository for big data analytics workloads. International Journal of Advanced Research in Computer Science, 7(5), 65-67.

Rajesh, K. V. N., & Ramesh, K. V. N. (2016). An introduction to data lake. i-Manager’s Journal on Information Technology, 5(2), 1-4.

Sharma, S. A., Sharma, A. K., & Sharma, D. M. (2013). Using Data Mining for Prediction: A Conceptual Analysis. I-Manager’s Journal on Information Technology, 2(1), 1–9.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Data Mining and the Enterprise BI Long Game

Data mining provides the foundational work for higher-order predictive and prescriptive analytics. Enterprise data warehouses are a trove of useful information, and data mining methods help to separate what is useful from what is not (Sharma, Sharma, & Sharma, 2013). Data mining is itself an analysis method; that is, “the analysis of data that was collected for other purposes but not the questions to be answered through the data mining process” (Maaß, Spruit, & de Waal, 2014, p. 2). Data mining takes on the unknown-unknowns of the dataset and begins to make sense of the vast amount of data points available. It involves both data transformation and reduction. These are necessary as “prediction algorithms have no control over the quality of the features and must accept it as a source of error” (Maaß, Spruit, & de Waal, 2014, p. 6).

Getty Images

What is produced from these data mining efforts is a set of relevant data points that can be used for aggregate, predictive, and prescriptive analysis in the enterprise organization’s business intelligence platform(s). It is no different than avoiding the “garbage-in, garbage-out” mistake of simple reporting and visualization. Data mining reduces the noise and eliminates the dilution of relevant data by irrelevant covariates. It provides the business intelligence framework with usable data and a minimum of error.

For example, if I were to embark on a predictive modeling project to determine what factors influenced employee attrition from a large manufacturing company over the last five years, I would first want to do extensive data mining on the raw dataset. With over 20,000 employees on all continents across the world, and hundreds of data points per employee, a rigorous data mining phase eliminates the variables that would throw errors into any predictive model such as decision trees or multiple regression.

References

Maaß, D., Spruit, M., & de Waal, P. (2014). Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques. Decision Analytics, 1(1), 1–17.

Sharma, S. A., Sharma, A. K., & Sharma, D. M. (2013). Using Data Mining for Prediction: A Conceptual Analysis. I-Manager’s Journal on Information Technology, 2(1), 1–9.

Zero-Latency Data and Business Intelligence

Business intelligence enables decision-makers and stakeholders to make strategic decisions based on the information available to them. Just as the quality of the data is critical, the timeliness of the data is equally so. Laursen & Thorlund (2010) identify three types of data:

  1. Lag information. This covers what happened previously, and may be used to feed predictive models attempting to create lead information. Although the data is recorded in real-time (i.e., a flight data recorder), reading and reporting from the data is done ex post facto.
  2. Real-time data. This data shows what has happening at present. Continuing the aviation example, the ADS-B pings from aircraft are real-time data points collected by receivers across the globe and fed to flight tracking sites such as FlightAware.com for real-time reporting.
  3. Lead information. This data is often yielded from predictive models created by real-time or lag information. Airlines use a combination of flight, weather, and air traffic data to project an estimated arrival time for any given commercial aircraft at a particular destination.

There are appropriate instances for all three types. Real-time tends to be the most desired, but of course with decreased lag and immediate demand comes a trade-off of processing power, vulnerability to errors, and cost. Somewhere between “very old” and “absolutely immediate” is the sweet spot of timeliness and cost-efficiency. In other words, the push for zero-latency data may be more costly than profitable. Businesses must develop their own cost/benefit models to determine how real-time their BI data should be.

One area of real-time necessity is item affinity analysis. Every day on Amazon, customers order items and are presented with other items that may be relevant to their purchase, based on purchasing patterns from other customers who have ordered the same thing as well as their own purchasing history (Pophal, 2014). This data must be zero-latency, as a recommendation must be posted almost immediately after the customer makes their initial order. A lag time of minutes, hours, or days would lose the potential sale.

References

Laursen, G. H. N., and Thorlund, J. (2010) Business Analytics for Managers: Taking Business Intelligence Beyond Reporting. Wiley & SAS Business Institute.

Pophal, L. (2014). The technology of contextualized content: What’s next on the horizon? Retrieved from http://www.econtentmag.com/Articles/Editorial/Feature/The-Technology-of-Contextualized-Content-Whats-Next-on-the-Horizon-99029.htm