Where Clinical, Genomic, and Big Data Collide

One of the early proving grounds of big data is healthcare, and the constant cycle of insights catching up to volume hasn’t changed since the early days of the electronic patient record. Early healthcare data typically involved structured metrics such as ICD9 codes and other billing data, which yielded very little clinical detail. The introduction of new data points, both structured and unstructured, has opened the door to many new analytics possibilities. While the possibilities are there, “few viable automated processes” exist that can “extract meaning from data that is diverse, complex, and often unstructured” (Barlow, 2014, p. 18). Indeed, the gap continues to widen between the “rapid technological process in data acquisition and the comparatively slow functional characterization of biomedical information (Cirillo & Valencia, 2019, p. 161).

With so much available, a hospital or healthcare provider may find it difficult to determine a place to start, and either ignore the possibilities altogether or engage in initiatives that are not impactful to clinical quality or costs. There are five broad areas in which value can be delivered: clinical operations, payment & pricing, R&D, new business models, and public health; data are gathered from four broad sources including clinical, pharmaceutical, administrative, and consumer (Barlow, 2014, p. 21).

As of late, genomics have entered the conversation as both a consumer product (e.g., 23AndMe or Ancestry, known as personal genomic testing) and clinical practice. It is one thing to prescribe a medication based on a patient’s chart history, but an entirely different patient experience when a prescription is tailored to a patient’s particular metabolism, genetic predispositions, and risks (Barlow, 2014, p. 19). The wealth of patient-generated health data from a growing number of consumer devices has already contributed to the rise of “Personalized Medicine” (Cirillo & Valencia, 2019, p. 162) and the introduction of genomic data will move the needle even further. One can’t get much more personalized than a genetic footprint.

One debate around personal genomic testing is the value it provides when given directly to consumers without the benefit of clinician involvement. While the benefits of such testing include lifestyle changes that mitigate future disease risk, consumers are also prone to misinterpretation that may lead to unnecessary medical treatment (Meisel et al., 2015, p. 1). Beyond future risk, a recent study found the interest around personal genomic testing had a great deal to do with family or individual history of a particular affliction (Meisel et al., 2015). Consumers are mindful of explaining current risks and phenomena, not just predicting them.

References

Barlow, R. D. (2014). Great expectations for big data. Health Management Technology, 35(3), 18-21.

Cirillo, D., & Valencia, A. (2019). Big data analytics for personalized medicine. Current Opinion in Biotechnology, 58, 161-167.

Meisel, S. F., Carere, D. A., Wardle, J., Kalia, S. S., Moreno, T. A., Mountain, J. L., . . . Green, R. C. (2015). Explaining, not just predicting, drives interest in personal genomics. Genome Medicine, 7(1), 74.

Big Data: Human vs Material Agency

No alt text provided for this image

Lehrer, Wieneke, Vom Brocke, Jung, and Seidel (2018) studied four companies and their use of big data analytics in the business. Common to all companies in the case study was a two-layer service innovation process: first, automated customer-oriented actions based on trigger actions and preferences; and second, the combination of human and material agencies to produce customer-oriented interactions. The latter is of particular interest, as popular opinion sometimes tends to totalize big data as a replacement for human interaction. As illustrated in this study, the material agency (technology) exists to supplement the human agency.

One particular illustration is Company A, “the Swiss subsidiary of a multinational insurance firm that offers private individuals and corporate customers a broad range of personal, property, liability, and motor vehicle insurance” (Lehrer et al., 2018). Through a recent implementation of big data analytics tools and methodologies, the company has created new ways of more efficient interaction and supplemented employees’ customer service with better insights. In the latter case, the material agency guides employees’ own interactions with customers. That is, “the employees’ skill sets, experiences, and customer contact strategies [interact] with the material features of BDA to create new practices” (Lehrer et al., 2018, p. 438). This may include a number of sales- and service-oriented cues, such as social media or online shopping data points pointing to a major life event. On the other front, consider how the stream of data from various customer devices (e.g., home security system, automobile ODBC data trackers, smartphone location data) provides a wealth of data points that can be utilized by various machine learning methods to understand what typical behavior looks like for a customer and then know when anomalies show up. Personally, my home security system now knows it is an unusual occurrence for me to go outside a particular geographic region without arming the system. When that does occur, I receive an alert reminding me to arm it.

Reference

Lehrer, C., Wieneke, A., Vom Brocke, J. A. N., Jung, R., & Seidel, S. (2018). How big data analytics enables service innovation: Materiality, affordance, and the individualization of service. Journal of Management Information Systems, 35(2), 424-460. doi:10.1080/07421222.2018.1451953

What Makes Big Data “Big?”

I’ve never been a fan of buzzwords. The latests source of my discomfort is the term thought leader, which is one of those ubiquitous but necessary phrases in almost every professional space. That hasn’t kept me from poking fun at it, though, as I believe we should be able to laugh at ourselves and not take things too seriously.

No alt text provided for this image

Big Data is a buzzword. But it’s also my career.

What is the difference between regular, conventional, garden-variety data and Big Data? There’s a lot we could say here, but they key differences that come to mind for me are use, size, scope, and storage. I immediately think of two specific datasets I’ve used for teaching purposes: LendingClub and Stattleship.

LendingClub posts their loan history (anonymized, of course) for public consumption so that any audience may feed it into an engine or tool of their choice for analysis. I’ve used this dataset before to demonstrate predictive modeling and how financial institutions use it to aid decision-making in loan approvals. Stattleship is a sports data service with an API that allows access to a myriad of major league sports data. They also provide a custom wrapper to be used in R, and I’ve used these tools to teach R.

One of the primary differences between big data and conventional data is use case. Take these two datasets, for example. The architects of these sets understand that a variety of users will be downloading the data for various reasons, and there is no specific use case intended for either set. The possibilities are endless. With smaller troves of data, we typically have an intended use attached, and the data is specific to that use. Not so with big data.

These datasets illustrate two other key factors in big data: size and scope. Again, the datasets are not at all meant to answer one specific question or have a narrow focus. Sizing is often at least in gigabytes or terabytes—and in many cases tipping over into petabytes. The freedom to explore multiple lines of inquiry is inherent in big data sets without any sort of restriction on scope.

Finally, the storage and maintenance of big data is another key difference that sets it apart from conventional datasets. The trend of moving database operations offsite and using Database-as-a-Service models have enabled the growth of big data, as has the development of distributed computing and storage. Smaller conventional datasets do not require such an infrastructure and are not quite as impactful on a company’s bottom line.

Future of BI: Opportunities, Pitfalls, and Threats

Opportunities

Master data management (MDM). A few years ago this was thought to be a dead concept and I wonder how much of that sentiment was driven by the advent of data lakes, unstructured processing, artificial intelligence, et cetera. We have come far enough now to know that (a) the two do not have to be mutually exclusive, and (b) MDM is seeing a resurgence as the importance of data governance and quality management grows. Regardless of how the data is used, it must be clean and relevant.

Ethics. Cambridge Analytica should not have been the first watershed moment in the ethics of big data and business intelligence. While a number of industries have established sub-disciplines in ethics, data science and business intelligence are young, and this will continue to grow. That particular scandal did peel the layer of collective public naivete back. We are more attuned now to the potential pitfalls of big data in the hands of companies with less-than-best intentions. However, willful ignorance does remain and this is a major opportunity for growth.

Data-driven cultures and citizen data scientists. Business intelligence has expanded from a small cadre of statisticians and developers to include more subject-area experts and regular business users. This democratization of data science is largely due to the ease of use of popular analytics packages such as Tableau and Qlik. As the black box of analytics is demystified and the power is put in the hands of more users, data-driven cultures will become easier to create in organizations.

Pitfalls

Over-reliance on the next-best-thing. Let’s admit it: there are some impressive analytics packages on the market right now. The innovations in data science are exciting. But without a focus on less-flashy elements such as data governance and the right people-processes, whatever the next best thing might be will fail. It is tempting to get caught up in the continuous cycle of innovation and forget about these critical elements.

De-valuing BI talent. The release of analytics packages that an average business user can pilot without the need of a dedicated statistician or business intelligence developer has done many good things for the discipline, but going too far in this direction is a potential pitfall. Socially, we are in the era of experts and scientists being ignored in favor of what people believe they know (cite). Between this predisposition and more functions being in the reach of regular business users, there is a potential for BI experts to be brushed aside and their talent de-valued.

Checking our brains at the door. As useful and amazing as business intelligence has become in organizations, it may be tempting to put more and more decision-making power on artificial intelligence at the expense of human intelligence. Plenty of films have used this premise as fodder for apocalyptic computers-take-over-the-world stories. But on a more practical level, business intelligence is all about serving up the right information so decision-makers can make the right calls—not making all the decisions for them.

Threats

Inflexible organizations. Organizational culture can be a great asset or opportunity, but it can also be an incredible hindrance. Even the best deployments with the best intentions can be rendered useless if an organization is not willing to embrace whatever change is necessary to take advantage of it all. This is not a new threat, per se, but one that will always be around.

Bad actors. We like to believe that big data and the algorithms that drive how we interact with it are neutral at best. However, as McNamee (2019) notes, it is possible for bad actors to utilize otherwise benign data and algorithms for nefarious purposes. As collections of data grow and algorithms to drive outcomes or profit grow, the chances of these bad actors to utilize them become more and more likely.

Lack of transparency. This may be considered a threat more in the big data realm in general more so than in business intelligence, but it does bear highlighting within this context. Businesses use proprietary algorithms and logic that turn troves of data into consequential decisions about our lives. These also shape the world that we see through our consumption of news and social media websites. Do we remain in willful ignorance of how those are served up to us, or do we push for more transparency there?

References

Graham, Mackenzie. (2018). Facebook, Big Data, and the Trust of the Public. Retrieved from http://blog.practicalethics.ox.ac.uk/2018/04/facebook-big-data-and-the-trust-of-the-public/

Jürgensen, K. (2016). Master Data Management (MDM): Help or Hindrance? Retrieved from https://www.red-gate.com/simple-talk/sql/database-delivery/master-data-management-mdm-help-or-hindrance/

McNamee, Roger. (2019). Zucked: Waking up to the Facebook catastrophe. New York: Penguin.

Nichols, T. (2017). The death of expertise: The campaign against established knowledge and why it matters. New York: Oxford UP.

Pyramid Analytics. The Business Intelligence Trends of 2019 Discussed. Retrieved from https://www.pyramidanalytics.com/blog/details/blog-guest-bi-trends-of-2019-discussed

Rees, G., & Colqhuon, L. (2017). Predict future trends with business intelligence. Retrieved from https://www.intheblack.com/articles/2017/12/01/future-trends-business-intelligence

CRM, OLAP Cubes, and Business Intelligence

Customer Relationship Management, as a concept, brings together a number of various systems from functions across the business (sales, marketing, operations, external, etc) that allow the enterprise to create, maintain, and grow positive and productive relationships with customers. We might think of it as being the glue that brings front office and back office together and allows the business to de-silo what would otherwise be proprietary information across the organization.

No alt text provided for this image

But what good are all these data points if they aren’t utilized effectively? It would be easy to fall victim to information overload if we tried to explore the data from a particular axis or angle. This is where classic data mining and online analytical processing (OLAP) come in. If we think of various systems of record as one-dimensional axes on a graph, bringing these together in a three-dimensional cube and taking a particular block within that cube to analyze would be much more efficient. Rather than starting with the data and searching for questions to answer that might involve those points (as is tempting to do at times), we are able to start with a specific business question and use OLAP to answer it.

For example, assume I am a cosmetics manufacturer and want to know how much of my product actually goes out the door to consumers after it is sold to a distributor. I want to use that information to adjust my marketing efforts and potentially re-evaluate my production line. I have the following data points available by way of my existing business intelligence environment:

  • Production line data
  • Inventory balances in my warehouse
  • Marketing campaign data
  • Sales data from my company to the distributor
  • Sales data from the distributor to the end consumer

Rather than starting from one or two of these data points and throwing things against the wall to see what might stick, I can use OLAP capabilities to find the different relationships between these points, eventually driving my answer. Understand here that answering the initial question is simply a matter of reading one data point (the last one in this case)—however, a strategic approach that addresses the customer relationship is the end goal.

One caveat here. OLAP may be considered a predecessor to currently-understood data mining, depending on which view of business intelligence you find appealing. Strictly speaking, traditional OLAP has been used for a number of years already for marketing, forecasting, and sales. Data mining capabilities at present far surpass what has been traditionally available in the OLAP sense.

Reference

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

From Decision Support to Business Intelligence

Decision support systems (DSS) predate business intelligence (BI) by several decades. Sprague and Carlson (1982) define a DSS as “class of information system that draws on transaction processing systems and interacts with the other parts of the overall information system to support the decision-making activities of managers and other knowledge workers in organisations.” This definition is very nearly interchangeable with that of a business intelligence system. We can think of DSS as more of a framework and model more than an actual software package. These have often been aided by computer resources, such as databases and online processing (OLAP), but they may also be offline. Any DSS involves a data or knowledge base, the business rules, and the interface itself. DSS systems may be classified by one of the following drivers (Power, 2000):

  • Communication-Driven
  • Data-Driven
  • Document-Driven
  • Knowledge-Driven
  • Model-Driven

Business intelligence can be viewed as the successor of DSS or the parent of it. I prefer to see it as a hybrid. As methods of collecting, storing, viewing, and analyzing data became more advanced, DSS systems came to be a specific part in a larger BI framework. A DSS is always dependent on “access to accurate, well-structured, and organized data” (Felsberger, Oberegger, & Reiner, 2016, p. 3). The various functions of business intelligence that have grown in recent years all serve to support the data points going into the DSS.

In a manufacturing environment, a practical example might be the evaluation and assignment of work centers. The knowledge base may include data such as what must go in, what must be produced, what constraints are in place, et cetera. Production and diagnostic data from the different work centers would be integrated via the BI capabilities of the organization, as well as forecasted production and schedule data. Business rules such as employee labor hours and machine lifecycle may also be included. The DSS would use all these data points to drive outputs; in this case, the desired outputs and decisions include production labor and machine scheduling that are most efficient to the company.

References

Felsberger, A., Oberegger, B., & Reiner, G. (2016). A review of decision support systems for manufacturing systems.

Power, D. J. (2000). Web-based and model-driven decision support systems: concepts and issues. In proceedings of the Americas Conference on Information Systems, Long Beach, California.

Sprague, R. H., & Carlson, E. D. (1982). Building effective decision support systems. Prentice Hall Professional Technical Reference.

2PC and 3PC (Commit Protocols) in DBMS

Both Two-Phase Commit (2PC) protocol and Three-Phase Commit (3PC) protocol are popular with Distributed DBMS instances because all nodes must commit to a transaction or none of them will. It is an all-or-nothing proposition. Both protocols share a Prepare (Voting) and Commit/Abort phase, but 3PC adds an additional pre-Commit phase in which every participating node must vote yes to a commit before it is actually done. Compared to 3PC, Two-Phase Commit may be characterized as sending the command and hoping for the best, since the bulk of the transaction (the instructions for what to actually do) are transmitted with the commit phase. The return message after the transaction, from each participant, determines commit or abort status globally. The 3PC extra step of pre-commit is intended to clear up any global commit/abort failure issues or blocking. This step polls for availability before anything is done and the nodes can “act independently in the event of a failure” (Connolly & Begg, 2015). This is an important distinction. In 2PC, a single abort vote or acknowledgement undoes the entire process. In 3PC, assuming the pre-commit phase came back with a global commit vote, even a timeout or network partition would not cause a global abort.

Terminating a process, according to Connolly & Begg (2015), is where the differences between these protocols are most critical. In 2PC it is possible to have a block because after the vote, the nodes are waiting on a commit or abort message from coordinator before making the global commit. If partition occurs, they are stuck until coordinator re-establishes communication. A power failure is more catastrophic, as it may involve multiple nodes and the controller. In both 2PC and 3PC, backup procedures are activated. 2PC participants remain in a blocked state. Of course, overall, there are tradeoffs. The major issue with 3PC is the communication overhead, which is to be expected with the extra phase (Kumar, 2016).

References

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Kumar, M. (2016). Commit protocols in distributed database system: A comparison. International Journal for Innovative Research in Science & Technology, 2(12), 277-281.

Concurrency: Optimistic or Pessimistic?

Optimistic concurrency control is the more complex of the two concurrency control methods. A transaction beginning is timestamped, a process is run, and the change is validated. If another transaction completed after this transaction’s start time, this transaction is aborted. In other words, the original record is unavailable because someone got to it first and completed the transaction. The risk here is a dirty read, as it is possible for more than one person to have access to a record at a time. Change validation is done at the end of the transaction block.

Conservative (or pessimistic) concurrency control is akin to checking out a book at the library, and is the simpler of the two methods. Once a transaction begins, the record is locked, and no one else can modify it. In the library example, I would go to the library to check out a book (record) to read it (modify it); if it is there (no one has initiated a change), I may check it out. If the book (record) is not there (someone has locked it, modifying it), I cannot check it out. It is a first-come, first-served method that ensures no two people have concurrent access to a record at a time.

Each has its risks and rewards. Optimistic concurrency control tends to be used in environments without much contention for a single record of truth. It allows a higher volume of transactions per hour. However, as the name implies, the method essentially hopes for the best then deals with the problem if and when it arises. On the other hand, pessimistic concurrency control virtually guarantees that all transactions will be executed correctly and that the database is stable. It is a simpler decision tree: either abort if locked or commit if unlocked. All the drawbacks of pessimistic concurrency control lie in timing: fewer transactions per hour and limited access to the data depending on the number of users making transactions.

One specific advantage of optimistic locking, that isn’t always thought of immediately, is evident in the scenario when a user cannot maintain a consistent connection to the database. Assume for a moment that a user locks a table in a remote database for updating and the connection is severed (either through server reset, ISP woes, et cetera). The user reconnects and is back in the database. However, the previous session was not properly closed, so we have a phantom user with the record still open.

Reference

Connolly, T. & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Database Indexing and Clusters

A useful analogy for database indexing is highlighting an article or book. We highlight or mark passages to make them easier for retrieval at a later date, when we pick up the book and want to find something quickly. Likewise, an index provides quick access to key information in the database tables. It would be silly to highlight an entire book, or mark up all of the pages; by the same token, it would be functionally useless to index a large number of columns in a table. There is a point of diminishing returns here.

It is generally recommended to index columns that are involved in WHERE or JOIN clauses (Larsen, 2010). These columns are frequently sought out by multiple query operations and are typically as critical to the table as the Primary Key. It is important to choose wisely here because for every table operation done, an index update must be done. This work can become exponential if multiple indexes are placed on a single table. Again, we come back to the principle of diminishing returns.

There is also the matter of choosing between clustered and nonclustered indexes. The former typically reads like browsing through a telephone directory: in order. Primary Keys are typically used in clustered indexes. One drawback here is the need to re-order the index when information is updated, added, or deleted. On the other hand, a non-clustered index operates much like an index in the back of a textbook, or like a dimension table in a star-schema database. While the latter may seem more advantageous at all times, but it usually shines when values are constantly updated. In situations where most of the table data is returned in a query, or a Primary Key is the rational identifier, a clustered index is the type of choice.

References

Connolly, T. & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson. Larsen, G. A. (2010). The dos and don’ts of database indexing. Retrieved from https://www.databasejournal.com/features/mssql/article.php/3860851/The-Dos-and-Donts-of-Database-Indexing.htm

Wagner, B. (2017). Clustered vs nonclustered: What index is right for my data? Retrieved from https://hackernoon.com/clustered-vs-nonclustered-what-index-is-right-for-my-data-717b329d042c

Data Mining: What’s Ahead

I’ve written on Data Mining before, as it is a fundamental step for higher-order predictive and prescriptive analytics work. Enterprise data warehouses are a trove of useful information, and data mining methods help to separate what is useful from what is not (Sharma, Sharma, & Sharma, 2013). Data mining is itself an analysis method; that is, “the analysis of data that was collected for other purposes but not the questions to be answered through the data mining process” (Maaß, Spruit, & de Waal, 2014, p. 2). Data mining takes on the unknown-unknowns of the dataset and begins to make sense of the vast amount of data points available. It involves both data transformation and reduction. These are necessary as “prediction algorithms have no control over the quality of the features and must accept it as a source of error” (Maaß, Spruit, & de Waal, 2014, p. 6). Data mining reduces the noise and eliminates the dilution of relevant data by irrelevant covariates. It provides the business intelligence framework with usable data and a minimum of error.

Tembhurkar, Tugnayat, & Nagdive (2014) outline five stages for successful data-to-BI transformation:

  1. Collection of raw data
  2. Data mining and cleansing
  3. Data warehousing
  4. Implementation of BI tools
  5. Analysis of outputs (p. 132).

Given the importance of Data Mining in the BI process, I do not see it going away or diminishing in stature. In fact, more attention may be coming to it because of the growing interest in data lakes and ELT over ETL (e.g., Meena & Vidhyameena, 2016; Rajesh & Ramesh, 2016). Increased attention will be paid to mining and cleansing practices. New developments will include advances in unstructured data, IoT data, distributed systems data mining, and NLP/multimedia data mining.

References

Maaß, D., Spruit, M., & de Waal, P. (2014). Improving short-term demand forecasting for short-lifecycle consumer products with data mining techniques. Decision Analytics, 1(1), 1–17.

Meena, S. D., & Vidhyameena, S. (2016). Data lake – a new data repository for big data analytics workloads. International Journal of Advanced Research in Computer Science, 7(5), 65-67.

Rajesh, K. V. N., & Ramesh, K. V. N. (2016). An introduction to data lake. i-Manager’s Journal on Information Technology, 5(2), 1-4.

Sharma, S. A., Sharma, A. K., & Sharma, D. M. (2013). Using Data Mining for Prediction: A Conceptual Analysis. I-Manager’s Journal on Information Technology, 2(1), 1–9.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).