Variables and Measures, or People and Goals?

Just as any IT implementation shouldn’t be for its own sake—that is, it should serve a business purpose within the sponsoring organization and not simply be a cost center—quantitative analysis within the context of an organization should likewise serve a business purpose. For example, there must be some reason a widget manufacturer commissions a study of its customer base. It wasn’t brought up just to keep the research division busy. There are typically research questions and hypotheses that exist and guide the methodology.

In my own research consulting work, I have often started with broad research questions that then drive more narrow research questions and/or particular segment analyses. At the analysis level, the variables and desired outcomes are examined in order to determine what test to use. From that point, it is easy to get lost in the vocabulary of quantitative analysis and forget that the work is being done to answer a business question.

No alt text provided for this image

For example, assuming the National Widget Company commissioned that study of its customer base, I could simply report the measures of central tendency and leave them to interpret why there’s a difference between the mean and median ages. But a true data scientist/analyst helps explain why the numbers mean what they do, and ensures the business users don’t get lost in the lingo. I would take the time to explain that the mean age is 42.5, the median age is 37, and that difference indicates there are more instances of older customers than younger and possibly some outliers bringing that mean age up. I would then turn back to them and ask what this means for their business. Remember that as the analyst, we are not the business subject-matter experts. Offering the numbers to the business and asking them to provide context creates more opportunities for synergy.

Consider another example involving correlation. Two variables, or points of interest as we would call them: widget sales and distance from a major airport. A strong negative correlation (r=-0.49) is found. First we must caution against equating correlation and causation. We would then pivot away from the r-value and put the focus back on the variables of interest: it appears that an individual who lives closer to a major airport is more likely to buy these widgets. Again, we would put the question back on the business to then have a conversation about why these variables might be related and the possible covariates.

In either case, and in any analytics situation, proper use of visualization is paramount. In the latter example it is much easier to see what a high r-value means on a scatterplot as opposed to explaining it verbally. Data visualization bridges many gaps that numbers and words simply cannot fill. These are the languages of dashboards, executive roll-ups, and KPIs.

Overall, the primary thing to remember in keeping an audience engaged in a discussion around quantitative research is this: the variables of interest are the reason for the study, not the numbers themselves. Keep the focus on what matters.

The Privacy Divide: Social Media and Personal Genomic Testing

No alt text provided for this image

With every advance in technology comes a trade-off of some kind. Where the use of personally-identifiable information is concerned, the trade-offs typically involve the exchange of privacy and confidentiality for a non-monetary benefit. In the early days of social media, conventional wisdom said the product was the service. However, we have seen over the last decade that the users of such platforms are the products, the perceived benefits merely carrots on sticks to keep the products (users) engaged in the cycle. We willfully pour details of ourselves into various social media outlets, despite the documented bad behaviors by giants like Facebook, and mostly remain complacent in having our personal data packaged and leveraged against us by various business interests.

However, in the conversation I’ve had around personal genomic testing (PGT), I’ve noticed that many are quick to cite data privacy and risk as a key reason not to participate. Think about this. On one hand, we have evidence to prove Facebook has been using our data in dubious ways, yet we keep pouring ourselves into it (McNamee, 2019). On the other hand, the potential benefits of PGT are outweighed by a fear of that data potentially being misused.

My purpose is not to minimize the potential hazards around PGT. Consider the following risks: (a) hacking; (b) profit or misuse by the company or partners; (c) limited protection from a narrow scope of laws; (d) requests from state and federal authorities; and (e) changing privacy policies or company use due to mergers, acquisitions, bankruptcies, et cetera (Rosenbaum, 2018). In the face of potential benefits from PGT, these are serious caveats. But read that list outside of this context, and it is equally applicable to the data we generate and provide to social media outlets on a daily basis.

As of yet the privacy regulations around social media use only exist within the context of the company itself—that is, there are no substantial federal regulations in the US on the matter, only the GDPR in the EU (St. Vincent, 2018). Where health information is concerned, the US does have slightly more mature federal regulation. The Health Insurance Portability and Accountability Act (HIPAA) requires confidentiality in all individually-identifiable health information; in 2013, this law was extended to genetic information by way of the Genetic Information Nondiscrimination Act (GINA). While the rules prohibit use of genetic information for underwriting purposes, there is no restriction on the sharing or use of genetic information that has been de-identified (National Human Genome Research Institute, 2015). De-identification is not entirely foolproof. There are cases in which the data can be re-identified (Rosenbaum, 2018).

The incongruence is puzzling. In the case of social media, users willfully provide a wealth of data points on a regular basis to companies that repackage and monetize that data for dubious purposes, in the absence of meaningful US legislation to protect it. In the case of PGT, where at least HIPAA and GINA have a rudimentary level of codified protection, users’ hesitance appears to be much more pronounced.

References

McNamee, R. (2019). Zucked: Waking up to the Facebook catastrophe. New York: Penguin.

National Human Genome Research Institute. (2015). Privacy in genomics. Retrieved from https://www.genome.gov/about-genomics/policy-issues/Privacy

Rosenbaum, E. (2018). Five biggest risks of sharing your DNA with consumer genetic-testing companies. Retrieved from https://www.cnbc.com/2018/06/16/5-biggest-risks-of-sharing-dna-with-consumer-genetic-testing-companies.html

St. Vincent, S. (2018). US should create laws to protect social media users’ data. Retrieved from https://www.hrw.org/news/2018/04/05/us-should-create-laws-protect-social-media-users-data

Where Clinical, Genomic, and Big Data Collide

One of the early proving grounds of big data is healthcare, and the constant cycle of insights catching up to volume hasn’t changed since the early days of the electronic patient record. Early healthcare data typically involved structured metrics such as ICD9 codes and other billing data, which yielded very little clinical detail. The introduction of new data points, both structured and unstructured, has opened the door to many new analytics possibilities. While the possibilities are there, “few viable automated processes” exist that can “extract meaning from data that is diverse, complex, and often unstructured” (Barlow, 2014, p. 18). Indeed, the gap continues to widen between the “rapid technological process in data acquisition and the comparatively slow functional characterization of biomedical information (Cirillo & Valencia, 2019, p. 161).

With so much available, a hospital or healthcare provider may find it difficult to determine a place to start, and either ignore the possibilities altogether or engage in initiatives that are not impactful to clinical quality or costs. There are five broad areas in which value can be delivered: clinical operations, payment & pricing, R&D, new business models, and public health; data are gathered from four broad sources including clinical, pharmaceutical, administrative, and consumer (Barlow, 2014, p. 21).

As of late, genomics have entered the conversation as both a consumer product (e.g., 23AndMe or Ancestry, known as personal genomic testing) and clinical practice. It is one thing to prescribe a medication based on a patient’s chart history, but an entirely different patient experience when a prescription is tailored to a patient’s particular metabolism, genetic predispositions, and risks (Barlow, 2014, p. 19). The wealth of patient-generated health data from a growing number of consumer devices has already contributed to the rise of “Personalized Medicine” (Cirillo & Valencia, 2019, p. 162) and the introduction of genomic data will move the needle even further. One can’t get much more personalized than a genetic footprint.

One debate around personal genomic testing is the value it provides when given directly to consumers without the benefit of clinician involvement. While the benefits of such testing include lifestyle changes that mitigate future disease risk, consumers are also prone to misinterpretation that may lead to unnecessary medical treatment (Meisel et al., 2015, p. 1). Beyond future risk, a recent study found the interest around personal genomic testing had a great deal to do with family or individual history of a particular affliction (Meisel et al., 2015). Consumers are mindful of explaining current risks and phenomena, not just predicting them.

References

Barlow, R. D. (2014). Great expectations for big data. Health Management Technology, 35(3), 18-21.

Cirillo, D., & Valencia, A. (2019). Big data analytics for personalized medicine. Current Opinion in Biotechnology, 58, 161-167.

Meisel, S. F., Carere, D. A., Wardle, J., Kalia, S. S., Moreno, T. A., Mountain, J. L., . . . Green, R. C. (2015). Explaining, not just predicting, drives interest in personal genomics. Genome Medicine, 7(1), 74.

Big Data: Human vs Material Agency

No alt text provided for this image

Lehrer, Wieneke, Vom Brocke, Jung, and Seidel (2018) studied four companies and their use of big data analytics in the business. Common to all companies in the case study was a two-layer service innovation process: first, automated customer-oriented actions based on trigger actions and preferences; and second, the combination of human and material agencies to produce customer-oriented interactions. The latter is of particular interest, as popular opinion sometimes tends to totalize big data as a replacement for human interaction. As illustrated in this study, the material agency (technology) exists to supplement the human agency.

One particular illustration is Company A, “the Swiss subsidiary of a multinational insurance firm that offers private individuals and corporate customers a broad range of personal, property, liability, and motor vehicle insurance” (Lehrer et al., 2018). Through a recent implementation of big data analytics tools and methodologies, the company has created new ways of more efficient interaction and supplemented employees’ customer service with better insights. In the latter case, the material agency guides employees’ own interactions with customers. That is, “the employees’ skill sets, experiences, and customer contact strategies [interact] with the material features of BDA to create new practices” (Lehrer et al., 2018, p. 438). This may include a number of sales- and service-oriented cues, such as social media or online shopping data points pointing to a major life event. On the other front, consider how the stream of data from various customer devices (e.g., home security system, automobile ODBC data trackers, smartphone location data) provides a wealth of data points that can be utilized by various machine learning methods to understand what typical behavior looks like for a customer and then know when anomalies show up. Personally, my home security system now knows it is an unusual occurrence for me to go outside a particular geographic region without arming the system. When that does occur, I receive an alert reminding me to arm it.

Reference

Lehrer, C., Wieneke, A., Vom Brocke, J. A. N., Jung, R., & Seidel, S. (2018). How big data analytics enables service innovation: Materiality, affordance, and the individualization of service. Journal of Management Information Systems, 35(2), 424-460. doi:10.1080/07421222.2018.1451953

What Makes Big Data “Big?”

I’ve never been a fan of buzzwords. The latests source of my discomfort is the term thought leader, which is one of those ubiquitous but necessary phrases in almost every professional space. That hasn’t kept me from poking fun at it, though, as I believe we should be able to laugh at ourselves and not take things too seriously.

No alt text provided for this image

Big Data is a buzzword. But it’s also my career.

What is the difference between regular, conventional, garden-variety data and Big Data? There’s a lot we could say here, but they key differences that come to mind for me are use, size, scope, and storage. I immediately think of two specific datasets I’ve used for teaching purposes: LendingClub and Stattleship.

LendingClub posts their loan history (anonymized, of course) for public consumption so that any audience may feed it into an engine or tool of their choice for analysis. I’ve used this dataset before to demonstrate predictive modeling and how financial institutions use it to aid decision-making in loan approvals. Stattleship is a sports data service with an API that allows access to a myriad of major league sports data. They also provide a custom wrapper to be used in R, and I’ve used these tools to teach R.

One of the primary differences between big data and conventional data is use case. Take these two datasets, for example. The architects of these sets understand that a variety of users will be downloading the data for various reasons, and there is no specific use case intended for either set. The possibilities are endless. With smaller troves of data, we typically have an intended use attached, and the data is specific to that use. Not so with big data.

These datasets illustrate two other key factors in big data: size and scope. Again, the datasets are not at all meant to answer one specific question or have a narrow focus. Sizing is often at least in gigabytes or terabytes—and in many cases tipping over into petabytes. The freedom to explore multiple lines of inquiry is inherent in big data sets without any sort of restriction on scope.

Finally, the storage and maintenance of big data is another key difference that sets it apart from conventional datasets. The trend of moving database operations offsite and using Database-as-a-Service models have enabled the growth of big data, as has the development of distributed computing and storage. Smaller conventional datasets do not require such an infrastructure and are not quite as impactful on a company’s bottom line.

Future of BI: Opportunities, Pitfalls, and Threats

Opportunities

Master data management (MDM). A few years ago this was thought to be a dead concept and I wonder how much of that sentiment was driven by the advent of data lakes, unstructured processing, artificial intelligence, et cetera. We have come far enough now to know that (a) the two do not have to be mutually exclusive, and (b) MDM is seeing a resurgence as the importance of data governance and quality management grows. Regardless of how the data is used, it must be clean and relevant.

Ethics. Cambridge Analytica should not have been the first watershed moment in the ethics of big data and business intelligence. While a number of industries have established sub-disciplines in ethics, data science and business intelligence are young, and this will continue to grow. That particular scandal did peel the layer of collective public naivete back. We are more attuned now to the potential pitfalls of big data in the hands of companies with less-than-best intentions. However, willful ignorance does remain and this is a major opportunity for growth.

Data-driven cultures and citizen data scientists. Business intelligence has expanded from a small cadre of statisticians and developers to include more subject-area experts and regular business users. This democratization of data science is largely due to the ease of use of popular analytics packages such as Tableau and Qlik. As the black box of analytics is demystified and the power is put in the hands of more users, data-driven cultures will become easier to create in organizations.

Pitfalls

Over-reliance on the next-best-thing. Let’s admit it: there are some impressive analytics packages on the market right now. The innovations in data science are exciting. But without a focus on less-flashy elements such as data governance and the right people-processes, whatever the next best thing might be will fail. It is tempting to get caught up in the continuous cycle of innovation and forget about these critical elements.

De-valuing BI talent. The release of analytics packages that an average business user can pilot without the need of a dedicated statistician or business intelligence developer has done many good things for the discipline, but going too far in this direction is a potential pitfall. Socially, we are in the era of experts and scientists being ignored in favor of what people believe they know (cite). Between this predisposition and more functions being in the reach of regular business users, there is a potential for BI experts to be brushed aside and their talent de-valued.

Checking our brains at the door. As useful and amazing as business intelligence has become in organizations, it may be tempting to put more and more decision-making power on artificial intelligence at the expense of human intelligence. Plenty of films have used this premise as fodder for apocalyptic computers-take-over-the-world stories. But on a more practical level, business intelligence is all about serving up the right information so decision-makers can make the right calls—not making all the decisions for them.

Threats

Inflexible organizations. Organizational culture can be a great asset or opportunity, but it can also be an incredible hindrance. Even the best deployments with the best intentions can be rendered useless if an organization is not willing to embrace whatever change is necessary to take advantage of it all. This is not a new threat, per se, but one that will always be around.

Bad actors. We like to believe that big data and the algorithms that drive how we interact with it are neutral at best. However, as McNamee (2019) notes, it is possible for bad actors to utilize otherwise benign data and algorithms for nefarious purposes. As collections of data grow and algorithms to drive outcomes or profit grow, the chances of these bad actors to utilize them become more and more likely.

Lack of transparency. This may be considered a threat more in the big data realm in general more so than in business intelligence, but it does bear highlighting within this context. Businesses use proprietary algorithms and logic that turn troves of data into consequential decisions about our lives. These also shape the world that we see through our consumption of news and social media websites. Do we remain in willful ignorance of how those are served up to us, or do we push for more transparency there?

References

Graham, Mackenzie. (2018). Facebook, Big Data, and the Trust of the Public. Retrieved from http://blog.practicalethics.ox.ac.uk/2018/04/facebook-big-data-and-the-trust-of-the-public/

Jürgensen, K. (2016). Master Data Management (MDM): Help or Hindrance? Retrieved from https://www.red-gate.com/simple-talk/sql/database-delivery/master-data-management-mdm-help-or-hindrance/

McNamee, Roger. (2019). Zucked: Waking up to the Facebook catastrophe. New York: Penguin.

Nichols, T. (2017). The death of expertise: The campaign against established knowledge and why it matters. New York: Oxford UP.

Pyramid Analytics. The Business Intelligence Trends of 2019 Discussed. Retrieved from https://www.pyramidanalytics.com/blog/details/blog-guest-bi-trends-of-2019-discussed

Rees, G., & Colqhuon, L. (2017). Predict future trends with business intelligence. Retrieved from https://www.intheblack.com/articles/2017/12/01/future-trends-business-intelligence

CRM, OLAP Cubes, and Business Intelligence

Customer Relationship Management, as a concept, brings together a number of various systems from functions across the business (sales, marketing, operations, external, etc) that allow the enterprise to create, maintain, and grow positive and productive relationships with customers. We might think of it as being the glue that brings front office and back office together and allows the business to de-silo what would otherwise be proprietary information across the organization.

No alt text provided for this image

But what good are all these data points if they aren’t utilized effectively? It would be easy to fall victim to information overload if we tried to explore the data from a particular axis or angle. This is where classic data mining and online analytical processing (OLAP) come in. If we think of various systems of record as one-dimensional axes on a graph, bringing these together in a three-dimensional cube and taking a particular block within that cube to analyze would be much more efficient. Rather than starting with the data and searching for questions to answer that might involve those points (as is tempting to do at times), we are able to start with a specific business question and use OLAP to answer it.

For example, assume I am a cosmetics manufacturer and want to know how much of my product actually goes out the door to consumers after it is sold to a distributor. I want to use that information to adjust my marketing efforts and potentially re-evaluate my production line. I have the following data points available by way of my existing business intelligence environment:

  • Production line data
  • Inventory balances in my warehouse
  • Marketing campaign data
  • Sales data from my company to the distributor
  • Sales data from the distributor to the end consumer

Rather than starting from one or two of these data points and throwing things against the wall to see what might stick, I can use OLAP capabilities to find the different relationships between these points, eventually driving my answer. Understand here that answering the initial question is simply a matter of reading one data point (the last one in this case)—however, a strategic approach that addresses the customer relationship is the end goal.

One caveat here. OLAP may be considered a predecessor to currently-understood data mining, depending on which view of business intelligence you find appealing. Strictly speaking, traditional OLAP has been used for a number of years already for marketing, forecasting, and sales. Data mining capabilities at present far surpass what has been traditionally available in the OLAP sense.

Reference

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

From Decision Support to Business Intelligence

Decision support systems (DSS) predate business intelligence (BI) by several decades. Sprague and Carlson (1982) define a DSS as “class of information system that draws on transaction processing systems and interacts with the other parts of the overall information system to support the decision-making activities of managers and other knowledge workers in organisations.” This definition is very nearly interchangeable with that of a business intelligence system. We can think of DSS as more of a framework and model more than an actual software package. These have often been aided by computer resources, such as databases and online processing (OLAP), but they may also be offline. Any DSS involves a data or knowledge base, the business rules, and the interface itself. DSS systems may be classified by one of the following drivers (Power, 2000):

  • Communication-Driven
  • Data-Driven
  • Document-Driven
  • Knowledge-Driven
  • Model-Driven

Business intelligence can be viewed as the successor of DSS or the parent of it. I prefer to see it as a hybrid. As methods of collecting, storing, viewing, and analyzing data became more advanced, DSS systems came to be a specific part in a larger BI framework. A DSS is always dependent on “access to accurate, well-structured, and organized data” (Felsberger, Oberegger, & Reiner, 2016, p. 3). The various functions of business intelligence that have grown in recent years all serve to support the data points going into the DSS.

In a manufacturing environment, a practical example might be the evaluation and assignment of work centers. The knowledge base may include data such as what must go in, what must be produced, what constraints are in place, et cetera. Production and diagnostic data from the different work centers would be integrated via the BI capabilities of the organization, as well as forecasted production and schedule data. Business rules such as employee labor hours and machine lifecycle may also be included. The DSS would use all these data points to drive outputs; in this case, the desired outputs and decisions include production labor and machine scheduling that are most efficient to the company.

References

Felsberger, A., Oberegger, B., & Reiner, G. (2016). A review of decision support systems for manufacturing systems.

Power, D. J. (2000). Web-based and model-driven decision support systems: concepts and issues. In proceedings of the Americas Conference on Information Systems, Long Beach, California.

Sprague, R. H., & Carlson, E. D. (1982). Building effective decision support systems. Prentice Hall Professional Technical Reference.

2PC and 3PC (Commit Protocols) in DBMS

Both Two-Phase Commit (2PC) protocol and Three-Phase Commit (3PC) protocol are popular with Distributed DBMS instances because all nodes must commit to a transaction or none of them will. It is an all-or-nothing proposition. Both protocols share a Prepare (Voting) and Commit/Abort phase, but 3PC adds an additional pre-Commit phase in which every participating node must vote yes to a commit before it is actually done. Compared to 3PC, Two-Phase Commit may be characterized as sending the command and hoping for the best, since the bulk of the transaction (the instructions for what to actually do) are transmitted with the commit phase. The return message after the transaction, from each participant, determines commit or abort status globally. The 3PC extra step of pre-commit is intended to clear up any global commit/abort failure issues or blocking. This step polls for availability before anything is done and the nodes can “act independently in the event of a failure” (Connolly & Begg, 2015). This is an important distinction. In 2PC, a single abort vote or acknowledgement undoes the entire process. In 3PC, assuming the pre-commit phase came back with a global commit vote, even a timeout or network partition would not cause a global abort.

Terminating a process, according to Connolly & Begg (2015), is where the differences between these protocols are most critical. In 2PC it is possible to have a block because after the vote, the nodes are waiting on a commit or abort message from coordinator before making the global commit. If partition occurs, they are stuck until coordinator re-establishes communication. A power failure is more catastrophic, as it may involve multiple nodes and the controller. In both 2PC and 3PC, backup procedures are activated. 2PC participants remain in a blocked state. Of course, overall, there are tradeoffs. The major issue with 3PC is the communication overhead, which is to be expected with the extra phase (Kumar, 2016).

References

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Kumar, M. (2016). Commit protocols in distributed database system: A comparison. International Journal for Innovative Research in Science & Technology, 2(12), 277-281.

Concurrency: Optimistic or Pessimistic?

Optimistic concurrency control is the more complex of the two concurrency control methods. A transaction beginning is timestamped, a process is run, and the change is validated. If another transaction completed after this transaction’s start time, this transaction is aborted. In other words, the original record is unavailable because someone got to it first and completed the transaction. The risk here is a dirty read, as it is possible for more than one person to have access to a record at a time. Change validation is done at the end of the transaction block.

Conservative (or pessimistic) concurrency control is akin to checking out a book at the library, and is the simpler of the two methods. Once a transaction begins, the record is locked, and no one else can modify it. In the library example, I would go to the library to check out a book (record) to read it (modify it); if it is there (no one has initiated a change), I may check it out. If the book (record) is not there (someone has locked it, modifying it), I cannot check it out. It is a first-come, first-served method that ensures no two people have concurrent access to a record at a time.

Each has its risks and rewards. Optimistic concurrency control tends to be used in environments without much contention for a single record of truth. It allows a higher volume of transactions per hour. However, as the name implies, the method essentially hopes for the best then deals with the problem if and when it arises. On the other hand, pessimistic concurrency control virtually guarantees that all transactions will be executed correctly and that the database is stable. It is a simpler decision tree: either abort if locked or commit if unlocked. All the drawbacks of pessimistic concurrency control lie in timing: fewer transactions per hour and limited access to the data depending on the number of users making transactions.

One specific advantage of optimistic locking, that isn’t always thought of immediately, is evident in the scenario when a user cannot maintain a consistent connection to the database. Assume for a moment that a user locks a table in a remote database for updating and the connection is severed (either through server reset, ISP woes, et cetera). The user reconnects and is back in the database. However, the previous session was not properly closed, so we have a phantom user with the record still open.

Reference

Connolly, T. & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.