The Role of Data Brokers in Healthcare

In courses I’ve led before, we looked at the disjointed data privacy regulations in the United States and current events in data privacy (e.g., Facebook, Cambridge Analytica, personal genomics testing, etc). The overall issue is repeatable in any setting: giving a single entity a large amount of data inevitably raises questions of ethics, privacy, security, and motivation.

Where healthcare data brokers are concerned, the stated goals differ by type of data. Where direct patient interaction with the data is concerned, the goal is to give patients “more control over the data” (Klugman, 2018) and perhaps bypass the clunky patient portals set up by providers. Of the data that is not personally identifiable, it can have much less altruistic goals, such as being a player in a multi-billion-dollar market (Patientory, 2018) or contributing to health insurance discrimination (Butler, 2018). I am not naïve enough to think that all exercises in healthcare should be altruistic, and the concept of insurance itself has a certain modicum of discrimination in its core; however, weaponizing the data to aid in unfair practices is beyond the pale here.

No alt text provided for this image

From a data engineering perspective, a broker in the truest sense of the word may act as a clearinghouse between providers with disparate systems, enabling the seamless transfer of patient data between those providers without putting the burden of ETL on either of them. Whereas XML formatting and other portability developments have allowed providers using different EHR systems to port patient data, a data brokerage would act as an independent party acting on the patient’s behalf and handling the technical details on integrating their data between all providers and interested parties. Beyond holding the data, the broker would be responsible for ensuring each provider and biller has access to the same single source of truth on that particular patient.

This would, of course, require a data warehouse of sorts for the single source to be held, and puts the questions of security, privacy, transparency, and ethics on the broker. The broker has to make money to survive and a business model must emerge, so it would not be immune to market forces. The aggregation of so much patient data in one place would be too great a temptation to let sit and not make money as de-identified commodities, so a secondary market would emerge and lead to the same issues cited above. Call me pessimistic, but the best predictor of future actions is past behavior, and thus far the companies holding massive amounts of data about our lives either can’t keep it secure from breaches or are perfectly happy selling it while turning a blind eye to what is done with it.


Butler, M. (2018). Data brokers and health insurer partnerships could result in insurance discrimination. Retrieved from

Klugman, C. (2018). Hospitals selling patient records to data brokers: A violation of patient trust and autonomy. Retrieved from

Patientory. (2018). Data brokers have access to your information, do you? Retrieved from

Data-in-Motion or Data-at-Rest?

Reading the available material on data-in-motion reminds me of when I first read about data lakes over data warehouses, or NoSQL over SQL: the urgency of the former, and outright danger of the latter, are both overblown. Put simply, data-in-motion provides real-time insights. Most of our analytics efforts across data science spheres apply to stored data, be it years, weeks, or hours old. Taking a look at data-in-motion means not storing it prior to analyzing it, and extracting insights as the data rolls in. This is one workstream of dual efforts that tell us the whole picture: historical data providing insight on what happened and potential to train for future detection, and real-time data to get what’s happening right now.

Churchward (2018) argues a fair point here: once data is stored, it isn’t real-time by definition. But taking that argument to a logical extreme by asking whether we would like to make decisions on data three months old is a stretch. While it is true that matters such as security and intrusion detection must have real-time detection, categorically dismissing data-at-rest analytics is reckless. It vilifies practices that are the foundation of any comprehensive analytics strategy. Both data-at-rest and data-in-motion are valuable drivers of any business intelligence effort that seeks to paint a total picture of any phenomena. 

There are, of course, less frantic cases to be made for data-in-motion. Patel (2018) illustrates a critical situation on an oil drilling rig, in which information even a few minutes old can be life-threatening. In this case, written for Spotfire X, there may be some confusion of monitoring versus analytics. The dashboard shown on the website and written scenario paint more a picture of monitoring and dashboarding than the sort of analytics we would consider deploying Spark or Kafka for. I don’t need a lot of processing power to tell me that a temperature sensor readings are increasing.

Performing real-time analytics on data-in-motion is an intensive task, requiring quite a bit of computing resources. Scalable solutions such as Spark or Kafka are available but may eventually hit a wall. Providers such as Logtrust (2017) differentiate themselves as a real-time analytics provider by pointing out the potential shortfalls of those solutions and offer a single platform for both data-in-motion and data-at-rest.


Churchward, G. (2018). Why “true” real-time data matters: The risks of processing data-at-rest rather than data-in-motion. Retrieved from

Logtrust. (2017). Real-time IoT big data-in-motion analytics.

Patel, M. (2018). A new era of analytics: Connect and visually analyze data in motion. Retrieved from

Challenges of Health Informatics in the Cloud

Alghatani and Rezgui (2019) present a framework for remote patient monitoring via cloud architecture. The primary intention is to reduce disparate data sources and walls between various data siloes, increasing cost effectiveness, response time, and quality of care. The cloud architecture involves the database itself, user interface(s), and artificial intelligence. This cloud is used by four primary groups: patients, hospitals, insurance companies, and controllers (system stewards).

The authors outline a number of advantages here. Telemedicine can be a great thing but has a number of barriers to overcome, not the least of which are cost, culture, political environment, and infrastructure. The cloud architecture seeks to mitigate the cost and infrastructure issues. IT resources can be extended dynamically based on need and the decentralized nature of the system allows for better scalability, flexibility, and reliability.

There are a number of challenges to be considered. The authors highlight seven:

  1. Security
  2. Data management
  3. Governance
  4. Control
  5. Reliability
  6. Availability
  7. Business continuity

An extensive discussion on data collection challenges is presented, outlining a number of possible methods for collection and synchronization. There must be an assumption that no device on this architecture will maintain constant contact with the cloud, and consistency models must be taken into consideration. One option is for each device to maintain local storage and upload to the cloud once a stable connection is available. Another option is a whisper network of its own, much like the early Amazon Kindle devices. A third and final option—also the authors’ proposal—is the utilization of fog computing as a layer between these devices and the cloud.

Privacy is always an issue and cloud architecture muddies the waters a bit, as there is no on-premise server locked down that holds the personally identifiable information. Banks and hospitals have typically been the slowest to adopt cloud computing, in my experience. As Alghatani and Rezgui (2019) note, governance and control are concerns here. The Health Insurance Portability and Accountability Act (HIPAA) requires confidentiality in all individually-identifiable health information; in 2013, this law was extended to genetic information by way of the Genetic Information Nondiscrimination Act (GINA). While the rules prohibit use of genetic information for underwriting purposes, there is no restriction on the sharing or use of genetic information that has been de-identified (National Human Genome Research Institute, 2015). De-identification is not entirely foolproof. There are cases in which the data can be re-identified (Rosenbaum, 2018).


Alghatani, K., & Abdelmounaam, R. (2019). A cloud-based intelligent remote patient monitoring architecture. Paper presented at the International Conference on Health Informatics & Medical Systems, HIMS’19, Las Vegas, NV.

National Human Genome Research Institute. (2015). Privacy in genomics. Retrieved from

Rosenbaum, E. (2018). Five biggest risks of sharing your DNA with consumer genetic-testing companies. Retrieved from

Single-Node Hadoop Installation on Ubuntu 16.04

When embarking on a new build for most anything, I tend to use online how-to guides published by other bloggers who have encountered specific issues. I recently completed a single-node Hadoop installation on a Linode Ubuntu box, after several unsuccessful attempts at a three-node setup, and am posting my steps here in case they will be of help to someone attempting to do what I did.

I relied heavily on Parth Goel’s work and this guide follows it nearly verbatim.

Part 1: Provision the server, harden, and install Java

In my case, I have a number of Linode boxes running already and added one more, running Ubuntu 16.04. I also followed the guide for securing a Linode server.

After provisioning and booting up, I completed the following steps:

Set hostname in /etc/hosts (hadoop-master)
sudo nano /etc/hosts

Add the following line:

<your server ip> hadoop-master
Set hostname in /etc/hostname (hadoop-master)
sudo nano /etc/hosts

Add the following line:

<your server ip> hadoop-master
Harden per Linode recommendations

Login to hadoop-master as root.

adduser hduser
adduser hduser sudo
sudo addgroup hadoop
sudo usermod -a -G hadoop hduser

Create ssh key pairs on local machine (OS X for me) and copy to Linode box. On OS, after creating the key pair, run the following command:

ssh-copy-id -i <key name> hduser@<server ip>

Next, disable root login, change SSH port, disable IPV6, and disable password login. You would be surprised at how many brute-force attacks a server is subjected to every minute. If you wind up locking yourself out, there is emergency LISH access.

sudo nano /etc/ssh/sshd_config

In the config file, you’ll change a few lines to look like this:

Port 2222
PermitRootLogin no
PasswordAuthentication no
UsePAM no

Save and exit text editor. One last command to disable IPV6, then restart SSHD:

echo 'AddressFamily inet' | sudo tee -a /etc/ssh/sshd_config
sudo systemctl restart sshd

Next, install UFW. In my case, I have a static IP I can connect from, so I whitelisted it. Nothing else can hit that server. I suppose I could have avoided all the hardening since I was going to only whitelist from one IP address, but better safe than sorry. The last thing I want is my little sandbox being used for some DoS bot attack.

sudo apt-get install ufw
sudo ufw allow from <vpn ip>
sudo ufw enable
Install Java and reboot

This was one of the biggest issues I ran into with various online how-to guides. It was nearly impossible to get the right combination of Hadoop, Ubuntu, and Java versions right, particularly when many guides went with using Oracle JDK. This step uses the default and doesn’t mess around with custom packages.

sudo apt-get update
sudo apt-get install default-jdk
sudo reboot

Part 2: Create localhost SSH access for Hadoop and install

Once this server boots up again, you will log in as hduser. The next step will create a key pair on hadoop-master. Leave the filename and prompts blank, otherwise you’ll have more work ahead of you. Note that we are adding the SSH port in our SSH command, as we changed it earlier, and will have to add this to the Hadoop environment variables.

ssh-keygen -t rsa 

cat ~/.ssh/ >> ~/.ssh/authorized_keys 

chmod 0600 ~/.ssh/authorized_keys 

ssh -p 2222 localhost

Once you’ve confirmed that works, exit the SSH session and get back to your hadoop-master hduser command line. Now it’s time to install Hadoop.



tar xvzf hadoop-2.7.3.tar.gz

sudo mkdir -p /usr/local/hadoop

cd hadoop-2.7.3/

sudo mv * /usr/local/hadoop

sudo chown -R hduser:hadoop /usr/local/hadoop

Part 3: Hadoop configuration

Variables configuration

Here, you have to do some checking to make sure your Java library is what is expected.

update-alternatives --config java

In this case we are looking for /lib/jvm/java-8-openjdk-amd64

Edit your bashrc file first.

sudo nano ~/.bashrc

Add the following:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

Now edit your hadoop-env file.

sudo nano /usr/local/hadoop/etc/hadoop/

Edit the following line to look like this:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Add this line:

export HADOOP_SSH_OPTS="-p 2222"
Hadoop XML configuration files
sudo mkdir -p /app/hadoop/tmp

sudo chown hduser:hadoop /app/hadoop/tmp

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Add the following

<description>A base for other temporary directories.</description>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml

Add the following.

<description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode

sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode

sudo chown -R hduser:hadoop /usr/local/hadoop_store

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add the following.

<description>Default block replication.The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Add the following.


Part 4. Reboot and fire it up!

After reboot, format HDFS.

hdfs namenode -format

Start service.

Test with a simple job.

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 2 5

You will want to visit http://<server-ip>:50070/ and http://<server-ip>:8088 to see the consoles.

MongoDB and CouchDB in Healthcare Applications

No alt text provided for this image

Both MongoDB and CouchDB are regarded in similar fashion—as they are document databases—and have been used widely in healthcare applications. The similarity to relational database systems usually allows for an easier learning curve and integration with in-place systems. They have been tested against XML and relational databases (e.g., Freire et al., 2016) and used in conjunction with them (e.g., Groce, 2015).

With respect to electronic health record (EHR) management, Freire et al. (2016) tested CouchDB performance with millions of EHRs including both administrative and epidemiological data points. It was noted that CouchBase is specifically designed for distributed computing and is a strength in this case. A number of datasets were set up for benchmarking and specific queries were written in each database language to answer health-specific questions. Response times varied widely, but the XML-based solutions consistently underperformed both MySQL and CouchBase. Against MySQL, CouchBase delivered faster response times. Despite space and indexing time requirements, CouchBase emerged as the top performer in the test.

MongoDB may be used to supplement and scale up SQL-based deployments, as outlined by Groce (2015). In this case, MongoDB was used to cut down on latency and performance overhead in Doctoralia, a company that connects patients with medical providers. Prior to the deployment, a single SQL server in one geographic location was utilized to handle all the load. As the organizational needs expanded to different countries and data volume increased, it became clear that a scaled approach was needed.

MongoDB allowed Doctoralia to deploy servers to each geographic location (reducing geographic latency) and frontload queries and aggregates to these servers (reducing processing latency). This precompute process also took much of the load off the central SQL server. The distributed framework allows Doctoralia to scale hardware needs up or down as demand requires, and replication allows for high availability with little to no downtime or lack of response seen by end users. Deploying a new server to handle new load is done in a matter of minutes. Doctoralia measures the MongoDB deployment in terms of speed and availability, and has considered it a great success.


CouchBase (2017). NoSQL for healthcare. Retrieved from

Freire, S. M., Teodoro, D., Wei-Kleiner, F., Sundvall, E., Karlsson, D., & Lambrix, P. (2016). Comparing the performance of NoSQL approaches for managing archetype-basedelectronic health record data. PLoS ONE, 11(3).

Groce, D. (2015). How MongoDB helped a healthcare firm scale horizontally. Retrieved from

MongoDB. (2019). Healthcare. Retrieved from

Analytics Theories for Medical Diagnosis

Khivsara (2018) presents a number of basic analytics theories. Of these, I believe four are most relevant for medical diagnosis: clustering, association rules, regression, and textual analysis.

No alt text provided for this image

Association rules are nothing more than finding casual structures and patterns between objects in order to establish some sort of logical relationship. It is a machine learning analog to what doctors do on a regular basis in making diagnoses. Picture an emergency room triage room, where patients are sorted and prioritized based on symptoms. In place of a nurse, perhaps on particularly busy nights, a self-service kiosk would allow patients to select all the symptoms they are exhibiting and these symptoms would generate potential diagnoses, the severity of which would determine priority in the night’s order.

Moving a step beyond simple associations, let us examine clustering. Assume two risk factors for chronic disease (e.g., unhealthy diet and tobacco use) were quantified for a population of patients and plotted on a two-axis graph. A simple review of the graph would show plots of individuals on the spectra of diet and tobacco use. Rather than being evenly dispersed across the graph, the data points would be arranged in two or more groupings depending upon the population. K-means clustering would classify those data points (the individuals) into different risk groups depending upon where they fell on the chart. K-means clustering is most useful in healthcare applications where similarities between patients must be quantified and cohorts established.

Going a step further and putting quantitative measures on the relationship between variables and predicted values, we have regression. Regression is all about quantifying the relationship between sets of variables and predicting values. In healthcare, the most common use of regression is related to healthcare costs. Insofar as making diagnoses, logistic regression in particular can be helpful with making diagnoses based on a number of known factors. Imagine a known regression equation for predicting diabetes risk based on multiple input variables.

Finally, let us examine textual analysis. The other three theories mentioned here rely on structured data. However, that structured data is only a fraction of the data collected when a patient sees a provider. The ability to utilize the unstructured data, rife with context and nuance, is perhaps the biggest untapped potential in healthcare analytics. The confluence of textual analysis and natural language processing (NLP) allow unstructured data from sources such as patient records and provider dictation to become part of the picture in predictive modeling and coexist with structured data.


EMC Services. (2018). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Retrieved from

Healthcare.AI. (2017). Step by step to K-Means clustering. Retrieved from

HealthCatalyst. (2019). How to use text analytics in healthcare to improve outcomes. Retrieved from

Kulkarni, A. R., & Mundhe, S. D. (2017). Data mining technique: An implementation of association rule mining in healthcare. International Advanced Research Journal in Science, Engineering and Technology, 4(7), 62-65.

World Health Organization. (2005). Chronic diseases and their common risk factors. Retrieved from

Variables and Measures, or People and Goals?

Just as any IT implementation shouldn’t be for its own sake—that is, it should serve a business purpose within the sponsoring organization and not simply be a cost center—quantitative analysis within the context of an organization should likewise serve a business purpose. For example, there must be some reason a widget manufacturer commissions a study of its customer base. It wasn’t brought up just to keep the research division busy. There are typically research questions and hypotheses that exist and guide the methodology.

In my own research consulting work, I have often started with broad research questions that then drive more narrow research questions and/or particular segment analyses. At the analysis level, the variables and desired outcomes are examined in order to determine what test to use. From that point, it is easy to get lost in the vocabulary of quantitative analysis and forget that the work is being done to answer a business question.

No alt text provided for this image

For example, assuming the National Widget Company commissioned that study of its customer base, I could simply report the measures of central tendency and leave them to interpret why there’s a difference between the mean and median ages. But a true data scientist/analyst helps explain why the numbers mean what they do, and ensures the business users don’t get lost in the lingo. I would take the time to explain that the mean age is 42.5, the median age is 37, and that difference indicates there are more instances of older customers than younger and possibly some outliers bringing that mean age up. I would then turn back to them and ask what this means for their business. Remember that as the analyst, we are not the business subject-matter experts. Offering the numbers to the business and asking them to provide context creates more opportunities for synergy.

Consider another example involving correlation. Two variables, or points of interest as we would call them: widget sales and distance from a major airport. A strong negative correlation (r=-0.49) is found. First we must caution against equating correlation and causation. We would then pivot away from the r-value and put the focus back on the variables of interest: it appears that an individual who lives closer to a major airport is more likely to buy these widgets. Again, we would put the question back on the business to then have a conversation about why these variables might be related and the possible covariates.

In either case, and in any analytics situation, proper use of visualization is paramount. In the latter example it is much easier to see what a high r-value means on a scatterplot as opposed to explaining it verbally. Data visualization bridges many gaps that numbers and words simply cannot fill. These are the languages of dashboards, executive roll-ups, and KPIs.

Overall, the primary thing to remember in keeping an audience engaged in a discussion around quantitative research is this: the variables of interest are the reason for the study, not the numbers themselves. Keep the focus on what matters.

The Privacy Divide: Social Media and Personal Genomic Testing

No alt text provided for this image

With every advance in technology comes a trade-off of some kind. Where the use of personally-identifiable information is concerned, the trade-offs typically involve the exchange of privacy and confidentiality for a non-monetary benefit. In the early days of social media, conventional wisdom said the product was the service. However, we have seen over the last decade that the users of such platforms are the products, the perceived benefits merely carrots on sticks to keep the products (users) engaged in the cycle. We willfully pour details of ourselves into various social media outlets, despite the documented bad behaviors by giants like Facebook, and mostly remain complacent in having our personal data packaged and leveraged against us by various business interests.

However, in the conversation I’ve had around personal genomic testing (PGT), I’ve noticed that many are quick to cite data privacy and risk as a key reason not to participate. Think about this. On one hand, we have evidence to prove Facebook has been using our data in dubious ways, yet we keep pouring ourselves into it (McNamee, 2019). On the other hand, the potential benefits of PGT are outweighed by a fear of that data potentially being misused.

My purpose is not to minimize the potential hazards around PGT. Consider the following risks: (a) hacking; (b) profit or misuse by the company or partners; (c) limited protection from a narrow scope of laws; (d) requests from state and federal authorities; and (e) changing privacy policies or company use due to mergers, acquisitions, bankruptcies, et cetera (Rosenbaum, 2018). In the face of potential benefits from PGT, these are serious caveats. But read that list outside of this context, and it is equally applicable to the data we generate and provide to social media outlets on a daily basis.

As of yet the privacy regulations around social media use only exist within the context of the company itself—that is, there are no substantial federal regulations in the US on the matter, only the GDPR in the EU (St. Vincent, 2018). Where health information is concerned, the US does have slightly more mature federal regulation. The Health Insurance Portability and Accountability Act (HIPAA) requires confidentiality in all individually-identifiable health information; in 2013, this law was extended to genetic information by way of the Genetic Information Nondiscrimination Act (GINA). While the rules prohibit use of genetic information for underwriting purposes, there is no restriction on the sharing or use of genetic information that has been de-identified (National Human Genome Research Institute, 2015). De-identification is not entirely foolproof. There are cases in which the data can be re-identified (Rosenbaum, 2018).

The incongruence is puzzling. In the case of social media, users willfully provide a wealth of data points on a regular basis to companies that repackage and monetize that data for dubious purposes, in the absence of meaningful US legislation to protect it. In the case of PGT, where at least HIPAA and GINA have a rudimentary level of codified protection, users’ hesitance appears to be much more pronounced.


McNamee, R. (2019). Zucked: Waking up to the Facebook catastrophe. New York: Penguin.

National Human Genome Research Institute. (2015). Privacy in genomics. Retrieved from

Rosenbaum, E. (2018). Five biggest risks of sharing your DNA with consumer genetic-testing companies. Retrieved from

St. Vincent, S. (2018). US should create laws to protect social media users’ data. Retrieved from

Where Clinical, Genomic, and Big Data Collide

One of the early proving grounds of big data is healthcare, and the constant cycle of insights catching up to volume hasn’t changed since the early days of the electronic patient record. Early healthcare data typically involved structured metrics such as ICD9 codes and other billing data, which yielded very little clinical detail. The introduction of new data points, both structured and unstructured, has opened the door to many new analytics possibilities. While the possibilities are there, “few viable automated processes” exist that can “extract meaning from data that is diverse, complex, and often unstructured” (Barlow, 2014, p. 18). Indeed, the gap continues to widen between the “rapid technological process in data acquisition and the comparatively slow functional characterization of biomedical information (Cirillo & Valencia, 2019, p. 161).

With so much available, a hospital or healthcare provider may find it difficult to determine a place to start, and either ignore the possibilities altogether or engage in initiatives that are not impactful to clinical quality or costs. There are five broad areas in which value can be delivered: clinical operations, payment & pricing, R&D, new business models, and public health; data are gathered from four broad sources including clinical, pharmaceutical, administrative, and consumer (Barlow, 2014, p. 21).

As of late, genomics have entered the conversation as both a consumer product (e.g., 23AndMe or Ancestry, known as personal genomic testing) and clinical practice. It is one thing to prescribe a medication based on a patient’s chart history, but an entirely different patient experience when a prescription is tailored to a patient’s particular metabolism, genetic predispositions, and risks (Barlow, 2014, p. 19). The wealth of patient-generated health data from a growing number of consumer devices has already contributed to the rise of “Personalized Medicine” (Cirillo & Valencia, 2019, p. 162) and the introduction of genomic data will move the needle even further. One can’t get much more personalized than a genetic footprint.

One debate around personal genomic testing is the value it provides when given directly to consumers without the benefit of clinician involvement. While the benefits of such testing include lifestyle changes that mitigate future disease risk, consumers are also prone to misinterpretation that may lead to unnecessary medical treatment (Meisel et al., 2015, p. 1). Beyond future risk, a recent study found the interest around personal genomic testing had a great deal to do with family or individual history of a particular affliction (Meisel et al., 2015). Consumers are mindful of explaining current risks and phenomena, not just predicting them.


Barlow, R. D. (2014). Great expectations for big data. Health Management Technology, 35(3), 18-21.

Cirillo, D., & Valencia, A. (2019). Big data analytics for personalized medicine. Current Opinion in Biotechnology, 58, 161-167.

Meisel, S. F., Carere, D. A., Wardle, J., Kalia, S. S., Moreno, T. A., Mountain, J. L., . . . Green, R. C. (2015). Explaining, not just predicting, drives interest in personal genomics. Genome Medicine, 7(1), 74.

Big Data: Human vs Material Agency

No alt text provided for this image

Lehrer, Wieneke, Vom Brocke, Jung, and Seidel (2018) studied four companies and their use of big data analytics in the business. Common to all companies in the case study was a two-layer service innovation process: first, automated customer-oriented actions based on trigger actions and preferences; and second, the combination of human and material agencies to produce customer-oriented interactions. The latter is of particular interest, as popular opinion sometimes tends to totalize big data as a replacement for human interaction. As illustrated in this study, the material agency (technology) exists to supplement the human agency.

One particular illustration is Company A, “the Swiss subsidiary of a multinational insurance firm that offers private individuals and corporate customers a broad range of personal, property, liability, and motor vehicle insurance” (Lehrer et al., 2018). Through a recent implementation of big data analytics tools and methodologies, the company has created new ways of more efficient interaction and supplemented employees’ customer service with better insights. In the latter case, the material agency guides employees’ own interactions with customers. That is, “the employees’ skill sets, experiences, and customer contact strategies [interact] with the material features of BDA to create new practices” (Lehrer et al., 2018, p. 438). This may include a number of sales- and service-oriented cues, such as social media or online shopping data points pointing to a major life event. On the other front, consider how the stream of data from various customer devices (e.g., home security system, automobile ODBC data trackers, smartphone location data) provides a wealth of data points that can be utilized by various machine learning methods to understand what typical behavior looks like for a customer and then know when anomalies show up. Personally, my home security system now knows it is an unusual occurrence for me to go outside a particular geographic region without arming the system. When that does occur, I receive an alert reminding me to arm it.


Lehrer, C., Wieneke, A., Vom Brocke, J. A. N., Jung, R., & Seidel, S. (2018). How big data analytics enables service innovation: Materiality, affordance, and the individualization of service. Journal of Management Information Systems, 35(2), 424-460. doi:10.1080/07421222.2018.1451953