Single-Node Hadoop Installation on Ubuntu 16.04

When embarking on a new build for most anything, I tend to use online how-to guides published by other bloggers who have encountered specific issues. I recently completed a single-node Hadoop installation on a Linode Ubuntu box, after several unsuccessful attempts at a three-node setup, and am posting my steps here in case they will be of help to someone attempting to do what I did.

I relied heavily on Parth Goel’s work and this guide follows it nearly verbatim.

Part 1: Provision the server, harden, and install Java

In my case, I have a number of Linode boxes running already and added one more, running Ubuntu 16.04. I also followed the guide for securing a Linode server.

After provisioning and booting up, I completed the following steps:

Set hostname in /etc/hosts (hadoop-master)
sudo nano /etc/hosts

Add the following line:

<your server ip> hadoop-master
Set hostname in /etc/hostname (hadoop-master)
sudo nano /etc/hosts

Add the following line:

<your server ip> hadoop-master
Harden per Linode recommendations

Login to hadoop-master as root.

adduser hduser
adduser hduser sudo
sudo addgroup hadoop
sudo usermod -a -G hadoop hduser

Create ssh key pairs on local machine (OS X for me) and copy to Linode box. On OS, after creating the key pair, run the following command:

ssh-copy-id -i <key name> hduser@<server ip>

Next, disable root login, change SSH port, disable IPV6, and disable password login. You would be surprised at how many brute-force attacks a server is subjected to every minute. If you wind up locking yourself out, there is emergency LISH access.

sudo nano /etc/ssh/sshd_config

In the config file, you’ll change a few lines to look like this:

Port 2222
PermitRootLogin no
PasswordAuthentication no
UsePAM no

Save and exit text editor. One last command to disable IPV6, then restart SSHD:

echo 'AddressFamily inet' | sudo tee -a /etc/ssh/sshd_config
sudo systemctl restart sshd

Next, install UFW. In my case, I have a static IP I can connect from, so I whitelisted it. Nothing else can hit that server. I suppose I could have avoided all the hardening since I was going to only whitelist from one IP address, but better safe than sorry. The last thing I want is my little sandbox being used for some DoS bot attack.

sudo apt-get install ufw
sudo ufw allow from <vpn ip>
sudo ufw enable
Install Java and reboot

This was one of the biggest issues I ran into with various online how-to guides. It was nearly impossible to get the right combination of Hadoop, Ubuntu, and Java versions right, particularly when many guides went with using Oracle JDK. This step uses the default and doesn’t mess around with custom packages.

sudo apt-get update
sudo apt-get install default-jdk
sudo reboot

Part 2: Create localhost SSH access for Hadoop and install

Once this server boots up again, you will log in as hduser. The next step will create a key pair on hadoop-master. Leave the filename and prompts blank, otherwise you’ll have more work ahead of you. Note that we are adding the SSH port in our SSH command, as we changed it earlier, and will have to add this to the Hadoop environment variables.

ssh-keygen -t rsa 

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 

chmod 0600 ~/.ssh/authorized_keys 

ssh -p 2222 localhost

Once you’ve confirmed that works, exit the SSH session and get back to your hadoop-master hduser command line. Now it’s time to install Hadoop.

cd

wget http://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

tar xvzf hadoop-2.7.3.tar.gz

sudo mkdir -p /usr/local/hadoop

cd hadoop-2.7.3/

sudo mv * /usr/local/hadoop

sudo chown -R hduser:hadoop /usr/local/hadoop

Part 3: Hadoop configuration

Variables configuration

Here, you have to do some checking to make sure your Java library is what is expected.

update-alternatives --config java

In this case we are looking for /lib/jvm/java-8-openjdk-amd64

Edit your bashrc file first.

sudo nano ~/.bashrc

Add the following:

#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
#HADOOP VARIABLES END

Now edit your hadoop-env file.

sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Edit the following line to look like this:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Add this line:

export HADOOP_SSH_OPTS="-p 2222"
Hadoop XML configuration files
core-site.xml
sudo mkdir -p /app/hadoop/tmp

sudo chown hduser:hadoop /app/hadoop/tmp

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Add the following

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:54310</value>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
mapred-site.xml
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml

Add the following.

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hadoop-master:54311</value>
<description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
</description>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
hdfs-site.xml
sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode

sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode

sudo chown -R hduser:hadoop /usr/local/hadoop_store

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add the following.

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
yarn-site.xml
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Add the following.

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

Part 4. Reboot and fire it up!

After reboot, format HDFS.

hdfs namenode -format

Start service.

start-dfs.sh

start-yarn.sh

Test with a simple job.

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 2 5

You will want to visit http://<server-ip>:50070/ and http://<server-ip>:8088 to see the consoles.

MongoDB and CouchDB in Healthcare Applications

No alt text provided for this image

Both MongoDB and CouchDB are regarded in similar fashion—as they are document databases—and have been used widely in healthcare applications. The similarity to relational database systems usually allows for an easier learning curve and integration with in-place systems. They have been tested against XML and relational databases (e.g., Freire et al., 2016) and used in conjunction with them (e.g., Groce, 2015).

With respect to electronic health record (EHR) management, Freire et al. (2016) tested CouchDB performance with millions of EHRs including both administrative and epidemiological data points. It was noted that CouchBase is specifically designed for distributed computing and is a strength in this case. A number of datasets were set up for benchmarking and specific queries were written in each database language to answer health-specific questions. Response times varied widely, but the XML-based solutions consistently underperformed both MySQL and CouchBase. Against MySQL, CouchBase delivered faster response times. Despite space and indexing time requirements, CouchBase emerged as the top performer in the test.

MongoDB may be used to supplement and scale up SQL-based deployments, as outlined by Groce (2015). In this case, MongoDB was used to cut down on latency and performance overhead in Doctoralia, a company that connects patients with medical providers. Prior to the deployment, a single SQL server in one geographic location was utilized to handle all the load. As the organizational needs expanded to different countries and data volume increased, it became clear that a scaled approach was needed.

MongoDB allowed Doctoralia to deploy servers to each geographic location (reducing geographic latency) and frontload queries and aggregates to these servers (reducing processing latency). This precompute process also took much of the load off the central SQL server. The distributed framework allows Doctoralia to scale hardware needs up or down as demand requires, and replication allows for high availability with little to no downtime or lack of response seen by end users. Deploying a new server to handle new load is done in a matter of minutes. Doctoralia measures the MongoDB deployment in terms of speed and availability, and has considered it a great success.

References

CouchBase (2017). NoSQL for healthcare. Retrieved from https://www.couchbase.com/solutions/nosql-for-healthcare

Freire, S. M., Teodoro, D., Wei-Kleiner, F., Sundvall, E., Karlsson, D., & Lambrix, P. (2016). Comparing the performance of NoSQL approaches for managing archetype-basedelectronic health record data. PLoS ONE, 11(3).

Groce, D. (2015). How MongoDB helped a healthcare firm scale horizontally. Retrieved from https://dzone.com/articles/leaf-in-the-wild-doctoralia-scales-patient-service

MongoDB. (2019). Healthcare. Retrieved from https://www.mongodb.com/industries/healthcare

Analytics Theories for Medical Diagnosis

Khivsara (2018) presents a number of basic analytics theories. Of these, I believe four are most relevant for medical diagnosis: clustering, association rules, regression, and textual analysis.

No alt text provided for this image

Association rules are nothing more than finding casual structures and patterns between objects in order to establish some sort of logical relationship. It is a machine learning analog to what doctors do on a regular basis in making diagnoses. Picture an emergency room triage room, where patients are sorted and prioritized based on symptoms. In place of a nurse, perhaps on particularly busy nights, a self-service kiosk would allow patients to select all the symptoms they are exhibiting and these symptoms would generate potential diagnoses, the severity of which would determine priority in the night’s order.

Moving a step beyond simple associations, let us examine clustering. Assume two risk factors for chronic disease (e.g., unhealthy diet and tobacco use) were quantified for a population of patients and plotted on a two-axis graph. A simple review of the graph would show plots of individuals on the spectra of diet and tobacco use. Rather than being evenly dispersed across the graph, the data points would be arranged in two or more groupings depending upon the population. K-means clustering would classify those data points (the individuals) into different risk groups depending upon where they fell on the chart. K-means clustering is most useful in healthcare applications where similarities between patients must be quantified and cohorts established.

Going a step further and putting quantitative measures on the relationship between variables and predicted values, we have regression. Regression is all about quantifying the relationship between sets of variables and predicting values. In healthcare, the most common use of regression is related to healthcare costs. Insofar as making diagnoses, logistic regression in particular can be helpful with making diagnoses based on a number of known factors. Imagine a known regression equation for predicting diabetes risk based on multiple input variables.

Finally, let us examine textual analysis. The other three theories mentioned here rely on structured data. However, that structured data is only a fraction of the data collected when a patient sees a provider. The ability to utilize the unstructured data, rife with context and nuance, is perhaps the biggest untapped potential in healthcare analytics. The confluence of textual analysis and natural language processing (NLP) allow unstructured data from sources such as patient records and provider dictation to become part of the picture in predictive modeling and coexist with structured data.

References

EMC Services. (2018). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. Retrieved from https://bhavanakhivsara.files.wordpress.com/2018/06/data-science-and-big-data-analy-nieizv_book.pdf

Healthcare.AI. (2017). Step by step to K-Means clustering. Retrieved from https://healthcare.ai/step-step-k-means-clustering/

HealthCatalyst. (2019). How to use text analytics in healthcare to improve outcomes. Retrieved from https://www.healthcatalyst.com/how-to-use-text-analytics-in-healthcare-to-improve-outcomes

Kulkarni, A. R., & Mundhe, S. D. (2017). Data mining technique: An implementation of association rule mining in healthcare. International Advanced Research Journal in Science, Engineering and Technology, 4(7), 62-65.

World Health Organization. (2005). Chronic diseases and their common risk factors. Retrieved from https://www.who.int/chp/chronic_disease_report/media/Factsheet1.pdf

Variables and Measures, or People and Goals?

Just as any IT implementation shouldn’t be for its own sake—that is, it should serve a business purpose within the sponsoring organization and not simply be a cost center—quantitative analysis within the context of an organization should likewise serve a business purpose. For example, there must be some reason a widget manufacturer commissions a study of its customer base. It wasn’t brought up just to keep the research division busy. There are typically research questions and hypotheses that exist and guide the methodology.

In my own research consulting work, I have often started with broad research questions that then drive more narrow research questions and/or particular segment analyses. At the analysis level, the variables and desired outcomes are examined in order to determine what test to use. From that point, it is easy to get lost in the vocabulary of quantitative analysis and forget that the work is being done to answer a business question.

No alt text provided for this image

For example, assuming the National Widget Company commissioned that study of its customer base, I could simply report the measures of central tendency and leave them to interpret why there’s a difference between the mean and median ages. But a true data scientist/analyst helps explain why the numbers mean what they do, and ensures the business users don’t get lost in the lingo. I would take the time to explain that the mean age is 42.5, the median age is 37, and that difference indicates there are more instances of older customers than younger and possibly some outliers bringing that mean age up. I would then turn back to them and ask what this means for their business. Remember that as the analyst, we are not the business subject-matter experts. Offering the numbers to the business and asking them to provide context creates more opportunities for synergy.

Consider another example involving correlation. Two variables, or points of interest as we would call them: widget sales and distance from a major airport. A strong negative correlation (r=-0.49) is found. First we must caution against equating correlation and causation. We would then pivot away from the r-value and put the focus back on the variables of interest: it appears that an individual who lives closer to a major airport is more likely to buy these widgets. Again, we would put the question back on the business to then have a conversation about why these variables might be related and the possible covariates.

In either case, and in any analytics situation, proper use of visualization is paramount. In the latter example it is much easier to see what a high r-value means on a scatterplot as opposed to explaining it verbally. Data visualization bridges many gaps that numbers and words simply cannot fill. These are the languages of dashboards, executive roll-ups, and KPIs.

Overall, the primary thing to remember in keeping an audience engaged in a discussion around quantitative research is this: the variables of interest are the reason for the study, not the numbers themselves. Keep the focus on what matters.

The Privacy Divide: Social Media and Personal Genomic Testing

No alt text provided for this image

With every advance in technology comes a trade-off of some kind. Where the use of personally-identifiable information is concerned, the trade-offs typically involve the exchange of privacy and confidentiality for a non-monetary benefit. In the early days of social media, conventional wisdom said the product was the service. However, we have seen over the last decade that the users of such platforms are the products, the perceived benefits merely carrots on sticks to keep the products (users) engaged in the cycle. We willfully pour details of ourselves into various social media outlets, despite the documented bad behaviors by giants like Facebook, and mostly remain complacent in having our personal data packaged and leveraged against us by various business interests.

However, in the conversation I’ve had around personal genomic testing (PGT), I’ve noticed that many are quick to cite data privacy and risk as a key reason not to participate. Think about this. On one hand, we have evidence to prove Facebook has been using our data in dubious ways, yet we keep pouring ourselves into it (McNamee, 2019). On the other hand, the potential benefits of PGT are outweighed by a fear of that data potentially being misused.

My purpose is not to minimize the potential hazards around PGT. Consider the following risks: (a) hacking; (b) profit or misuse by the company or partners; (c) limited protection from a narrow scope of laws; (d) requests from state and federal authorities; and (e) changing privacy policies or company use due to mergers, acquisitions, bankruptcies, et cetera (Rosenbaum, 2018). In the face of potential benefits from PGT, these are serious caveats. But read that list outside of this context, and it is equally applicable to the data we generate and provide to social media outlets on a daily basis.

As of yet the privacy regulations around social media use only exist within the context of the company itself—that is, there are no substantial federal regulations in the US on the matter, only the GDPR in the EU (St. Vincent, 2018). Where health information is concerned, the US does have slightly more mature federal regulation. The Health Insurance Portability and Accountability Act (HIPAA) requires confidentiality in all individually-identifiable health information; in 2013, this law was extended to genetic information by way of the Genetic Information Nondiscrimination Act (GINA). While the rules prohibit use of genetic information for underwriting purposes, there is no restriction on the sharing or use of genetic information that has been de-identified (National Human Genome Research Institute, 2015). De-identification is not entirely foolproof. There are cases in which the data can be re-identified (Rosenbaum, 2018).

The incongruence is puzzling. In the case of social media, users willfully provide a wealth of data points on a regular basis to companies that repackage and monetize that data for dubious purposes, in the absence of meaningful US legislation to protect it. In the case of PGT, where at least HIPAA and GINA have a rudimentary level of codified protection, users’ hesitance appears to be much more pronounced.

References

McNamee, R. (2019). Zucked: Waking up to the Facebook catastrophe. New York: Penguin.

National Human Genome Research Institute. (2015). Privacy in genomics. Retrieved from https://www.genome.gov/about-genomics/policy-issues/Privacy

Rosenbaum, E. (2018). Five biggest risks of sharing your DNA with consumer genetic-testing companies. Retrieved from https://www.cnbc.com/2018/06/16/5-biggest-risks-of-sharing-dna-with-consumer-genetic-testing-companies.html

St. Vincent, S. (2018). US should create laws to protect social media users’ data. Retrieved from https://www.hrw.org/news/2018/04/05/us-should-create-laws-protect-social-media-users-data

Where Clinical, Genomic, and Big Data Collide

One of the early proving grounds of big data is healthcare, and the constant cycle of insights catching up to volume hasn’t changed since the early days of the electronic patient record. Early healthcare data typically involved structured metrics such as ICD9 codes and other billing data, which yielded very little clinical detail. The introduction of new data points, both structured and unstructured, has opened the door to many new analytics possibilities. While the possibilities are there, “few viable automated processes” exist that can “extract meaning from data that is diverse, complex, and often unstructured” (Barlow, 2014, p. 18). Indeed, the gap continues to widen between the “rapid technological process in data acquisition and the comparatively slow functional characterization of biomedical information (Cirillo & Valencia, 2019, p. 161).

With so much available, a hospital or healthcare provider may find it difficult to determine a place to start, and either ignore the possibilities altogether or engage in initiatives that are not impactful to clinical quality or costs. There are five broad areas in which value can be delivered: clinical operations, payment & pricing, R&D, new business models, and public health; data are gathered from four broad sources including clinical, pharmaceutical, administrative, and consumer (Barlow, 2014, p. 21).

As of late, genomics have entered the conversation as both a consumer product (e.g., 23AndMe or Ancestry, known as personal genomic testing) and clinical practice. It is one thing to prescribe a medication based on a patient’s chart history, but an entirely different patient experience when a prescription is tailored to a patient’s particular metabolism, genetic predispositions, and risks (Barlow, 2014, p. 19). The wealth of patient-generated health data from a growing number of consumer devices has already contributed to the rise of “Personalized Medicine” (Cirillo & Valencia, 2019, p. 162) and the introduction of genomic data will move the needle even further. One can’t get much more personalized than a genetic footprint.

One debate around personal genomic testing is the value it provides when given directly to consumers without the benefit of clinician involvement. While the benefits of such testing include lifestyle changes that mitigate future disease risk, consumers are also prone to misinterpretation that may lead to unnecessary medical treatment (Meisel et al., 2015, p. 1). Beyond future risk, a recent study found the interest around personal genomic testing had a great deal to do with family or individual history of a particular affliction (Meisel et al., 2015). Consumers are mindful of explaining current risks and phenomena, not just predicting them.

References

Barlow, R. D. (2014). Great expectations for big data. Health Management Technology, 35(3), 18-21.

Cirillo, D., & Valencia, A. (2019). Big data analytics for personalized medicine. Current Opinion in Biotechnology, 58, 161-167.

Meisel, S. F., Carere, D. A., Wardle, J., Kalia, S. S., Moreno, T. A., Mountain, J. L., . . . Green, R. C. (2015). Explaining, not just predicting, drives interest in personal genomics. Genome Medicine, 7(1), 74.

Big Data: Human vs Material Agency

No alt text provided for this image

Lehrer, Wieneke, Vom Brocke, Jung, and Seidel (2018) studied four companies and their use of big data analytics in the business. Common to all companies in the case study was a two-layer service innovation process: first, automated customer-oriented actions based on trigger actions and preferences; and second, the combination of human and material agencies to produce customer-oriented interactions. The latter is of particular interest, as popular opinion sometimes tends to totalize big data as a replacement for human interaction. As illustrated in this study, the material agency (technology) exists to supplement the human agency.

One particular illustration is Company A, “the Swiss subsidiary of a multinational insurance firm that offers private individuals and corporate customers a broad range of personal, property, liability, and motor vehicle insurance” (Lehrer et al., 2018). Through a recent implementation of big data analytics tools and methodologies, the company has created new ways of more efficient interaction and supplemented employees’ customer service with better insights. In the latter case, the material agency guides employees’ own interactions with customers. That is, “the employees’ skill sets, experiences, and customer contact strategies [interact] with the material features of BDA to create new practices” (Lehrer et al., 2018, p. 438). This may include a number of sales- and service-oriented cues, such as social media or online shopping data points pointing to a major life event. On the other front, consider how the stream of data from various customer devices (e.g., home security system, automobile ODBC data trackers, smartphone location data) provides a wealth of data points that can be utilized by various machine learning methods to understand what typical behavior looks like for a customer and then know when anomalies show up. Personally, my home security system now knows it is an unusual occurrence for me to go outside a particular geographic region without arming the system. When that does occur, I receive an alert reminding me to arm it.

Reference

Lehrer, C., Wieneke, A., Vom Brocke, J. A. N., Jung, R., & Seidel, S. (2018). How big data analytics enables service innovation: Materiality, affordance, and the individualization of service. Journal of Management Information Systems, 35(2), 424-460. doi:10.1080/07421222.2018.1451953

What Makes Big Data “Big?”

I’ve never been a fan of buzzwords. The latests source of my discomfort is the term thought leader, which is one of those ubiquitous but necessary phrases in almost every professional space. That hasn’t kept me from poking fun at it, though, as I believe we should be able to laugh at ourselves and not take things too seriously.

No alt text provided for this image

Big Data is a buzzword. But it’s also my career.

What is the difference between regular, conventional, garden-variety data and Big Data? There’s a lot we could say here, but they key differences that come to mind for me are use, size, scope, and storage. I immediately think of two specific datasets I’ve used for teaching purposes: LendingClub and Stattleship.

LendingClub posts their loan history (anonymized, of course) for public consumption so that any audience may feed it into an engine or tool of their choice for analysis. I’ve used this dataset before to demonstrate predictive modeling and how financial institutions use it to aid decision-making in loan approvals. Stattleship is a sports data service with an API that allows access to a myriad of major league sports data. They also provide a custom wrapper to be used in R, and I’ve used these tools to teach R.

One of the primary differences between big data and conventional data is use case. Take these two datasets, for example. The architects of these sets understand that a variety of users will be downloading the data for various reasons, and there is no specific use case intended for either set. The possibilities are endless. With smaller troves of data, we typically have an intended use attached, and the data is specific to that use. Not so with big data.

These datasets illustrate two other key factors in big data: size and scope. Again, the datasets are not at all meant to answer one specific question or have a narrow focus. Sizing is often at least in gigabytes or terabytes—and in many cases tipping over into petabytes. The freedom to explore multiple lines of inquiry is inherent in big data sets without any sort of restriction on scope.

Finally, the storage and maintenance of big data is another key difference that sets it apart from conventional datasets. The trend of moving database operations offsite and using Database-as-a-Service models have enabled the growth of big data, as has the development of distributed computing and storage. Smaller conventional datasets do not require such an infrastructure and are not quite as impactful on a company’s bottom line.

Future of BI: Opportunities, Pitfalls, and Threats

Opportunities

Master data management (MDM). A few years ago this was thought to be a dead concept and I wonder how much of that sentiment was driven by the advent of data lakes, unstructured processing, artificial intelligence, et cetera. We have come far enough now to know that (a) the two do not have to be mutually exclusive, and (b) MDM is seeing a resurgence as the importance of data governance and quality management grows. Regardless of how the data is used, it must be clean and relevant.

Ethics. Cambridge Analytica should not have been the first watershed moment in the ethics of big data and business intelligence. While a number of industries have established sub-disciplines in ethics, data science and business intelligence are young, and this will continue to grow. That particular scandal did peel the layer of collective public naivete back. We are more attuned now to the potential pitfalls of big data in the hands of companies with less-than-best intentions. However, willful ignorance does remain and this is a major opportunity for growth.

Data-driven cultures and citizen data scientists. Business intelligence has expanded from a small cadre of statisticians and developers to include more subject-area experts and regular business users. This democratization of data science is largely due to the ease of use of popular analytics packages such as Tableau and Qlik. As the black box of analytics is demystified and the power is put in the hands of more users, data-driven cultures will become easier to create in organizations.

Pitfalls

Over-reliance on the next-best-thing. Let’s admit it: there are some impressive analytics packages on the market right now. The innovations in data science are exciting. But without a focus on less-flashy elements such as data governance and the right people-processes, whatever the next best thing might be will fail. It is tempting to get caught up in the continuous cycle of innovation and forget about these critical elements.

De-valuing BI talent. The release of analytics packages that an average business user can pilot without the need of a dedicated statistician or business intelligence developer has done many good things for the discipline, but going too far in this direction is a potential pitfall. Socially, we are in the era of experts and scientists being ignored in favor of what people believe they know (cite). Between this predisposition and more functions being in the reach of regular business users, there is a potential for BI experts to be brushed aside and their talent de-valued.

Checking our brains at the door. As useful and amazing as business intelligence has become in organizations, it may be tempting to put more and more decision-making power on artificial intelligence at the expense of human intelligence. Plenty of films have used this premise as fodder for apocalyptic computers-take-over-the-world stories. But on a more practical level, business intelligence is all about serving up the right information so decision-makers can make the right calls—not making all the decisions for them.

Threats

Inflexible organizations. Organizational culture can be a great asset or opportunity, but it can also be an incredible hindrance. Even the best deployments with the best intentions can be rendered useless if an organization is not willing to embrace whatever change is necessary to take advantage of it all. This is not a new threat, per se, but one that will always be around.

Bad actors. We like to believe that big data and the algorithms that drive how we interact with it are neutral at best. However, as McNamee (2019) notes, it is possible for bad actors to utilize otherwise benign data and algorithms for nefarious purposes. As collections of data grow and algorithms to drive outcomes or profit grow, the chances of these bad actors to utilize them become more and more likely.

Lack of transparency. This may be considered a threat more in the big data realm in general more so than in business intelligence, but it does bear highlighting within this context. Businesses use proprietary algorithms and logic that turn troves of data into consequential decisions about our lives. These also shape the world that we see through our consumption of news and social media websites. Do we remain in willful ignorance of how those are served up to us, or do we push for more transparency there?

References

Graham, Mackenzie. (2018). Facebook, Big Data, and the Trust of the Public. Retrieved from http://blog.practicalethics.ox.ac.uk/2018/04/facebook-big-data-and-the-trust-of-the-public/

Jürgensen, K. (2016). Master Data Management (MDM): Help or Hindrance? Retrieved from https://www.red-gate.com/simple-talk/sql/database-delivery/master-data-management-mdm-help-or-hindrance/

McNamee, Roger. (2019). Zucked: Waking up to the Facebook catastrophe. New York: Penguin.

Nichols, T. (2017). The death of expertise: The campaign against established knowledge and why it matters. New York: Oxford UP.

Pyramid Analytics. The Business Intelligence Trends of 2019 Discussed. Retrieved from https://www.pyramidanalytics.com/blog/details/blog-guest-bi-trends-of-2019-discussed

Rees, G., & Colqhuon, L. (2017). Predict future trends with business intelligence. Retrieved from https://www.intheblack.com/articles/2017/12/01/future-trends-business-intelligence

CRM, OLAP Cubes, and Business Intelligence

Customer Relationship Management, as a concept, brings together a number of various systems from functions across the business (sales, marketing, operations, external, etc) that allow the enterprise to create, maintain, and grow positive and productive relationships with customers. We might think of it as being the glue that brings front office and back office together and allows the business to de-silo what would otherwise be proprietary information across the organization.

No alt text provided for this image

But what good are all these data points if they aren’t utilized effectively? It would be easy to fall victim to information overload if we tried to explore the data from a particular axis or angle. This is where classic data mining and online analytical processing (OLAP) come in. If we think of various systems of record as one-dimensional axes on a graph, bringing these together in a three-dimensional cube and taking a particular block within that cube to analyze would be much more efficient. Rather than starting with the data and searching for questions to answer that might involve those points (as is tempting to do at times), we are able to start with a specific business question and use OLAP to answer it.

For example, assume I am a cosmetics manufacturer and want to know how much of my product actually goes out the door to consumers after it is sold to a distributor. I want to use that information to adjust my marketing efforts and potentially re-evaluate my production line. I have the following data points available by way of my existing business intelligence environment:

  • Production line data
  • Inventory balances in my warehouse
  • Marketing campaign data
  • Sales data from my company to the distributor
  • Sales data from the distributor to the end consumer

Rather than starting from one or two of these data points and throwing things against the wall to see what might stick, I can use OLAP capabilities to find the different relationships between these points, eventually driving my answer. Understand here that answering the initial question is simply a matter of reading one data point (the last one in this case)—however, a strategic approach that addresses the customer relationship is the end goal.

One caveat here. OLAP may be considered a predecessor to currently-understood data mining, depending on which view of business intelligence you find appealing. Strictly speaking, traditional OLAP has been used for a number of years already for marketing, forecasting, and sales. Data mining capabilities at present far surpass what has been traditionally available in the OLAP sense.

Reference

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.