Centralized and Distributed DBMS

Perhaps the best place to start with comparing centralized and distributed DBMS instances is the architecture itself. As the names suggest, it is mostly a matter of whether the data resides in one physical location—not necessarily logical, as multiple volumes within a single location do not qualify as a distributed DBMS—or multiple locations with an underlying controller to bring it all together. It might be compared to disk RAID options, in which data on a storage system is mirrored or striped across multiple physical drives.

No alt text provided for this image

We can continue the RAID analogy in discussing replication and partitioning. Much like Distributed DBMS architecture, RAID storage allows disks to be seamlessly duplicated for high fault tolerance or the data itself to be written across multiple disks to increase storage capacity and throughput. In RAID 0, data is striped across multiple disks; this is the equivalent of DDMBS partitioning. All the nodes in a DDMBS store different parts of the complete database. This may be accomplished by horizontal partitioning (in which all columns are stored, but different nodes have different subsets of records) or vertical partitioning (in which certain columns are stored in different nodes, of all records). Alternatively, in RAID 1, a disk is mirrored to another disk; this is the equivalent of replication.

A common misconception with DDBMS instances involves the CAP theorem. There is an assumption that while CDBMS instances enjoy Consistency, Availability, and Partition Tolerance all at the same time (the latter by virtue of being in a single location and it being a moot point), DDBMS administrators must choose either CP, CA, or AP. Rather, it is more accurate to say that a DDBMS administrator, in the event of a network partition, must choose between availability or consistency. The former may sacrifice consistency and the latter may sacrifice availability.

In terms of applications, a DDBMS is most appropriate for large volumes of data or for users spread across a large geographic area. A partitioned DDBMS architecture might be optimized to store specific columns on nodes local to user groups that use those columns more frequently than other user groups, even though they are not directly accessed. Geographic spread is a relevant use case due to the various network hops and latency differences that may exist between an otherwise central data center and users worldwide.


Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson. 

 Mehra, A. (2017). Understanding the CAP theorem. Retrieved from https://dzone.com/articles/understanding-the-cap-theorem

On DBaas migrations

An increasing number of enterprise systems are moving to as-a-service models, reducing a company’s overhead and turning traditional facets of information technology—those that have used up both real estate and capital expenditures—into outsourced subscriptions that are managed by outside companies. Infrastructure, Networking, and Reporting as a service are already popular. Moving the databases themselves off a company’s property and balance sheet into a cloud architecture entails what is known as Database-as-a-service (DBaaS) (Bonthu, Thammiraju, & Murthy, 2014). There are many factors involved in establishing the DBaaS environment and migrating the data from on-premise boxes to cloud.

There are typically eight steps involved in moving from on-premise to cloud databases:

  1. Define the scope of migration
  2. Ensure data security
  3. Select service provider
  4. Map the data
  5. Schedule the migration
  6. Select tools for migration or develop migration scripts
  7. Test before (and after) the migration
  8. Actual data migration

 The actual migration, insofar as relational databases are concerned, typically consists of three steps:

  1. Relational schema migration – it includes the migration of tables, indexes and views.
  2. Data migration done via tools or migration scripts. The time required for data migration depends on the size of the database.
  3. Database stored programs migration – the migration of stored procedures and triggers. (Vodomin & Andročec, 2015)

 The different types of cloud databases available, relational and non-relational, make for a variety of ways to migrate and a number of considerations for enterprise migration. Regardless, a one-time expenditure on migration can save countless dollars and hours of ballooning infrastructure and database sprawl. It is much easier to handle such sprawl by responding with both storage and virtual machine elasticity as opposed to investing more in onsite resources (Bonthu, Thammiraju, & Murthy, 2014). Further research in this space is warranted as the options for cloud architecture increase and companies have more options for service-based managed IT.


 Bonthu, S., Thammiraju, S. D. M., & Murthy, Y. S. S. R. (2014). Study on database virtualization for database as a service (dbaas). International Journal of Advanced Research in Computer Science, 5(2), 31-34.

 Vodomin, G., & Andročec, D. (2015). Problems during database migration to the cloud. Paper presented at the Central European Conference on Information and Intelligence Systems, Varaždin, Croatia.

MetroMaps and T-Cubes: Beyond Gantt Charts

Martínez, Dolado, & Presedo (2010) discuss two visual modeling tools for software development and planning, MetroMap and T-Cube. This discussion is in the context of greater attention being paid to the development process and metrics, not just the software engineering itself. A concession the authors make very early on is that Gantt charts are the prevalent method for project mapping in organizations, and that the research to date shows they are not effective for communicating, especially when different groups are involved. Enter the MetroMap, a way of visualizing abstract train-of-thought information that communicates both high-level and detailed information to viewers.

Image courtesy of Martínez, Dolado, & Presedo (2010)

T-Cube visualization is reminiscent of a Rubik’s Cube, utilizing the three-dimensional nature of a physical cube, the individual cubes making up the whole, and the facets (colors) on each individual cube. These correspond to tasks and attributes. The authors utilized a specific software set to illustrate these concepts, represented in the article. As the tasks and attributes are written independently, they can be represented by workgroup, type of task, module or time.

These two methods have their strengths and weaknesses, both individually and together. At first glance, it is obvious that the MetroMap can represent many indicators at once while the T-Cube can only show one at a time. MetroMap uses a variety of icons and styles to represent information while the T-Cube uses traditional treemaps. The authors size up the tools in a simple comparison table, noting that MetroMap generally has the edge on viewing a lot of information at once.

Features and benefits are great, but how does actual use differ? Is one easier than the other in practice? The authors examined a shortest-path route to accomplish the same task in both tools, and found that MetroMap was the most efficient in multiple scenarios. In all cases the actions were more basic and straightforward. Overall, either tool is more informative and effective than Gantt charts. Access to information and ability to understand it are paramount in any planning and development exercise. These are two tools that better enable that.


Martínez, A., Dolado, J., & Presedo, C. (2010). Software Project Visualization Using Task Oriented Metaphors. JSEA, 3, 1015-1026.

Delphi Methods and Ensemble Classifiers

Ensemble classifiers are a bit like Delphi methodology, in that they utilize multiple models (or experts) to arrive at a model that offers better predictive performance than would a single model (Dalkey & Helmer, 1963; Acharya, 2019). These are independent or parallel classifiers, implementing a majority vote amongst the classifiers like the Delphi method. A variety of individual classifiers can be used, including logistic regression, nearest neighbor methods, decision trees, Bayesian analysis, or discriminate analysis. According to Dietterich (2002), ensemble classification overcomes three major problems: Statistical, Computational, and Representational. The Statistical problem involves the hypothetical space being too large for the data itself, producing multiple accurate hypotheses yet only one being chose. The Computational problem involves the algorithm’s inability to guarantee the best hypothesis. The Representational problem involves the hypothetical space being devoid of any good approximation of the target.

Ensemble methods include bagging, boosting, and stacking. Bagging is considered a parallel or independent method; boosting and stacking are both sequential or dependent methods. Parallel methods are used when the independence between the base classifiers is advantageous, including error reduction; sequential methods are used when dependence between the classifiers is advantageous, such as correcting mislabeled examples or converting weak learners (Smolyakov, 2017).

Random forests are not exactly ensemble classifiers but do produce results from multiple decision trees and aggregate the results, like Bagging (Liberman, 2017). These train on different datasets and features, both randomly selected. Bias and variance errors are mitigated by way of low correlation between the models. Again, like ensemble classifiers and even Delphi method decision-making, learners operating as a committee should outperform any of the individual learners.


Acharya, Tarun (2019). Advanced ensemble classifiers. Retrieved from https://towardsdatascience.com/advanced-ensemble-classifiers-8d7372e74e40

Connolly, T. & Begg, C. (2015).  Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson. 

Dalkey, N., & Helmer, O. (1963). An experimental application of the Delphi method to the use of experts. Management Science9(3), 458-467.

Dietterich, T. G. (2000). Ensemble methods in machine learning. International workshop on multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg.

Dietterich, T. G. (2002). Ensemble Learning. In The Handbook of Brain Theory and Neural Networks, Second Edition, (M.A. Arbib, Ed.), (pp. 405-408). Cambridge, MA: The MIT Press.

Liberman, N. (2017). Decision trees and random forests. Retrieved from https://towardsdatascience.com/decision-trees-and-random-forests-df0c3123f991

Smolyakov, V. (2017). Ensemble learning to improve machine learning results. Retrieved from https://blog.statsbot.co/ensemble-learning-d1dcd548e936

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Thick Data and Big Data

In March 1968, Robert F. Kennedy said, of the Gross Domestic Product index: “It measures neither our wit nor our courage, neither our wisdom nor our learning, neither our compassion nor our devotion to our country, it measures everything in short, except that which makes life worthwhile.”

“What is measurable is not always what is valuable.” Wang (2016b) paraphrased Kennedy, originally referencing GDP and its inability to measure the qualitative human condition. With the exponential increase in attention to Big Data as of late, the focus on speed and scale have left out things that are “sticky” or “difficult to quantify” (Wang, 2016b). This disparity reflects the traditional gap between qualitative and quantitative research. In fact, Wang found referring to the qualitative efforts in traditional terms (e.g., ethnography) was met with enough skepticism and pushback that a new term friendly to data jargon had to emerge—and thus the term thick data was born.

Courtesy Tricia Wang

At first glance, thick data is not attractive in the traditional sense of big data. It is inefficient, does not scale up, and is usually not reproducible. However, when combined with big data, it fills the gaps that the quantitative measures leave open. While big data can identify patterns, it cannot explain why those patterns exist. If big data can go broad, thick data can go deep. Thick data relies on human learning and complements the findings from machine learning that big data cannot provide adequate context for. It shows the social context of specified patterns and is able to handle irreproducible complexity. It is the qualitative complement to quantitative data, the color and nuance to a black-and-white picture.

Forces against the adoption of thick data typically stem from bias against qualitative data. Again, it is messy…inefficient, sticky, complicated, and nuanced. Most of the big data world values what can be quantified and the relationships that can be mapped. As (Wang, 2016a) notes, quantifying is addictive, and it can be easy to throw out data that doesn’t fit a numerical value. It isn’t a zero-sum game, however. Both big data and thick data complement each other. But “silo culture”—the same phenomenon that disrupts data integration and wreaks havoc across enterprise data environments—threatens the symbiosis between these two (Riskope, 2017). While thick data is not an innovation in the same sense of cutting-edge artificial intelligence or new developments in IoT technology, it is an innovation in how we think about the world around us and what is important when studying that world.


Riskope. (2017). Big data or thick data: Two faces of a coin.  Retrieved from https://www.riskope.com/2017/05/24/big-data-or-thick-data-two-faces-of-a-coin/

Wang, T. (2016a). The human insights missing from big data.  Retrieved from https://www.ted.com/talks/tricia_wang_the_human_insights_missing_from_big_data

Wang, T. (2016b). Why big data needs thick data.  Retrieved from https://medium.com/ethnography-matters/why-big-data-needs-thick-data-b4b3e75e3d7

XML and Standardization

XML is a true double-edged sword in the data analytics world, with both advantages and disadvantages not unlike relational databases or NoSQL. The global advantages and disadvantages inherent in XML are just as applicable in the healthcare field. For example, consider the flexibility of user-created tags on the fly—something that is both an advantage (for ease of use, compatibility, expandability, et cetera) and disadvantage (lack of standardization, potential incompatibility with user interfaces, et cetera) in the global sphere. These are equally applicable in healthcare settings. Considering an electronic health record (EHR), different providers and points of care may add to the EHR without having to conform to the standards of other providers; that is, data from a rheumatologist may be added to the patient record in with the same ease as a general practitioner or psychologist. The portability of the XML format means that the record can be exchanged amongst providers or networks as long as the recipient can read XML. However, this versatility comes at a price, as the lack of standardization means that all tags and fields in any given record must be known prior to query and can be quite a time-consuming process.

Considering an analogy to a different industry, think of a consumer packaged goods (CPG) manufacturer. The CPG has its own internal master data schemas in relational databases and reserves XML for its reseller data interface, so that the different wholesalers and retail network can share sales data back to the CPG in a common format. While all participants use a handful of core attributes (e.g., manufacturer SKU and long description), each wholesaler and retailer has its own set of attributes that are proprietary. XML allows the different participants to feed data back to the CPG without conforming to a schema imposed across the entire retail network and allows the CPG to glean the requisite data shared amongst all participants. However, the process requires setting up the known tags for each new participant so that the CPG knows ahead of time what specific tags are relevant to each participant.


Brewton, J., Yuan, X., & Akowuah, F. (2012). XML in health information systems. Paper presented at the World Congress in Computer Science, Computer Engineering, and Applied Computing, Las Vegas, NV.

Jumaa, H., Rubel, P., & Fayn, J. (2010, 1-3 July 2010). An XML-based framework for automating data exchange in healthcare. Paper presented at the The 12th IEEE International Conference on e-Health Networking, Applications and Services.

Stockemer, M. (2007). How Do HL7 and XML Co-Exist in Clinical Interfacing? Retrieved from https://healthstandards.com/blog/2007/08/10/how-do-hl7-and-xml-coexist-in-clinical-

The Role of Data Brokers in Healthcare

In courses I’ve led before, we looked at the disjointed data privacy regulations in the United States and current events in data privacy (e.g., Facebook, Cambridge Analytica, personal genomics testing, etc). The overall issue is repeatable in any setting: giving a single entity a large amount of data inevitably raises questions of ethics, privacy, security, and motivation.

Where healthcare data brokers are concerned, the stated goals differ by type of data. Where direct patient interaction with the data is concerned, the goal is to give patients “more control over the data” (Klugman, 2018) and perhaps bypass the clunky patient portals set up by providers. Of the data that is not personally identifiable, it can have much less altruistic goals, such as being a player in a multi-billion-dollar market (Patientory, 2018) or contributing to health insurance discrimination (Butler, 2018). I am not naïve enough to think that all exercises in healthcare should be altruistic, and the concept of insurance itself has a certain modicum of discrimination in its core; however, weaponizing the data to aid in unfair practices is beyond the pale here.

No alt text provided for this image

From a data engineering perspective, a broker in the truest sense of the word may act as a clearinghouse between providers with disparate systems, enabling the seamless transfer of patient data between those providers without putting the burden of ETL on either of them. Whereas XML formatting and other portability developments have allowed providers using different EHR systems to port patient data, a data brokerage would act as an independent party acting on the patient’s behalf and handling the technical details on integrating their data between all providers and interested parties. Beyond holding the data, the broker would be responsible for ensuring each provider and biller has access to the same single source of truth on that particular patient.

This would, of course, require a data warehouse of sorts for the single source to be held, and puts the questions of security, privacy, transparency, and ethics on the broker. The broker has to make money to survive and a business model must emerge, so it would not be immune to market forces. The aggregation of so much patient data in one place would be too great a temptation to let sit and not make money as de-identified commodities, so a secondary market would emerge and lead to the same issues cited above. Call me pessimistic, but the best predictor of future actions is past behavior, and thus far the companies holding massive amounts of data about our lives either can’t keep it secure from breaches or are perfectly happy selling it while turning a blind eye to what is done with it.


Butler, M. (2018). Data brokers and health insurer partnerships could result in insurance discrimination. Retrieved from https://journal.ahima.org/2018/07/24/data-brokers-and-health-insurer-partnerships-could-result-in-insurance-discrimination/

Klugman, C. (2018). Hospitals selling patient records to data brokers: A violation of patient trust and autonomy. Retrieved from http://www.bioethics.net/2018/12/hospitals-selling-patient-records-to-data-brokers-a-violation-of-patient-trust-and-autonomy/

Patientory. (2018). Data brokers have access to your information, do you? Retrieved from https://medium.com/@patientory/data-brokers-have-access-to-your-health-information-do-you-562b0584e17e

Data-in-Motion or Data-at-Rest?

Reading the available material on data-in-motion reminds me of when I first read about data lakes over data warehouses, or NoSQL over SQL: the urgency of the former, and outright danger of the latter, are both overblown. Put simply, data-in-motion provides real-time insights. Most of our analytics efforts across data science spheres apply to stored data, be it years, weeks, or hours old. Taking a look at data-in-motion means not storing it prior to analyzing it, and extracting insights as the data rolls in. This is one workstream of dual efforts that tell us the whole picture: historical data providing insight on what happened and potential to train for future detection, and real-time data to get what’s happening right now.

Churchward (2018) argues a fair point here: once data is stored, it isn’t real-time by definition. But taking that argument to a logical extreme by asking whether we would like to make decisions on data three months old is a stretch. While it is true that matters such as security and intrusion detection must have real-time detection, categorically dismissing data-at-rest analytics is reckless. It vilifies practices that are the foundation of any comprehensive analytics strategy. Both data-at-rest and data-in-motion are valuable drivers of any business intelligence effort that seeks to paint a total picture of any phenomena. 

There are, of course, less frantic cases to be made for data-in-motion. Patel (2018) illustrates a critical situation on an oil drilling rig, in which information even a few minutes old can be life-threatening. In this case, written for Spotfire X, there may be some confusion of monitoring versus analytics. The dashboard shown on the website and written scenario paint more a picture of monitoring and dashboarding than the sort of analytics we would consider deploying Spark or Kafka for. I don’t need a lot of processing power to tell me that a temperature sensor readings are increasing.

Performing real-time analytics on data-in-motion is an intensive task, requiring quite a bit of computing resources. Scalable solutions such as Spark or Kafka are available but may eventually hit a wall. Providers such as Logtrust (2017) differentiate themselves as a real-time analytics provider by pointing out the potential shortfalls of those solutions and offer a single platform for both data-in-motion and data-at-rest.


Churchward, G. (2018). Why “true” real-time data matters: The risks of processing data-at-rest rather than data-in-motion. Retrieved from https://insidebigdata.com/2018/03/22/true-real-time-data-matters-risks-processing-data-rest-rather-data-motion/

Logtrust. (2017). Real-time IoT big data-in-motion analytics.

Patel, M. (2018). A new era of analytics: Connect and visually analyze data in motion. Retrieved from https://www.tibco.com/blog/2018/12/17/a-new-era-of-analytics-connect-and-visually-analyze-data-in-motion/

Challenges of Health Informatics in the Cloud

Alghatani and Rezgui (2019) present a framework for remote patient monitoring via cloud architecture. The primary intention is to reduce disparate data sources and walls between various data siloes, increasing cost effectiveness, response time, and quality of care. The cloud architecture involves the database itself, user interface(s), and artificial intelligence. This cloud is used by four primary groups: patients, hospitals, insurance companies, and controllers (system stewards).

The authors outline a number of advantages here. Telemedicine can be a great thing but has a number of barriers to overcome, not the least of which are cost, culture, political environment, and infrastructure. The cloud architecture seeks to mitigate the cost and infrastructure issues. IT resources can be extended dynamically based on need and the decentralized nature of the system allows for better scalability, flexibility, and reliability.

There are a number of challenges to be considered. The authors highlight seven:

  1. Security
  2. Data management
  3. Governance
  4. Control
  5. Reliability
  6. Availability
  7. Business continuity

An extensive discussion on data collection challenges is presented, outlining a number of possible methods for collection and synchronization. There must be an assumption that no device on this architecture will maintain constant contact with the cloud, and consistency models must be taken into consideration. One option is for each device to maintain local storage and upload to the cloud once a stable connection is available. Another option is a whisper network of its own, much like the early Amazon Kindle devices. A third and final option—also the authors’ proposal—is the utilization of fog computing as a layer between these devices and the cloud.

Privacy is always an issue and cloud architecture muddies the waters a bit, as there is no on-premise server locked down that holds the personally identifiable information. Banks and hospitals have typically been the slowest to adopt cloud computing, in my experience. As Alghatani and Rezgui (2019) note, governance and control are concerns here. The Health Insurance Portability and Accountability Act (HIPAA) requires confidentiality in all individually-identifiable health information; in 2013, this law was extended to genetic information by way of the Genetic Information Nondiscrimination Act (GINA). While the rules prohibit use of genetic information for underwriting purposes, there is no restriction on the sharing or use of genetic information that has been de-identified (National Human Genome Research Institute, 2015). De-identification is not entirely foolproof. There are cases in which the data can be re-identified (Rosenbaum, 2018).


Alghatani, K., & Abdelmounaam, R. (2019). A cloud-based intelligent remote patient monitoring architecture. Paper presented at the International Conference on Health Informatics & Medical Systems, HIMS’19, Las Vegas, NV.

National Human Genome Research Institute. (2015). Privacy in genomics. Retrieved from https://www.genome.gov/about-genomics/policy-issues/Privacy

Rosenbaum, E. (2018). Five biggest risks of sharing your DNA with consumer genetic-testing companies. Retrieved from https://www.cnbc.com/2018/06/16/5-biggest-risks-of-sharing-dna-with-consumer-genetic-testing-companies.html

Single-Node Hadoop Installation on Ubuntu 16.04

When embarking on a new build for most anything, I tend to use online how-to guides published by other bloggers who have encountered specific issues. I recently completed a single-node Hadoop installation on a Linode Ubuntu box, after several unsuccessful attempts at a three-node setup, and am posting my steps here in case they will be of help to someone attempting to do what I did.

I relied heavily on Parth Goel’s work and this guide follows it nearly verbatim.

Part 1: Provision the server, harden, and install Java

In my case, I have a number of Linode boxes running already and added one more, running Ubuntu 16.04. I also followed the guide for securing a Linode server.

After provisioning and booting up, I completed the following steps:

Set hostname in /etc/hosts (hadoop-master)
sudo nano /etc/hosts

Add the following line:

<your server ip> hadoop-master
Set hostname in /etc/hostname (hadoop-master)
sudo nano /etc/hosts

Add the following line:

<your server ip> hadoop-master
Harden per Linode recommendations

Login to hadoop-master as root.

adduser hduser
adduser hduser sudo
sudo addgroup hadoop
sudo usermod -a -G hadoop hduser

Create ssh key pairs on local machine (OS X for me) and copy to Linode box. On OS, after creating the key pair, run the following command:

ssh-copy-id -i <key name> hduser@<server ip>

Next, disable root login, change SSH port, disable IPV6, and disable password login. You would be surprised at how many brute-force attacks a server is subjected to every minute. If you wind up locking yourself out, there is emergency LISH access.

sudo nano /etc/ssh/sshd_config

In the config file, you’ll change a few lines to look like this:

Port 2222
PermitRootLogin no
PasswordAuthentication no
UsePAM no

Save and exit text editor. One last command to disable IPV6, then restart SSHD:

echo 'AddressFamily inet' | sudo tee -a /etc/ssh/sshd_config
sudo systemctl restart sshd

Next, install UFW. In my case, I have a static IP I can connect from, so I whitelisted it. Nothing else can hit that server. I suppose I could have avoided all the hardening since I was going to only whitelist from one IP address, but better safe than sorry. The last thing I want is my little sandbox being used for some DoS bot attack.

sudo apt-get install ufw
sudo ufw allow from <vpn ip>
sudo ufw enable
Install Java and reboot

This was one of the biggest issues I ran into with various online how-to guides. It was nearly impossible to get the right combination of Hadoop, Ubuntu, and Java versions right, particularly when many guides went with using Oracle JDK. This step uses the default and doesn’t mess around with custom packages.

sudo apt-get update
sudo apt-get install default-jdk
sudo reboot

Part 2: Create localhost SSH access for Hadoop and install

Once this server boots up again, you will log in as hduser. The next step will create a key pair on hadoop-master. Leave the filename and prompts blank, otherwise you’ll have more work ahead of you. Note that we are adding the SSH port in our SSH command, as we changed it earlier, and will have to add this to the Hadoop environment variables.

ssh-keygen -t rsa 

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 

chmod 0600 ~/.ssh/authorized_keys 

ssh -p 2222 localhost

Once you’ve confirmed that works, exit the SSH session and get back to your hadoop-master hduser command line. Now it’s time to install Hadoop.


wget http://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz

tar xvzf hadoop-2.7.3.tar.gz

sudo mkdir -p /usr/local/hadoop

cd hadoop-2.7.3/

sudo mv * /usr/local/hadoop

sudo chown -R hduser:hadoop /usr/local/hadoop

Part 3: Hadoop configuration

Variables configuration

Here, you have to do some checking to make sure your Java library is what is expected.

update-alternatives --config java

In this case we are looking for /lib/jvm/java-8-openjdk-amd64

Edit your bashrc file first.

sudo nano ~/.bashrc

Add the following:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

Now edit your hadoop-env file.

sudo nano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Edit the following line to look like this:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Add this line:

export HADOOP_SSH_OPTS="-p 2222"
Hadoop XML configuration files
sudo mkdir -p /app/hadoop/tmp

sudo chown hduser:hadoop /app/hadoop/tmp

sudo nano /usr/local/hadoop/etc/hadoop/core-site.xml

Add the following

<description>A base for other temporary directories.</description>
<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description>
cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template /usr/local/hadoop/etc/hadoop/mapred-site.xml

sudo nano /usr/local/hadoop/etc/hadoop/mapred-site.xml

Add the following.

<description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode

sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode

sudo chown -R hduser:hadoop /usr/local/hadoop_store

sudo nano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Add the following.

<description>Default block replication.The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
sudo nano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Add the following.


Part 4. Reboot and fire it up!

After reboot, format HDFS.

hdfs namenode -format

Start service.



Test with a simple job.

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 2 5

You will want to visit http://<server-ip>:50070/ and http://<server-ip>:8088 to see the consoles.