Machine Learning: Supervised and Unsupervised

Supervised typically takes the form of classification or regression. We know the input and output variables, and try to make sense of the relationships between the two. Tembhurkar, Tugnayat, & Nagdive (2014) refer to this as Descriptive mining. Common methods include decision tree, kNN algorithm, regression, and discriminant analysis. The methods are dependent upon the type of data input: continuous variables will use regression methods, while discrete variables will use classification methods.

For example, a human resources division in a large multinational company wants to determine what factors have contributed to employee attrition over the past two years. A decision tree methodology can produce a simple “if-then” map of what attributes combine and result in a separated employee. An example tree might point out that a male employee over the age of 45, working in Division X, who commutes more than 25 miles from home, has a manager 10 years or more his junior, and has been in the same unit for more than seven years is a prime candidate for attrition. Although many of the variables are continuous, a decision tree method makes the data manageable and actionable for human resources division use.

Unsupervised are usually clustering or association. The output variables are not known, and we are relying on the system to make sense of the data. No a priori knowledge. Temburkhar et al refers to this as Prescriptive mining. Common methods include neural networks, anomaly detection, k-means clustering, and principal components analysis. The methods are dependent upon the type of data input: continuous variables will use association methods, while discrete variables will use clustering methods.

For example, a multi-level marketing company has a number of data points on its associates: units sold, associates recruited, years in the program, rewards program tier, et cetera. They know the associates can be grouped into performance categories akin to novice and expert but are unclear on both how many categories to look at and what factors are important. Principal components analysis and k-means clustering can reveal how the associates differentiate themselves based on the available variables and suggest an appropriate number of categories within which to classify them.

References

Brownlee, J. (2016, September 22). Supervised and unsupervised machine learning algorithms.  Retrieved from https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

Soni, D. (2018, March 22). Supervised vs. Unsupervised learning – towards data science.  Retrieved from https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Corporate Information Factories and Business Dimensional Models

Differentiating between a Corporate Information Factory (CIF) and a Business Dimensional Model (BDM) may come down to two different directions of strategic thought: top-down (CIF) or bottom-up (BDM).

In the BDM, otherwise known as the Kimball approach, data remain in their respective logical business units (e.g, Sales or Production) but are brought together into the data warehouse through a commonly defined bus architecture. This approach is most prevalent in the Microsoft BI stack. Star or snowflake schemas are utilized and data are rarely normalized past 1NF, if at all. The logical focus is on the originating business units and the goal is often to allow these units to more effectively share data across the organization. For presentation, fewer queries and joins are necessary than one would need to make sense of CIF data.

The CIF, or Inmon approach, starts with the central data repository as the unit of focus as opposed to the individual business units. The business units can create data marts from the normalized tables. Third normal form is required. The most apparent disadvantage here is the amount of time and thought required to implement a true CIF, but the resulting product is a true enterprise data factory. More joins are needed, though, to put the data into presentable form.

Where Extract-Transform-Load or Extract-Load-Transform is concerned, the former (ETL) is the most conventional understanding of the process and typically implemented in dimensional modeling. The transformation happens before the data reaches the target system and is logically arranged already—to some degree—by business until or purpose. The latter (ELT) is utilized most often in more powerful analytics implementations or data lakes.

References

Bethke, U. (2017, May 15). Dimensional modeling and Kimball data marts in the age of big data and Hadoop.  Retrieved from https://sonra.io/2017/05/15/dimensional-modeling-and-kimball-data-marts-in-the-age-of-big-data-and-hadoop/

Harris, D. ETL vs. ELT: How to choose the best approach for your data warehouse. Retrieved from https://www.softwareadvice.com/resources/etl-vs-elt-for-your-data-warehouse/

Kajeepeta, S. (2010, Jun 7). Is it time to switch to ELT? Intelligent Enterprise – Online. Retrieved from https://proxy.cecybrary.com/login?url=https://search.proquest.com/docview/365390283?accountid=144789

Kumar, G. (2017, Mar 14). Dimensional modelling vs corporate information factory. Retrieved from http://www.data-design.org/blog/dimensional-modelling-vs-corporate-information-factory

Decision Support Systems, Data Warehouses, and OLAP Cubes

As Tembhurkar, Tugnayat, & Nagdive define it, BI is “a collection of tools and techniques [that] transforms raw data into significant information useful for analyzing business systems” (2014, p. 128). BI has evolved from the earlier incarnations of Decision Support Systems, which served the same purpose(s) but were much more rudimentary compared to today’s implementations. These DSS solutions were often comprised of data warehouses (DWs) and online analytical processing (OLAP) engines. Both components worked together to serve the business needs: ETL and storage being handled by the data warehouse, and the front-end analysis handled by the OLAP system.

The data warehouse serves as the central repository for multiple systems of record, often heterogenous and disparate in the beginning. Data is typically replicated and stored in subject-area schemas (e.g., sales or employee data), most typically in fact and dimension tables as part of a SQL-backed relational database. The data warehouse itself can offer views and data marts pre-packaged. It supports the OLAP system. Like the OLAP system in its original form, the data warehouse is starting to be eclipsed by data lakes in enterprise environments that deal with a large amount of heterogenous data that often includes unstructured data. The difference between the two, for purposes of this comparison, is where the “T” (transformation) falls in ETL or ELT. In a data warehouse, the transformation happens before loading into the warehouse, as its purpose is to serve as a central common repository. In a data lake, the transformation happens after loading, as the lake does not impose any schemas or restrictions in order to achieve any kind of homogenous state.

The OLAP system is multi-dimensional, not unlike a three-dimensional spreadsheet. It is not a relational database but enables the analysis of the data in the data warehouse. The OLAP system enables what we typically understand as slicing and dicing the data. While these were sufficient in the early days of BI, the shift towards a DevOps culture and the proliferation of machine learning, predictive analysis, dashboarding, and envelope-pushing analytics capabilities have required more from a BI solution than rigid OLAP cubes.

References

Felzke, M. (2014). Data warehouse vs. OLAP cube. Retrieved from https://www.solverglobal.com/blog/2014/04/data-warehouse-vs-olap-cube/

Harris, D. (n.d.). ETL vs. ELT: How to choose the best approach for your data warehouse. Retrieved from https://www.softwareadvice.com/resources/etl-vs-elt-for-your-data-warehouse/

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

NXD and RDBMS Solutions

Comparing native XML database (NXD) and relational DBMS solutions is close to comparing apples and oranges. Both are spherical fruit, but they have very different flavors, applications, and characteristics. RDBMS has been around for a long time and is much more established than NXD; as a result, there is less collective knowledge around NXD and its implementations. RBMS solutions are practically ubiquitous and have a number of different implementations, both open-source and proprietary. Tables are normalized and typically in a fact/dimension model or star schema.

On the other hand, comparative NXD solutions rely on containers and documents in a simple tree structure. Complex joins and queries that are allowable in RDBMS are typically more difficult in NXD (Pavlovic-Lazetic, 2007). One area that NXD shows promise is in Web-enabled data warehousing (Salem, R., Boussaïd, O., & Darmont, J., 2013). Bringing multiple sources of unstructured and structured data together in an Active XML Repository addresses data heterogeneity, distribution, and interoperability issues.

A typical RDMBS implementation for business is a data warehouse in which structured data from various systems of record are brought into a common area and reconciled. These other systems of record may include proprietary relational database systems, mainframe non-relational databases, data exported to delimited formats, et cetera. A data dictionary may be maintained and reconciliation policies may be drawn up by a central data governance board. The output from this data warehouse allows users from different divisions using different systems of record to understand a common organization-wide data taxonomy.

One possible NXD solution involves an IoT data environment. Imagine a number of environmental sensors (e.g., temperature, humidity, pressure) being read on regular intervals and pushed to a central web location. In a typical XML tree structure, readings from each sensor or central controller (handling multiple sensors) could be placed in an XML document. This data does not require complex joins, and is much better suited for a NXD solution.

References

Pavlovic-Lazetic, G. (2007). Native XML databases vs. relational databases in dealing with XML documents. Kragujevac Journal of Mathematics, 30, 181-199.

Salem, R., Boussaïd, O., & Darmont, J. (2013). Active XML-based web data integration. Information Systems Frontiers, 15(3), 371-398.

 

Comparing ADO.NET and Fusion Middleware

ADO.NET provides a distinct Connected and Disconnected layer for data access via the Web, allowing for the unpredictability of internet connections. It is optimized for SQL, being a Microsoft product, and has extensive XML support. In that respect it may act as a bridge between traditional RDBMS systems and XML, which is a material concern for Web-DBMS implementations (Barik & Ramesh, 2011). ADO.NET is feature-rich and has many uses, but the trade-off here is a relatively steep learning curve and complex customization.

Oracle’s Fusion Middleware is a much more sprawling collection of enterprise-level resources that provides a number of distinct applications and frameworks at the web, middle, and data tiers. At the core of the middle tier is the WebLogic server. It is Java-based and runs a number of open standard technologies. There are a number of additional components in the collection, but of particular relevance in relation to ADO.NET is the Service-Oriented Architecture (SOA) Suite. It acts as an incrementally-available bridge between heterogenous systems.

As complex and comprehensive as the Oracle suite is, of course there are trade-offs. Successfully implemented, it is an efficient solution that can incorporate multiple different systems and flow information effectively across the solution. However, developers and resources are relatively scant compared to Microsoft ADO.NET; beyond that, it is a much bigger capital and operational investment than ADO.NET.

These two may be in consideration where Web-DBMS data access is concerned. Neither is an actual programming language. Rather, they should be viewed as middle-tier solutions that enable data access in a Web environment. The scale to which these solutions will be deployed is a primary consideration for which to consider. ADO.NET is a more concise product that functions well within an existing Microsoft ecosystem, whereas Fusion Middleware is a much broader set of tools more suited to an enterprise environment. That isn’t to say Fusion Middleware cannot be suited to smaller environments.

References

Barik, P. K., & Ramesh, D. B. (2011). Design and development of virtual distributed database for web-based library resource sharing network for Orissa technical and management institutions. International Journal of Information Dissemination and Technology, 1(1), 51.

Connolly, T., & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Oracle. (n.d.). Oracle Fusion Middleware. Retrieved from https://www.oracle.com/technetwork/middleware/fusion-middleware/overview/index.html

Web-DBMS Solutions: Advantages and Disadvantages

Web-DBMS solutions are attractive for a number of reasons, not the least of which is the ever-increasing ubiquity of high speed internet access. Mobile database architecture leverages the presence of cellular data connections with higher reliability and speed. Web-DBMS solutions are similar in concept, as they provide access to a corporate database property virtually anywhere, and have the added features of being platform-independent and easily accessible via a Web URL.

In times past, corporate databases were typically accessible only to those onsite and through a specific application or platform. As the demands for mobility and the prevalence of heterogenous user platforms (e.g., Windows, Mac OS, Linux, iOS, Android) have grown, Web-DBMS solutions have risen to meet the resulting challenges. Platform independence is a key advantage in Web-DBMS solutions. A web browser and Java client are typically all that are required, both of which can be acquired through various methods free of charge. This virtually guarantees the user has the basic technical ability to access the Web-DBMS portal. In addition, there are no proprietary front-end applications that require a specific operating system or libraries to be installed. Even in an internal platform-homogenous environment, Web-DBMS solutions simplify access to corporate databases, requiring no special deployments or configurations to be completed by a central IT department.

Scalability and deployment are double-edged swords here. An advantage of housing the application layer server-side is the ability to rapidly deploy user access simply by providing the Web URL for the front-end portal. However, given the ease of deployment, scaling up the number of users must be balanced with scaling up the back-end application servers to handle the increased load. In addition to load handling, the solution must also take replication into account and address the issues inherent in Distributed DBMS implementations.

The move to a Web interface includes the adoption of HTML for the front-end programming language, which brings standardization into the picture as a key advantage of Web-DBMS solutions. HTML is a ubiquitous language across all web browsers. More recently, XML has emerged as a standard for data exchange, following the path of HTML. Of course, particular vendors are introducing proprietary features into this standard landscape, which may impact the availability of a Web-DBMS solution if not properly handled (Connolly & Begg, 2015). Another feature in the scope of standardization is the ability to deploy to each user in the exact same way by providing a single Web URL. Companies may also implement single sign-on capabilities, which increases the measure of standardization inherent in a deployment.

When a user visits that particular Web URL, they may be presented with a simple graphical user interface (GUI) and do not have to be familiar with the inner workings of how that GUI interacts with the various components of the solution. A user simply sees the designed GUI for their application. There is no need for various end-user network configurations beyond perhaps a simple proxy URL, and the GUI itself can be designed with the end user’s needs specifically in mind. Of course, the development tools for the GUI and various layers of the solution are relatively immature. The majority of internet development environments to date are little more than text editors (Connolly & Begg, 2015). This disadvantage will likely disappear quickly as technology catches up.

Some key disadvantages present in Web-DBMS implementations are the same issues present in Mobile-DBMS implementations; that is, bandwidth, reliability, and security. Although high speed internet access has spread quickly across the country, dead spots and dropped connections are still a major constraint. With these risks come questions of transaction management and disaster recovery in the event of a network partition. In terms of security, when a Web-DBMS solution is location-agnostic and requires only a web browser, the solution must address questions of basic web security as well as the security standards of the organization.

References

Barik, P. K., & Ramesh, D. B. (2011). Design and development of virtual distributed database for web-based library resource sharing network for Orissa technical and management institutions. International Journal of Information Dissemination and Technology, 1(1), 51.

Connolly, T., & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

CAP and ACID Principles in Mobile Database Solutions

The concept of a mobile database brings some characteristics from distributed systems and incorporates the growing developments in wireless technology and mobile devices. At a practical level, a mobile database solution allows a business user to have connectivity to a corporate central database while in the field, without a dedicated link to the corporate database server. The method of communication is not unlike asynchronous replication, wherein the updates between mobile node and corporate node are handled on a schedule as opposed to instantaneously. According to Connolly & Begg (2015), mobile database solutions typically include:

  1. a corporate database server and DBMS that manages and stores the corporate data and provides corporate applications;
  2. a remote database and DBMS that manages and stores the mobile data and provides mobile applications;
  3. a mobile database platform that includes laptop, smartphone, or other Internet access devices;
  4. two-way communication links between the corporate and mobile DBMS.

Beyond the functions of a standard DBMS, mobile database solutions also require the ability to:

  1. communicate with the centralized database server through modes such as wireless or Internet access;
  2. replicate data on the centralized database server and mobile device;
  3. synchronize data on the centralized database server and mobile device;
  4. capture data from various sources such as the Internet;
  5. manage data on the mobile device;
  6. analyze data on a mobile device;
  7.  create customized mobile applications (Connolly & Begg, 2015).

Common issues in mobile database solutions include security, network partitioning, cellular handoff, and ACID transaction management (Connolly & Begg, 2015; Ibikunle & Adegbenjo, 2013). Security, partitioning, and handoff are all inherent in the mobile nature of the solution; that is, the idea of mobile nodes roaming around the country, with the connections being handed off between cellular towers as the user traverses a route, obviously carries with it the possibilities of signal loss or physical loss of an unsecured device.

ACID principles, which address the cluster as a whole entity, must be relaxed and adapted for mobile database solutions (Connolly & Begg, 2015). Assume for a moment that a large number of modifications are done by a user on the mobile node. Those transactions must be committed to the central database at next update, but if the connection is lost or weakened as a cellular signal is handed off, the batch may not complete. In that case, according to strict Atomicity rules, the entire set of transactions must be rolled back. This is not optimal and thus a new approach to Atomicity must be defined for mobile solutions. During that time, isolation is an issue, because a resource is blocked until the transaction is released. This also brings in the questions of consistency and durability: what happens when connectivity is lost and the mobile database is inconsistent with the central database? It is not discoverable until the mobile database is able to re-establish connectivity. What happens if one or more components of the mobile database solution experiences a failure?

Similarly, as CAP concerns are raised when a network partition occurs, we must take into account additional partition likelihoods of mobile handoff or signal loss. The mobile database solution must choose between consistency or availability in an event of partition. In choosing consistency, none of the nodes will be available until they are all back online. In choosing availability, the nodes will be available but not necessarily consistent until connectivity is re-established between all nodes.

The shortcomings in both ACID and CAP are mostly relegated to CAP, which applies to the database solution as a whole. The system overall must be available. However, consistency is possible in a slightly more relaxed way (just as ACID properties tend to be more relaxed for mobile database solutions). Ramakrishnan (2012) acknowledges that consistency can exist on a spectrum.

References

Connolly, T., & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Frank, L., Pedersen, R. U., Frank, C. H., & Larsson, N. J. (2014). The cap theorem versus databases with relaxed acid properties. Paper presented at the Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication, Siem Reap, Cambodia.

Greiner, R. (2014). Cap theorem: Revisited.  Retrieved from http://robertgreiner.com/2014/08/cap-theorem-revisited/

Ibikunle, F. A., & Adegbenjo, A. A. (2013). Management issues and challenges in mobile database system. International Journal of Engineering Sciences & Emerging Technologies, 5(1).

Shapiro, M., Preguiça, N., Baquero, C., & Zawirski, M. (2011). Conflict-free replicated data types. Paper presented at the Stabilization, Safety, and Security of Distributed Systems, Berlin, Heidelberg.

Ulnes, S. A. (2017). Eventually consistent: A mobile-first distributed system.  Retrieved from https://academy.realm.io/posts/eventually-consistent-making-a-mobile-first-distributed-system/

Breaking down Synchronous and Asynchronous Database Replication

Synchronous writes the data at the same time across source and target(s), simultaneously. It is a single transaction, and all-or-nothing. Asynchronous writes to source then propagates the changes to the target(s) at regular intervals. It is on a schedule, so there is a lag between local commit and replication to remote nodes. This is most commonly used in cloud backup situations.

Breaking it down to the host-storage relationship:

Synchronous

  1. Source Host sends write request to Source Storage
  2. Source Storage writes data and sends to Target Storage
  3. Target Storage writes data and sends acknowledgement to Source Storage
  4. Source Storage acknowledgement to Source Host

Asynchronous

  1. Source Host sends write request to Source Storage
  2. Source Storage writes data and sends acknowledgement to Source Host
  3. The update is held in queue until a specified time, at which the Source Storage sends the update to Target Storage
  4. Target Storage writes data and sends acknowledgement to Source Storage.

The key difference in the chain is where/when the acknowledgement is sent. In Synchronous, the write-to-target action must complete successfully before a Source Host receives acknowledgement. In Asynchronous, the acknowledgement is sent upon writing to source, without confirming an immediate write to target.

Primary concerns between these two methods are data integrity and performance. Synchronous may guarantee no data loss but it consumes much more bandwidth and cost than an Asynchronous solution. Asynchronous tends to be more cost-effective and uses less resources, and is more resilient by design; however, data loss at write is more likely. If data must match real-time across nodes, Synchronous Replication is preferable, as even a small delay between local and remote write in Asynchronous Replication may not be allowable.

References

Connolly, T., & Begg, C. (2015).  Database Systems – Database Systems: A Practical Approach to Design, Implementation, and Management (6th ed.). London, UK: Pearson.

Vembu. (2016). Synchronous (vs) Asynchronous Replication. Retrieved from https://www.vembu.com/blog/synchronous-vs-asynchronous-replication/

Competing Values: The “Why” Behind the “What and How”

In my current pre-research for eventual dissertation work, I explore the inability of companies to capitalize on analytics capabilities due to a lack of a data-centric culture, and seek to identify a number of key measures for implementing such. This is admittedly an interdisciplinary endeavor. Büschgens, Bausch, & Balkin (2013) focus on the broader organizational culture phenomenon, but their theoretical approach and meta-analysis are relevant. The same principles that foster a successful organizational culture may also be useful for implementing different subcultures or niche cultures that are a part of that broader successful culture.

The authors introduce two important theoretical constructs: Measurement of Behavior and Output (Ouchi, 1979) and the Competing Values Framework (Quinn and Rohrbaugh, 1983; Quinn and Spreitzer, 1991). Ouchi’s model, for our purposes, begins with a low ability to measure outputs. Thus, the organizational process is classified on the knowledge of the transformation process. For implementing a data-centric culture, our hope is that stakeholders are closely engaged, but that is not always within control. Even in a worst-case scenario (or, clan control), it is possible to “[align] the individual’s objectives with those of the organization” (Büschgens, Bausch, & Balkin, 2013, p. 766).

The Competing Values Framework is a useful tool for quantifying the specific means and ends each part of an organization most closely identifies with. This is of particular importance when a data-centric culture spans over multiple internal entities. Finance might be more Hierarchal in their approach, but IT may be more Rational. Appealing to why the data-centric culture is important will require different foci for each department based on their plot on the Competing Values Framework. Such is the focus of the meta-analysis. The authors investigate the relationship between innovation and the four major cultural traits, and outline their findings. Of particular interest are (a) those findings on the relationship and (b) the fact that “organizations that create radical innovations do not exhibit different organizational cultures than those that are rather oriented at incremental innovations” (Büschgens, Bausch, & Balkin, 2013, p. 775). This is encouraging, as organizations may sometimes feel overwhelmed or pushed by the need to make great strides in change when in fact the current climate would not support such radical change, and such a speed is accessible and relevant to all the major cultural types.

Those of us in IT would be wise to don a management consulting hat once in a while, seeking to understand our customers and what why drives their daily productivity.

References

Büschgens, T., Bausch, A., & Balkin, D. B. (2013). Organizational culture and innovation: A meta-analytic review. Journal of Product Innovation Management, 30(4), 763–781. doi: 10.1111/jpim.12021.

Ouchi, W. G. (1979). A conceptual framework for the design of organizational control mechanisms. Management Science, 25(9), 833–48.

Quinn, R. E., & Rohrbaugh, J. (1983). A spatial model of effectiveness criteria: Towards a competing values approach to organizational analysis. Management Science, 29(3), 363–77.

Quinn, R. E., & Spreitzer, G. M. (1991). The psychometrics of the competing values culture instrument and an analysis of the impact of organizational culture on quality of life. Research in Organizational Change and Development, 5, 15–42.

This post originally appeared on my LinkedIn page: https://www.linkedin.com/pulse/competing-values-why-behind-what-how-jonathan-fowler/