Data Warehousing and Data Mining

The relationship between data mining tools and data warehousing systems can be most easily seen in the connector options of popular analytics software packages. For example, the image below right shows the many source options from which to pull data in from warehouse backends in Tableau Desktop. Microsoft Power BI includes similar interface options. There are countless packages in R for connecting to data warehouse backends, readily available online from proprietary and open-source vendors. Other proprietary packages such as SPSS, SAS, and JMP have similar interfaces.

Simply put, a data mining tool enables insights into what is stored in the data warehouse, and is only as useful as the quality of the data it accesses. Power (2016) calls this discover, access, and distill. In professional practice, this author has often seen businesses focus erroneously on a particular data mining tool, believing the paid solution will provide immediate value, without ensuring the data warehouse (or equivalent) is in proper order first. Successful implementation of a data mining tool requires a number of preparatory steps, including (but not limited to):

  1. Identifying appropriate Systems of Record (SORs)
  2. Validating the SOR accuracy and alignment with business purposes
  3. Establishing a common understanding of the data points within each SOR and how they translate across business units within the organization (this often requires an organization-wide Data Governance Board or equivalent)
  4. Developing business goals, or questions the data mining tool can answer

These steps ensure the data is valid, useful, and actionable. Organizations that do not take the necessary steps to ensure data quality and develop a business case for the data mining tool run a danger of wasting time and resources on a solution in search of a problem (Gudfinnsson, Strand, & Berndtsson, 2015; LaValle, Lesser, Shockley, Hopkins, & Kruschwitz, 2011).

Consider an international manufacturing company that currently uses a number of disparate systems of record for its business: Cognos (not a true SOR but regarded as one), AS400, 3PL, SQL, Informix, and multiple warehouse management systems. The company does not have a unified data warehouse or data governance procedures in place. In current state, different business units that use different systems of record are unable to successfully work together with common understandings of data. Attempts at data mining and even simple reporting have failed across business units because of the quality of data—for example, sales forecasting does not translate between Finance and Marketing because the basic figures from the disparate systems of record do not match. There can be no useful data mining from this data without significant transformation.

Assuming the foundational steps are done, and the data mining tool is in production, new data points can be put back into the warehouse based on discovered insights. For example, consider a multi-level marketing company has a number of data points on its associates: units sold, associates recruited, years in the program, rewards program tier, et cetera. They know the associates can be grouped into performance categories akin to “novice” and “expert” but are unclear on both how many categories to look at and what factors are important. Principal components analysis and k-means clustering can reveal how the associates differentiate themselves based on the available variables and suggest an appropriate number of categories within which to classify them. These classifications can be put back into the data warehouse and used as co-variates in other analysis work.

References

Brownlee, J. (2016, September 22). Supervised and unsupervised machine learning algorithms.  Retrieved from https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

Gudfinnsson, K., Strand, M., & Berndtsson, M. (2015). Analyzing business intelligence maturity. Journal of Decision Systems, 24(1), 37-54. doi:10.1080/12460125.2015.994287

LaValle, S., Lesser, E., Shockley, R., Hopkins, M. S., & Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21-31.

Power, D. J. (2016). Data science: Supporting decision-making. Journal of Decision Systems, 25(4), 345-356.

Soni, D. (2018, March 22). Supervised vs. Unsupervised learning – towards data science.  Retrieved from https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Tableau Desktop 2018.2 [Computer software]. (2018). Retrieved from http://www.tableau.com.

Tembhurkar, M. P., Tugnayat, R. M., & Nagdive, A. S. (2014). Overview on data mining schemes to design business intelligence framework for mobile technology. International Journal of Advanced Research in Computer Science, 5(8).

Wagging the Dog: BI Tools, Not Solutions

If you’re familiar with the expression, or perhaps have seen the eponymous film, you understand the idea of something with far less importance or weight driving a much bigger process. In the film’s case, the expression was used to characterize a completely fabricated war shifting attention away from an actual scandal. For our purposes here, consider it this way: a business purchasing their end-use BI tool before crafting the strategy behind what they want and how they want to use it.

It’s a tempting situation. Vendors do a very good job of promoting their business intelligence tools, and there’s nothing wrong with that. But a company can’t rely on that alone to solve the big questions. You wouldn’t buy a dishwasher and then build a house around it…so why rush to invest in a BI tool before you’ve determined exactly what you want out of it and what questions the business wants to answer?

This over-reliance on proprietary tools has, at least for me, encouraged a focus on open-source BI tools. My most common tools of choice are MySQL for relational databases, RStudio for ETL and analytics, Shiny for R-based deployable visualizations, Orange for GUI-based analytics, and Git for source control. There are other tools, to be sure, and the beauty of the open-source sphere is the constant evolution. Beyond that, you are guaranteed not to invest in a proprietary solution that will be obsolete in a few years.

But more importantly–and where this fits into my point of wagging the dog–an open-source solution allows your company to pilot potential tools and solutions without the same level of risk and investment a proprietary solution may yield. I have seen companies invest plenty of money in proprietary solutions before they thought through the business process and wound up spending a tremendous amount time and money trying to make that solution work for what they needed even after they realized the tool was not right for them. They let the tail wag the dog.

Software is a tool, not a solution. Be sure you know what a tool needs to do for you before you choose it.

For further reading:

Statsbot – Open Source Business Intelligence

Big Data Made Simple – Top 10 free and open source business intelligence software

Doppelgänger search with R and MatchIt

In his book Everybody Lies, Seth Stephens-Davidowitz discusses the Doppelgänger Discovery method used most notably in baseball, in the case of slugger David Ortiz. Doppelgänger Discovery is a way to load up a model with as many data points about a person as possible and find their statistical twins. In the case of David Ortiz, it proved that he wasn’t quite out of his prime, based on the career arcs of other players just like him.

We are slightly modifying the scenario here. Let’s assume you are charged with selecting participants for a particularly difficult professional development program that requires a specific personality profile and resume for someone to truly get the most out of it. You have 3 spots open, and 3 idealized candidate profiles that represent those individuals who would be best suited to participate. There are 4 key factors to match on, and just sorting names in a spreadsheet doesn’t really cut it. As with most analytics scenarios, there’s an R package for that.  There are several. I’ve used and prefer MatchIt.

First, get your data straight. In this case, we want a spreadsheet with our individual identifiers (names, Person X, or participant numbers), groups (control vs selection), and the factors to match on. Something like this:

GroupIDFactor1Factor2Factor3Factor4
0Person A.333.2.5713
0Person B.667.2.5714
0Person C.667.6-.285-2
0Person D.3331.2.5716
0Person E.000.8-.2858
0Person F.000.4-.285-5
1Person G.3331.4-.285-1
1Person H.667.6-.5710
1Person I.000.2.2856

Let’s figure out who would be our ideal candidates. First, install the MatchIt library via your package loader. Next, load your spreadsheet (assuming a CSV format) as a dataframe named matching.

The following script calls the MatchIt package and performs the matching:

# Call the library
library(MatchIt)

# Initialize
set.seed(1234)

# Run matching function; all 4 factors are equally weighted
match.it <- matchit(Group ~ Factor1 + Factor2 + Factor3 + Factor4, data = matching, method="nearest", ratio=1)
a <- summary(match.it)

# Put matched set in a new data frame
df.match <- match.data(match.it)[1:ncol(matching)]

# Plot the results
plot(match.it, type = 'jitter', interactive = FALSE)

Now, you have a data frame with the 3 prototypical candidates and the 3 chosen candidates. Keep in mind you do not have a 1:1 correspondence here, as these are nearest-neighbor matches. See the documentation for more information on alternate methods and exact matching.

SurveyGizmo, R, and MySQL on Amazon AWS

Those who have been in any sort of sociological research field should very familiar with the available survey platforms out on the web now (e.g., SurveyMonkeySurveyGizmo, or LimeSurvey). Getting your results usually involves a multi-step generate/export/import cycle. Is there a better way?

I asked the question when using R to digest a survey deployed on SurveyGizmo. With so many R packages out there, I had a hunch there was something to help me get my results from SG into R without having to run through the generate/export/import cycle. Enter RSurveyGizmo, a package that does exactly that.

Beyond aggregates and analytics, the survey results in SurveyGizmo should be stored elsewhere for future use. This raises more questions about ETL from the website itself to your database of choice. In this case, let’s assume we have a MySQL database running on Amazon AWS. I recommend this over a MSSQL instance because of the difficulty of using an ODBC connection on anything other than Windows (but it can be done).

Assumptions

  • SurveyGizmo account with surveys already active
  • MySQL database established on Amazon AWS
  • You know your host, port, dbname, username, and password for your MySQL database on Amazon AWS
  • R version 3.4.2

Part I: SurveyGizmo

  1. Log into your SurveyGizmo account and head over to your API access options. Find that under Account > Integrations > Manage API.
  2. If you don’t have an active API key listed, Create an API Key. You will then see the API key listed for your user account. Copy that key to a text editor, as you will need it momentarily.
  3. Go back to your SurveyGizmo home page and view the surveys you have out there. Choose one and click on it.
  4. You’ll be taken to the survey build page and the address will be something like https://app.surveygizmo.com/builder/build/id/xxxxxxx where xxxxxxx is a unique number. Copy that number to a text editor, as you will need it momentarily too.

Part II: R + SurveyGizmo

  1. Install RSurveyGizmo via devtools.
    library(devtools)
    install_github(repo="DerekYves/rsurveygizmo")
  2. Construct the script to grab your survey. You will need the API key and survey number.
    library(Rsurveygizmo)
    api <- "your_api_key"
    my.data <- pullsg(survey_number, api, completes_only=T)
  3. You will see loading progress and, depending on the size of your survey, will have a frame full of data in just a few moments. (Sometimes I get a JSON error, but it resolves itself in a few minutes.) SurveyGizmo does have API call limits, so please be judicious with how many times you do this. It’s generally good to run the process once you have enough data to start writing your analytics scripts, then again once the survey is closed.
  4. This is the simplest of the methods in the RSurveyGizmo package. You will want to explore the package documentation to learn all it can do for you.

Part III: R + MySQL

  1. Install the RMySQL package via your package loader.
  2. Construct the script to establish your connection, filling in your specific details.
    # load RMySQL
    library(RMySQL)
    
    # establish the MySQL connection
    con <- dbConnect(RMySQL::MySQL(),
     username = "user",
     password = "password",
     host = "name.something.zone.rds.amazonaws.com",
     port = 3306,
     dbname = "mydb"
    )
  3. Now con will serve as your pipeline for the RMySQL calls.
  4. Two common methods are dbWriteTable and dbSendQuery. As you might expect, to write an R data frame to a table in your MySQL database, you use dbWriteTable:
    dbWriteTable(con, "table_name", dataframe.name, overwrite=TRUE)

    Using overwrite=TRUE means your table is essentially dropped and recreated, rather than appended.
    To get an existing MySQL table into a new R data frame, you’d use dbSendQuery:

    newframe = dbSendQuery(con, "SELECT * FROM mydb.mytable")
  5. Here’s a wrinkle, though. SurveyGizmo downloads come with concatenated column names that may not be very helpful. I prefer to convert all my column names to a standard format and establish a reference table with all the original questions matched up. The following script grabs all the column names from an existing data frame and creates a table with a standard “qxxx” format matched to the original question name.
    # get question text into vector
    Question_Text <- colnames(mydata.original)
    
    # get length of that vector
    sq <- length(Question_Text)
    
    # generate sequence based on that length
    QKey <- sprintf("q%03d",seq(1:sq))
    
    # make a new data frame with the QKeys matched to the original question text
    mydata.questions <- data.frame(QKey, Question_Text)
    
    # replace original question text with the those keys
    colnames(mydata.original) = as.character(QKey);

    Now you have two frames: mydata.original with uniform column names, and mydata.questions with those column names matched to the original text.

    Assuming you want to get those frames into your MySQL database, use the following:

    dbWriteTable(con, "mydata_questions",mydata.questions, overwrite=TRUE)
    dbWriteTable(con, "mydata_original",mydata.original, overwrite=TRUE)

R, Azure SQL Server, and Mac OS X

There are a few options out there for those of us who work (mostly or solely) on a Mac and need to access MSSQL databases through R.

 

RODBC and RSqlServer are the two I’ve worked with. RODBC requires some additional workarounds with a Unix ODBC driver; RSqlServer has issues with rJava in OS X High Sierra. I’ve found RODBC to require the least amount of workarounds and frustration. Another wrinkle here is having an Azure SQL Server as the endpoint with SQL Authentication (not Windows Auth).

 

First, you will need to update your Mac with Homebrew then add a few packages. This page discusses how but did not work for me out of the box. There are extra steps.

 

1. Install HomeBrew via Terminal

 

2. Install unixODBC via Terminal
brew update
brew install unixodbc

 

3. Install freeTDS via Terminal

brew install freetds --with-unixodbc

 

4. Add to /usr/local/etc/odbcinst.ini
[MSSQL]
 
Description   = Microsoft SQL Server driver
 
Driver        = /usr/local/Cellar/freetds/0.95.18/lib/libtdsodbc.so

 

5. Add to /usr/local/etc/freetds.conf
[MY_SQL_SERVER] 
host = myazureserver.database.windows.net
 
port = 1433
 
tds version = 7.0

 

6. Add to /usr/local/etc/odbc.ini
[myazureserver]
Driver=/usr/local/lib/libtdsodbc.so
Trace=No
Server=myazureserver.database.windows.net
Port=1433
TDS_Version=8.0
Database=myazuredatabase

 

7. In Terminal, enter (with user and pass replaced by your credentials):
isql -v myazureserver user pass
You should see a success message and a new prompt. See the Connecting section on the page for what it should look like. Type quit to exit that SQL shell.

 

8. If that is successful, you know the system-level configuration is complete. Run the following commands in Terminal to create symbolic links:
ln -vs /usr/local/Cellar/freetds/0.95.18/etc/freetds.conf ~/.freetds.conf
ln -vs /usr/local/Cellar/unixodbc/2.3.2_1/etc/odbc.ini ~/.odbc.ini
ln -vs /usr/local/Cellar/unixodbc/2.3.2_1/etc/odbcinst.ini ~/.odbcinst.ini
9. In R, run this script (after replacing user and pass with your credentials)
# install RODBC package (can comment this out once run)
install.packages("RODBC", type = "source")

# call RODBC package
library(RODBC)

# create a connection
mycon <- odbcConnect("myazureserver", uid="user", pwd="pass")

# see what it looks like:
mycon 

# Select the top 100 records from table dbo.Table and load into dataframe "rs"
rs <- sqlQuery(mycon, "SELECT TOP (100) * FROM dbo.Table")

 

Now you should have a data frame named “rs” with 100 rows of data.