In 2007, researchers started warning about ways that social interaction data might be used to predict and manipulate behavior. See the full report here.
Wagging the Dog: BI Tools, Not Solutions
If you’re familiar with the expression, or perhaps have seen the eponymous film, you understand the idea of something with far less importance or weight driving a much bigger process. In the film’s case, the expression was used to characterize a completely fabricated war shifting attention away from an actual scandal. For our purposes here, consider it this way: a business purchasing their end-use BI tool before crafting the strategy behind what they want and how they want to use it.
It’s a tempting situation. Vendors do a very good job of promoting their business intelligence tools, and there’s nothing wrong with that. But a company can’t rely on that alone to solve the big questions. You wouldn’t buy a dishwasher and then build a house around it…so why rush to invest in a BI tool before you’ve determined exactly what you want out of it and what questions the business wants to answer?
This over-reliance on proprietary tools has, at least for me, encouraged a focus on open-source BI tools. My most common tools of choice are MySQL for relational databases, RStudio for ETL and analytics, Shiny for R-based deployable visualizations, Orange for GUI-based analytics, and Git for source control. There are other tools, to be sure, and the beauty of the open-source sphere is the constant evolution. Beyond that, you are guaranteed not to invest in a proprietary solution that will be obsolete in a few years.
But more importantly–and where this fits into my point of wagging the dog–an open-source solution allows your company to pilot potential tools and solutions without the same level of risk and investment a proprietary solution may yield. I have seen companies invest plenty of money in proprietary solutions before they thought through the business process and wound up spending a tremendous amount time and money trying to make that solution work for what they needed even after they realized the tool was not right for them. They let the tail wag the dog.
Software is a tool, not a solution. Be sure you know what a tool needs to do for you before you choose it.
For further reading:
Statsbot – Open Source Business Intelligence
Big Data Made Simple – Top 10 free and open source business intelligence software
Open Source 101: Columbia
Being platform-agnostic, and not letting the tail wag the dog, is a critical part of Business Intelligence efforts. I’ve written on this before, and firmly believe that choosing a particular software solution should be done after the strategies and business cases for the BI efforts are crafted.
To that end, open-source software is a favorite. There’s a conference coming up near me in April. I’ve registered, and I hope to see you there.
Doppelgänger search with R and MatchIt
In his book Everybody Lies, Seth Stephens-Davidowitz discusses the Doppelgänger Discovery method used most notably in baseball, in the case of slugger David Ortiz. Doppelgänger Discovery is a way to load up a model with as many data points about a person as possible and find their statistical twins. In the case of David Ortiz, it proved that he wasn’t quite out of his prime, based on the career arcs of other players just like him.
We are slightly modifying the scenario here. Let’s assume you are charged with selecting participants for a particularly difficult professional development program that requires a specific personality profile and resume for someone to truly get the most out of it. You have 3 spots open, and 3 idealized candidate profiles that represent those individuals who would be best suited to participate. There are 4 key factors to match on, and just sorting names in a spreadsheet doesn’t really cut it. As with most analytics scenarios, there’s an R package for that. There are several. I’ve used and prefer MatchIt.
First, get your data straight. In this case, we want a spreadsheet with our individual identifiers (names, Person X, or participant numbers), groups (control vs selection), and the factors to match on. Something like this:
Group | ID | Factor1 | Factor2 | Factor3 | Factor4 |
---|---|---|---|---|---|
0 | Person A | .333 | .2 | .571 | 3 |
0 | Person B | .667 | .2 | .571 | 4 |
0 | Person C | .667 | .6 | -.285 | -2 |
0 | Person D | .333 | 1.2 | .571 | 6 |
0 | Person E | .000 | .8 | -.285 | 8 |
0 | Person F | .000 | .4 | -.285 | -5 |
1 | Person G | .333 | 1.4 | -.285 | -1 |
1 | Person H | .667 | .6 | -.571 | 0 |
1 | Person I | .000 | .2 | .285 | 6 |
Let’s figure out who would be our ideal candidates. First, install the MatchIt library via your package loader. Next, load your spreadsheet (assuming a CSV format) as a dataframe named matching.
The following script calls the MatchIt package and performs the matching:
# Call the library library(MatchIt) # Initialize set.seed(1234) # Run matching function; all 4 factors are equally weighted match.it <- matchit(Group ~ Factor1 + Factor2 + Factor3 + Factor4, data = matching, method="nearest", ratio=1) a <- summary(match.it) # Put matched set in a new data frame df.match <- match.data(match.it)[1:ncol(matching)] # Plot the results plot(match.it, type = 'jitter', interactive = FALSE)
Now, you have a data frame with the 3 prototypical candidates and the 3 chosen candidates. Keep in mind you do not have a 1:1 correspondence here, as these are nearest-neighbor matches. See the documentation for more information on alternate methods and exact matching.