Doppelgänger search with R and MatchIt

In his book Everybody Lies, Seth Stephens-Davidowitz discusses the Doppelgänger Discovery method used most notably in baseball, in the case of slugger David Ortiz. Doppelgänger Discovery is a way to load up a model with as many data points about a person as possible and find their statistical twins. In the case of David Ortiz, it proved that he wasn’t quite out of his prime, based on the career arcs of other players just like him.

We are slightly modifying the scenario here. Let’s assume you are charged with selecting participants for a particularly difficult professional development program that requires a specific personality profile and resume for someone to truly get the most out of it. You have 3 spots open, and 3 idealized candidate profiles that represent those individuals who would be best suited to participate. There are 4 key factors to match on, and just sorting names in a spreadsheet doesn’t really cut it. As with most analytics scenarios, there’s an R package for that.  There are several. I’ve used and prefer MatchIt.

First, get your data straight. In this case, we want a spreadsheet with our individual identifiers (names, Person X, or participant numbers), groups (control vs selection), and the factors to match on. Something like this:

0Person A.333.2.5713
0Person B.667.2.5714
0Person C.667.6-.285-2
0Person D.3331.2.5716
0Person E.000.8-.2858
0Person F.000.4-.285-5
1Person G.3331.4-.285-1
1Person H.667.6-.5710
1Person I.000.2.2856

Let’s figure out who would be our ideal candidates. First, install the MatchIt library via your package loader. Next, load your spreadsheet (assuming a CSV format) as a dataframe named matching.

The following script calls the MatchIt package and performs the matching:

# Call the library

# Initialize

# Run matching function; all 4 factors are equally weighted <- matchit(Group ~ Factor1 + Factor2 + Factor3 + Factor4, data = matching, method="nearest", ratio=1)
a <- summary(

# Put matched set in a new data frame
df.match <-[1:ncol(matching)]

# Plot the results
plot(, type = 'jitter', interactive = FALSE)

Now, you have a data frame with the 3 prototypical candidates and the 3 chosen candidates. Keep in mind you do not have a 1:1 correspondence here, as these are nearest-neighbor matches. See the documentation for more information on alternate methods and exact matching.

Leave a Reply

Your email address will not be published. Required fields are marked *