Reproducibility in Action
26 July 2017
Richard Schwinn, PhD
A Breakthrough at Major University University
A Breakthrough at Major University (Pile of Cash)
- In 2006, a major university saw a billion dollar industry on the horizon.
- One of its primary researchers, we’ll call him Dr. P., designed a test to show that
- DNA signatures could predict whether patients would respond to specific chemotherapy drugs (P. et al., 2006).
- Such a test would be hugely valuable.
- If it said a particular treatment wouldn’t work, it could be skipped,
- instead of uselessly poisoning the patient for 6 weeks.
- With this test, patients would get the treatments most likely to work for them.
MD Anderson’s Dr. Baggerly Attempts to Verify
MD Anderson’s Dr. Baggerly Attempts to Verify (Picture of Dr. Baggerly)
- A competing hospital, MD Anderson Cancer Institute, was also excited about the possible treatment.
- But before investing into it, they wanted to be sure that they completely understood the underlying research.
- So they assigned Dr. Keith Baggerly to replicate the empirical results.
- Everyone was very optimistic especially because
- Dr. P. had provided data along with his paper.
Meanwhile Major University Jumped Straight into Clinical Trials
Meanwhile Major University Jumped Straight into Clinical Trials
- While Baggerly was trying to make sense of the data, the Major University jumped straight into clinical trials. (PDF and Excel icons)
- After trying to replicate the findings for several months and noticing massive errors Dr. Baggerly contacted Major University’s Dr. P.
- Within 6 months, Dr. P. cut off all communication.
The First Investigation (magnifying glass)
- In response to Dr. Baggerly’s persistence, Major University convened an outside panel to investigate the results
Results Vindicated
- P.’s findings were vindicated
- The panel concluded that the research was sound!
- They said that they could find nothing wrong with it.
The Clinical Trials Continued
The Clinical Trials Continued
The Clinical Trials Continued
- Here are two screen shots from a 60 minutes special this story.
- The husband of one of Dr. P’s patients from the the clinical trial as Dr. P for his
- permission to record their conversation.
- Dr. P responds, “Absolutely. That’s a good thing, ’cause you’re going to miss a lot.”
- This is an important point. Dr. P would the same to anyone asking about his research, even hi colleagues those investigating his work like Baggerly.
Undeterred
Baggerly repeatedly showed that Dr. P.’s research was intentionally bad.
- Yet his analyses were either ignored or criticized for being too technical.
- Like the Erin Brockovic of health research, Dr. Baggerly didn’t stop raising the red flag.
Finally… After 5 Years of Fighting
Finally… After 5 Years of Fighting (Rhodes Scholar Image)
Paul Goldberg, an editor of the Cancer Letter, a small trades publication,
- Who had been previously contacted by Baggerly, noticed that Dr. P. had lied about being a Rhodes Scholar.
- This was a lie everyone could understand.
- Finally a story about Dr. P. gets traction!
Follow up
- In support of his friend, Dr. P.’s co-author, Dr. N.,
- attempted to reproduce the research from the raw data.
- And he immediately discovered data manipulations that
- in his own words, “Couldn’t be inadvertent.”
Aftermath
- Some clinical trial patients, who were denied standard treatments based on this research, died.
- Today, 11 lawsuits have been settled. Others are still pending.
- Two thirds of Dr. P.’s papers have been retracted
- One of the most shocking aspects of this story is that
- Brad Perez, a student in Dr. P.’s lab at Major University at the time, had written a thorough whistle blower report discrediting the research back in 2006… and it was ignored!
- Today Dr. Perez is an oncologist in Tampa, Florida.
The Audience for Cutting Edge Research is Small
The Audience for Cutting Edge Research is Small
- Cutting-edge research is
- written for a small group of experts.
- In many fields, peer reviewers qualified enough to understand cutting-edge research
- face high opportunity costs and
- almost never have time to verify research based on raw data.
- These barriers open a door for those willing to commit outright fraud to go unnoticed.
What can be done?
- In the past, the only solution was to rely on
- trust among patients and
- honor among doctors.
- An economist however would recommend
- increasing the expected cost of deception
- \(E[C(\omega)] =\) expected cost
- \(\omega =\) level of deception
- or the probability of detection.
\[E[C] = \int_\Omega C(\omega)P(d\omega)\]
Retraction Watch
Websites like retractionwatch.com spread the word
- Thus increasing the severity of the stigma and reputational effects for bad researchers.
- The other option is to increase the probability of detection.
Combating Deceptive Research
- By increasing the probability of detection, reproducible research
- reduces the incentive to commit fraud and
- it makes identifying subtle, unintentional errors easier.
- Reproducible research has a precise definition:
- Research is considered reproducible if
- it is published with both
- so that it is easy for a non-expert to reproduce the results.
Software Options
- Literate programming languages
- Combined with statistical software
- like SAS
- and iPython Notebook
- R-Studio integrates a number of programming languages under the extremely easy to use rmarkdown language.
RMarkdown Cheatsheet
- RMarkdown can be learned by a person with no programming experience in one afternoon.
- RStudio provides a cheat sheet covering all of its important commands that fits on one sheet of paper. (double sided)
R Work Flow
- Generating reports in R and RMarkdown decreases the cost of investigating the accuracy of academic research.
- Data is cheaper than ever.
- A little study of basic Application Programming Interfaces (APIs) documentation enables
- researchers to do things that were previously impossible to all but a small group of computer scientists.
- Reproducibility offers tremendous time-savings opportunities for regional and periodical reporting through parameterized reports.
- RMarkdown enables you to rerun your data-processing and, with a little work, even the textual content of your document can be updated in one click of the Knit button.
R Work Flow
R Output
RStudio Output
- RStudio unifies several programming languages
- through RMarkdown to easily output content to multiple formats.
US SBA
- These tools were created to ensure the quality of academic research
- but they are incredibly useful outside of academia
- I am an economist at the USA Small Business Administration.
- One of our stated objectives is to report on the status of small businesses.
- We issue an annual small business profile to meet this goal.
US Small Business GDP
- Another one of our statutory goals is report on the contributions of small business.
- Using R Markdown allows us create
- an adaptive report
- with new graphical elements
- such as a modified Tufte layout
- hierarchical maps
- spark graphs
- waffle charts
- and parameterized industry level profiles
- which provide dozens of statistics and graphics
Faux Market Research: Car Sharing
- 2 questions
- By a show of hands, who has never used a car sharing platform?
- Can I also ask, who has wanted to use a car sharing platform but it was unavailable in their area?
- Suppose you work for Uber.
- You want to pitch the directors on expanding to new areas in Florida
Step 1: Load Libraries and knitr options
library(knitr) # Generates report
library(dplyr) # Wrangles data
library(choroplethr) # Creates maps
library(choroplethrMaps) # County data
library(ggplot2) # Creates graphics
library(gridExtra) # Arranges graphics
library(acs) # Downloads data
library(stringr) # Wraps labels
knitr::opts_chunk$set(...)
Step 2: Data Downloading
demo_df = acs_data_prep(c("B01003", "B19301"))
commute_df = acs_data_prep("B08534", 1:10)
transport_df = acs_data_prep("B08301", c(2,10,16:20))
aggregate_df = acs_data_prep("B08135", 1)
df = rbind(demo_df, commute_df, transport_df, aggregate_df)
Step 3: Create Statewide Maps
maps_list =
c("B01003", # Total Population
"B19301", # Income
"B08534", # Number of commuters
"B08135") # Aggregate Travel Time to Work
plot_maps = function(x) {
it = filter(df, table_number == maps_list[x], index == 1)
county_choropleth(it, state_zoom = tolower(state_name)) +
scale_fill_brewer(palette = x) +
ggtitle(it$table_title) +
theme(legend.position = "bottom")
}
Step 4: Create County Level Reports
state_counties = filter(df, state.name == tolower(state_name), table_number == "B08135") %>% arrange(state_rank) # Selects county for the state
make_county_reports = function(x) {...}
county_reports = lapply((1:nrow(state_counties)), make_county_reports)
county_reports[1:nrow(state_counties)]
Thank You
- We have seen that
- Reproducibility tools can combat falsified research
- That these tools can be used profitably for regional and periodical reporting
- and that complex and useful reports can be created in a matter minutes
Outline
- Definitions
- History
- Uses of SUSB
- Challenges to Accessing SUSB
- Future Availability via the SUSB Data Explorer
Outline
Here is a quick outline of my introduction to the Statistics of U.S. Businesses
- At first glance, you might think these topics are a little unexciting.
- But if you give it a moment, and look closely at this list,
- you will notice that, indeed, it is not exciting.
- Luckily for you, after my discussion on the programmatic manipulation of the underlying data
- You’ll hear from Miriam whose detailed mastery of the nuances SUSB makes all of this possible
- Miriam can tell you about which data exists and why or why not
- she can explain why reported totals and table sums do not necessarily align, and so much more
- and today she will tell you about some existing OER tools on the Advocacy webpage
- Jonathon will show you OERs new SUSB visualizations for analysis which should provide a nice change of pace.
What is SUSB?
The Statistics of U.S. Businesses (SUSB) is an annual dataset that provides data on
- Numbers of businesses
- Employment
- Revenues
- Births and deaths
- Expansions and contractions
- Payroll
- For firms and for establishments
- By size, by, industry, and by geography
History and Generation of SUSB
History and Generation of SUSB
- SUSB began in 1989 as an extension of the County Business Patterns (CBP) which itself began in 1964 and also grew out of similar data projects dating back to 1946
- The data underlying SUSB come from the Business Register (BR), which is a database of all known employers maintained by the U.S. Census Bureau
- Additional data come from various Census Bureau programs, such as
- the Economic Census
- the Annual Survey of Manufactures
- the Current Business Surveys
- the Company Organization Survey
- Noise infusion methodology is applied to protect individual business establishments from disclosure
- Advocacy funds SUSB through an inter-agency-agreement (IAA)
Uses of SUSB 1
- SUSB is the premiere source of data on small businesses
- The data are referenced extensively
- by congress
- in leading newspapers
- and, most prominently, in academic research
Uses of SUSB 2
A quick Google Scholar search yields
- Over 19,000 references in academic publications
- And there already more than 2,500 citations this year alone
- The data are regularly incorporated into research from leading institutions like the University of Chicago (Hurst), Columbia (Benjamin Pugsley), Harvard, etc.
- You might recognize the author of the top search result for SUSB
- It’s a major credit to our office that Advocacy’s own, Brian Headd, appears in the top rank
- Between his book, An Analysis of Small Business and Jobs, and his other publications, Brian has hundreds of small business citations for his work based on SUSB
SUSB Data Challenges 1
The Office of Advocacy disseminates the SUSB datasets jointly with the SUSB office at Census
- As a result we can request
- special tabulations
- changes in the tables such as breakdowns by various new employment size increments
- Overtime, these little internal changes
- coupled with external changes like
- changes to the definitions of industry names and
- disclosure policies (noise infusion, complementary cell suppression)
- have resulted in inconsistency in the format of the data over time
- this makes the data difficult to organize and to use for analyses involving time series
SUSB Data Challenges 2
- 4162 Geographies
- 1 national stats
- 51 states
- 917 metropolitan statistical areas
- 3193 counties
- 26 firm sizes
- 2016 industries
- 7 variables (employment, number of firm, etc.)
- 20+ years of data
If data were provided for all permutations, it would represent well over 30 trillion elements. The existing source tables consist of only between 300 and 400 million cells.
SUSB Data Challenges 2
- The real problem is that the data are both big and complex
- The raw SUSB data is about 13 GB and exists across over more than 800 different tables
- I should note that while it doesn’t qualify as “BigData” since it fits on a single computer
- At more than 10 GBs and with single tables too large for excel to open it passes the threshold into the realm of “MediumData”
- We want to make the data accessible to more people, so we have devised a way to sensibly organize the entire database into 12 tables without sacrificing any information
- Each of the 12 tables views the information from a different perspective
- such as by various geography
- or by levels of industry detail
SUSB Data Challenges 3
- To achieve this, I used a set of free, open-source tools, originally intended for big data, to wrangle the data into a normalized database
- The great thing about these BigData tools is that they’re
- somehow ready for all the problems which are unknown to me but seem to arise anyway
- After a productive discussion with the SUSB office in April, they sent us a set of the original ACCESS source files
- I then migrated these files into an SQL database
- Next, I normalized naming conventions and merged the key tables with data tables
- So that the original data are now fully represented by ~160 tables complete with descriptive language for key variables
- So for example, alongside the standard column of state codes numbered 1 to 51 is a column of state names
SUSB Data Challenges 4
- The data are now organized into roughly 160 tables, down from >800, with homogenized column names and formating
- This was achieved
- using regular expressions (UNIX programming conventions)
- and open source software
- what makes this approach special is that none of the steps I’ve taken are point and click
- thus every single command is documented so that if I’d made any error, and thankfully I did not,
- it would be easy to identify, correct, and rerun the normalization
- Were we so inclined, we could recreate the normalized data from the Census source files (which represents months of work)
- during the time it takes to listen to this presentation
- For easy distribution I’ve compressed the data into 160 binary Microsoft Excel files so the entire database is less than one GB
- In the R environment the complete database is <300 MB.
Future Availability via the SUSB Data Explorer
Future Availability via the SUSB Data Explorer
- Inspired by OER’s Elizabeth Glass, the OER data team has held several meeting with data specialists to discuss the possiblity of creating an online databank
- Ultimately none were able to provide the solution we wanted
- So I decided to make my our own web interface last week
- It’s working name is the SUSB Data Explorer
- It allows users to get directly to the data that they want
- and to download it, and only it, instantly
- Pat and I, as well as potentially Miriam and Sarah,
- have just enrolled in the Office of Entrepreneurial Development’s Human Centered Design course
- to ensure that the data explorer’s interface is as useful as possible to our stakeholders.
- Demo