Reproducibility in Action
26 July 2017
Richard Schwinn, PhD
A Breakthrough at Major University University
MD Anderson’s Dr. Baggerly Attempts to Verify
Meanwhile Major University Jumped Straight into Clinical Trials
The Clinical Trials Continued
The Clinical Trials Continued
Finally… After 5 Years of Fighting
The Audience for Cutting Edge Research is Small
What can be done?
- In the past, the only solution was to rely on
- trust among patients and
- honor among doctors.
- An economist however would recommend
- increasing the expected cost of deception
- \(E[C(\omega)] =\) expected cost
- \(\omega =\) level of deception
- or the probability of detection.
\[E[C] = \int_\Omega C(\omega)P(d\omega)\]
Retraction Watch Increases the Cost of Deception
Retraction Watch
Websites like retractionwatch.com spread the word
- They increase the severity of the stigma and reputational effects for fraudulent researchers.
- The other option is to increase the probability of detection.
\[E[C] = \int_\Omega C(\omega)P(d\omega)\]
Combating Deceptive Research
- By increasing the probability of detection, reproducible research
- reduces the incentive to commit fraud and
- it makes identifying subtle, unintentional errors easier.
- Reproducible research has a precise definition:
- Research is considered reproducible if
- it is published with both
- so that it is easy for a non-expert to reproduce the results.
\[E[C] = \int_\Omega C(\omega)P(d\omega)\]
Software Options
- Literate programming languages
- Combined with statistical software
- like SAS
- and iPython Notebook
- R-Studio integrates a number of programming languages under the extremely easy to use rmarkdown language.
R Work Flow
R Output
Faux Market Research: Car Sharing
- 2 questions
- By a show of hands, who has never used a car sharing platform?
- Can I also ask, who has wanted to use a car sharing platform but it was unavailable in their area?
- Suppose you work for Uber.
- You want to pitch the directors on expanding to new areas in Florida
Step 1: Load Libraries and knitr options
library(knitr) # Generates report
library(dplyr) # Wrangles data
library(choroplethr) # Creates maps
library(choroplethrMaps) # County data
library(ggplot2) # Creates graphics
library(gridExtra) # Arranges graphics
library(acs) # Downloads data
library(stringr) # Wraps labels
knitr::opts_chunk$set(...)
Step 2: Data Downloading
demo_df = acs_data_prep(c("B01003", "B19301"))
commute_df = acs_data_prep("B08534", 1:10)
transport_df = acs_data_prep("B08301", c(2,10,16:20))
aggregate_df = acs_data_prep("B08135", 1)
df = rbind(demo_df, commute_df, transport_df, aggregate_df)
Step 3: Create Statewide Maps
maps_list =
c("B01003", # Total Population
"B19301", # Income
"B08534", # Number of commuters
"B08135") # Aggregate Travel Time to Work
plot_maps = function(x) {
it = filter(df, table_number == maps_list[x], index == 1)
county_choropleth(it, state_zoom = tolower(state_name)) +
scale_fill_brewer(palette = x) +
ggtitle(it$table_title) +
theme(legend.position = "bottom")
}
Step 4: Create County Level Reports
state_counties = filter(df, state.name == tolower(state_name), table_number == "B08135") %>% arrange(state_rank) # Selects county for the state
make_county_reports = function(x) {...}
county_reports = lapply((1:nrow(state_counties)), make_county_reports)
county_reports[1:nrow(state_counties)]
Thank You
- We have seen that
- Reproducibility tools can combat falsified research
- That these tools can be used profitably for regional and periodical reporting
- and that complex and useful reports can be created in a matter minutes
Outline
- What is SUSB?
- History and Generation of SUSB
- Uses of SUSB
- SUSB Data Challenges
- Future Availability via the SUSB Data Explorer
What is SUSB?
The Statistics of U.S. Businesses (SUSB) is an annual dataset that provides data on
- Numbers of businesses
- Employment
- Revenues
- Births and deaths
- Expansions and contractions
- Payroll
- For firms and for establishments
- By size, by, industry, and by geography
History and Generation of SUSB
SUSB Data Challenges
- 4162 Geographies
- 1 national stats
- 51 states
- 917 metropolitan statistical areas
- 3193 counties
- 26 firm sizes
- 2016 industries
- 7 variables (employment, number of firm, etc.)
- 20+ years of data
If data were provided for all permutations, it would represent well over 30 trillion elements. The existing source tables consist of only between 300 and 400 million cells.
Future Availability via the SUSB Data Explorer