Reproducibility in Action

26 July 2017

Richard Schwinn, PhD

A Breakthrough at Major University University

A Breakthrough at Major University (Pile of Cash)

MD Anderson’s Dr. Baggerly Attempts to Verify

MD Anderson’s Dr. Baggerly Attempts to Verify (Picture of Dr. Baggerly)

Meanwhile Major University Jumped Straight into Clinical Trials

Meanwhile Major University Jumped Straight into Clinical Trials

The First Investigation

The First Investigation (magnifying glass)

Results Vindicated

Results Vindicated

The Clinical Trials Continued

The Clinical Trials Continued

Video

60 Minutes Video

The Clinical Trials Continued

Undeterred

Undeterred

Undeterred

Baggerly repeatedly showed that Dr. P.’s research was intentionally bad.

Finally… After 5 Years of Fighting

Finally… After 5 Years of Fighting (Rhodes Scholar Image)

Paul Goldberg, an editor of the Cancer Letter, a small trades publication,

Follow up

Follow up

Aftermath - Brad Perez

Aftermath

Why?

The Audience for Cutting Edge Research is Small

The Audience for Cutting Edge Research is Small

What can be done?

\[E[C] = \int_\Omega C(\omega)P(d\omega)\]

Retraction Watch

Retraction Watch

Websites like retractionwatch.com spread the word

  1. Thus increasing the severity of the stigma and reputational effects for bad researchers.
  2. The other option is to increase the probability of detection.

Combating Deceptive Research

What are the tools of Reproducibility?

Software Options

RMarkdown Cheatsheet

RMarkdown Cheatsheet

R Work Flow

  1. Generating reports in R and RMarkdown decreases the cost of investigating the accuracy of academic research.
  2. Data is cheaper than ever.
    • A little study of basic Application Programming Interfaces (APIs) documentation enables
    • researchers to do things that were previously impossible to all but a small group of computer scientists.
  3. Reproducibility offers tremendous time-savings opportunities for regional and periodical reporting through parameterized reports.
    • RMarkdown enables you to rerun your data-processing and, with a little work, even the textual content of your document can be updated in one click of the Knit button.

R Work Flow

1

R Output

2

RStudio Output

US SBA

US SBA

US Small Business GDP

US Small Business GDP

Faux Market Research: Car Sharing

Car Sharing Demo

Step 1: Load Libraries and knitr options

library(knitr)              # Generates report
library(dplyr)              # Wrangles data
library(choroplethr)        # Creates maps
library(choroplethrMaps)    # County data
library(ggplot2)            # Creates graphics
library(gridExtra)          # Arranges graphics
library(acs)                # Downloads data
library(stringr)            # Wraps labels

knitr::opts_chunk$set(...)

Step 2: Data Downloading

demo_df = acs_data_prep(c("B01003", "B19301"))
commute_df = acs_data_prep("B08534", 1:10)
transport_df = acs_data_prep("B08301", c(2,10,16:20))
aggregate_df = acs_data_prep("B08135", 1)
df = rbind(demo_df, commute_df, transport_df, aggregate_df)

Step 3: Create Statewide Maps

maps_list = 
c("B01003",    # Total Population
  "B19301",    # Income
  "B08534",    # Number of commuters
  "B08135")    # Aggregate Travel Time to Work
  
  plot_maps = function(x) {
  it = filter(df, table_number == maps_list[x], index == 1)
  county_choropleth(it, state_zoom = tolower(state_name)) +
  scale_fill_brewer(palette = x) +
  ggtitle(it$table_title) +
  theme(legend.position = "bottom")
  }

Step 4: Create County Level Reports

state_counties = filter(df, state.name == tolower(state_name), table_number == "B08135") %>% arrange(state_rank) # Selects county for the state

make_county_reports = function(x) {...}

county_reports = lapply((1:nrow(state_counties)), make_county_reports)
county_reports[1:nrow(state_counties)]

Knit

Thank You

Outline

Outline

Here is a quick outline of my introduction to the Statistics of U.S. Businesses

What is SUSB?

The Statistics of U.S. Businesses (SUSB) is an annual dataset that provides data on

History and Generation of SUSB

Census logo

Census logo

History and Generation of SUSB

Uses of SUSB 1

Uses of SUSB 1

Uses of SUSB 2

Uses of SUSB 2

A quick Google Scholar search yields

SUSB Data Challenges 1

SUSB Data Challenges 1

The Office of Advocacy disseminates the SUSB datasets jointly with the SUSB office at Census

SUSB Data Challenges 2

If data were provided for all permutations, it would represent well over 30 trillion elements. The existing source tables consist of only between 300 and 400 million cells.

SUSB Data Challenges 2

SUSB Data Challenges 3

SUSB Data Challenges 3

SUSB Data Challenges 4

SUSB Data Challenges 4

Future Availability via the SUSB Data Explorer

Future Availability via the SUSB Data Explorer


  1. Simon Munzert, 2014

  2. Simon Munzert, 2014