Reproducibility in Action

26 July 2017

Richard Schwinn, PhD

A Breakthrough at Major University University

MD Anderson’s Dr. Baggerly Attempts to Verify

Meanwhile Major University Jumped Straight into Clinical Trials

The First Investigation

Results Vindicated

The Clinical Trials Continued

The Clinical Trials Continued

Video

60 Minutes Video

Undeterred

Undeterred

Finally… After 5 Years of Fighting

Follow up

Aftermath - Brad Perez

Why?

The Audience for Cutting Edge Research is Small

What can be done?

\[E[C] = \int_\Omega C(\omega)P(d\omega)\]

Retraction Watch Increases the Cost of Deception

Retraction Watch

Websites like retractionwatch.com spread the word

  1. They increase the severity of the stigma and reputational effects for fraudulent researchers.
  2. The other option is to increase the probability of detection.

\[E[C] = \int_\Omega C(\omega)P(d\omega)\]

Combating Deceptive Research

\[E[C] = \int_\Omega C(\omega)P(d\omega)\]

What are the tools of Reproducibility?

Software Options

RMarkdown Cheatsheet

R Work Flow

1

R Output

2

US SBA

US Small Business GDP

Faux Market Research: Car Sharing

Car Sharing Demo

Step 1: Load Libraries and knitr options

library(knitr)              # Generates report
library(dplyr)              # Wrangles data
library(choroplethr)        # Creates maps
library(choroplethrMaps)    # County data
library(ggplot2)            # Creates graphics
library(gridExtra)          # Arranges graphics
library(acs)                # Downloads data
library(stringr)            # Wraps labels

knitr::opts_chunk$set(...)

Step 2: Data Downloading

demo_df = acs_data_prep(c("B01003", "B19301"))
commute_df = acs_data_prep("B08534", 1:10)
transport_df = acs_data_prep("B08301", c(2,10,16:20))
aggregate_df = acs_data_prep("B08135", 1)
df = rbind(demo_df, commute_df, transport_df, aggregate_df)

Step 3: Create Statewide Maps

maps_list = 
c("B01003",    # Total Population
  "B19301",    # Income
  "B08534",    # Number of commuters
  "B08135")    # Aggregate Travel Time to Work
  
  plot_maps = function(x) {
  it = filter(df, table_number == maps_list[x], index == 1)
  county_choropleth(it, state_zoom = tolower(state_name)) +
  scale_fill_brewer(palette = x) +
  ggtitle(it$table_title) +
  theme(legend.position = "bottom")
  }

Step 4: Create County Level Reports

state_counties = filter(df, state.name == tolower(state_name), table_number == "B08135") %>% arrange(state_rank) # Selects county for the state

make_county_reports = function(x) {...}

county_reports = lapply((1:nrow(state_counties)), make_county_reports)
county_reports[1:nrow(state_counties)]

Knit

Thank You

Outline

What is SUSB?

The Statistics of U.S. Businesses (SUSB) is an annual dataset that provides data on

History and Generation of SUSB

Census logo

Census logo

Uses of SUSB

Uses of SUSB

SUSB Data Challenges

SUSB Data Challenges

If data were provided for all permutations, it would represent well over 30 trillion elements. The existing source tables consist of only between 300 and 400 million cells.

SUSB Data Challenges

SUSB Data Challenges

Future Availability via the SUSB Data Explorer


  1. Simon Munzert, 2014

  2. Simon Munzert, 2014