Reproducible research

Stephen J Eglen

Encouraging code and data sharing
in neuroscience


Stephen J Eglen                  Cambridge Computational Biology Institute
https://sje30.github.io          University of Cambridge
sje30@cam.ac.uk                  @StephenEglen

Slides: http://bit.ly/eglen-futurepub


Acknowledgements

Software Sustainability Institute

These slides are available under a creative common CC-BY license.

Inverse problems are hard

Score (%) grade
70-100 A
60-69 B
50-59 C
40-49 D
0-39 F

Forward problem

I scored 68, what was my grade?

Inverse problem

I got a B, what was my score?

Research sharing: the inverse problem


Where is the scholarship?

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and that complete set of instructions that generated the figures.

[Buckheit and Donoho 1995, after Claerbout]

Moral or selfish approach?


Paper

Selfish reasons to share

Why not align what is good for science with what is good for scientists?

  1. Funding mandates (REF + enforcement from Wellcome Trust)
  2. Credit through data papers
  3. Leads to further collaborations (e.g. “EPAmeadev”)
  4. Fixes data bugs / errors in analysis
  5. Prevents data loss (Vines et al 2014). e.g. students have a habit of leaving…
  6. Your future self is probably one of the main beneficiaries of sharing.
  7. Now is a very good time to be an open scientist.

Code review pilot

10.1038/nn.4550

Specific recommendations

  1. Include enough code to reproduce key figure/result from your paper (“modeldb”).
  2. Provide toy examples if your project is too intensive to expect others to run in a few hours.
  3. Version control (github)
  4. Licence (MIT)
  5. Provide data
  6. Provide tests
  7. Use standards
  8. Use permanent URLs (Zenodo/figshare)
10.1038/nn.4550

Simple example

Paper Info

New tools

Docker

Can bundle entire open-source evironment for others to share:

(start docker)
docker run -d -p 8787:8787 sje30/eglen2015
open http://192.168.99.100:8787/

This should launch a web page …

A dirty secret

Jupyter notebooks

https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

binder = Docker + jupyter + cloud compute

https://github.com/sofroniewn/2pRAM-paper

binder other examples

Small example github repo

LIGO experiments

Old tools

Find a code buddy

Third most important file in github repo

(After Arfon Smith)

Makefile

Learn Make if you don’t know it already.

Practical tips

Reproducible papers

Reproducible papers

  1. Eglen et al. (2014)
  2. Eglen (2015)

Lessons I learnt

  1. Editors loved this.
  2. Reviewers engaged, edited code/figures.
  3. Brittle. Paper 2 broke within 6 months!

Challenges for a reproducible paper

  1. Technical aspects mostly there. Docker/Rmd/Jupyter notebooks/Zenodo/github.
  2. Sustainable compute platform required. mybinder.org was a victim of its own success. beta.mybinder.org Julia example
  3. Long (hours upwards) compute jobs are clearly not interactive.
  4. Social challenges much harder. How to incentivise/require authors to work like this?
  5. Publisher workflow should work with the author worflow (that starts before and may continue after publisher workflow finishes).
  6. Will reviewers be expected to do more than read the paper?

Elife developments

What would the workflow look like for an “executable” paper? Elife publishers, developers and users “solved” it:

Further info

Elife questionnaire

Survey on Demand for RR

  1. Responders: Neuro (34%) Biochemistry (28%) Systems Biology (28%)
  2. R markdown / Jupyter as leading technologies
  3. Key wishes: accessing data/code, interacting with figures
  4. ~70% of responders have shared data/code

Desirable features for a journal

  1. Pubmed
  2. Open access
  3. Low cost
  4. Decent formatting
  5. Publish reviews, possibly with “elife” consensus

Not yet convinced re: PPPR.

Overlay journals can work

Discrete Analysis launched in 2016 by Tim Gowers (Cambridge).

  1. Strong editorial board and lead editor.
  2. Traditional reviewing process.
  3. Papers live on Arxiv.
  4. Editor writes a summary of papers.
  5. Low cost (currently free to authors).
  6. Cambridge has worked through issues.

Would this work in Computational Neuroscience?

Things we can do tomorrow

Format-free submissions

Source

Don’t format manuscripts: Brischoux & Legagneux

End embargo periods?

Ask editors why we can’t share AAM’s immediately upon acceptance.

Embargo periods suit publishers, not us.

Install unpaywall

http://unpaywall.org/

Summary

  1. Share your code and data for everyone to beneft.
  2. Learn the new tools, but don’t forget the old tools.
  3. Support the journals that work with the community.
  4. The biggest beneficiary of reproducible research may be your future self.