Reproducible research
Stephen J Eglen
Encouraging code and data sharing
in neuroscience
Stephen J Eglen Cambridge Computational Biology Institute
https://sje30.github.io University of Cambridge
sje30@cam.ac.uk @StephenEglen
Slides: http://bit.ly/eglen-futurepub
Acknowledgements
Software Sustainability Institute
These slides are available under a creative common CC-BY license.
Inverse problems are hard
70-100 |
A |
60-69 |
B |
50-59 |
C |
40-49 |
D |
0-39 |
F |
Forward problem
I scored 68, what was my grade?
Inverse problem
I got a B, what was my score?
Research sharing: the inverse problem
Where is the scholarship?
An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and that complete set of instructions that generated the figures.
[Buckheit and Donoho 1995, after Claerbout]
Moral or selfish approach?
Selfish reasons to share
Why not align what is good for science with what is good for scientists?
- Funding mandates (REF + enforcement from Wellcome Trust)
- Credit through data papers
- Leads to further collaborations (e.g. “EPAmeadev”)
- Fixes data bugs / errors in analysis
- Prevents data loss (Vines et al 2014). e.g. students have a habit of leaving…
- Your future self is probably one of the main beneficiaries of sharing.
- Now is a very good time to be an open scientist.
Code review pilot
Specific recommendations
- Include enough code to reproduce key figure/result from your paper (“modeldb”).
- Provide toy examples if your project is too intensive to expect others to run in a few hours.
- Version control (github)
- Licence (MIT)
- Provide data
- Provide tests
- Use standards
- Use permanent URLs (Zenodo/figshare)
Simple example
Docker
Can bundle entire open-source evironment for others to share:
(start docker)
docker run -d -p 8787:8787 sje30/eglen2015
open http://192.168.99.100:8787/
This should launch a web page …
Jupyter notebooks
- Embed code within manuscript; figures/tables dynamically regenerated
binder = Docker + jupyter + cloud compute
- mybinder.org developed and supported by Freeman lab, Janelia Farm.
- Allows jupyter notebooks to be dynamically evaluated (not just rendered) online.
Find a code buddy
- We ask our students to submit a .Rnw file rather than a pdf. They get a zero if I can’t compile the pdf.
- So, ask someone else if they can run your code.
Third most important file in github repo
(After Arfon Smith)
- First: LICENSE
- Second: README.md
- Third: ???
Makefile
Learn Make if you don’t know it already.
Practical tips
- Lobby journals about their code-sharing practices.
- Lobby funders likewise.
- When reviewing articles, ask for code to be made available.
- When starting on a new project, assume code will be public at some point in the future.
Reproducible papers
- Eglen et al. (2014)
- Eglen (2015)
Lessons I learnt
- Editors loved this.
- Reviewers engaged, edited code/figures.
- Brittle. Paper 2 broke within 6 months!
Challenges for a reproducible paper
- Technical aspects mostly there. Docker/Rmd/Jupyter notebooks/Zenodo/github.
- Sustainable compute platform required. mybinder.org was a victim of its own success. beta.mybinder.org Julia example
- Long (hours upwards) compute jobs are clearly not interactive.
- Social challenges much harder. How to incentivise/require authors to work like this?
- Publisher workflow should work with the author worflow (that starts before and may continue after publisher workflow finishes).
- Will reviewers be expected to do more than read the paper?
Elife developments
What would the workflow look like for an “executable” paper? Elife publishers, developers and users “solved” it:
Further info
Elife questionnaire
Survey on Demand for RR
- Responders: Neuro (34%) Biochemistry (28%) Systems Biology (28%)
- R markdown / Jupyter as leading technologies
- Key wishes: accessing data/code, interacting with figures
- ~70% of responders have shared data/code
Desirable features for a journal
- Pubmed
- Open access
- Low cost
- Decent formatting
- Publish reviews, possibly with “elife” consensus
Not yet convinced re: PPPR.
Overlay journals can work
Discrete Analysis launched in 2016 by Tim Gowers (Cambridge).
- Strong editorial board and lead editor.
- Traditional reviewing process.
- Papers live on Arxiv.
- Editor writes a summary of papers.
- Low cost (currently free to authors).
- Cambridge has worked through issues.
Would this work in Computational Neuroscience?
Things we can do tomorrow
End embargo periods?
Ask editors why we can’t share AAM’s immediately upon acceptance.
Embargo periods suit publishers, not us.
Summary
- Share your code and data for everyone to beneft.
- Learn the new tools, but don’t forget the old tools.
- Support the journals that work with the community.
- The biggest beneficiary of reproducible research may be your future self.