1.7 Reproducibility differences of articles published in various journals and using R or Python language

Authors: Bartłomiej Eljasiak, Konrad Komisarczyk, Mariusz Słapek (Warsaw University of Technology)

1.7.1 Abstract

The aim of this article is to present differences in reproducing both code and results of papers from three popular journals: Journal of Machine Learning Research, Journal of Open Source Software, Journal of Statistical Software. Additionally, every journal has been split into two categories for Python and R articles respectively. Following we propose two general methods of scoring the reproducibility and based on them we mark several dozen articles. We have observed distinct traits of papers from each journal. R is shown to be the most reproducible language, moreover, the Journal of Open Source Software - a journal with least scientific credit received the highest marks in our comparison.

1.7.2 Introduction and Motivation

Due to the growing number of research publications and open-source solutions, the importance of repeatability and reproducibility is increasing. Although reproducibility is a cornerstone of science, a large amount of published research results cannot be reproduced (Gundersen and Kjensmo 2018). Repeatability and reproducibility are closely related to science.

“Reproducibility of a method/test can be defined as the closeness of the agreement between independent results obtained with the same method on the identical subject(s) (or object, or test material) but under different conditions (different observers, laboratories etc.). (…) On the other hand, repeatability denotes the closeness of the agreement between independent results obtained with the same method on the identical subject(s) (or object or test material), under the same conditions.”(Slezak and Waczulikova 2011)

Reproducibility is crucial since it is what an researcher can guarantee about a research. This not only ensures that the results are correct, but rather ensures transparency and gives scientists confidence in understanding exactly what was done (Eisner 2018). It allows science to progress by building on previous work. What is more, it is necessary to prevent scientific misconduct. The increasing number of cases is causing a crisis of confidence in science (Drummond 2012).

In psychology the problem has already been addressed. From 2011 to 2015 over two hundred scientists cooperated to reproduce results of one hundred psychological studies (Anderson et al. 2019). In computer science (and data science) scientists notice the need for creating tools and guidelines, which help to guarantee reproducibility of solutions (Biecek and Kosinski 2017, @Stodden1240). There exist already developed solutions which are tested to be applied (Elmenreich et al. 2018).

Reproducibility can focus on different aspects of the publication, including code, results of analysis and data collection methods. This work will focus mainly on the code - results produced by evaluation of different functions and chunks of code from analysed publications.

In this paper we want to compare journals on the reproducibility of their articles. Moreover, we will present the reproducibility differences between R and Python - two of the most popular programming languages in data science publications.There is discussion between proponents of these two languages, which one is more convenient to use in data science. Different journals also compete between each other. There are already many metrics devised to assess which journal is better regarding this metric (“Journal Metrics - Impact, Speed and Reach,” n.d.). There are no publications related to the reproducibility topic which compare different journals and languages. Although there are some exploring reproducibility within one specific journal (Stodden, Seiler, and Ma 2018). What is more, journals notice the importance of this subject (McNutt 2014). Also according to scientists journals should take some responsibility for this subject (Eisner 2018).

1.7.3 Methodology

We decided to focus on three journals:

The Journal of Statistical Software (JSTAT)
The Journal of Machine Learning Research (JMLR)
The Journal of Open Source Software (JOSS)

The Journal of Statistical Software and The Journal of Machine Learning Research are well known among scientists in the field of data science. The Journal of Open Source Software is relatively new, was established in 2016.

We choose articles randomly from the time frame 2010-present. From every journal we choose around 10 articles, of which around 5 are articles introducing an R package and around 5 are introducing a Python library. We choose only articles having tests on their github repositories. For our metrics we test the following:

Tests on github provided by authors - for R packages test_that tests, for Python libraries pytest and unittest tests.
Examples from the article - we test whether chunks of code included in the article produce the same results. Number of examples in an article varies a lot, in particular all the articles from Journal of Open Source Software do not have any examples.
Examples provided by the authors on github repository in the catalog examples.

1.7.3.1 How to compare articles?

1.7.3.1.1 Insight to the problem

Before anything can be said about the differences in journals and languages, first there has to be a measure in which they can be compared. Journals in general prefer articles of the same structure. What it means is that articles from different journals can vary substantially. This includes not only topics but number of pages, style of writing and most importantly for the topic of this article the way they present code. Thus it comes as no surprise that there are many means how the code can be reproduced. Every so often when an article is presenting a package there can be no examples and only unit tests. Naturally, the opposite can occur. Obvious conclusion is that the proposed measure must not be in favor of any way of presenting code in the given article. The problem of defining the right measure of article reproducibility deserves a separate article itself and It should be stated that metrics used by us are for sure not without a flaw. We do not assume that they are unbiased but that they are true enough that we can draw conclusions from them.

1.7.3.1.2 Proposed metrics

First of all, we did all tests provided by the author in the article or located on an online repository. But if there was an example but there was no direct connection to if from the article it was not included in our reproduction process, because in our opinion it’s not a part of the article and therefore journal. Because of what has been said in the previous paragraph we decided to look at articles from two perspectives. One is more bias, second is true to the sheer number of reproducible examples and positive tests.

1.7.3.1.3 Student’s Mark

Analysis of the problem has led to the decision that numerical assessment of article reproducibility has too many flaws and does not represent well the problems that occur while recreating the results of an article. What we propose is a 4-degree mark ranging from 2 to 5. The highest mark, 5 is given if the author provided all results to the code shown in the article and repository and if the results can be fully reproduced. The article is scored 4 when there are some minor problems in the reproducibility of the code. For example, an article may lack the outputs to some part of the code shown or there are some errors in tests. The major rule is that code should still do what it was meant to do. If some errors happen, but they are not affecting the results, they are negligible. The article is scored 3 in a few cases. If an article lacks all or the vast majority of code outputs, but when reproduced it still produces reasonable results. When in some tests or examples we can observe non-negligible differences, but this cannot happen to a key element of the article. For example, the method proposed in the article describing training machine learning model works and the model is trained well, but there are errors in the part of the code where the model is used by some different library. If we would have to score the article based only on this example we would give it a 3. The article is scored 2 if there are visible differences in reproducing results of key elements of the article. Or If the code from the article didn’t work even though we had all dependencies.

1.7.3.1.4 Reproducibility value

Second metric we used to analyse articles is simple and puts the same weight to the reproducibility problems of the tests and examples.

\[ R_{val} = 1 - \frac{negative \ tests + negative \ examples}{all \ tests + all \ examples} \]

So a score of 0 represents an article that failed in all tests and had only not working examples.

1.7.4 Results

Results of reproducing all chosen articles are presented in the following table:

Journal	Language	Title	StudentsMark	ReproducibilityValue
JOSS	Python	Autorank: A Python package for automated ranking of classifiers (Herbold 2020)	4	0.95
JSTAT	R	Beyond Tandem Analysis: Joint Dimension Reduction and Clustering in R (Markos, D’Enza, and Velden 2019)	5	1
JSTAT	Python	CoClust: A Python Package for Co-Clustering (Role, Morbieu, and Nadif 2019)	2	0.3
JSTAT	R	corr2D: Implementation of Two-Dimensional Correlation Analysis in R (Geitner et al. 2019)	4	1
JSTAT	R	frailtyEM: An R Package for Estimating Semiparametric Shared Frailty Models (Balan and Putter 2019)	4	0.79
JOSS	Python	Graph Transliterator: A graph-based transliteration tool (Pue 2019)	4	0.93
JMLR	Python	HyperTools: a Python Toolbox for Gaining Geometric Insights into High-Dimensional Data (Heusser et al. 2018)	5	0.96
JOSS	R	iml: An R package for Interpretable Machine Learning (Molnar 2018)	5	1
JOSS	R	learningCurve: An implementation of Crawford’s and Wright’s learning curve production functions (Boehmke and Freels 2017)	5	1
JOSS	R	mimosa: A Modern Graphical User Interface for 2-level Mixed Models (Titz 2020)	3	0.67
JMLR	R	mlr: Machine Learning in R (Bischl, Lang, et al. 2016 a)	4	0.98
JMLR	R	Model-based Boosting 2.0 (Hothorn et al. 2010)	4	0.95
JSTAT	Python	Natter: A Python Natural Image Statistics Toolbox (Sinz et al. 2014)	2	0
JSTAT	R	Network Coincidence Analysis: The netCoin R Package (Escobar and Martinez-Uribe 2020)	4	0.91
JMLR	Python	OpenEnsembles: A Python Resource for Ensemble Clustering (Ronan et al. 2018)	2	0.78
JOSS	R	origami: A Generalized Framework for Cross-Validation in R (Coyle and Hejazi 2018)	5	1
JOSS	Python	py-pde: A Python package for solving partial differential equations (Zwicker 2020)	2	0.86
JOSS	Python	PyEscape: A narrow escape problem simulator packagefor Python (Hughes, Morris, and Tomkins 2020)	2	0.8
JOSS	Python	Pyglmnet: Python implementation of elastic-net regularized generalized linear models (Jas et al. 2020)	5	0.98
JSTAT	Python	pyParticleEst: A Python Framework for Particle-Based Estimation Methods (Nordh 2017)	3	0.67
JSTAT	Python	“PypeR, A Python Package for Using R in Python” (Xia, McClelland, and Wang 2010)	3	1
JOSS	R	“Rclean: A Tool for Writing Cleaner, More Transparent Code” (Lau, Pasquier, and Seltzer 2020)	5	1
JMLR	Python	RLPy: A Value-Function-Based Reinforcement Learning Framework for Education and Research (Geramifard et al. 2015)	2	0.59
JSTAT	R	rmcfs: An R Package for Monte Carlo Feature Selection and Interdependency Discovery (Dramiński and Koronacki 2018)	2	0.57
JMLR	Python	Seglearn: A Python Package for Learning Sequences and Time Series (Burns and Whyne 2018)	4	0.88
JSTAT	Python	Simulated Data for Linear Regression with Structured and Sparse Penalties: Introducing pylearn-simulate (Löfstedt et al. 2018)	2	0.3
JOSS	R	tacmagic: Positron emission tomography analysis in R (Brown 2019)	5	1
JMLR	Python	TensorLy: Tensor Learning in Python (Kossaifi et al. 2019)	3	1
JMLR	R	The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R (Li et al. 2015)	3	1
JMLR	R	The huge Package for High-dimensional Undirected Graph Estimation in R (Zhao et al. 2012)	3	1

To better present obtained results plots below show distribution of marks within each journal and language:

FIGURE 1.4: Distribution of ‘Student’s Mark’ score of reproduced articles within each journal.

FIGURE 1.5: Distribution of ‘Student’s Mark’ score of reproduced articles within each language.

Following plots show means of ‘Student’s Mark’ scores of articles within each journal and each language:

FIGURE 1.6: Comparison of mean ‘Student’s Mark’ score of reproduced articles between journals.

FIGURE 1.7: Comparison of mean ‘Student’s Mark’ score of reproduced articles between languages.

Based on the plots we can see that R articles had a better mean score and Journal of Open Source Software had also the best mean score among the journals.

Similar plots below show means of ‘Reprodcibility Value’ scores:

FIGURE 1.8: Comparison of mean ‘Reproducibility Value’ score of reproduced articles between journals.

FIGURE 1.9: Comparison of mean ‘Reproducibility Value’ score of reproduced articles between languages.

Same as with ‘Student’s Mark’ we can see that R articles had a better mean score and Journal of Open Source Software had the best mean score among the journals.

We also examined similar statistics for 6 groups - every journal-language pair.

FIGURE 1.10: Distribution of ‘Student’s Mark’ score of reproduced articles within each journal-language pair.

FIGURE 1.11: Comparison of mean ‘Student’s Mark’ score of reproduced articles between journal-language pairs.

FIGURE 1.12: Comparison of mean ‘Reproducibility Value’ score of reproduced articles between journal-language pairs.

On these plots we can see that within each journal R always has a higher mean score than Python in both metrics. On the other hand, the difference between languages within JMLR and JOSS is not as significant as in JSTAT.

1.7.5 Summary and conclusions

In this article, we presented the differences between three scientific journals and two most popular data science programming languages in the context of the reproducibility of articles. Based on our research, we have to come to the following conclusions:

The Journal of Open Source Software has the highest score in both metrics. Articles from this journal are often published by professional developers in contrary to other journals dominated by theoretical scientists. That is why, in our opinion, the quality of code on JOSS could be higher. Similarly, the Journal of Machine Learning Research is occupied by people being more professional developers than the Journal of Statistical Software. This is because, as the names suggest, JMLR is connected more to machine learning and JSTAT to statistics. And this may be, from our point of view, why JSTAT scored the lowest in both metrics.
According to our research R is more reproducible than Python. The difference between these two languages in JMLR and JOSS is not as significant as in JSTAT. It is due to the fact that R is a language “made by statisticians for statisticians” and JSTAT articles are focused on statistics more than articles from other journals.
Some of the Python packages were created in Python2, which is no longer supported. This created many problems with reproducing them.
Most of the articles from JOSS had specified requirements in contrast to the other two journals. This contributed to higher JOSS results.

To sum up, our research suggests that articles published in JOSS and using R language are the most reproducible, but research conducted on a bigger sample is needed to confirm our results. What is more, a similar comparison of other journals and languages can be made.

References

Anderson, Christopher, Joanna Anderson, Marcel van Assen, Peter Attridge, Angela Attwood, Jordan Axt, Molly Babel, et al. 2019. “Reproducibility Project: Psychology.” https://doi.org/10.17605/OSF.IO/EZCUJ.

Balan, Theodor, and Hein Putter. 2019. “FrailtyEM: An R Package for Estimating Semiparametric Shared Frailty Models.” Journal of Statistical Software, Articles 90 (7): 1–29. https://doi.org/10.18637/jss.v090.i07.

Biecek, Przemyslaw, and Marcin Kosinski. 2017. “archivist: An R Package for Managing, Recording and Restoring Data Analysis Results.” Journal of Statistical Software 82 (11): 1–28. https://doi.org/10.18637/jss.v082.i11.

Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016a. “Mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1–5. http://jmlr.org/papers/v17/15-066.html.

Boehmke, Bradley, and Jason Freels. 2017. “LearningCurve: An Implementation of Crawford’s and Wright’s Learning Curve Production Functions.” The Journal of Open Source Software 2 (May). https://doi.org/10.21105/joss.00202.

Brown, Eric. 2019. “Tacmagic: Positron Emission Tomography Analysis in R.” The Journal of Open Source Software 4 (February): 1281. https://doi.org/10.21105/joss.01281.

Burns, David M., and Cari M. Whyne. 2018. “Seglearn: A Python Package for Learning Sequences and Time Series.” Journal of Machine Learning Research 19 (83): 1–7. http://jmlr.org/papers/v19/18-160.html.

Coyle, Jeremy, and Nima Hejazi. 2018. “Origami: A Generalized Framework for Cross-Validation in R.” The Journal of Open Source Software 3 (January): 512. https://doi.org/10.21105/joss.00512.

Dramiński, Michał, and Jacek Koronacki. 2018. “Rmcfs: An R Package for Monte Carlo Feature Selection and Interdependency Discovery.” Journal of Statistical Software, Articles 85 (12): 1–28. https://doi.org/10.18637/jss.v085.i12.

Drummond, Chris. 2012. “Reproducible Research: A Dissenting Opinion.” In.

Eisner, D. A. 2018. “Reproducibility of Science: Fraud, Impact Factors and Carelessness.” Journal of Molecular and Cellular Cardiology 114 (January): 364–68. https://doi.org/10.1016/j.yjmcc.2017.10.009.

Elmenreich, Wilfried, Philipp Moll, Sebastian Theuermann, and Mathias Lux. 2018. “Making Computer Science Results Reproducible - a Case Study Using Gradle and Docker,” August. https://doi.org/10.7287/peerj.preprints.27082v1.

Escobar, Modesto, and Luis Martinez-Uribe. 2020. “Network Coin Cidence Analysis: The netCoin R Package.” Journal of Statistical Software, Articles 93 (11): 1–32. https://doi.org/10.18637/jss.v093.i11.

Geitner, Robert, Robby Fritzsch, Jürgen Popp, and Thomas Bocklitz. 2019. “Corr2D: Implementation of Two-Dimensional Correlation Analysis in R.” Journal of Statistical Software, Articles 90 (3): 1–33. https://doi.org/10.18637/jss.v090.i03.

Geramifard, Alborz, Christoph Dann, Robert H. Klein, William Dabney, and Jonathan P. How. 2015. “RLPy: A Value-Function-Based Reinforcement Learning Framework for Education and Research.” Journal of Machine Learning Research 16 (46): 1573–8. http://jmlr.org/papers/v16/geramifard15a.html.

Gundersen, Odd Erik, and Sigbjørn Kjensmo. 2018. “State of the Art: Reproducibility in Artificial Intelligence.” https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17248.

Herbold, Steffen. 2020. “Autorank: A Python Package for Automated Ranking of Classifiers.” Journal of Open Source Software 5 (48): 2173. https://doi.org/10.21105/joss.02173.

Heusser, Andrew C., Kirsten Ziman, Lucy L. W. Owen, and Jeremy R. Manning. 2018. “HyperTools: A Python Toolbox for Gaining Geometric Insights into High-Dimensional Data.” Journal of Machine Learning Research 18 (152): 1–6. http://jmlr.org/papers/v18/17-434.html.

Hothorn, Torsten, Peter Bühlmann, Thomas Kneib, Matthias Schmid, and Benjamin Hofner. 2010. “Model-Based Boosting 2.0.” Journal of Machine Learning Research 11 (71): 2109–13. http://jmlr.org/papers/v11/hothorn10a.html.

Hughes, Nathan, Richard Morris, and Melissa Tomkins. 2020. “PyEscape: A Narrow Escape Problem Simulator Package for Python.” Journal of Open Source Software 5 (47): 2072. https://doi.org/10.21105/joss.02072.

Jas, Mainak, Titipat Achakulvisut, Aid Idrizović, Daniel Acuna, Matthew Antalek, Vinicius Marques, Tommy Odland, et al. 2020. “Pyglmnet: Python Implementation of Elastic-Net Regularized Generalized Linear Models.” Journal of Open Source Software 5 (47): 1959. https://doi.org/10.21105/joss.01959.

“Journal Metrics - Impact, Speed and Reach.” n.d. https://www.journals.elsevier.com/international-journal-of-approximate-reasoning/news/journal-metricsimpact-speed-and-reach.

Kossaifi, Jean, Yannis Panagakis, Anima Anandkumar, and Maja Pantic. 2019. “TensorLy: Tensor Learning in Python.” Journal of Machine Learning Research 20 (26): 1–6. http://jmlr.org/papers/v20/18-277.html.

Lau, Matthew, Thomas F. J.-M Pasquier, and Margo Seltzer. 2020. “Rclean: A Tool for Writing Cleaner, More Transparent Code.” Journal of Open Source Software 5 (46): 1312. https://doi.org/10.21105/joss.01312.

Li, Xingguo, Tuo Zhao, Xiaoming Yuan, and Han Liu. 2015. “The Flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R.” Journal of Machine Learning Research 16 (18): 553–57. http://jmlr.org/papers/v16/li15a.html.

Löfstedt, Tommy, Vincent Guillemot, Vincent Frouin, Edouard Duchesnay, and Fouad Hadj-Selem. 2018. “Simulated Data for Linear Regression with Structured and Sparse Penalties: Introducing Pylearn-Simulate.” Journal of Statistical Software, Articles 87 (3): 1–33. https://doi.org/10.18637/jss.v087.i03.

Markos, Angelos, Alfonso D’Enza, and Michel van de Velden. 2019. “Beyond Tandem Analysis: Joint Dimension Reduction and Clustering in R.” Journal of Statistical Software, Articles 91 (10): 1–24. https://doi.org/10.18637/jss.v091.i10.

McNutt, Marcia. 2014. “Journals Unite for Reproducibility.” Science 346 (6210): 679–79. https://doi.org/10.1126/science.aaa1724.

Molnar, Christoph. 2018. “Iml: An R Package for Interpretable Machine Learning.” Journal of Open Source Software 3 (June): 786. https://doi.org/10.21105/joss.00786.

Nordh, Jerker. 2017. “pyParticleEst: A Python Framework for Particle-Based Estimation Methods.” Journal of Statistical Software 78 (3). https://doi.org/10.18637/jss.v078.i03.

Pue, A. 2019. “Graph Transliterator: A Graph-Based Transliteration Tool.” Journal of Open Source Software 4 (44): 1717. https://doi.org/10.21105/joss.01717.

Role, François, Stanislas Morbieu, and Mohamed Nadif. 2019. “CoClust: A Python Package for Co-Clustering.” Journal of Statistical Software 88 (7). https://doi.org/10.18637/jss.v088.i07.

Ronan, Tom, Shawn Anastasio, Zhijie Qi, Pedro Henrique S. Vieira Tavares, Roman Sloutsky, and Kristen M. Naegle. 2018. “OpenEnsembles: A Python Resource for Ensemble Clustering.” Journal of Machine Learning Research 19 (26): 1–6. http://jmlr.org/papers/v19/18-100.html.

Sinz, Fabian, Joern-Philipp Lies, Sebastian Gerwinn, and Matthias Bethge. 2014. “Natter: A Python Natural Image Statistics Toolbox.” Journal of Statistical Software 61 (October): 1–34. https://doi.org/10.18637/jss.v061.i05.

Slezak, Peter, and Iveta Waczulikova. 2011. “Reproducibility and Repeatability.” Physiological Research / Academia Scientiarum Bohemoslovaca 60 (April): 203–4; author reply 204.

Stodden, Victoria, Jennifer Seiler, and Zhaokun Ma. 2018. “An Empirical Analysis of Journal Policy Effectiveness for Computational Reproducibility.” Proceedings of the National Academy of Sciences 115 (11): 2584–9. https://doi.org/10.1073/pnas.1708290115.

Titz, Johannes. 2020. “Mimosa: A Modern Graphical User Interface for 2-Level Mixed Models.” Journal of Open Source Software 5 (49): 2116. https://doi.org/10.21105/joss.02116.

Xia, Xiao-Qin, Michael McClelland, and Yipeng Wang. 2010. “PypeR, a Python Package for Using R in Python.” Journal of Statistical Software, Code Snippets 35 (2): 1–8. http://www.jstatsoft.org/v35/c02.

Zhao, Tuo, Han Liu, Kathryn Roeder, John Lafferty, and Larry Wasserman. 2012. “The Huge Package for High-Dimensional Undirected Graph Estimation in R.” Journal of Machine Learning Research 13 (37): 1059–62. http://jmlr.org/papers/v13/zhao12a.html.

Zwicker, David. 2020. “Py-Pde: A Python Package for Solving Partial Differential Equations.” Journal of Open Source Software 5 (48): 2158. https://doi.org/10.21105/joss.02158.