1.2 Aging articles. How time affects reproducibility of scientific papers?

Authors: Paweł Morgen, Piotr Sieńko, Konrad Welkier (Warsaw University of Technology)

1.2.1 Abstract

Reproduction of a code presented in scientific papers tend to be a laborious yet important process since it enables readers a better understanding of the tools proposed by the authors. While recreating an article various difficulties are faced what can result in calling the paper irreproducible. Some reasons why such situations occur stem from the year when the article was published (for example usage of no more supported packages). The purpose of the following paper is to prove whether this is a general trend which means answering the question: is the year when the article was published related to the reproducibility of the paper. To do so a package CodeExtractorR was created that enables extracting code from PDF files. By using this tool a significant number of articles could be analyzed and therefore results received enabled us to give an objective answer to the stated question.

1.2.2 Introduction

Every article published in a scientific journal is aimed at improving our knowledge in a certain field. To prove their theories, authors should provide papers with detailed, working examples and extensive supplementary materials to reproduce results. Unfortunately, these conditions are not always fulfilled. In such a case, other researchers are not able to verify and accept the solutions presented by the author. Moreover, the article is not only useless for the scientific community but also for business recipients.

Over the years, several different definitions of reproducibility have been proposed. According to Gentleman and Temple Lang (2007), reproducible research are papers with accompanying software tools that allow the reader to directly reproduce methods that are presented in the research paper. Other authors suggest that scientific paper is reproducible only if text, data and code are made available and allow an independent researcher to recreate the results (Vandewalle, Kovacevic, and Vetterli 2009). Second definition emphasizes the importance of accessibility to data used in researches, therefore it seems to be more suitable and complete interpretation of reproducibility. In addition, in this article, we used scale based on the spectrum of reproducibility, proposed by Peng (2011). In his work, he also mentioned reproducibility as a minimal requirement for assessing the scientific value of the paper. In the past few years, computing has become an essential part of scientific workflow. Some best practices for writing and publishing reproducible scientific article were presented by Stodden et al. (2013). Furthermore, she made a brief overview of existing tools and software that facilitate this task. Similar issue was closely described by Kitzes, Turek, and Deniz (2017). Tools created solely for reproducibility in R were proposed by Marwick (n.d.) in package rrtools.

Although many articles focus on software or framework solutions for reproducibility problems, analysis of scientific papers reproducibility in the context of release date has, to the best of the authors’ knowledge, not been described before. The intention of such research is to find correlations between age of article and its reproducibility. Authors believe that finding these dependencies will allow to calculate the estimated life span of data science article. Furthermore, as replicability helps with applying proposed methods and tools, its approximated level might be helpful in estimating usefulness of every scientific article.

1.2.3 CodeExtractoR package

To examine sufficent number of articles, package CodeExtractoR has been used. Its main functions - extract_code_from_pdf() and extract_code_from_html(), automatically extract code snippets from pdf and html files into R files.

# install.packages("devtools")
devtools::install_github("MrDomani/CodeExtractoR")
library(CodeExtractoR)

# Path to file
path_file <- "filepath.html"
  
# Path to R file output
output <- "output.R"

# Extracting code
extract_code_from_html(input_file = path_file, output_file = output, console_char = ">")

To extract code from pdf, API key to https://cloudconvert.com/ is required.

1.2.4 Methodology

The first issue that should be touched upon, while considering the methodology behind preparing this article, is the scale used to assess the reproducibility of the papers. In the Introduction it was already mentioned how the scale was created but a more detailed description is required. The authors decided that the scale should consist of 4 levels (from 0 up to 3):

  1. The 0 grade was given in case when no chunk of code gave the anticipated results and no figure was reproduced successfully (in practice such situation occurred mainly when the package described in the article was no more available).
  2. The 1 grade means that at least one example gave the results that the authors waited for, while it also includes situations when about half of the code in the chunks behaved as expected.
  3. The 2 grade was awarded to the articles that were reproducible “in the majority” what also means that they were not reproducible in 100%.
  4. The 3 grade was received by the articles that were fully reproducible and no problems were encountered in the process. Such a result was highly anticipated by the authors but the criteria for this grade were rather strict.

The second issue that also played a vital role in the authors’ work was the scope of the analysis. It was decided that in order to maintain “other thing equal” according to a well-known Latin phrase “ceteris paribus” only one online journal – The R Journal – was taken into account. Being equipped with a tool for faster reproduction of articles – the CodeExtractoR package – the authors agreed to examine about 20 articles that were published across a few years. Such a great number of papers meant that the approach taken could be described as holistic. It is also worth mentioning that usually 30 articles from each year were analyzed (at least whenever it was possible). Finally, it should be noted that in the case of the date when they were published the examined articles range from 2009 up to 2019. The third and final issue that should be considered in this part of the article focuses on the measures undertaken by the authors in order to tackle the problem of biased assessment. Although the scale that was proposed was not totally dependent on the person who was using it, it still left someplace for personal liking and disliking of the paper. As a way of marginalization of this trend the authors have taken part in many conversations when the facts that led to specific grading of the articles were shared. This enabled awarding the grades even more fairly. However, the final measure was much more simple and it was believed to be much more effective as well when compared to the previous one. The articles for each year were divided into 3 groups and assigned to one author each. Thus each author has examined the papers from the whole range of release dates that were taken into account.

1.2.5 Results

Specific results are presented in Table 1.1, which shows the number of examined articles from 2009 up to 2019, grouped by received grade. The column “Grade” represents the 0 - III scale of reproducibility. The rest of the columns shows a number of papers that achieved a particular grade in each year.

 

Grade 09’ 10’ 11’ 12’ 13’ 14’ 15’ 16’ 17’ 18’ 19’
0 3 2 6 4 6 8 5 9 5 9 9
I 4 2 0 2 3 6 6 10 3 2 2
II 3 7 8 4 13 6 10 7 14 6 6
III 6 7 6 5 8 10 9 4 8 13 13
SUM 16 18 20 15 30 30 30 30 30 30 30

 

 

Figure 1.1 shows the results in the original 4-level scale. Although, the number of papers varies throughout the years in every reproducibility class, it is observable that intermediate ones - I and II, are less common in the oldest and newest papers. In addition, results in 2018 and 2019 are identical. It is important to remember that from 2009 to 2012, the overall number of papers oscillated around 18 per year. After 2013, the number of researched articles was constant.

 

Number of papers by class and publication year

FIGURE 1.1: Number of papers by class and publication year

 

After calculating the percentage of each class in a specific year ( Figure 1.2, Figure 1.3), it is observed, that in the two oldest examined years - 2009 and 2010 - a ratio of completely unreproducible papers (with 0 or I class ) is surprisingly low. Furthermore, papers with III class of reproducibility are nearly 40% of all articles in these years.

 

Ratio of each class throughout years

FIGURE 1.2: Ratio of each class throughout years

 

Except for 2019, 2018 and 2016, percentage of fully reproducible papers (III class) is stable. In the newest articles, this percentage is slightly higher. Year 2016 is the only one, where unreproducible papers were in the majority. Only in 3 cases, percentage of reproducible articles dropped below 60%.

 

Summarized results throughout years

FIGURE 1.3: Summarized results throughout years

 

1.2.6 Summary and conclusions

This article has presented an analysis of scientific papers reproducibility in the context of publication date. As a part of this work, 280 articles from RJournal have been verified. Collected results show that even though recently published articles are more likely to be fully reproducible than older ones, time does not affect reproducibility as we could have expected. A higher ratio of complete reproducibility occurs only for papers from the last 2 years. In articles from 2009 – 2017 there is no evident correlation between reproducibility and year of publication. However, several patterns are possible to recognize. Firstly, except for the year 2016, in each year percentage of reproducible articles (class II or III) was higher than unreproducible ones (class 0 or I). Secondly, the proportion of fully reproducible articles seems to oscillate between 30% - 40%. Moreover, the earliest years have a similar or even lower ratio of complete irreproducibility than later ones. These unexpected observations might be explained by the fact that older packages appear to be less dependent on other libraries. Additionally, modern packages contain more complex and dedicated functions, therefore they are more vulnerable to changes. Undoubtedly, this paper does not cover all aspects of scientific articles obsolescence, however, we believe that the data we have collected might be a valuable starting point for further research.

References

Gentleman, Robert, and Duncan Temple Lang. 2007. “Statistical Analyses and Reproducible Research.” Journal of Computational and Graphical Statistics 16 (1): 1–23.

Kitzes, Justin, Daniel Turek, and Fatma Deniz. 2017. The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Univ of California Press.

Marwick, B. n.d. “Rrtools: Creates a Reproducible Research Compendium (2018).”

Peng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–7. https://doi.org/10.1126/science.1213847.

Stodden, Victoria, David H. Bailey, Jonathan M. Borwein, Randall J. LeVeque, William J. Rider, and William Stein. 2013. “Setting the Default to Reproducible Reproducibility in Computational and Experimental Mathematics.” In.

Vandewalle, Patrick, Jelena Kovacevic, and Martin Vetterli. 2009. “Reproducible Research in Signal Processing.” IEEE Signal Processing Magazine 26 (3): 37–47.