4.4 The reproducibility analysis of articles covering RMDL and UNet++ architectures churns

Authors: Marceli Korbin, Szymon Szmajdziński, Paweł Wojciechowski (Warsaw University of Techcnology)

4.4.1 Abstract

Recently, people have been facing a reproducibility crisis. A vast part of scientific papers is hard, if possible, to reproduce, therefore it is important to talk about this issue, especially today. The main subject of this article is the reproducibility of scientific papers and brief analysis of papers titled RMDL: Random multimodel deep learning for classification and U-Net++: A Nested U-Net Architecture for Medical Image Segmentation. Two experiments are conducted in this article in order to check the reproducibility of the mentioned papers. All steps mentioned in those articles are followed as long as there are no issues. Despite obtaining the final results which indicate at least partial reproducibility of the articles, occurring problems prove the process being harder than it seems.

4.4.2 Reproduction

Reproducibility is an ability to be recreated or copied. In other words, the main goal of reproducibility is to obtain as similar results as possible to those in a paper, by using a method described in this paper. As it turns out, it’s not that easy to achieve.

Reproducibility is an essential aspect of a scientific paper for several reasons. The most important reason is the insurance of correctness of the results (Tatman et al. 2018). By just reading a paper one cannot be sure about the accuracy of the results – there is always a chance of a mistake made by the researcher. Moreover, machine learning models are usually at least partly random (dividing data into training and test sets, choosing default parameters). It is then possible for promising results to be at some point a consequence of a coincidence. By reproducing the paper the reliability of the results is increased.

Another advantage of reproducibility is transparency. There is a chance, albeit small, of manipulating data by a researcher to achieve better results. Reproducing a paper can ensure that such an unethical incident has not taken place.

Lastly, reproducing a paper can definitely help us better understand a subject. Through running a code by oneself, one can find some lines which are unclear to them and try to understand them.

4.4.3 Random Multimodel Deep Learning for Classification

The first article to discuss is RMDL: Random multimodel deep learning for classification (Kowsari et al. 2018b), which expounds RMDL – a voting ensemble technique that uses randomly generated deep neural networks with various architectures. The idea of such an approach is to benefit from the advantages of each neural network architecture used. Therefore RMDL is versatile and suitable for various types of classification problems such as text, video, or image classification. In the article, the model’s text and image classification performance is presented on the examples of popular data sets: MNIST, CIFAR-10, WOS, Reuters, IMDB, and 20newsgroup.

RMDL generates multiple neural network models based on three architectures: Multi-Layer Perceptron, Convolutional Neural Network, and Recurrent Neural Network (more precisely, RNN with LSTM and GRU units). The number of generated Random Deep Learning models (RDL) is explicitly defined with a parameter. The number of layers and nodes for all of these RDL models are generated randomly and all the models are trained in parallel.

For text feature extractions, RMDL might use various methods like word embedding, TF-IDF, and n-grams. In text classification examples, word vectorization technique (GloVe) is used. Every RDL has an output layer that uses softmax to determine the prediction of class. The final prediction from all of the RDL models is determined through the use of a majority vote.

Each generated RDL model can use different optimizers. Therefore, if several generated models do not provide a good fit, they can be ignored for the time of voting. Each of the used architectures, feature engineering techniques or optimizers are described briefly in the article. Presented results show that RMDL performed better than baseline models for each exemplary data set. On Google Scholar this paper has 81 citations, which indicates mediocre relevance. Such a solution is neither commonly used nor quoted.

4.4.3.1 Results

Experiments were conducted to check if the same results as the ones present in the paper could be obtained. The first noticeable thing was that the process of training those models consumed a lot of time, therefore only a part of the datasets was chosen: IMBD, 20NewsGroup, two Web of Science datasets (WOS-5736, WOS-11967) and MNIST. The tables below show the results. Model size denotes how many voting submodels RMDL is composed of.

Model size IMDB 20NewsGroup
Paper’s result Reproduction’s result Paper’s result Reproduction’s result
3 RDLs 89.91 89.39 86.73 82.46
9 RDLs 90.13 89.32 87.62 85.77
15 RDLs 90.79 89.54 87.91 85.94

Table 1: Accuracy for text datasets.

Model size WOS-5736 WOS-11967
Paper’s result Reproduction’s result Paper’s result Reproduction’s result
3 RDLs 90.86 87.10 87.39 78.15
9 RDLs 92.60 91.02 90.65 84.88
15 RDLs 92.66 89.98 - -

Table 2: Accuracy for Web of Science text datasets.

Model size Papers’ result Reproduction’s result
3 RDLs 0.51 1.37
9 RDLs 0.41 0.67

Table 3: Loss for MNIST dataset.

In all experiments the obtained results were worse than those in the paper. In most cases the difference is not substantial, although for 3RDLs for WOS-11967 it achieves 9.23 p.p. The reasons for it could be randomness or differences in hardware. Nevertheless, the accuracy of the new models models was still relatively high.

4.4.3.2 Problems

Some problems occurred during the experiments. The first one and the hardest was the insufficient amount of processing power. At first, RNN models were considerably difficult to train, since default parameters didn’t allow for the use of GPU. In turn, each epoch took several minutes for the model to be trained. The issue was fixed after modifying the parameters of RNNs models and eventually RNNs models could be trained in a reasonable amount of time.

During the attempt to conduct experiments on Web of Science datasets, it turned out that the website, which was supposed to contain data available to download, did not exist anymore. Those data were eventually found in other sources, downloaded and loaded manually. Another difficulty was insufficient RAM size, which made checking WOS-46985 results impossible. The data was too large and the only output of the code covering this part consisted of error messages.

4.4.4 A Nested U-Net Architecture for Medical Image Segmentation

The second article to be analyzed is UNet++: A Nested U-Net Architecture for Medical Image Segmentation (Z. Zhou et al. 2019), which presents an architecture based on encoder-decoder networks and skip pathways between them, meant to reduce the distance between the components. Such a solution is claimed to simplify learning tasks in the case of semantically similar feature maps of the decoder and encoder networks. The architecture is evaluated in medical image segmentation tasks, involving segmentation of lung nodule, cell nuclei, liver, and colon polyp in images.

UNet++ consists of an encoder and decoder, connected through dense convolutional blocks and skip pathways between them with additional layers. After data enters the encoder, a feature map is processed in a block, whose number of layers is dependent on the network level. The skip pathways in blocks are meant to shorten the distance between semantic levels of the feature maps in the encoder and decoder, improve gradient flow and make the overall optimization process easier. Another important feature of UNet++ is optional deep supervision, which works in either of two modes: accurate mode and fast mode. It serves to enable model pruning and is based on building full resolution feature maps at chosen semantic levels of the network.

All the elements are illustrated in figures and described as means to an improvement of previously implemented U-Net and wide U-Net architectures. The three models are then compared in a series of experiments, which prove an outperformance of UNet++ over the other two architectures on each dataset.

4.4.4.1 Problems

The second experiment brought some problems as well; the first of them was the downloading of necessary data. The authors of the original article shared some links to datasets; however, not all of them were public. In several cases special permission was needed – despite requests being sent, the access was not granded in time.

This might be explained by the fact that medical data usually requires proper anonymization and is rarely public. Such a problem did not occur during the analysis of the RMDL article, since it could be perceived as an example of basic research, where new architectures are introduced without a concrete use. Therefore, popular datasets are usually used for that type of tasks.

Eventually, three datasets which contain pictures of Lungs, Nuclei and Polyps were used in the experiments. The next problem to overcome was data preprocessing. There were issues on the GitHub profile concerning preprocessing, on which the authors did not give any instructions. This prompted an improvisatory approach. The Nuclei and Polyp datasets were the only preprocessable ones, while the Lung dataset turned out to be too complicated for preparation. The problem with this dataset was its size, which prevented from loading the whole set at once. Moreover, the photos therein were three-dimensional.

It was also discovered that the test dataset had no ground truth masks, which would be necessary to calculate the accuracy of the models (further explained in the ‘Results’ subsection). This issue was overcome by dividing the original training dataset into training and test subsets; such a solution turned out later to have been used by the authors of the article as well.

After preprocessing the data, the code would be developed, although it was not simple either. The authors were using the older versions of several libraries with many errors. The first solution tried was to modify the authors’ code to work on the newer versions of libraries. The procedure failed due to the complexity of that task, yet it would be overcome by exchanging the newer versions in place of the older ones. Moreover, figuring out a way to use GPU prompted an installation of the correct versions of libraries. Despite GPU being enabled, the learning process still consumed a considerably high amount of time.

4.4.4.2 Results

The models were eventually ready to be trained in order to try reconstructing the authors’ results. The authors suggested two metrics to evaluate the effectiveness of a model:

  • Dice coefficient - overlap area multiply by two and divided by the total number of pixels in both photos.

  • Intersection over Union (IoU) - an area of overlap divided by area of union.

Metric Nuclei Polyp
Paper’s result Reproduction’s result Paper’s result Reproduction’s result
Iou 94.00 85.74 33.45 24.59
Dice 95.80 79.88 - 34.27

Table 4: Dice and IoU measures for Nuclei and Polyp datasets.

Both metrics describe how similar the two pictures are. Two experiments were conducted on the Nuclei and Polyp datasets. Results turned out to be fairly similar, despite all of the reproduced scores being lower. The highest difference occurred between dice coefficients on the Nuclei dataset. The most possible source of these differences is an insufficient amount of processing power. It was impossible to set the same batch size or epochs number the authors did; optimizing those parameters would have most likely improved the new results. Despite the differences, the results of the article can be considered reproducible.

4.4.5 Conclusions

Reproduction might be seen as simply using a premade code, a task which would bear no difficulties. However, the whole process turned out to be much more complicated. During this procedure, there is a high probability for several troubles to occur; varying issues happened with the datasets of both articles analyzed.

Likewise, data preprocessing is problematic. It is crucial to describe this process; otherwise, reproduction is more difficult, if possible at all. Sharing code with preprocessing done on exemplary datasets should be a good solution to the problem. Another obstacle to have been encountered in both articles is processing power. Although it is not the authors’ fault, it remains very important during the reproduction of an article.

Contacting an author might solve at least a part of these problems. The ability to ask about a specific method or to consult a line of code is a simple way of clarifying things. That said, authors do not always do that; there were many issues on Github concerning one of the analyzed articles, of which few were answered.

References

Kowsari, K., Heidarysafa, M., Brown, D. E., Meimandi, K. J., & Barnes, L. E. (2018b). Rmdl: Random multimodel deep learning for classification. In Proceedings of the 2nd international conference on information system and data mining (pp. 19–28).
Tatman, R., VanderPlas, J., & Dane, S. (2018). A practical taxonomy of reproducibility for machine learning research.
Zhou, Z., Siddiquee, M. M. R., Tajbakhsh, N., & Liang, J. (2019). UNet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Transactions on Medical Imaging.