ML Completeness Checklist

Table of contents:

ML Completeness Checklist
ML Completeness Checklist

Video: ML Completeness Checklist

Video: ML Completeness Checklist
Video: MICCAI Reproducibility Checklist 2024, November
Anonim

With the aim of increasing reproducibility and enabling others to more easily rely on published work, we present an ML code completeness checklist. The ML Code Completeness Checklist evaluates the code store based on the scripts and artifacts provided in it.

ML Code Completeness Checklist
ML Code Completeness Checklist

Introduction

Last year, Joel Pino released a reproducibility checklist to facilitate reproducible research presented at major OA conferences (NeurIPS, ICML,…). Most of the items on the checklist focus on the components of the paper. One item on this checklist is “provide a link to the source code,” but other than that, few recommendations were made.

Best practices have been summarized in the ML Code Completeness Checklist, which is now part of the official NeurIPS 2020 Code Submission Process and will be available for use by reviewers as they see fit.

ML Completeness Checklist

The M code completeness checklist checks the code store for:

  1. Dependencies - Does the repository have dependency information or instructions on how to set up the environment?
  2. Training Scenarios - Does the repository contain a way to train / fit the models described in the document?
  3. Evaluation Scenarios - Does the repository contain a script for calculating the performance of the trained model (s) or performing experiments on the models?
  4. Pretrained Models - Does the repository provide free access to pretrained model weights?
  5. Results - does the repository contain a table / graph of the main results and a script to reproduce those results?

Each repository can receive from 0 (has none) to 5 (has all) ticks. More information on the criteria for each item can be found in the Github repository.

What is the evidence that checklist items contribute to more useful repositories?

The community generally uses GitHub stars as a proxy for the usefulness of the repository. Therefore, the repositories that score higher on the ML completeness checklist are expected to also have more GitHub stars. To test this hypothesis, there were 884 GitHub repos submitted as official implementations in the NeurIPS 2019 docs. A 25% subset of these 884 repos were randomly selected and manually checked in the ML completeness checklist. They have grouped this sample NeurIPS 2019 GitHub repos by the number of ticks they have in the ML code completeness checklist and mapped the GitHub median stars in each group. The result is below:

Image
Image

NeurIPS 2019 repos with 0 checkboxes had a median of 1.5 stars on GitHub. In contrast, repos with 5 checkboxes had a median of 196.5 GitHub stars. Only 9% of repos had 5 ticks, and most of the repos (70%) had 3 ticks or less. The Wilcoxon rank sum test was performed and found that the number of stars in the 5 tick class is significantly (p.value <1e-4) higher than in all other classes except 5 versus 4 (where p.value is the boundary). at 0.015). You can see the data and code for this figure in the Github repository.

To test if this relationship extends more broadly, a script was created to automate the computation of a checklist from the README repository and associated code. We then re-analyzed the entire set of 884 NeurIPS 2019 repositories, as well as a broader set of 8926 code repositories for all ML articles published in 2019. In both cases, the specialists obtained a qualitatively identical result with median stars monotonically increasing from ticks in a statistically significant way (p.value <1e-4). Finally, using robust linear regression, we found that pretrained models and results had the greatest positive impact on GitHub stars.

This is considered useful evidence by the analysts that encouraging researchers to include all of the components required by the ML completeness checklist will lead to more useful repositories, and that the score on the checklist indicates better quality submissions.

Currently, experts do not claim that the proposed 5 checklist items are the only or even the most significant factor in the popularity of the repository. Other factors can influence popularity, such as: scientific contribution size, marketing (e.g. blog posts and Twitter posts), documentation (comprehensive READMEs, tutorials, and API documentation), code quality, and previous work.

Some examples of NeurIPS 2019 repositories with 5 checkboxes:

Experts recognize that although they have tried to make the checklist as general as possible, it may not be fully applicable to all types of documents, for example, theoretical or sets of documents. However, even if the main purpose of the article is to represent a dataset, it can still benefit from the release of baseline models, including training scenarios, evaluation scenarios, and results.

Start using

To make it easier for reviewers and users to understand what is in the repository and for experts to evaluate it correctly, a collection of best practices is provided for writing README.md files, defining dependencies, and releasing pretrained models, datasets, and results. It is recommended that you clearly define these 5 elements in your repository and link them to any external resources such as documents and leaderboards to provide more context and clarity for your users. These are the official guidelines for submitting a code to NeurIPS 2020.