The rise and fall (and rise) of datasets


Growing criticism of datasets built from user-generated data retrieved from the web has led to many popular benchmarks being withdrawn or redacted. Their afterlife, in copies or subsets that continue to be used, is concerning.

The rapid pace of development in machine learning research over the past two decades has been, in large part, fueled by the availability of large reference datasets of images, videos, text, and more. These make it possible to compare and evaluate algorithms, and help to define research objectives. However, in recent years, the machine learning community has identified an alarming number of potential legal and ethical issues with many of the most popular image datasets, such as representational harms, bias effects, violation of privacy and unclear or questionable downstream use.1.2.

Widely used datasets such as ImageNet, Tiny Images, Megaface and MS-Celeb-1M usually contain images pulled from the internet, especially from sharing platforms such as Flickr. This often happens without the explicit permission or even knowledge of the people who generated the data. Training machine learning algorithms on copyrighted data is generally considered “fair use” on the grounds that it amounts to a transformative use of the original data. This principle was reinforced by a 2015 US court decision in Authors Guild v. Google. The former challenged Google’s right to scan books for their book search algorithms, but the court ruled it was not illegal to scan copyrighted books for crawling purposes. data and develop search algorithms.

Additionally, photos and other user-generated data are often published on platforms such as Flickr with Creative Commons licenses, which go beyond restrictive copyright and encourage sharing and reuse. However, neither the principle of fair use nor the Creative Commons licenses should be interpreted to imply that this content is for gain, as there are many ethical, legal and technical issues to consider beyond copyright. . Several recent polls1,2,3,4 point to a range of concerns, as in the case of ImageNet, which was established over a decade ago and is one of the most influential computer vision datasets. It contains 14 million images, hand-annotated by Amazon Mechanical Turk (MTurk) workers and has over 20,000 categories. Recent analysis, including by the creators of ImageNet themselves5, revealed that there are many problematic annotations, especially those that are offensive and biased. More than half of the tags in the people subtree were considered potentially dangerous and as a result 600,000 images were removed from ImageNet.

A fundamental underlying problem that has become clear over the years is that datasets are not neutral, but represent particular social and political norms, which may specifically affect marginalized groups.4. In hindsight, there should have been concerns about the ethics of taking user-generated data from the web, crowdsourcing non-expert taggers, and unrestricted access to developers — including those working on sensitive applications such as facial recognition and biometric monitoring. . Take, for example, Microsoft Celeb (MS-Celeb-1M), which is a dataset of 10 million face images taken from the Internet. Although most of the images are photos of actors, many other people are included who have a professional presence online, such as journalists, human rights activists, academics, authors and more. A recent report, after which Microsoft deleted the dataset, said the images were being used without individuals’ knowledge or consent in facial recognition applications by various organizations, including Huawei, Sensetime and IBM.

Several datasets have now been deleted or, as in the case of ImageNet, have been heavily redacted. In practice, however, they continue to be widely used and available, either in their original form, such as via online torrents, or in derivative form, as subsets or modifications of the d origin or pretrained models on the outdated dataset.1. In many cases, the deprecation was silent or the status of the dataset remained ambiguous. For example, Microsoft took down the MS-Celeb-1M dataset website, stating the project was complete but, as of today, it lacks a clear public announcement and the dataset still exists in various repositories. . Another example is MegaFace, which has a landing page with the statement that the dataset is decommissioned, but without hinting at the ethical concerns raised about it, as in a recent New York Times item.

A more encouraging example is the Tiny Images dataset, in which the MIT hosts announce on the landing page that the dataset is retired, clearly citing the ethical concerns about the dataset that have been raised in a recent analysis.3 and ask researchers not to use the dataset. However, Correy et al.4 report that many large retracted datasets still have an active afterlife, which leads to the spread of identified damage, and they argue that a consistent approach to retraction is needed. For example, hosts must make a clear announcement that outlines the reasons for abandonment, and they must have a clear execution plan and timeline for abandonment. The authors further argue that a central repository, maintained by the machine learning community, is needed to house outdated datasets.

There is also a role for conferences and journals. In particular, submission guidelines should require authors to list and describe the datasets generated and analyzed, and authors should ensure that none of the datasets they have used have been withdrawn. Like the others nature research journals, Intelligence of natural machines pays particular attention to data citation and data availability statements, but will also monitor the use of withdrawn datasets in manuscripts that are sent for peer review and, if necessary, ask authors to use alternative data sets. There may be exceptions where the use of outdated datasets might be permitted – for example, when biases and adverse effects of datasets are investigated. We will seek expert advice in such cases.

Moving forward, a fundamental shift in dataset culture is needed1.2. Peng et al.1 emphasize that damage mitigation and management are necessary throughout the lifecycle of a dataset, as ethical impacts are difficult to anticipate and address at the time of dataset creation, and ethical and social standards may also change over time. Creators should monitor usage of their datasets, update licenses and documentation, and limit access as necessary. We will closely follow developing community standards and support authors in creating responsible reports on datasets.

The references

  1. 1.

    Peng, K., Mathur, A. & Narayanan, A. Preprint at (2021).

  2. 2.

    Paullada, A. et al. Grounds (NY). 2100336 (2021).

    Google Scholar article

  3. 3.

    Prabhu, VU & Birhane, A. Preprint at (2020).

  4. 4.

    Correy, F. et al. Preprint at (2021).

  5. 5.

    Yang, K. et al. in Proc. 2020 Conference on Fairness, Accountability and Transparency 547–558 (2020).

Download references

About this article

Verify currency and authenticity via CrossMark

Quote this article

The rise and fall (and rise) of data sets.
Nat Mach Intelligence (2022).

Download quote


Comments are closed.