‘Cheap tricks’ and data mismanagement stymie AI progress

By Paddy Smith
Researchers say machine learning models are compromised by poor data management and crude extrapolation...

Researchers say “cheap tricks” and data mismanagement are hobbling the ability of AI to live up to its full potential and are promoting bias.

The comments are from a survey of dataset development and its use in ML research called Data and Its Discontents, authored by University of Washington linguists Amandalynne Paullada and Emily M Bender, Mozilla Foundation’s Inioluwa Deborah Raji and Emily Denton and Alex Hanna from Google Research.

Foundational role

In the paper, they say, “Datasets have played a foundational role in the advancement of machine learning research… However, recent work from a breadth of perspectives has revealed the limitations of predominant practices in dataset collection and use.”

The team points to visual and natural language processing as areas where bias has been identified in ML datasets. Underrepresentation of dark skin tones and female pronouns have been found in western data catalogues, while image classification datasets have been found to include pornographic images, leading to their removal.

The report says: “While deep learning models have seemed to achieve remarkable performance on challenging tasks in artificial intelligence, recent work has illustrated how these performance gains may be due largely to ‘cheap tricks’ rather than human-like reasoning capabilities.”

'Much to learn'

It goes on to say that “the machine learning community still has much to learn from other disciplines with respect to how they handle the data of human subjects”.

In a conclusion bearing an epigraph from Toni Cade Bambara (“Not all speed is movement”), the researchers said, “We argue that fixes that focus narrowly on improving datasets by making them more representative or more challenging might miss the more general point raised by these critiques, and we’ll be trapped in a game of dataset whack-a-mole rather than making progress, so long as notions of ‘progress’ are largely defined by performance on datasets.

“At the same time, we wish to recognize and honor the liberatory potential of datasets, when carefully designed, to make visible patterns of injustice in the world such that they may be addressed (see, for example, the work of Data for Black Lives7). Recent work by Register and Ko [2020] illustrates how educational interventions that guide students through the process of collecting their own personal data and running it through machine learning pipelines can equip them with skills and technical literacy toward self-advocacy – a promising lesson for the next generation of machine learning practitioners and for those impacted by machine learning systems.”

The team advocates a more careful approach to data collection in future, even at the expense of large catalogues, in order to protect individual liberties and the effect of technology on people and solutions.

Share

Featured Articles

Virgin Atlantic accelerates AI transformation with Amperity

Leading enterprise customer data platform will help Virgin Atlantic leverage a data-driven strategy to deliver highly personalised customer experiences

Sustainability LIVE: Event for AI leaders

Featuring experts from companies including Microsoft, Google, AWS, Meta and Tech Mahindra, Sustainability LIVE offers a number of sessions for tech leaders

VMware and NVIDIA AI Foundation unlocks business potential

VMware and NVIDIA have partnered in a private AI foundation with a wide range of offerings, to aid businesses to better adopt and customise AI models

TimeAI Summit Oct 2023 to unite tech giants and visionaries

Technology

MIT suggest generative AI is democratising AI access

Machine Learning

How AI could help airlines mitigate contrail climate impact

Machine Learning