‘Cheap tricks’ and data mismanagement stymie AI progress

By Paddy Smith
Researchers say machine learning models are compromised by poor data management and crude extrapolation...

Researchers say “cheap tricks” and data mismanagement are hobbling the ability of AI to live up to its full potential and are promoting bias.

The comments are from a survey of dataset development and its use in ML research called Data and Its Discontents, authored by University of Washington linguists Amandalynne Paullada and Emily M Bender, Mozilla Foundation’s Inioluwa Deborah Raji and Emily Denton and Alex Hanna from Google Research.

Foundational role

In the paper, they say, “Datasets have played a foundational role in the advancement of machine learning research… However, recent work from a breadth of perspectives has revealed the limitations of predominant practices in dataset collection and use.”

The team points to visual and natural language processing as areas where bias has been identified in ML datasets. Underrepresentation of dark skin tones and female pronouns have been found in western data catalogues, while image classification datasets have been found to include pornographic images, leading to their removal.

The report says: “While deep learning models have seemed to achieve remarkable performance on challenging tasks in artificial intelligence, recent work has illustrated how these performance gains may be due largely to ‘cheap tricks’ rather than human-like reasoning capabilities.”

'Much to learn'

It goes on to say that “the machine learning community still has much to learn from other disciplines with respect to how they handle the data of human subjects”.

In a conclusion bearing an epigraph from Toni Cade Bambara (“Not all speed is movement”), the researchers said, “We argue that fixes that focus narrowly on improving datasets by making them more representative or more challenging might miss the more general point raised by these critiques, and we’ll be trapped in a game of dataset whack-a-mole rather than making progress, so long as notions of ‘progress’ are largely defined by performance on datasets.

“At the same time, we wish to recognize and honor the liberatory potential of datasets, when carefully designed, to make visible patterns of injustice in the world such that they may be addressed (see, for example, the work of Data for Black Lives7). Recent work by Register and Ko [2020] illustrates how educational interventions that guide students through the process of collecting their own personal data and running it through machine learning pipelines can equip them with skills and technical literacy toward self-advocacy – a promising lesson for the next generation of machine learning practitioners and for those impacted by machine learning systems.”

The team advocates a more careful approach to data collection in future, even at the expense of large catalogues, in order to protect individual liberties and the effect of technology on people and solutions.


Featured Articles

Intuitive Machines: NASA's Odysseus bets on Private Company

Discover more about the small private company that landed the first US spacecraft on the moon in 50 years, with NASA continuing to test new technologies

Unveiling Gemma: Google Commits to Open-Model AI & LLMs

Tech giant Google, with Google DeepMind, launches Gemma, consisting of new new state-of-the-art open AI models built for an open community of developers

Sustainability LIVE: Net Zero a Key Event for AI Leaders

Sustainability LIVE: Net Zero, taking place in London on 6th and 7th March 2024, promises to be a valuable event for AI leaders

US to Form AI Task Force to Confront AI Threats to Safety

AI Strategy

Wipro to Advance Enterprise Gen AI Adoption with IBM watsonx

AI Strategy

Dr Joy Buolamwini: Helping Tech Giants Recognise AI Biases

Machine Learning