‘Cheap tricks’ and data mismanagement stymie AI progress

By Paddy Smith
Researchers say machine learning models are compromised by poor data management and crude extrapolation...

Researchers say “cheap tricks” and data mismanagement are hobbling the ability of AI to live up to its full potential and are promoting bias.

The comments are from a survey of dataset development and its use in ML research called Data and Its Discontents, authored by University of Washington linguists Amandalynne Paullada and Emily M Bender, Mozilla Foundation’s Inioluwa Deborah Raji and Emily Denton and Alex Hanna from Google Research.

Foundational role

In the paper, they say, “Datasets have played a foundational role in the advancement of machine learning research… However, recent work from a breadth of perspectives has revealed the limitations of predominant practices in dataset collection and use.”

The team points to visual and natural language processing as areas where bias has been identified in ML datasets. Underrepresentation of dark skin tones and female pronouns have been found in western data catalogues, while image classification datasets have been found to include pornographic images, leading to their removal.

The report says: “While deep learning models have seemed to achieve remarkable performance on challenging tasks in artificial intelligence, recent work has illustrated how these performance gains may be due largely to ‘cheap tricks’ rather than human-like reasoning capabilities.”

'Much to learn'

It goes on to say that “the machine learning community still has much to learn from other disciplines with respect to how they handle the data of human subjects”.

In a conclusion bearing an epigraph from Toni Cade Bambara (“Not all speed is movement”), the researchers said, “We argue that fixes that focus narrowly on improving datasets by making them more representative or more challenging might miss the more general point raised by these critiques, and we’ll be trapped in a game of dataset whack-a-mole rather than making progress, so long as notions of ‘progress’ are largely defined by performance on datasets.

“At the same time, we wish to recognize and honor the liberatory potential of datasets, when carefully designed, to make visible patterns of injustice in the world such that they may be addressed (see, for example, the work of Data for Black Lives7). Recent work by Register and Ko [2020] illustrates how educational interventions that guide students through the process of collecting their own personal data and running it through machine learning pipelines can equip them with skills and technical literacy toward self-advocacy – a promising lesson for the next generation of machine learning practitioners and for those impacted by machine learning systems.”

The team advocates a more careful approach to data collection in future, even at the expense of large catalogues, in order to protect individual liberties and the effect of technology on people and solutions.


Featured Articles

AiDLab Culture x AI: The Future of the Fashion Industry

Hong Kong’s AiDLab launches its Culture x AI programme to push the boundaries of the fashion industry with its use of AI to inform the design process

Sophia Velastegui: Overcoming Enterprise AI Challenges

AI Magazine speaks with AI business leader Sophia Velastegui as she offers advice for businesses seeking to advance their AI use cases responsibly

Bigger Not Always Better as OpenAI Launch New GPT-4o mini

OpenAI release new GPT-4o mini model designed to be more cost-efficient whilst retaining a lot of the same capabilities of larger models

Why are the UK and China Leading in Gen AI Adoption?

AI Strategy

Moody's Gen AI Tool Alerts CRE Investors on Risk-Posing News

Data & Analytics

AWS Unveils AI Service That Makes Enterprise Apps in Minutes

Data & Analytics