‘Cheap tricks’ and data mismanagement stymie AI progress

Researchers say “cheap tricks” and data mismanagement are hobbling the ability of AI to live up to its full potential and are promoting bias.
The comments are from a survey of dataset development and its use in ML research called Data and Its Discontents, authored by University of Washington linguists Amandalynne Paullada and Emily M Bender, Mozilla Foundation’s Inioluwa Deborah Raji and Emily Denton and Alex Hanna from Google Research.
Foundational role
In the paper, they say, “Datasets have played a foundational role in the advancement of machine learning research… However, recent work from a breadth of perspectives has revealed the limitations of predominant practices in dataset collection and use.”
The team points to visual and natural language processing as areas where bias has been identified in ML datasets. Underrepresentation of dark skin tones and female pronouns have been found in western data catalogues, while image classification datasets have been found to include pornographic images, leading to their removal.
The report says: “While deep learning models have seemed to achieve remarkable performance on challenging tasks in artificial intelligence, recent work has illustrated how these performance gains may be due largely to ‘cheap tricks’ rather than human-like reasoning capabilities.”
'Much to learn'
It goes on to say that “the machine learning community still has much to learn from other disciplines with respect to how they handle the data of human subjects”.
In a conclusion bearing an epigraph from Toni Cade Bambara (“Not all speed is movement”), the researchers said, “We argue that fixes that focus narrowly on improving datasets by making them more representative or more challenging might miss the more general point raised by these critiques, and we’ll be trapped in a game of dataset whack-a-mole rather than making progress, so long as notions of ‘progress’ are largely defined by performance on datasets.
“At the same time, we wish to recognize and honor the liberatory potential of datasets, when carefully designed, to make visible patterns of injustice in the world such that they may be addressed (see, for example, the work of Data for Black Lives7). Recent work by Register and Ko [2020] illustrates how educational interventions that guide students through the process of collecting their own personal data and running it through machine learning pipelines can equip them with skills and technical literacy toward self-advocacy – a promising lesson for the next generation of machine learning practitioners and for those impacted by machine learning systems.”
The team advocates a more careful approach to data collection in future, even at the expense of large catalogues, in order to protect individual liberties and the effect of technology on people and solutions.