Is your AI training data contaminating your model?

By Erik Vogt, VP Enterprise Solutions, Appen

November 19, 2022

undefined mins

Deploying and utilising AI data is an involved process and there are several common mistakes companies make-. Here's five things to watch out for

In the age of artificial intelligence, market leaders are scrambling for the competitive edge, and deploying AI/ML models is one way to do that –– but it’s not always as easy as they would like. Deploying and utilising AI data is an involved process and there are several common mistakes companies make when getting started.

When it comes to AI teams encountering roadblocks, there are five key areas that include notable weaknesses. Let’s break down what you can do to boost efficiency and ensure you’re training your AI models to create more equitable, safer, and efficient AI for all.

1. Data Without Biases

Social sciences show that socioeconomic and demographic biases are unfortunately built into the data we use when training any AI models that involve people. Bias is, and always will be, a risk associated with all forms of data, but given that data used in many AI applications originates from real humans and human interactions, biases are both natural and inevitable. Left unchecked, bias can result in undesirable, and in some cases illegal, consequences.

Bias can take many forms, including selection bias, reporting bias, systematic bias, and implicit bias; they manifest from the difference in what your model predicts, and what the true value should be, and understanding them are critical in interpreting any data used in modelling.

One goal for successful modelling is understanding and managing biases in a way that mitigates risk. Unpacking this complex issue first requires AI practitioners to balance their data, its characteristics, and the relevance against the expected outcome. Getting perfect data is not always (ahem, ever) the most cost-effective solution. But the downstream costs of a bad model offering misleading results is an unusable product, or even the potential risk of damage to your brand should also be part of the overall ROI analysis.

Start collecting representative samples. For speech recognition models, for example, consider collecting data from a broad range of your target demographic, including various regional dialects, age, gender, cultural groups, and even speech impediments. This can be complicated, as your current users may only represent a portion of your target demographics, but not accounting for it can lead to an ouroboros mess – a model that can only perpetuate itself.

Another strategy is the careful use of synthetic data. If we can’t cost-effectively collect real-world interactions of humans in some use cases, synthetic data can be generated artificially to train the model in the missing parts, and or at least mitigate the bias in the data you have. In a perfect world, AI models take on a new, holistic role that is grounded in accuracy and true representation with a mix of synthetic and humanity.

2. Focus on Edge Cases

Edge cases involve examples you want your model to account for that are either too rare or too ambiguous to account for and can be especially important if your model variance is too high. A high-variance model doesn’t generalise well and tends not to do well with cases it has not seen before. Edge cases may be scenarios you want your model to account for but occur too infrequently to collect a natural environment.

When we do not have sufficient, real-world data, such as data that would be unethical to create, rare, or dangerous, this situation is referred to as an edge case. It’s easy to imagine easily training a hands-free vehicle on sunny days in normal traffic conditions, but training it to handle severe weather, an accident, or a child suddenly running into the road, is all but impossible. The problem is there are potentially a LOT of edge cases, and there is never going to be enough controlled data to meet them.

This is another good use of synthetic data. This can be done in a variety of ways that could be as easy as having actors playing out a scene to using platforms designed for gaming or creating special effects in movies to create data that may not be quite the same as the real thing, but close enough to dramatically improve mode performance. What seems to be the best solution in many cases is a careful synthesis of real-world data and synthetic data – so if you are asking yourself if you should collect or create – the answer may be both.

3. The Never-ending Learning Cycle

The reality of life is that the world is constantly changing. User behaviour and expectations change, markets grow and collapse, new products emerge and others die out, markets are occasionally shocked by Black Swan events, as brilliantly described by Nassim Nicholas Taleb, new data becomes available, languages and semantics drift, macroeconomic trends happen under our noses, and even the climate changes.

In our recent State of AI survey, we found that over 90% of respondents retrained their models at least quarterly. An AI model that was built just a few quarters ago could start degrading in predictive accuracy so it's imperative that we support and empower continued learning.

To counter these inevitable trends, start with implementing an ongoing monitoring system that can evaluate, and reevaluate, AI performance. This can be done in conjunction with iterative improvements. Evaluate your model against expected behaviour regularly, adjust as needed, test again, and repeat. This gives data scientists a fighting chance to create and utilise a continuous feedback loop that works to detect deviations and decide when to update or enrich your training corpora. As organisations, we are expected to adapt to and be prepared for any industry change, so why should our AI data be any different?

4. Safeguarding the Data Privacy Lifecycle

The issue of privacy and fair use of data is big – really big – and is only getting bigger. Privacy is at the centre of major legislation such as GDPR, HIPPA, and California’s 2020 Privacy Rights Act, and the legal consequences of mismanaging private data can keep risk analysis up at night, but privacy is also of utmost importance to many brands. Public expectations for data usage will only continue to grow.

One of the promises of AI is personalisation, but personalisation is based on personal data, as are many AI models. Some consumers already distrust technology, and reports of the misuse of data will amplify that over time. Consumers are increasingly expecting AI leaders to be vocal about how their user data is being used, have control of how it is used, and understand how their privacy will be protected.

To address these very legitimate concerns, organisations must ensure that the data they have from or about identifiable individuals be properly anonymised - or fiercely protected. Consumers have various legal rights, and brand owners have a significant fiduciary obligation to properly safeguard personally identifiable information. Unsurprisingly, ensuring this protection can be an uphill battle, but companies can do a lot to mitigate these concerns through development, deployment, and updating their AI models.

When consent is needed, conditional uses of the data should be clear; and when it’s revoked, data scientists must know how to remove it. It doesn’t just apply to the first interaction, but to the entire AI lifecycle from collection and storage to usage and destruction.

One good way to develop new models while protecting privacy is to start with anonymous, pre-labelled datasets. Another is to filter personal data through named entity recognition and perhaps replace personal information with entity tags. Anonymised data can inherently protect user privacy, and, when deployed in large amounts, ensures that there is enough data available to make accurate and well-informed AI models. In other cases, synthetic data, or in this case “fake PII”, can be used to make the training data realistic, without anything being traceable to the identity of source contributors.

5. Fostering a Holistic Approach

Compiling and utilising data for AI/ML models is no small feat, as it can require large collections of datasets spread across multiple regions, use cases, industries and scenarios. Speech and language recognition software requires training data that can understand dialects, accents and impediments from a wide range of users. Facial recognition, from security scanners to in-cabin interfaces, requires data from many shapes and sizes of people to be inclusive and useful.

Add anti-discriminatory regulation and the processes needed to deliver it, there are many disparate data sources (including augmented or synthetic data) that challenges data scientists to bring it all together to deliver meaningful business value. To streamline this, work with a single provider to help efficiently integrate data, the data lifecycle and reduce the total cost of ownership needed to effectively train, deploy, and retrain models to effectively address real business problems.

While we now know that building, deploying and maintaining artificial intelligence and machine learning models is not an easy task, it’s usually not impossible. At the very least, solving focused problems with good planning will help improve the AI training process, and in turn, help businesses that are leveraging AI in machine learning become more operationally efficient in the future.

#AI #Training #data #bias