Issues with unstructured data in AI’s quest for more input

By Kristian McCann

July 05, 2024

undefined mins

Share this article

Prioritise Us on Google

Share this article

Prioritise Us on Google

The art of refining raw data to power AI models often goes unnoticed, until the lack of robust processing leads to poor outputs fueled by low-quality input

In the relentless pursuit of AI dominance, where the world's tech giants and innovative startups alike are locked in a data arms race, industry is in peril of being in a situation where its eyes are bigger than its stomach.

People are seeking more and more data to fill ever-growing complex AI models, models, which are currently revolutionising businesses cross-sector through the likes of GenAI trailblazer ChatGPT.

But what is the fuel propelling this revolution? Data! Like the Wizard of Oz working behind the curtain, data is being taken in, meticulously processed, and then refined, to feed the insatiable appetites of AI models.

Yet, the art of data processing often goes unnoticed, overshadowed by the dazzling achievements of AI itself. But, for many working behind the curtain, this is bringing a looming problem.

For businesses, it's a financial one, for users of the system, it can result in a poor quality output.

Details of data processing

Data processing is the bridge between the digital world and the insights we seek to uncover. It takes the raw, data generated and transforms it into a format that AI systems can understand and utilise in tasks, like machine learning, where models learn patterns and make predictions, or natural language processing, where AI systems understand and generate human language.

Data processing process

This can be an easy task to complete, if you want to extract data from an excel sheet, for instance. With numbers neatly put into their respective and labelled columns and rows, getting a machine to extract, interpret and take action on is relatively easy.

“The problem on the other hand, let’s say we want to use AI to provide a score for how positive or negative each comment on X (formerly Twitter) was about a specific company,” says Alan Jacobson, Chief Data and Analytics Officer, Alteryx. “We likely would not have shared agreements as humans assessing the words, let alone agreeing with a computer method of scoring. This fuzziness is where the challenges arise.”

This is the issue with dealing with unstructured and heterogeneous data sources in AI projects. There are insights to be gained, but by putting them into a system unrefined can skew models.

To tackle this, companies can employ data cleaning methods. These tackle errors and inconsistencies, ensuring the information is accurate and complete before it goes to the system.

Yet, AI systems are data hungry. “By their nature AI models must be trained on very large datasets, so the data management capacity required is exponentially larger,” explains Siddharth Rajagopal, Chief Architect EMEA, Informatica.

Hundreds of terabytes of data now at play in Large Language Models make it no longer possible to check each data element to the same degree as smaller structured models have historically been reviewed.

Therefore to do this, companies need to call upon increasingly greater processing power, which requires significant computational resources. This can not only be time consuming, but expensive.

This ability to refine these data for these ever-growing AI models may be out of reach for some.

The multimodal future

Equally, as AI models evolve, they are increasingly processing image and audio data to support language models.

Although this brings a wealth of new data and understanding, this multimodal approach requires the development of more sophisticated data processing techniques to handle the diverse data types effectively.

"Processing unstructured data like images and audio introduces additional challenges, such as feature extraction, representation learning, and integration with textual data," notes Richard Fayers, Head of the Data and Analytics Practice at Slalom. "Techniques like computer vision, speech recognition, and multimodal fusion will become increasingly important to enable AI models to understand and generate rich, multimodal content."

The balancing act: quantity vs. quality

While the volume of data available is staggering, the true challenge lies in striking a balance between quantity and quality.

“Without the right amount and quality of data, there’s no AI,”explains Rajagopal.

Poor data quality can have severe consequences, with MIT estimating it to cost most companies a staggering 15-20% of revenue.

However, being too puritanical in pursuit of data quality can also lead to slower development rollouts or more limited models, which could prove costly in the midst of an AI race that is seemingly fuelled by sentiments similar to Mark Zuckerberg’s motto of “Move fast and break things”.

“It’s much more about prioritisation and taking the right proactive steps, than balancing the needs,” says Fayers. “You shouldn’t sacrifice the pre-processing to enable (say) larger more complex models, as this can compromise the quality and accuracy of the data feeding into them, ultimately affecting their performance and reliability. “

Safeguarding the ecosystem

As the proliferation of AI-generated content continues, robust data processing

becomes increasingly crucial to ensure model quality.

“We believe that one of the major drivers of data processing in AI will involve the detection and filtering out of AI-generated content from training datasets,” says Fayers. “Think of it as 'Mad Cow Disease' for AI – we need data processing to detect and stop AI Farms pumping out low-quality content, otherwise we will enter into an ultimate Vicious AI Circle resulting in disinformation, conspiracy theories, propaganda, and misinformation informing our decision-making process."

A European law enforcement group Europol report states 90% of online content will be AI-generated by 2026. In addition to distorting quality, some researchers believe this AI feeding AI phenomenon could lead to Model collapse, and the gradual degradation in the output.

Safeguarding of models, is therefore, key. Rajagopal believes that the cloud could hold the key in this administration. "For AI to be successful, it needs access to an intelligent data management cloud capable of quickly identifying all the necessary features for the model.”

Although, Rajagopal notes, "Figuring out how to provision the huge amount of compute power needed is a challenge.” In lieu of such a cloud, responsible data management practices implemented at the human or organisational level will become increasingly important as AI advances.
“The future of data processing in building AI models primarily concerns ensuring that a business using AI has a robust understanding of its data estate and an adequate data governance system in place,” explains Andy Crisp, SVP Global Data Owner, Dun & Bradstreet. “AI is only as smart as the data that fuels it, and so introducing policies and adhering to data quality standards is imperative to preventing the influx of substandard data that could compromise AI outputs. A solid data foundation not only enhances the accuracy of predictions but also fortifies the insights derived from AI algorithms, creating a robust framework for smarter decision-making.”

**************

Make sure you check out the latest edition of AI Magazine and also sign up to our global conference series - Tech & AI LIVE 2024