Generative AI and ML fuelling a revolution in data quality
A new data quality revolution is underway, powered by models that use generative AI and machine learning techniques such as ChatGPT. Although the use of AI in the data industry has, until now, mainly focused on predictive analytics, today we are entering an era of creative generative AI, where a powerful tool for NLP, data analysis, and automation will shape the future of data management and data quality.
Used in the data industry since the 1950s and 1960s – when they were developed to process and analyse data – early AI programs used rule-based systems, symbolic reasoning, and expert systems to make inferences and generate insights from data. Today, use of AI has accelerated dramatically: according to the Data and Analytics Leadership Annual Executive Survey 2023, 80.5% of data executives indicate that AI/ML will be an area of increased data and analytics investment during 2023 and it will be the highest investment priority for 16,3% of them, followed by data quality for 10.6% of organisations.
“Data quality is a make-or-break aspect of data management,” explains Davide Pelosi, Manager, Solutions Engineering at data integration and data integrity leader Talend. “It ensures that businesses can make informed decisions based on accurate, complete, and consistent information. When data quality is poor, it can lead to errors in decision-making, loss of revenue, and damage to a brand's reputation.
“Fortunately, software solutions providers are leading the charge in innovative data quality tools and techniques that help businesses identify and fix data quality problems quickly and efficiently,” he says. “However, there's still much work to be done. In a recent survey, 97% of the people Talend surveyed indicated they face challenges in using data, and their top concern is ensuring data quality, coming in first with almost half of all respondents (49%). That means there's a massive opportunity for improvement - and the rewards can be huge for businesses that get it right.”
Data quality workflows of the future
According to a report by Gartner, by 2025, at least 50% of all data management tasks will be automated. Most will be completed using AI/ML-powered automation, such as generative language models, so it's time for old-fashioned data management techniques to move aside, as ChatGPT and other generative language applications promise to shake the market up.
From content creation to development task automation, these technologies are already making waves in the business world – and their impact on data management and data quality initiatives is, quite frankly, exciting.
“By automating and simplifying data management tasks like never before, these technologies promise to revolutionise how organisations handle their data,” Pelosi comments. “The prospect of next-level automation and efficiency means it's easier than ever for businesses to ensure their data's accuracy, completeness, and consistency. Let's take the example of a data quality workflow.
“First, a technical data quality assessment is conducted using machine-learning algorithms to identify anomalies and quantify the severity of the issues. Then, based on the assessment results, generative language models can be used to suggest data quality rules and transformations in natural language text that business stakeholders can easily understand.”
From there, these proposed rules can be reviewed and validated by data quality experts and business stakeholders, who may accept or reject them or suggest modifications to better align with their business requirements.
“Businesses can also create additional Business Rules simply by asking in natural language, without needing development or complex UIs,” Pelosi adds. “For example, a business user might ask, ‘Please raise the acceptable age to drink alcohol to 18 and mark all the people not following the rule as not being targeted for the spring marketing campaign’, like we do today with Alexa. Once the rules are accepted, they can be converted into executable code, such as Python or SQL, using a similar, template-based approach.
“Of course, before deploying the code to production, it will need to be tested and validated using a sample of data to ensure the rules are working as expected and the data quality metrics are being met. But, once done, the cleaned data can be used for various downstream tasks, from data analysis and visualisation to machine learning and business intelligence.
“Picture this: the world of data management and quality is about to undergo a significant transformation, and we've got a sneak peek at what's coming. Although the use of generative language models in this field is still in its infancy and is being researched by industry experts, there are already some jaw-dropping research projects and prototypes out there that show the mind-boggling potential of this technology.”
Great care needed to deal with current limitations
With human-level performance on various professional and academic benchmarks, the latest version of OpenAI’s GPT technology, GPT-4, is highly impressive.
Despite its capabilities, however, GPT-4 has similar limitations as earlier GPT models. “Most importantly, it still is not fully reliable,” the OpenAI team says. “Great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of a specific use-case.”
Aaron Kalb, Chief Strategy Officer and Co-Founder at Alation, reiterates this final point: tools like GPT should not yet be trusted to advise on important decisions.
“That’s because it’s designed to generate content that simply looks correct with great flexibility and fluency, which creates a false sense of credibility and can result in so-called AI ‘hallucinations’.
As Kalb – who, when working at Apple, was part of the founding team behind its groundbreaking Siri voice assistant – explains, the authenticity and ease of use that makes GPT so alluring is also its most glaring limitation: “Only if and when a GPT model is fed knowledge with metadata context – so essentially contextual data about where it’s located, how trustworthy it is, and whether it is of high quality – can these hallucinations or inaccurate responses be fixed and GPT trusted as an AI advisor.”
“GPT is incredibly impressive in its ability to sound smart. The problem is that it still has no idea what it’s saying. It doesn’t have the knowledge it tries to put into words. It’s just really good at knowing which words ‘feel right’ to come after the words before, since it has effectively read and memorised the whole internet. It often gets the right answer since, for many questions, humanity has collectively posted the answer repeatedly online.”