Video analytics as the most immediate form of AI
Today, artificial intelligence (AI) is everywhere. Machine learning models are finding applications in medicine and transport, education and finance. Yet there is one use case that towers above the rest, one that is rapidly growing in popularity and is set to deliver results in the near-term: video analytics.
Video contains a wealth of useful data for all sorts of organisations, from security firms to retailers, logistics companies, and local authorities. Making sense of what exactly is happening on video used to be a job for people. However, computer vision models are quickly replacing human observers.
In fact software for computer vision applications is expected to be the one of the most prominent areas of AI adoption, and according to a forecast by analyst firm Omdia, pure-play computer vision is expected to be responsible for $52 billion in cumulative revenues between 2019 and 2025. While software that combines computer vision with analytics is expected to earn another $51 billion in the same time period.
Machine learning researchers are applying computer vision in an incredibly broad range of scenarios: from recognising ‘occult’ bone fractures in X-rays that are nearly invisible to the human eye, to calculating (more or less) the exact number of trees in any region of the planet from a satellite image. The Global Forest Change, for example, is a project by the University of Maryland that tracks deforestation, using convolutional neural networks on satellite data to spot trees. It might seem like a strange undertaking, but despite technological progress, we have a surprisingly poor understanding of how many trees cover the Earth.
Further, by using the interactive map, users can browse ‘Example Locations’ to find images that are accompanied by an explanation that details the effect of forest fires in Yakutsk, Siberia, or the expansion of palm oil plantations in Kalimantan, Borneo. Frankly, it’s a striking way of telling a story through data.
Autonomous machine learning
In the world of commercial applications, computer vision is already being used on building sites to compare the images of the work against the blueprints, otherwise known as digital twins, and in both hospitals and care homes to ensure their residents are safe and well.
Digital twins have become more prominent in recent years and some operators consider them pivotal to data centre design, offering data-driven insights on which to base decisions regarding performance, cooling optimisation, rack and room layouts. At Kao Data, for example, we chose Future Facilities computational fluid dynamics (CFD) software to create a digital twin of our London-1 facility; modeling and evaluating its performance against external weather and operating conditions at both full and partial utilization. Using a digital twin was essential in helping us refine our decision-making process around the provision of cooling infrastructure, but was also crucial in the deployment of NVIDIA’s Cambridge-1 supercomputer.
Of course, video analytics is also powering advances behind self-driving cars, where it supports the sensors in understanding the surrounding environment. Volkswagen recently announced that it was testing Level 4 vehicles – those capable of performing all driving tasks under specific conditions – on a special circuit in Munich, with hands-free vans expected to be seen on public roads by as early as 2025. Running ready-to-use machine learning models in a process called inferencing is simple and can be accomplished on edge hardware, for example, smart CCTV cameras equipped with purpose-built chips to track customers, or smart speakers that can recognise speech.
However, before such models become useful, they need to be trained on hundreds of thousands, sometimes millions of examples, in an incredibly compute-intensive mathematical exercise, and here, unfortunately, size matters. The more data the model consumes, and the larger computational and memory budgets are, the more accurate it will be at understanding text, images, or speech. Naturally, therefore, selecting the right data centre, with customisable architectures and the technical expertise to help configure this new scale of computing, ensuring scalability and low application latency, is becoming an imperative. For example, Google Brain’s Big Transfer (BiT), one of the largest pre-trained computer vision models ever made, involved more than 315 million labelled images in its creation.
Emerging video analytics workloads in particular are some of the most resource-intensive applications to appear in the data centre. And according to Omdia’s latest report, video data is extremely bandwidth intensive, requiring more compute horsepower, training time, and data storage resources than either audio data (used in voice/speech recognition and conversation) or basic operational data streams, such as machine operational data or consumer-based data like a browser history.
Interestingly the Hyperscale view is that these compute intensive applications are best served by their massive arrays of standardised servers spread across multiple locations. However, vision model users can often find that industrial scale data centres placed close to their applications can provide a far more effective platform for their requirements.
Advancing computer vision
Two things, as ever, are essential to advance state of the art computer vision: infrastructure and human labour. Traditional CPUs are somewhat useful for training machine learning models, but what this technology likes best is massively parallel processing that involves thousands of compute cores, rather than a few dozen. Running AI at scale requires re-thinking infrastructure in a way not seen for decades and embracing the principles behind high performance computing (HPC), which means hotter, heavier, and more power-hungry machines.
To a degree, this is already happening. NVIDIA, for example, who recently deployed the UK’s fastest AI-powered supercomputer, Cambridge-1, at Kao Data, has always designed chips with an abundance of cores. The company has fast transformed itself from a gaming brand into an enterprise juggernaut, and is taking its leadership position in AI hardware very seriously. As such, the latest crop of NVIDIA accelerators includes the A30 card for ‘mainstream’ enterprise compute workloads, such as recommendation systems, conversational AI, and computer vision.
The A30 is smaller and cheaper than the A100, which powers the DGX supercomputing appliances and is a favourite among the hyperscalers. However, this may signal that AI has gone from something that was the prerogative of major corporations to something that is now viable for small-to medium businesses.
Challenging for the top spot
NVIDIA is, of course, not the only company to have noticed the sea change; challengers vying for the parallel processing crown include Intel’s Habana, AMD’s Instinct, British hopefuls Graphcore, recently emerged SambaNova and Groq, and of course, Cerebras – the creators of a massive 72 square inch chip, with 850,000 'AI-optimised' cores packed into the latest design.
Another factor that determines the speed of progress in computer vision is the fact that data labelling remains, very much, a task for humans. Before a machine can understand what is happening on a diagnostic image or in a video stream, it needs to see carefully selected examples that were previously tagged by human employees. Data labelling is a rapidly growing industry, but it remains something of a secret, where often low-skill, low-income jobs are being outsourced to developing economies in Asia.
The quality of data labels also remains a hot topic. Incorrect labels will generate inaccurate models, and even the benchmarks used to test AI models, are not, unfortunately, free of error. A recent study analysed ten of the most popular machine learning benchmarks and found an average of 3.4 percent of labels were incorrect, including image datasets.
Obtaining image and video data is not hard, especially when you consider more than three billion images are shared online every day. But advancing computer vision will require armies of data labellers to process and tag this information before it becomes truly useful. As such, the existing computer vision and video analytics applications offer a tantalising glimpse of what is truly possible with AI.
Advances in hardware, access to more labelled data, and much larger machine learning models will inevitably produce much smarter systems. But in order to get there, we need to solve the challenge of training – and that means addressing the challenges of infrastructure, and the human component of data preparation.