It is time to shift from a model-centric mindset to a data-centric approach.
AI is a massive part of human life today and is woven into the fabric of our everyday society. From medical imagery scanning to the ubiquity of facial recognition software in our cell phones, AI is everywhere. However, up until now, the focus of AI has largely been centered on improving codes and algorithms, not on improving data used to train the models. Another way to put it? AI has taken a model-centric approach while hoping and praying data quality is up to par.
But, we are currently in the midst of a data revolution. There has never been so much data available to us before, yet this is part of the problem, as data science teams need to carefully select what data is needed to make models perform best.
Currently, most AI models are jump started from “off-the-shelf” models that give a significant head start, and seldom do you need to start building the models from a clean slate. While that mode of operation is a great way to kick start, there are still issues such as these models being trained on large public data sets that may not be close to the intended use; to move ahead and improve model accuracy we must begin to focus on improving the data used.
First, there’s often no point in training models on data similar to what they’ve already “seen”. The model won’t improve this way, yet it will still cost in compute, storage and labeling if relevant. It’s the net-new data, also known as “novelty data sets” that accelerate model performance, yet finding these valuable novelty data sets can be hard and manual work.
Second, implicit in data selection is the potential for introducing bias. Selecting data sets that represent the ideal training conditions is the objective, yet this is difficult to achieve, especially with visual data, where the implicit distributions are not always obvious.
Finally, as models are being trained on data, it’s often iterative and extensive manual work to root cause issues the model underperforms owing to class imbalances or other issues. As with all these issues, it’s the choice of data and lack of sophisticated tools that slow model improvement efforts and not necessarily the underlying model.
Why don’t we put more emphasis on data?
We all know data is expansive, everywhere, and increasingly important to our lives, systems, and communities. So why isn’t there more of an emphasis placed on data within AI?
To start, visual data sets are typically very large. Think: traffic cameras, medical imaging analysis, youtube videos, satellite imagery. There is no easy way to query and search through these huge datasets unless they were previously annotated or you’re willing to write custom detectors for each item one is looking for. This requires a lot of time and effort on the part of already strapped data scientists.
Unfortunately, no formalized tools exist for this sifting, and there are very few solid tools (open source or otherwise) to systematically build training datasets. To build good training datasets, not only does one need to understand the relationships, structure, and implicit biases, but also needs to search and curate data at large scale that is yet to be annotated. This laborious manual process is just accepted as an unfortunate part of the visual data management process.
Another factor to why AI’s emphasis has been on the model, not the data, is that while some preprocessing code or bespoke offline models attack a very specific problem, this often doesn’t scale up when it comes to real-world data applications.
In addition, there are security concerns involving the use of data to build AI systems. Visual data is often confidential and cannot be shared outside firewalls (or similar), yet it must be worked with more in order to unlock the potential of AI. That’s a real Catch-22 for many organizations and data scientists.
But, how could data curation change these common issues?
How can data scientists improve data curation?
Data scientists already are stretched, but the urgent need to improve data curation is critical to the future of unleashing the full potential of AI.
First, in order to improve data curation, data scientists need to allocate more effort, relative to the overall value of data curation. Historically, working on the model was (and still is) considered the more appealing and glamorous work of AI. But, given that these models are maturing, and 80% of a data scientist’s work involves collecting, cleaning, and organizing data, it’s time to make a cultural shift to investing more time in data and data curation.
There are many good models available that can give data scientists a head start. But going from 80% to 95+% accuracy ranges is exponentially more difficult, and it’s no longer enough to just tweak or tune the model to get you there. You need good training data and, implicitly, a good way to select your data.
One of the best ways to do this? The deployment of tools that can simplify and reduce the manual labor component of data management. For example, by deploying visual data search, clustering, and model audit tools, data science teams can radically speed up the model training process, while also bolstering model accuracy.
The future of visual data
The volume of visual data is only set to increase across industries as varied as healthcare, security, surveillance, sports and entertainment, manufacturing, automotive, and retail. And as more and more cameras are deployed all over the world, the cost of hardware will continue to decrease, while what we ask of all that visual data will rise.
Storing everything is not a realistic option. Finding algorithmic means to evaluate, store, curate and select visual data will become a critical component of MLops for these use cases.
The volume of visual data will only continue to grow exponentially, along with the potential uses and advancements we can make by harnessing its value. But, for these applications to become reality, we must first invest in the tools and processes that can help us manage visual data.