How to choose a subset of images for training?
A dataset of images, used for computer vision tasks, could be the key to success or failure. A clean dataset could lead the way to a great algorithm, model and ultimately system, while no matter how good the model or algorithm is, junk in — junk out.
Even if the data is clean, do you need to annotate all images and train on everything?
This is a very tricky question with no definite answer but the best for now is:
Look No Further Than Data Explorer
So how do you know? Which images should be kept?
Data Explorer is a platform that was built to allow us focus on the data, curate it, clean it and make sure we start the development cycles with a great foundation.
In previous blogs, we saw how a dataset could be visualized, explored, and even image based search to find elusive examples.
In this blog we will see how to automatically select a subset of images for training. To emphasize, this is done based on the images, no metadata or prior knowledge involved.
Data Explorer provides 6 different sampling methods:
- Inlier: Points similar to a cluster of points
- Outlier: Points dissimilar to all other points
- Bimodal: Strong inlier or outlier points
- Normal: Normal distribution
- Uniform: Pick ones sample after some ’n’ samples
- Coreset: A representative sample that attempts to preserve the cluster structure in the data
It is trivial to apply any of them to the data. Click the “filter” icon, choose the relevant sampling method and how much of the data to keep.
In the example below, we see how simple it is to keep 40% of the images using the “coreset” option:
Click the “Filter” icon, then choose sampling method; sampling fraction — how much of the data to keep
Choosing the best option and what amount of data to keep depends on many factors, but Data Explorer provides a very simple interface to conduct various experiments.
The Coreset option tries to preserve the data structure, so even small clusters should be represented after sampling the data. As such, it may a good place to start the process.
In this blog, we saw how a data set of images could be sampled. By doing so, we eliminate the need to annotate the entire dataset and so saving money and time. In addition, by training on only some of the images, training cycles are shortened, and even accuracy may improve.
In future blogs, we will see how Data Explorer leverages metadata to prepare the best dataset for further development steps.