9 min read

How Much Data Do We Need?

Featured Image

I’ve spent hundreds of hours talking with potential clients about how Synaptiq can help unlock value in their data.  I really enjoy those conversations; each one is different and I always learn something new, especially about the amazingly creative ways that people come up with to leverage data.  I often click the button on Zoom to disconnect from the call and say “I wish I had thought of that!”  

Despite all of the things that are different from call to call, two questions always come up: “how much will it cost?” and “how much data do we need?”  The first of those is usually much easier to answer than the second, because what people are really saying when they ask the second question is something like this: “We’ve got some data, but it’s a bit of a mess.  It’ll take some effort to get it into shape.  We can get more data, but that’ll take time as well and we’ve got limited resources.  How much work are we going to have to do to ensure that this project can succeed?”

If there’s anything that’s universally true in AI and machine learning, it is that more data is better. 

A paper published in 2009 entitled “The Unreasonable Effectiveness of Data” showed that differences in accuracy between algorithms essentially disappear as the amount of data used for training increases.  It matters what algorithm you use if you don’t have much data, but it doesn’t matter much if you have lots of data.  Said differently, the more data you have the less clever you need to be.

In this and the next few blog posts, I won’t answer the question of how much data you need, because the real answer is “it depends”, and I’d need to ask you some questions - which is impossible in a blog post - to give a solid answer.  Rather, I’ll talk about what you can do to get the best results possible from the data that you do have, focusing on the low data case.  That is, I’ll try to help you be clever in the right ways when there’s not much data.

I’ll limit the discussion to supervised learning, where your data consists of things and labels: a lung X-ray (the thing) and whether there is evidence of cancer (the label); a restaurant review and whether it's fake; clickstream data and whether the user will buy something.  I’ll also assume neural networks as the type of model to learn, but most of the ideas apply to many different types of models.  Within this broad umbrella, there are different low data cases, each of which requires or enables different approaches.

Below I give a quick overview of the low data cases and the relevant approaches.  Read my next blog posts for the details with lots of examples!

Low Data Case I: You just don’t have much data

Sometimes, you just don’t have much data and it’s really hard to get more.  That’s more often the case with early stage companies who, for example, may not yet have many customers and thus have limited data to use to make predictions.  But it can also be the result of getting data from physical, as opposed to digital, sources: images from drones flying over oil pipelines in remote locations, manual inspection of airplane engines that failed, and so on.  The good news is that this is such a common case there are many things you can do to make the most of your limited data.

Transfer learning

The last time you learned something completely new you were a baby.  Everything you learn today is informed by what you learned previously, and is easier and faster because of it.  Imagine that you want to learn to play tennis, but getting time on the court is hard.  You do have easy access at the gym to a squash court, so you play lots of squash, which is like tennis.  That way you’ve acquired some useful skills that you can just fine tune on the tennis court, needing far less time than if you had never played squash.

Likewise, machine learning systems can learn from scratch, or they can learn something new by fine-tuning something they learned before.  And just like with people, it takes less data and less time given that prior experience.  Suppose you want to train a model to detect people wearing glasses. You could train it from scratch, and the network will have to learn, from images, what people look like, what the places where people can be found look like, what glasses look like, and where glasses are in images when worn by people.  That’s a lot to learn and, as you can imagine, might need lots of data.  Or you could get a network that was trained to detect people, which has already learned what they look like and where they can be found and has already seen lots of examples of people with glasses, and then fine-tune it just specifically to detect glasses.

That is transfer learning, and it is exceedingly common these days.  In fact, it is rare to train a neural model from scratch at all, especially in visual domains.  You look for a model trained on data similar to yours, modify it slightly, and then continue training with just your data.  

Open source datasets

But what if you can’t find a pre-trained network?  An alternative is to look for a dataset that is similar to yours, but that is much larger, train a network on it, and then see “Transfer Learning” above.  There won’t be a blog post on this topic, but I’ve often been on calls with prospective clients and googled “X dataset” where X is whatever domain they are working in.  Rarely will that search return nothing useful.  Google even has a dataset search tool that is a good fallback if the generic “X dataset” query fails.  Kaggle also has a broad range of datasets that you can search, many inspired by business problems.

Synthetic data

But what if you really can’t find any similar data?  Maybe you’re building a video surveillance system to detect people carrying weapons in public places.  You got permission from local law enforcement to gather some data by carrying guns and knives in public places while they stood by.  But that approach clearly does not scale.

You can take the few images you have and create more by a process known as data augmentation: changing the contrast, adding salt and pepper noise, random cropping, flipping around the vertical axis, and so on.  Each of these augmentations yields a new example that has the same label as the original image.

An alternative is to create fake data.  In the surveillance example, you can find pictures of weapons and digitally paste them into pictures of public spaces which are readily available.  The challenge there is to make the resulting images look like they would if the person was actually carrying the weapon, and not look like, for example, someone has a gun glued to their forehead.  The fake data should look as close to the real data as possible.  We’ve even worked with clients who created data using gaming engines because the scenes they render are so realistic.  There are a number of gotchas here.  For example, if you’re not careful about how you paste objects into pictures, the neural network can learn that when it sees a “seam” in the image, it means that there is a weapon, and completely ignore what was pasted.  That is, if you pasted a carrot into an image the network would see the seam and say that the carrot is a weapon.  

Web scraping

Another approach we’ve used with very good results is submitting queries to Google’s image search engine.  For example, if you want to classify license plates by state you can submit queries like “license plate north carolina” and “license plate california” and “license plate michigan”.  Most of the hits that come back will be images of license plates from the specified state, so the query tells you the correct label!  Of course, you’re not guaranteed that the images will be of license plates from the target state, so there may be some noise in the data.  But a little manual curation with this approach can go a long way towards gathering relevant data.

Few shot learning

Finally, there is an entire class of learning algorithms that are built around the idea that you have just a handful of labeled instances.  I’ll write a long blog post or series of posts on this topic, but here is the basic idea.  Suppose you want to train a network to let people into your health club using just their faces.  To enroll, the members have to submit two different pictures of themselves.  You build a neural network that takes as input two images, and the target output is whether they are images of the same person.  Select a picture of Bob and a picture of Janet, give them to the network and tell it that the correct answer is “different”.  Select the two pictures of Janet and tell the network that the correct answer is “same”.  This forces the network to learn what features of people are common across different pictures of them, and what features are discriminating in pictures of two different people.  When Bob shows up at the club and stands in front of the camera, his picture is handed to the network along with pictures of each of the club’s members, one at a time, to see if the network ever says “same”.  If it doesn’t, the door stays locked, and if it does, a visit by Bob is recorded in the database.  Again, we’ll dive deep into this topic soon.

Low Data Case II: You’ve got lots of things, but few labels

Sometimes you’ve got lots of data, but not much of it is labeled.  That can happen when the knowledge required to label things correctly is specialized.  A summer intern can identify X-ray images of luggage with concealed guns, but you need a radiologist to tell you whether X-ray images of lungs show signs of cancer.  Or it may be the case that your ability to collect things far exceeds your capacity to label them.  You can write code to download hundreds of thousands of posts from Reddit or Stack Overflow, or a highway surveillance camera may stream images of thousands of license plates per hour, but in both cases a human must read a post or look at an image to assign a label.  

Like the previous case, this one is common as well, often with companies that have been gathering data, perhaps for some other purpose, for a long time and now have ideas about new ways to use their data.  Also like the previous case, there are many things you can do, with most of them focusing on how to best leverage the unlabelled data.

Active learning

Think back to when you learned math - fractions, or algebra, or calculus - in school.  Some homework problems made sense and could be solved easily, but others were hard and confusing.  One way of dealing with the hard problems is to ask the teacher for help.  It doesn’t make sense to ask for help with the things you already understand because the teacher’s time is limited.

Machine learning systems that engage in active learning do the same thing.  They can identify unlabeled examples that are confusing and ask the teacher, in this case a human, for the right answer, i.e., the correct label.  Rather than asking humans to label data chosen randomly, where many of the chosen items may already be well-understood by the algorithm, a much better use of precious human time (e.g., expensive radiologists) is to ask them to label examples that will actually teach the algorithm something new.

Semi-supervised learning

What can you do with tons of unlabelled examples other than label them?  It turns out that you can do a lot!  

Suppose you’ve trained a model using whatever labeled data is available.  As with active learning, when the model sees unlabeled examples it will be certain about its answer for some and uncertain about its answer for others.  Bootstrapping is the process of using the labels for which the model is fairly certain as ground truth, in effect pretending that they came from a human and using this expanded labeled dataset to learn a new model.  A few iterations of this process can greatly increase the number of labeled examples without involving a human and, due to the unreasonable effectiveness of data, the resulting model should be more accurate.  Of course, the model can be certain and wrong, so there are details to work out which will be covered in a future blog post.

Now imagine that you, not being a radiologist, have been put in the awkward position of labeling lung X-rays.  You don’t know what lung cancer looks like, but you can look at two different X-rays and judge how similar they are.  If two X-rays are very similar then you reason that they must have the same label - cancer or not - even if you don’t know which one is correct.  Said differently, labels that you assign for very similar examples should be consistent.  It turns out that penalizing a model for being inconsistent on unlabeled data as it learns using a small labeled dataset leads to significantly better outcomes.  Again, the devil is in the details, but that will be the subject of another blog post.

I hope this discussion has done a few things.  First, it should be clear that low data cases are common and that having small amounts of data does not mean that you cannot extract significant value from it.  Second, not all low data cases are the same, and the way you tackle them depends on which case you’re in.  Finally, I’d like you to come away with an intuitive understanding of some of the approaches to low data, and hope that you’ll read the blog posts that follow to get more of the technical details as well as advice on how to make them work in practical settings.

Additional Reading:

Cross-Species Communication? Researchers Say AI is the Key

Technology Could Help Us Understand Animals

Although animals don’t “speak” like us, they do communicate.  In 1973, ...

We Helped a Non-Profit Expand Road Access in Rural Rwanda

The Mission: Expanding Road Access in Rural Rwanda

In 2020, a non-profit set out to solve a problem: the lack of road...

Artificial Intelligence in the Wild: Helping Conservationists Save Species

Artificial Intelligence in the Wild

Scientists warn that we’re facing Earth's sixth “mass extinction.”

IExtinction is...