In machine learning lingo, “labeled” data means we have the data and we also know the output that is associated with the data. For example, you’re trying to predict house prices based on features like the square footage and neighborhood of a house. You have labeled data in this case if you also have the price of the house.

Without these labels it will be impossible to predict house prices — or more precisely, to train a machine learning model to predict house prices. But we don’t always have the luxury of labled data.

Imagine you have a data pipeline that delivers text documents at a certain rate. The content of these documents can vary. Can these documents be categorized, labeled, or put into groups? The group into which a document is classified is a type of prediction — give me a document and I’ll tell you which group it belongs to.

So how do we go about solving this unlabeled data problem? (You’ll also see this described as an unsupervised training problem.)

Unsupervised and Semi-Supervised Machine Learning Solutions

There are a handful of approaches we can use to solve this problem.

1) Topic models such as LSI (Latent Semantic Indexing) and LDA (Latent Dirichlet Analysis) can extract semantic information to categorize documents in terms of higher-level concepts. The documents can then be further classified/tagged into categories using clustering techniques such as K-Means.

2) Text summarization techniques can also help classify the documents.

3) Once we have a classification, deep learning techniques, especially recursive neural networks (RNNs) can be applied to generate text like the text in the cluster. This makes it easier for humans to recognize the semantic characteristics of the cluster/category and measure the performance of the classification schemes.

4) Once we have a classification and a way to easily put some examples from each cluster in front of humans, we can bootstrap the solution by having humans label a small subset of the documents. This labeling can then be used in a collaborative filtering recommender system model to fill in more gaps and steadily build up a labeled dataset of document categories.

To summarize, we begin by considering a variety of embedding schemes, apply a handful of unsupervised NLP models, bootstrap using select human input, and build up a labeled training dataset which then gives us more options to improve the sophistication of the classification scheme.

Have you used these techinques? How well have they worked for you? Are there others that are not mentioned above? Let us know!

Lots of Lingo

I’ve used a lot of terms in this post — LSI, LDA, text summarization, deep learning, recommender systems, embedding, and more. I’ll explain these in a future post. Stay tuned.

For more recent thoughts from Jitendra, read Machine Learning for Better Invoice Routing.