What is Positive-Unlabeled Learning

Positive-Unlabeled Learning: What it is and How it Works?

Introduction

Machine learning algorithms are employed for various purposes and provide excellent results in most contexts when appropriately trained and validated. However, these algorithms also have some limitations, and one of those limitations is the fundamental assumption that they need to have a labeled dataset. A labeled dataset has input features with assigned labels, helping the algorithm learn and distinguish how different features affect the output.

For instance, if we want to design an algorithm to detect spam emails, we will have both labeled and unlabeled data. In the labeled data, we already have marked spam and non-spam emails, so the algorithm will train and learn to distinguish between the two. On the other hand, if we have unlabeled data, the algorithm will not have any reference points to determine whether an email is spam or not. In such situations, a machine learning technique known as positive-unlabeled learning comes into play.

What is Positive-Unlabeled Learning?

Positive-Unlabeled (PU) learning is a machine learning technique employed to train algorithms when the dataset contains only labeled data and unlabeled data; however, no negative data is available. This technique is widely used when specific labels can be challenging to obtain or when the negative data comprises a significant proportion of the dataset, making the algorithm's learning difficult.

The assumption in PU learning is that the unlabeled data contains both positive and negative instances. The positive data refers to that for which the output is known for certain, whereas negative data refers to those whose output is unknown.

For instance, if we want to classify a tumor as malignant or benign based on the available data, we will have labeled cases of both malignant and benign tumors. However, some of the tumors do not have assigned labels, and this unlabeled data will be subject to positive-unlabeled learning.

How Positive-Unlabeled Learning Works?

PU learning involves the following steps:

Split the labeled data into two categories: positive and negative.
Train a model using just the positive data and the unlabeled data.
Once the intrinsic characterizations of the positive data are learned, identify some of the unlabeled data as probable positives based on the learned model.
Further train the model with the labeled positive and newly identified probable positive data.
Use the previously identified probable positives to identify additional probable positives in the unlabeled data and use this newly identified data to train the model again.
The process is repeated until optimal results are achieved, and all the unlabeled data is used for training.

Pros and Cons of Positive-Unlabeled Learning

Pros:

PU learning is much easier to implement and practical when dealing with large datasets.
This technique provides better results when using inadequate negative samples in supervised learning.
It does not require additional resources such as human labeling or manual quality control.
PU learning can provide better results at low cost, and the models provide more accurate results than unsupervised algorithms.
PU learning is robust to errors caused by unreliable labels, which is a common occurrence in large-scale datasets.

Cons:

PU learning requires the positive data set to be larger than the negative set because the positive data provides a reference point for identifying probable positives in the unlabeled data. If the positive dataset is too small, then positive-unlabeled learning may not work effectively.
The process of iterative training is costly in terms of computational time and resources.
PU learning is not practical where resources are limited, or the cost of labeling is too high.

Applications of Positive-Unlabeled Learning

PU learning is widely used in various fields such as:

Fraud detection - finding a group of people who exhibit fraudulent behavior without any example of what an honest customer looks like.
Biomedical research - discovering rare disease cases where positive data is not available.
Online advertising - identifying click-bots without labeled data on what a bot click looks like.
Sentiment analysis - providing labels for user-generated content such as reviews, tweets, etc.

Challenges in Positive-Unlabeled Learning

Positive-Unlabeled Learning comes with its set of challenges. The biggest challenge is to ensure that the identification of probable positives is accurate. If the algorithm identifies an accurate positive with a high degree of probability or misclassifies a negative instance as a positive, the model's performance will degrade.

Another problem is to ensure that the algorithm does not end up learning the negative class. If the model learns the negative class, it will assign the positive examples to the negative class, thereby negating the entire purpose of this learning technique.

Conclusion

Positive-Unlabeled learning is a useful machine learning technique when labeled negative instances are either not adequately available or costly to obtain. The algorithm can cope with large-scale datasets, it is effective where human resources for labeling are limited, and the cost of the labeling is high. Despite its challenges, positive-unlabeled learning provides better results at low costs and offers several applications in different fields.

Related AI Basics

What is Positive-Unlabeled Learning

Positive-Unlabeled Learning: What it is and How it Works?