- Backpropagation
- Backpropagation Decorrelation
- Backpropagation Through Structure
- Backpropagation Through Time
- Bag of Words
- Bagging
- Batch Normalization
- Bayesian Deep Learning
- Bayesian Deep Reinforcement Learning
- Bayesian Inference
- Bayesian Information Criterion
- Bayesian Network
- Bayesian Networks
- Bayesian Optimization
- Bayesian Reasoning
- Behavior Cloning
- Behavior Trees
- Bias-variance tradeoff
- Bidirectional Encoder Representations from Transformers
- Bidirectional Long Short-Term Memory
- Big Data
- Bio-inspired Computing
- Bio-inspired Computing Models
- Boltzmann Machine
- Boosting
- Boosting Algorithms
- Boosting Techniques
- Brain-Computer Interface
- Brain-inspired Computing
- Broad Learning System
What is Bayesian Optimization
Understanding Bayesian Optmization and its application in Machine Learning
Bayesian Optimization, also known as Sequential Model-Based Optimization (SMBO), is a probabilistic search technique that is predominantly used to find the global maximum or minimum of an unknown objective function that is costly to evaluate. This algorithm is a robust and efficient approach to optimizing black-box functions with high-dimension, non-linear or non-convex optimization landscapes. Bayesian Optimization is a much sought-after technique, particularly in the field of Machine Learning (ML), when it comes to hyperparameter tuning.
Hyperparameters are model parameters that cannot be learned from the training data, and need to be set manually. These hyperparameters include learning rates, regularization coefficients, weight init. function, etc. The accurate selection of hyperparameters is a complex and time-consuming process since there may be several ways to impact the performance of machine learning models. And that is why Bayesian Optimization comes into the picture of hyperparameter tuning in Machine Learning.
In this article, we will cover the basics of Bayesian Optimization, the Bayesian Optimization process, its application in Machine Learning, and some examples to explore various Bayesian Optimization techniques.
Bayesian Optimization process
Bayesian Optimization works by constructing a probability model, which describes how the objective function is estimated given the hyperparameters. The constructed model then guides the selection of hyperparameters for the objective function. The model is updated iteratively, by evaluating the objective function at promising hyperparameters at each iteration, until either the number of iterations allowed gets exhausted or the model can’t obtain any further improvement.
To be more precise, the Bayesian Optimization process can be summarized as:
- Step 1 - Define a probability model that projects how the objective function is estimated for the range of available hyperparameters;
- Step 2 - Sample a few points and evaluate the objective function to update the probability model;
- Step 3 - Identify the hyperparameters that maximize the probability of obtaining the highest objective value, and sample new points based on that;
- Step 4 - Evaluate the objective function at the new sample points and update the probability model;
- Step 5 - Repeat Steps 3 and 4 until you can’t get better objective values or the number of iterations exhausts.
Bayesian Optimization in Machine Learning
Bayesian Optimization for Machine Learning is an iterative process used to find the optimal set of hyperparameters for a specific task. In ML, Bayesian Optimization has been applied for hyperparameter tuning in various models, including neural networks, Random Forests, Gradient Boosting, SVMs, etc.
Hyperparameter tuning is the process of trying out different hyperparameter settings in a machine learning model, to get the best possible results. In training machine learning models, sometimes subtle changes in hyperparameters lead to significant changes in the score. Traditional methods of hyperparameter optimization like random search, grid search, manual search, etc., cannot guarantee optimal results or prevent overfitting, and are hence wasteful and time-consuming. Bayesian Optimization, on the other hand, is much more effective when it comes to hyperparameter optimization.
Let’s take an example of hyperparameter tuning for a Support Vector Machine (SVM) model for the classification task. The SVM algorithm has several hyperparameters that need to be set, including kernel type, regularization parameter, and gamma. To find the best hyperparameters, we can use Bayesian Optimization processes as follows:
- Step 1 - Choose an acquisition function; for example, Gaussian Process Upper Confidence Bound (UCB), that quantifies the trade-off between exploration and exploitation.
- Step 2 - Identify hyperparameters, define their range and choose prior distributions for them.
- Step 3 - Train the SVM for the initial set of hyperparameters selected randomly, and evaluate the model.
- Step 4 - Estimate the Probability of Improvement (POI) that the new set of hyperparameters will beat the current best set of hyperparameters using an acquisition function.
- Step 5 - Iterate this process until you achieve the desired improvement.
The above five steps form a complete Bayesian Optimization process applied to solve the hyperparameter tuning problem in SVMs. The process can be modified and applied to any other models as well.
Types of Acquisition Functions
Acquisition Functions are the functions that drive the Bayesian Optimization process by guiding the selection of the best set of hyperparameters. There are numerous types of Acquisition Functions available, but we will briefly discuss the most popular ones here: -
- Upper Confidence Bound (UCB) - UCB is widely used in classical bandit problems. It attempts to balance exploitation and exploration by calculating the upper confidence limit of the objective function at each iteration. It is given by: -
maxXt+1 = argmaxx{µt(x)+βσt(x)}where xt+1 is the next hyperparameter to test, µt(x) is the objective function mean at time t, σt(x) is the objective function standard deviation at time t, and β is the exploration parameter that balances the exploration-exploitation tradeoff.
- Probability of Improvement (POI) - POI is an acquisition function that optimizes the probability of improvement over the current best value of the objective function. It is given by: -
maxXt+1 = argmaxx{P(f(x)≥f(Xt)+kletmax(Yt)≤m)}where k is a positive constant that controls the amount of improvement to achieve and m is the current best objective value.
- Expected Improvement (EI) - EI is the expectation of improvement over the current best hyperparameter setting. It is given by: -
maxXt+1 = argmaxx{EI(x;Xt)=E{[yi−ymax]max(yi−ymax,0)}] }where ymax is the maximum objective value achieved so far.
Examples of Bayesian Optimization
Below are a few examples of Bayesian Optimization in action for hyperparameter tuning in different ML models
Gradient Boosting: It is a popular ensemble-based ML method which relies on decision trees, and Bayesian Optimization can be used to identify the best set of hyperparameters for an effective Gradient Boosting model.
To identify the optimal parameters for Gradient Boosting, one can use Bayesian Optimization with the acquisition function chosen as Expected Improvement (EI), as follows:
- Step 1 - Define hyperparameters and their ranges; popular Gradient Boosting parameters include the learning rate, number of trees, subsample ratio of the training instances, maximum allowed depth, minimum number of samples required to split an internal node, etc.
- Step 2 - Initialize the results of the Objective Function with the minimum value of the loss function.
- Step 3 - Set the number of iterations and iterations completed to zero.
- Step 4 - Create a Gaussian Process regressor, using historical data points obtained as a database (DB), and Expected Improvement acquisition function to obtain the new test points to evaluate the objective function. Test the Gradient Boosting model using the test configurations and append the results to the historical data points.
- Step 5 - Choose the point that corresponds to the highest evaluated value and return the maximum value for the objective function.
Random Forests: Random Forest is a robust, tree-based ensemble ML method that has several hyperparameters, including the number of estimators, bootstrapping technique, criterion function, max depth, max features, bootstrap, etc. We can use Bayesian Optimization techniques to find the optimal set of hyperparameters for the Random Forest model.
To identify the optimal hyperparameters, one can use Bayesian Optimization with the acquisition function chosen as Expected Improvement (EI) as follows:
- Step 1 - Define hyperparameters and their ranges for the Random Forest model.
- Step 2 - Initialize the Objective Function results with a minimum value representation.
- Step 3 - Create an ML model with historical data points obtained as a database and Expected Improvement acquisition function to obtain new test points to evaluate the objective function. Test the Random Forest model using the test configurations and append the results to the historical data points.
- Step 4 - Set the number of iterations and executions to zero and find the point that corresponds to the highest value evaluated until the execution is complete.
- Step 5 - Return the maximum value for the objective function.
Advantages and Disadvantages of Bayesian Optimization
Bayesian Optimization has numerous advantages, including:
- Works well with black-box functions that are expensive to evaluate.
- It is model-based, making it a more efficient optimization tool.
- Provides a probability distribution over the entire hyperparameter space and based on past information, the BO process fits a sophisticated regression model to model this distribution.
- Provides the optimal solution that satisfies the trade-off between exploration and exploitation effectively.
- BO deals with uncertainty, probability and prior assumptions, which is very helpful for applications that require that sort of reliability.
- BO is fast in comparison to brute force techniques since it focuses on finding the best solution effectively.
However, Bayesian Optimization has a few disadvantages which must be considered, including:
- The algorithms tend to be more computationally expensive than simpler methods such as brute force search, especially if complex models or high-dimensional search spaces are utilized.
- Initial assumption of prior function influences the optimization results and in Bayesian optimization, picking the right prior distribution can be a daunting task.
- Bayesian Optimization performance is highly impacted by the acquisition function chosen, which could lead to a convergence problem or worse, local minima.
- Theoretical understanding on the convergence rate are still limited, in particular in the high-dimensional problem, this will avoid overfitting.
Conclusion
Bayesian Optimization is an efficient, model-based approach to optimizing complex objective functions with costly function evaluations. It is a much sought-after technique, particularly in the field of Machine Learning for hyperparameter tuning. The Bayesian Optimization process involves several steps, including defining a probability model and selecting a suitable acquisition function, and iterating the process until either the desired improvement is achieved or the maximum number of iterations is exceeded. There are several types of Acquisition Functions available, each with its strengths and weaknesses. Bayesian Optimization, like any other technique, has its advantages and drawbacks, but the benefits outweigh the shortcomings when used correctly. Remember, the faster we tune in hyperparameters, the faster we can iterate model improvement and achieve a better overall model performance.