How to Fine-Tune and Deploy Mistral AI on Google Cloud Using Colab

Imagine being able to teach a strong AI model such as Mistral AI without costly hardware or intricate configurations. Fine-tuning big language models (LLMs) used to need both many expensive GPUs and considerable technical knowledge until recently. Fortunately, anyone can now readily fine-tune powerful AI models even on a single GPU thanks to fresh approaches like QLoRA and free platforms like Google Colab!

Using your own data, entirely free, Google Colab, I will walk you carefully through fine-tuning the Mistral AI model in this beginners-friendly blog. After fine-tuning, we will easily apply your customized AI model to Google Cloud Platform (Vertex AI), therefore enabling quick access for either personal or business purposes.

Ready to dive in and build your own powerful AI model effortlessly? Let's start!

Why Mistral AI?

Advanced open-source language model Mistral AI can handle chores including content creation, summarizing, chatbot building and question answering. It learns your particular use case by fine-tuning, so it becomes smarter and more accurate.

General Overview

We'll cover everything step-by-step:

Loading and Quantizing the Mistral AI model using BitsAndBytes.
Fine-tuning using QLoRA (Quantized Low-Rank Adaptation).
Saving and exporting the model.
Deploying the fine-tuned model on Google Cloud's Vertex AI.

Google Colab is free, user-friendly and gives access to strong GPUs ideal for training big language models without any complex configuration.

Why Choose Google Colab & Google Cloud?

Since Google Colab allows access to free GPUs, it offers a great venue for fast testing your ideas. Small-scale studies or first tests on public or private datasets would find it perfect since it facilitates debugging.

Google Cloud Platform (GCP) provides strong tools when you're ready to implement your finely tailored model including:

Vertex AI: Easy, managed deployment of machine learning models.
Model Registry: Central place to manage your trained models.
Model Garden: Ready-to-use, scalable AI models.

Fine-tuning with QLoRA: A Simple Explanation

Fine-tuning typically involves adjusting millions or billions of parameters in a model, making it resource-intensive. To overcome this, we use Quantization and QLoRA:

Quantization shrinks weights from 32-bit precision to just 4 bits, hence reducing model scale. This lets you fine-tune the model even on free Colab GPUs since Mistral 7B reduces from ~24GB down to roughly 4GB.
By freezing most model weights and training just two tiny "adapter" matrices, QLoRA even streamlines fine-tuning. These matrices keep accuracy high while substantially lowering the resources needed by having less parameters.

How to Fine-tune Mistral AI in Colab

Step 1: Install Necessary Libraries in Google Colab

! pip install bitsandbytes transformers peft accelerate
! pip install datasets trl ninja packaging
! pip install bitsandbytes --no-cache-dir --force-reinstall --index-url https://download.pytorch.org/whl/cu118
! pip install bitsandbytes --no-cache-dir --force-reinstall --index-url https://download.pytorch.org/whl/cu118
!pip install flash-attn --no-build-isolation

Importing Necessary Libraries

import torch
import os
import sys
import json
import IPython
from datetime import datetime
from datasets import load_dataset
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from transformers import (
   AutoModelForCausalLM,
   AutoTokenizer,
   BitsAndBytesConfig,
   AutoTokenizer,
   TrainingArguments,
)
from trl import SFTTrainer

Step 2: Load Model and Tokenizer

Here, we're selecting and loading our base model called Mistral-7B. We use the tokenizer from HuggingFace to convert our text data into numbers (tokens) that the model understands. We're also defining a special token for padding, which helps ensure all input sequences have the same length during training.

# Chose the base model you want
model_name = "Hugofernandez/Mistral-7B-v0.1-colab-sharded"
# set device
device = 'cuda'
#v Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.unk_token_id

Configure Model Quantization (4-bit)

To run large models efficiently, we reduce their memory usage using 4-bit quantization.

compute_dtype = getattr(torch, "float16")
print(compute_dtype)
bnb_config = BitsAndBytesConfig(
       load_in_4bit=True,
       bnb_4bit_quant_type="nf4",
       bnb_4bit_compute_dtype=compute_dtype,
       bnb_4bit_use_double_quant=True,
)

Step 3: Fine-Tuning the Mistral AI Model Using QLoRA

Now, we'll set up the model for fine-tuning. Here's exactly what's happening:

Load the Quantized Model:

#Load the model and quantize it
model = AutoModelForCausalLM.from_pretrained(
         model_name,
         quantization_config=bnb_config,
         use_flash_attention_2 = False,
         device_map={"": 0}, 
)

Configure QLoRA for Fine-tuning

We'll use QLoRA, a method that efficiently fine-tunes only specific small parts of the model instead of the whole thing:

peft_config = LoraConfig(
       lora_alpha=16,
       lora_dropout=0.05,
       r=16,
       bias="none",
       task_type="CAUSAL_LM",
       target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj", "lm_head",]
)

Preparing Model for Fine-tuning:

Next, prepare the model to work smoothly with quantized training:

model = prepare_model_for_kbit_training(model)
model.config.pad_token_id = tokenizer.pad_token_id
model.config.use_cache = False

Training Configuration:

Finally, define exactly how training will happen:

training_arguments = TrainingArguments(
       output_dir="./results",
       evaluation_strategy="epoch",
       optim="paged_adamw_8bit",
       per_device_train_batch_size=4,
       per_device_eval_batch_size=4,
       gradient_accumulation_steps=1,
       log_level="debug",
       save_steps=500,
       logging_steps=20,
       learning_rate=4e-4,
       num_train_epochs=1,
       warmup_steps=100,
       lr_scheduler_type="constant",
)

Step 4: Downloading Your Dataset (IMDB example)

We'll now download a small dataset to use for fine-tuning. Here we're using a simple dataset called tiny-imdb, perfect for testing and demonstration:

!git clone https://huggingface.co/datasets/iamholmes/tiny-imdb

Load Your Dataset

We load our dataset (tiny-imdb) using Hugging Face's easy-to-use load_dataset function. Since the dataset files are in .parquet format, we specify 'parquet' explicitly. Now our data is ready for fine-tuning the Mistral AI model!

from datasets import load_dataset
data_files = {'train': "/content/tiny-imdb/data/train-00000-of-00001.parquet", 'test': "/content/tiny-imdb/data/test-00000-of-00001.parquet"}
# Use 'parquet' instead of 'csv' to load parquet files
dataset = load_dataset('parquet', data_files=data_files)
print(dataset)

Now let's install and import some additional libraries (transformers, datasets and trl) and then import a helpful tool called DataCollatorForLanguageModeling, which neatly groups our data into batches during fine-tuning, making the training process smooth and efficient.

!pip install transformers datasets trl
from transformers import DataCollatorForLanguageModeling

Step 5: Setting Up the Trainer

We use SFTTrainer to fine-tune the Mistral AI model efficiently. It manages training, evaluation and optimization while applying QLoRA for memory-efficient fine-tuning. The DataCollatorForLanguageModeling ensures smooth text batching and mlm=False sets it for causal language modeling instead of masked modeling.

trainer = SFTTrainer(
       model=model,
       train_dataset=dataset['train'],
       eval_dataset=dataset['test'],
       peft_config=peft_config,
       # dataset_text_field="text",
       #packing = True
       #max_seq_length=512,
       tokenizer=tokenizer,
       args=training_arguments,
       data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), # Add data_collator
)

Step 6: Checking Trainable Parameters and Evaluating the Model

Before starting training, we check how many parameters are being fine-tuned using print_trainable_parameters(). This function counts the total parameters and the ones being updated during training. Since we're using QLoRA, only a small fraction of parameters are fine-tuned, making it memory-efficient. Finally, we run trainer.evaluate() to assess the model's initial performance before training.

def print_trainable_parameters(model):
   """
   Prints the number of trainable parameters in the model.
   """
   trainable_params = 0
   all_param = model.num_parameters()
   for _, param in model.named_parameters():
       if param.requires_grad:
           trainable_params += param.numel()
   print(
       f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
   )
trainer.evaluate()

Step 7: Testing the Fine-Tuned Model

After training, we test how well the fine-tuned Mistral AI model generates responses. We provide a sample prompt (eval_prompt) asking about neural networks, tokenize it and move it to the GPU. The model is set to evaluation mode (model.eval()), ensuring it doesn't update weights during inference. Using torch.no_grad(), we generate a response, decode it back into readable text and print the result. Finally, we switch the model back to training mode (model.train()) for further fine-tuning if needed.

#trainer.evaluate()
eval_prompt = """<s>[INST]What is a Neural Network and how does it work?[/INST]"""
# import random
model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")
model.eval()
with torch.no_grad():
   print(tokenizer.decode(model.generate(**model_input, max_new_tokens=256, pad_token_id=2)[0], skip_special_tokens=True))
model.train()

Step 8: Saving and Merging the Fine-Tuned Model

After fine-tuning, we save the updated model weights and merge them with the base model for deployment. First, we store the QLoRA fine-tuned model under the name 'MistralAI_QLORA'. Then, we reload the base model and apply the fine-tuned LoRA weights using PeftModel. The merge_and_unload() function combines the fine-tuned layers into the main model, removing the need for separate LoRA adapters. Finally, the merged model is saved to a directory (MistralAI_finetuned) for easy deployment.

new_model = 'MistralAI_QLORA'
trainer.model.save_pretrained(new_model)
# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(model_name)
# Load fine-tuned LoRA model
peft_model = PeftModel.from_pretrained(base_model, new_model)
# Merge LoRA weights into the base model
merged_model = peft_model.merge_and_unload()
# Save the final fine-tuned model
output_merged_dir = "/content/MistralAI_finetuned"
os.makedirs(output_merged_dir, exist_ok=True)
merged_model.save_pretrained(output_merged_dir, safe_serialization=False)
tokenizer.save_pretrained(output_merged_dir)

Step 9: Preparing for Deployment on Google Cloud (Vertex AI)

Before deploying the fine-tuned Mistral AI model, you need to set up Google Cloud Platform (GCP). Follow these steps:

Create a Google Cloud Account (if you don't have one).
Set up a new project in GCP.
Create a Service Account under IAM & Admin → Service Accounts.
Assign Permissions: Add the following roles to the service account:

Vertex AI User

Storage Object Admin

Enable Vertex AI APIs: Go to Vertex AI in GCP and click Enable Recommended APIs.

Create a Cloud Storage Bucket: Navigate to Cloud Storage and create an empty bucket to store your model.

Once your GCP setup is complete, install the required libraries in your Colab notebook to interact with Vertex AI:

! pip3 install --upgrade google-cloud-aiplatform
! pip3 install ipython pandas[output_formatting] google-cloud-language==2.10.0

Restarting the Notebook Kernel

After installing new packages, it's important to restart the Colab notebook kernel to apply the changes. This step ensures that all newly installed dependencies (like Google Cloud AI Platform) work correctly without conflicts.

Use the following command to restart the kernel automatically:

# Restart the notebook kernel after installs.
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

Importing Google Cloud Libraries

import os
from google.cloud import aiplatform, language, storage
from google.colab import auth as google_auth

Step 10: Configuring Google Cloud Variables

Now, we define key variables needed to interact with Google Cloud Platform (GCP) and Vertex AI for deploying our fine-tuned Mistral AI model.

# Set your Google Cloud project ID
PROJECT_ID = "singular-antler-421904"  # Replace with your actual GCP project ID
# Set the region where your Vertex AI resources will be deployed
REGION = "{REGION}"  # Example: "us-central1"
# Define the Google Cloud Storage bucket where the model will be stored
BUCKET_URI = "gs://{YOUR_BUCKET_NAME}"  # Replace with your bucket name
# Define paths for storing the model and staging files
BASE_MODEL_BUCKET = os.path.join(BUCKET_URI, "final_merged_mistral")  # Folder for the final fine-tuned model
STAGING_BUCKET = os.path.join(BUCKET_URI, "staging")  # Folder for temporary files during deployment
# Set the service account used for Vertex AI deployment
SERVICE_ACCOUNT = "{YOUR_SERVICE_ACCOUNT}"  # Replace with your service account email

Authenticate Google Cloud Account

Before interacting with Google Cloud Platform (GCP), we need to authenticate our Google account in Google Colab. This procedure allows us to access Vertex AI and Cloud Storage securely.

google_auth.authenticate_user()

Set Up Google Cloud Project and Enable APIs

After authenticating your Google Cloud account, we need to set the active project and enable necessary APIs for deployment.

! gcloud config set project $PROJECT_ID
! gcloud services enable language.googleapis.com

Upload the Fine-Tuned Model to Google Cloud Storage

Now that we have authenticated and configured our Google Cloud project, we need to upload the fine-tuned Mistral AI model to Google Cloud Storage (GCS). This step ensures that the model is stored securely and is ready for deployment on Vertex AI.

# Upload the directory to Google Cloud Storage
!gsutil -m cp -r {output_merged_dir} {BASE_MODEL_BUCKET}

Define the Docker Image for Model Deployment

To deploy the fine-tuned Mistral AI model on Vertex AI, we need a pre-configured Docker image that provides the necessary environment for serving the model.

PREDICTION_DOCKER_URI = "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-peft-serve:20231026_1907_RC00"

Deploying the Fine-Tuned Model to Vertex AI

Now that our fine-tuned Mistral AI model is uploaded to Google Cloud Storage, we define a function to deploy it on Vertex AI. This function will create an endpoint, upload the model and deploy it with the required resources.

def deploy_model(
   model_name: str,
   base_model_id: str,
   finetuned_lora_model_path: str,
   service_account: str,
   precision_loading_mode: str = "float16",
   machine_type: str = "n1-standard-8",
   accelerator_type: str = "NVIDIA_TESLA_V100",
   accelerator_count: int = 1,
) -> tuple[aiplatform.Model, aiplatform.Endpoint]:
   """Deploys trained models into Vertex AI."""
   endpoint = aiplatform.Endpoint.create(display_name=f"{model_name}-endpoint")
   serving_env = {
       "BASE_MODEL_ID": base_model_id,
       "PRECISION_LOADING_MODE": precision_loading_mode,
   }
   if finetuned_lora_model_path:
       serving_env["FINETUNED_LORA_MODEL_PATH"] = finetuned_lora_model_path
   model = aiplatform.Model.upload(
       display_name=model_name,
       serving_container_image_uri=PREDICTION_DOCKER_URI,
       serving_container_ports=[7080],
       serving_container_predict_route="/predictions/peft_serving",
       serving_container_health_route="/ping",
       serving_container_environment_variables=serving_env,
   )
   model.deploy(
       endpoint=endpoint,
       machine_type=machine_type,
       accelerator_type=accelerator_type,
       accelerator_count=accelerator_count,
       deploy_request_timeout=1800,
       service_account=service_account,
   )
   return model, endpoint

When deploying your fine-tuned Mistral AI model on Vertex AI, you can customize the machine type and GPU accelerator based on your budget and performance needs. The default setup uses an N1-standard-8 machine (8 vCPUs, 30GB RAM) with an NVIDIA Tesla V100 GPU, which offers a balance of speed and cost. If you're looking for a cheaper option, the NVIDIA Tesla T4 is available, though it may be slower for large models. For high-performance tasks, the NVIDIA A100 provides the fastest processing but comes at a significantly higher cost. Depending on your requirements, you can adjust the accelerator type and GPU count to optimize for efficiency and budget. You can check Vertex AI pricing to compare costs in your region.

machine_type = "n1-standard-8"
accelerator_type = "NVIDIA_TESLA_V100"
accelerator_count = 1

Deploying the Fine-Tuned Model on Vertex AI

Now, we deploy our Mistral AI model to Google Vertex AI, making it accessible via an endpoint. We set the precision loading mode to "float16" for efficient inference and define a custom model name and version to manage deployments easily.

precision_loading_mode = "float16"
model_name = "MistralIA_finetuned" # give any name you want
version = "1" # you can increment this number each time you want to do a new deployment
model_vertex, endpoint_vertex = deploy_model(
   model_name=model_name+version,
   base_model_id=BASE_MODEL_BUCKET,
   finetuned_lora_model_path="",  # This will avoid override finetuning models.
   service_account=SERVICE_ACCOUNT,
   precision_loading_mode=precision_loading_mode,
   machine_type=machine_type,
   accelerator_type=accelerator_type,
   accelerator_count=accelerator_count,
)
print("endpoint_name:", endpoint_vertex.name)

Testing the Deployed Mistral AI Model on Vertex AI

Once your model is deployed, you can send a test request to the Vertex AI endpoint to verify its performance. The following code sends a prompt asking, "What is a Neural Network and how does it work?" and retrieves the model's response.

instances = [
   {
       "prompt": "What is a Neural Network and how does it work?",
       "max_tokens": 500,
       "temperature": 1,
       "top_p": 1.0,
       "top_k": 10,
   },
]
response = endpoint_vertex.predict(instances=instances)
print(response)

Note: The code is available on Google Colab Notebook. Run this notebook using T4 GPU.

Conclusion

Fine-tuning and deploying Mistral AI has never been easier, thanks to QLoRA, Google Colab and Vertex AI. With this step-by-step guide, you've learned how to efficiently fine-tune Mistral AI using 4-bit quantization, reducing the need for expensive hardware while maintaining high performance.

We started by setting up Google Colab, loading the Mistral AI model and applying QLoRA for efficient fine-tuning. We then trained the model using SFTTrainer, evaluated its performance and merged the fine-tuned weights. Finally, we deployed the model on Google Cloud's Vertex AI, enabling real-time inference through a cloud-based API.

By leveraging Google Colab for training and Vertex AI for deployment, you now have a powerful, scalable AI model accessible from anywhere. Whether you're working on chatbots, content generation, or NLP applications, or Aiprojects, this approach allows you to customize Mistral AI to fit your needs without requiring high-end local GPUs.

Next Steps

Try fine-tuning on a different dataset.
Experiment with temperature, top-p and top-k values to optimize responses.
Deploy the model into production and integrate it with real-world applications.
Monitor performance and update your fine-tuned model as needed.

With this guide, you now have everything you need to train, fine-tune and deploy your custom AI model. The possibilities are endless-go ahead and build something amazing!