Building a robust GraphRAG System for specific use case -Part Two-
Part Two: The Fine-Tuning Process
In the first part of this series, we embarked on the crucial task of preparing a custom dataset for fine-tuning a large language model (LLM) for text-to-Cypher translation. We generated a diverse set of questions tailored to our specific Neo4j graph database schema and then leveraged another LLM to translate these questions into their corresponding Cypher queries. This process resulted in a high-quality dataset of question-Cypher pairs, which will serve as the foundation for fine-tuning our target LLM in this part.
Fine-Tuning Llama 3.1 with QLoRA
Our objective in this part is to fine-tune a Llama 3.1 8b model using the dataset generated in the previous step. To achieve this efficiently, we will leverage one of the Parameter-Efficient Fine-Tuning (PEFT) methods called QLoRA (Quantized Low-Rank Adaptation), implemented using the Unsloth framework.
PEFT (Parameter-Efficient Fine-Tuning)
Parameter-Efficient Fine-Tuning (PEFT) techniques address the challenge of fine-tuning large language models, which can be computationally expensive and require significant memory resources. Instead of updating all the parameters of a pre-trained LLM, PEFT methods modify only a small subset of parameters, typically those that have the most significant impact on the target task. This approach drastically reduces the computational burden and memory footprint of the fine-tuning process, making it feasible to fine-tune large LLMs even on resource-constrained hardware.
QLoRA (Quantized Low-Rank Adaptation)
QLoRA is a particularly efficient PEFT technique that combines the benefits of Low-Rank Adaptation (LoRA) with quantization. LoRA fine-tunes models by adding small, low-rank matrices to the existing layers, effectively injecting task-specific knowledge without modifying the original model's weights. QLoRA further enhances this approach by applying 4-bit quantization to the pre-trained model's weights, drastically reducing memory consumption while maintaining performance comparable to full 16-bit fine-tuning. This combination of techniques allows for fine-tuning large LLMs, such as Llama 3.1, on relatively modest hardware, even a single GPU with limited memory.
Unsloth Framework
The Unsloth framework is an open-source solution designed to streamline and simplify the fine-tuning and training of LLMs like Llama 3, Mistral, and Gemma. Developed by a team with experience at NVIDIA, Unsloth focuses on making the fine-tuning process faster, more efficient, and less resource-intensive. It achieves this by incorporating advanced techniques like LoRA and quantization, providing a user-friendly interface, and offering seamless integration with popular tools like Google Colab.
Unsloth's primary goal is to democratize the creation of custom AI models, enabling developers to efficiently build and deploy models tailored to specific needs, regardless of their computational resources. By utilizing Unsloth, we can leverage the power of QLoRA to fine-tune our Llama 3.1 model for text-to-Cypher translation effectively and efficiently.
Installing Dependencies
Here we install the necessary packages for fine-tuning.
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# We have to check which Torch version for Xformers (2.3 -> 0.0.27)
from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton
Loading the Model
Here we load the pre-trained Llama 3.1 8B model and its tokenizer using the FastLanguageModel
class from Unsloth. We specify the maximum sequence length (max_seq_length
) to restrict the model's context window, which impacts both compute and memory usage.
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # to restricts its context window, most of model support larger, but make sure to use it based on your need since it will consumes more compute and VRAM
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Meta-Llama-3.1-8B",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
We set the data type (dtype
) to None
for automatic detection, or we can explicitly choose float16
for older GPUs or bfloat16
for newer Ampere GPUs. We enable 4-bit quantization (load_in_4bit
) to reduce memory usage during fine-tuning.
Configuring the Model for Training
Here we prepare the loaded model for PEFT using QLoRA. We use the get_peft_model
function to apply LoRA to specific layers of the model. We define the rank (r
) of the LoRA matrices, which determines the number of trainable parameters. Higher ranks can store more information but increase computational and memory costs.
model = FastLanguageModel.get_peft_model(
model,
r = 16, # Suggested 8, 16, 32, 64, 128, higher ranks can store more information but increase the computational and memory cost of LoRA.
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",],
lora_alpha = 16, # rule of thumb, double the r or equal. alpha is a scaling factor for updates
lora_dropout = 0, # Supports any, but = 0 is optimized. Probability of zeroing out elements in low-rank matrices for regularization.
random_state = 3407,
use_rslora = True, # has been proven to work better ( https://arxiv.org/pdf/2312.03732 )
)
We specify the target_modules
, which are the layers where LoRA will be applied. We set lora_alpha
(scaling factor for updates), lora_dropout
(probability of dropout for regularization), and random_state
for reproducibility. We also enable use_rslora
, which has been shown to improve performance.
Instruction Tuning Prompt Formatter
Here we define the prompt format and a function for preparing training data. We use the Alpaca format, which consists of an instruction, input, and response. We customize the instruction to guide the model to convert text to Cypher queries based on the provided graph schema.
prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
instructions = f"Convert text to cypher query based on this schema: {graph.schema}" # examples["instructions"]
inputs = examples["input"]
outputs = examples["output"]
texts = []
for input, output in zip(inputs, outputs):
# Must add EOS_TOKEN, otherwise your generation will go on forever!
text = prompt.format(instructions, input, output) + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
pass
The formatting_prompts_func
takes a dataset of examples and formats them according to the Alpaca prompt template, adding the end-of-sequence token (EOS_TOKEN
) to ensure proper termination of the generated sequences.
Loading and Preparing the Dataset
Here we load the dataset generated in the first part, filter out rows with syntax errors or timeouts, and rename the columns to match the expected format for the fine-tuning process.
import pandas as pd
df = pd.read_csv('final_text2cypher.csv')
df = df[(df['syntax_error'] == False) & (df['timeout'] == False)]
df = df[['question','cypher']]
df.rename(columns={'question': 'input','cypher':'output'}, inplace=True)
df.reset_index(drop=True, inplace=True)
We then convert the Pandas DataFrame into a Hugging Face Dataset
object and apply the formatting_prompts_func
to format the examples according to the Alpaca prompt template.
from datasets import Dataset
dataset = Dataset.from_pandas(df)
dataset = dataset.map(formatting_prompts_func, batched = True)
Creating the Supervised Fine-Tuning Trainer
Here we create the SFTTrainer from the TRL library to fine-tune the model using the prepared dataset. We provide the model, tokenizer, training dataset, text field name, maximum sequence length, and other configurations.
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
# max_steps = 60,
num_train_epochs=1,
learning_rate = 2e-4, # the rate at which the model updates its parameters during training.
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
),
)
We use the TrainingArguments
class from Transformers to define the training parameters, including batch size, gradient accumulation steps, warmup steps, learning rate, optimizer, weight decay, and other hyperparameters.
Starting the Training
trainer_stats = trainer.train()
This will start the training loop, iterating over the training dataset and updating the model's parameters based on the defined training arguments.
Inference
Here we enable native faster inference for the fine-tuned model and define a function for generating Cypher queries.
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
def generate_cypher_query(question):
inputs = tokenizer(
[
prompt.format(
f"Convert text to cypher query based on this schema: {graph.schema}", # instruction
question, # input
"", # output - leave this blank for generation!
)
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
result = tokenizer.batch_decode(outputs)
cypher_query = result[0].split("### Response:")[1].split("###")[0].strip().replace("<|end_of_text|>", "").replace("<eos>", "").replace("{{", "{").replace("}}", "}")
return cypher_query
question = "Write your question here .."
cypher_query = generate_cypher_query(question)
The generate_cypher_query
function takes a natural language question as input, formats it according to the Alpaca prompt template, and uses the fine-tuned model to generate a Cypher query. The generated query is then extracted from the model's output and cleaned up.
Saving the Model
Here we save the fine-tuned model in the GGUF format. We can choose to save the model in 8-bit quantized format (Q8_0), 16-bit format (f16), or other quantized formats like q4_k_m, depending on the desired trade-off between model size and performance.
# Save to 8bit Q8_0
if True: model.save_pretrained_gguf("model", tokenizer,)
# or Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
# or Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m") # you can use any other format not only "q4_k_m"
Deploying the Model and Creating an OpenAI-Compatible API Endpoint
Installing and Creating the Model with Ollama
Here we install Ollama, a tool for serving LLMs .
curl -fsSL https://ollama.com/install.sh | sh
And create a Modelfile
that specifies the path to the saved GGUF model.
touch Modelfile
The Modelfile containg:
FROM /path/to/model.gguf
We then use the ollama create
command to build the model for serving.
ollama create llama3.1-cypher
Starting the Server
Here we start the Ollama server, which will make the fine-tuned model accessible via an API endpoint.
ollama serve
Testing the API
Here we demonstrate how to interact with the deployed model using the OpenAI API client. We initialize the client with the URL of the Ollama server and send a chat completion request with a natural language question.
# pip install openai
from openai import OpenAI
client = OpenAI(
base_url = 'http://127.0.0.1:11434/v1',
api_key='ollama', # required, but unused
)
response = client.chat.completions.create(
model="llama3.1-cypher",
messages=[
{"role": "user", "content": "Write your question here .. "},
]
)
print(response.choices[0].message.content)
The server will use the fine-tuned model to generate a Cypher query and return it as part of the API response.
With our Llama 3.1 model now fine-tuned and deployed as an OpenAI-compatible API endpoint, we possess a powerful tool for translating natural language questions into Cypher queries. This capability lays the groundwork for building a sophisticated question-answering system capable of extracting valuable insights from our graph database. In the final part of this series, we'll explore how to integrate this fine-tuned model with a knowledge extraction component to create a comprehensive Q&A system that empowers users to interact with their data using natural language. Join us as we complete the journey from raw data to insightful answers.