2024-10-14
In this post, I go over zero-shot sentiment analysis using Llama 3.2. First, I wanted to find a good dataset. I went to browse Kaggle datasets and came across this dataset.
The dataset contains user reviews of the Spotify app on Google Play Store.
I love Small Language Models. I can run them on relatively low GPU requirements or even on CPU without worrying too much about how slow it's gonna be, So I wanted to compare both Llama 3.2 1B and 3B on zero-shot classification on this dataset to get a feel of how well they perform and follow instructions. I use the lovely Ollama API to interact with these models.
The dataset comes in a single CSV file with two columns: Review
and label
.
Here's the distribution of label
Here are all the imports we're gonna need:
import pandas as pd
import ollama
from tqdm.auto import tqdm
from dataclasses import dataclass
from tabulate import tabulate
df = pd.read_csv("data/DATASET.csv").dropna()
Since I just want to get a feel of how it works and don't really have time to run on all 51K examples in the dataset, I sample I small subset of it.
SAMPLE_SIZE = 500
df = df.sample(SAMPLE_SIZE, random_state=5578416).reset_index(drop=True)
Here's the distribution of label
in the sample
Then I define num_samples
to be the length of my final dataframe
num_samples = len(df)
I define two prompts to compare between them
PROMPT_1 = "You are a helpful AI assistant who helps user in syntiment analysis. Given a review in the user message, respond with either `POSITIVE` if the review is positive, or `NEGATIVE` if the review is negative. Respond with ONLY the sentiment and without any elaboration."
PROMPT_2 = "Given the review in the user message, respond with either `POSITIVE` if the review is positive, or `NEGATIVE` if the review is negative. Respond with ONLY the sentiment and without any elaboration."
I define a helper function that returns either POSITIVE
, NEGATIVE
, or FAILED
in case the response wasn't either POSITIVE
or NEGATIVE
. (This can happen if the SLM doesn't strictly follow our instructions).
def run(model: str, system_prompt: str, review: str) -> str:
response = ollama.chat(
model=model,
messages=[
{
"role": "system",
"content": system_prompt,
},
{
"role": "user",
"content": review,
},
],
options={"temperature": 0},
)
answer = response["message"]["content"].strip()
if answer == "POSITIVE":
return "POSITIVE"
if answer == "NEGATIVE":
return "NEGATIVE"
return "FAILED"
Then I run a loop going over each model, each prompt, and each sample to run prediction.
models = ["llama3.2:1b", "llama3.2:3b"]
prompts = {"prompt_1": PROMPT_1, "prompt_2": PROMPT_2}
for model in tqdm(models):
for prompt_name, prompt in prompts.items():
df[model + "::" + prompt_name] = ""
for i in tqdm(range(len(df))):
df.at[i, model + "::" + prompt_name] = run(model, prompt, df["Review"][i])
I define a class that will hold relevant stats.
@dataclass
class ModelStats:
model_name: str
propmt_name: str
following_instructions_and_correct_count: int
following_instructions_and_incorrect_count: int
not_following_instructions_count: int
@property
def following_instructions_and_correct(self):
return self.following_instructions_and_correct_count / num_samples
@property
def following_instructions_and_incorrect(self):
return self.following_instructions_and_incorrect_count / num_samples
@property
def not_following_instructions(self):
return self.not_following_instructions_count / num_samples
Then I compute these stats:
model_stats = []
for model in models:
for prompt_name in prompts.keys():
col = model + "::" + prompt_name
correct_count = (df[col] == df["label"]).sum()
incorrect_count = (df[col] != df["label"]).sum()
not_following_instructions_count = (df[col] == "FAILED").sum()
model_stats.append(
ModelStats(
model_name=model,
propmt_name=prompt_name,
following_instructions_and_correct_count=correct_count,
following_instructions_and_incorrect_count=incorrect_count,
not_following_instructions_count=not_following_instructions_count,
)
)
Now, I can print all the collected results in a nice format:
print(
tabulate(
[
["Model", *[s.model_name for s in model_stats]],
["Prompt", *[s.propmt_name for s in model_stats]],
[
"Following Instruction & Correct",
*[f"{s.following_instructions_and_correct:2.2%}" for s in model_stats],
],
[
"Following Instruction & Incorrect",
*[
f"{s.following_instructions_and_incorrect:2.2%}"
for s in model_stats
],
],
[
"Not Following Instructions",
*[f"{s.not_following_instructions:2.2%}" for s in model_stats],
],
[
"Correct (Count)",
*[s.following_instructions_and_correct_count for s in model_stats],
],
[
"Incorrect (Count)",
*[s.following_instructions_and_incorrect_count for s in model_stats],
],
[
"Not Following Instructions (Count)",
*[s.not_following_instructions_count for s in model_stats],
],
],
tablefmt="fancy_grid",
)
)
And this produces:
We can see prompt_2
is producing slightly better results that prompt_1
.
We can also see that the 3b model is consistently outperforming its 1b sibling, and both of them don't follow the instruction very rarely.
Here's some anallysis on some correct vs incorrect predictions on different sequemce lengths
Here's a link to the GitHub repo that has the code.