Zero-shot Sentiment Analysis using Llama 3.2

2024-10-14

In this post, I go over zero-shot sentiment analysis using Llama 3.2. First, I wanted to find a good dataset. I went to browse Kaggle datasets and came across this dataset.

The dataset contains user reviews of the Spotify app on Google Play Store.

I love Small Language Models. I can run them on relatively low GPU requirements or even on CPU without worrying too much about how slow it's gonna be, So I wanted to compare both Llama 3.2 1B and 3B on zero-shot classification on this dataset to get a feel of how well they perform and follow instructions. I use the lovely Ollama API to interact with these models.

The dataset comes in a single CSV file with two columns: Review and label.

Here's the distribution of label

label_dist

Here are all the imports we're gonna need:

import pandas as pd
import ollama
from tqdm.auto import tqdm
from dataclasses import dataclass
from tabulate import tabulate

df = pd.read_csv("data/DATASET.csv").dropna()

Since I just want to get a feel of how it works and don't really have time to run on all 51K examples in the dataset, I sample I small subset of it.

SAMPLE_SIZE = 500

df = df.sample(SAMPLE_SIZE, random_state=5578416).reset_index(drop=True)

Here's the distribution of label in the sample

sample_label_dist

Then I define num_samples to be the length of my final dataframe

num_samples = len(df)

I define two prompts to compare between them

PROMPT_1 = "You are a helpful AI assistant who helps user in syntiment analysis. Given a review in the user message, respond with either `POSITIVE` if the review is positive, or `NEGATIVE` if the review is negative. Respond with ONLY the sentiment and without any elaboration."

PROMPT_2 = "Given the review in the user message, respond with either `POSITIVE` if the review is positive, or `NEGATIVE` if the review is negative. Respond with ONLY the sentiment and without any elaboration."

I define a helper function that returns either POSITIVE, NEGATIVE, or FAILED in case the response wasn't either POSITIVE or NEGATIVE. (This can happen if the SLM doesn't strictly follow our instructions).

def run(model: str, system_prompt: str, review: str) -> str:
    response = ollama.chat(
        model=model,
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": review,
            },
        ],
        options={"temperature": 0},
    )
    answer = response["message"]["content"].strip()
    if answer == "POSITIVE":
        return "POSITIVE"
    if answer == "NEGATIVE":
        return "NEGATIVE"
    return "FAILED"

Then I run a loop going over each model, each prompt, and each sample to run prediction.

models = ["llama3.2:1b", "llama3.2:3b"]
prompts = {"prompt_1": PROMPT_1, "prompt_2": PROMPT_2}

for model in tqdm(models):
    for prompt_name, prompt in prompts.items():
        df[model + "::" + prompt_name] = ""
        for i in tqdm(range(len(df))):
            df.at[i, model + "::" + prompt_name] = run(model, prompt, df["Review"][i])

I define a class that will hold relevant stats.

@dataclass
class ModelStats:
    model_name: str
    propmt_name: str
    following_instructions_and_correct_count: int
    following_instructions_and_incorrect_count: int
    not_following_instructions_count: int

    @property
    def following_instructions_and_correct(self):
        return self.following_instructions_and_correct_count / num_samples

    @property
    def following_instructions_and_incorrect(self):
        return self.following_instructions_and_incorrect_count / num_samples

    @property
    def not_following_instructions(self):
        return self.not_following_instructions_count / num_samples

Then I compute these stats:

model_stats = []
for model in models:
    for prompt_name in prompts.keys():
        col = model + "::" + prompt_name
        correct_count = (df[col] == df["label"]).sum()
        incorrect_count = (df[col] != df["label"]).sum()
        not_following_instructions_count = (df[col] == "FAILED").sum()
        model_stats.append(
            ModelStats(
                model_name=model,
                propmt_name=prompt_name,
                following_instructions_and_correct_count=correct_count,
                following_instructions_and_incorrect_count=incorrect_count,
                not_following_instructions_count=not_following_instructions_count,
            )
        )

Now, I can print all the collected results in a nice format:

print(
    tabulate(
        [
            ["Model", *[s.model_name for s in model_stats]],
            ["Prompt", *[s.propmt_name for s in model_stats]],
            [
                "Following Instruction &#x26; Correct",
                *[f"{s.following_instructions_and_correct:2.2%}" for s in model_stats],
            ],
            [
                "Following Instruction &#x26; Incorrect",
                *[
                    f"{s.following_instructions_and_incorrect:2.2%}"
                    for s in model_stats
                ],
            ],
            [
                "Not Following Instructions",
                *[f"{s.not_following_instructions:2.2%}" for s in model_stats],
            ],
            [
                "Correct (Count)",
                *[s.following_instructions_and_correct_count for s in model_stats],
            ],
            [
                "Incorrect (Count)",
                *[s.following_instructions_and_incorrect_count for s in model_stats],
            ],
            [
                "Not Following Instructions (Count)",
                *[s.not_following_instructions_count for s in model_stats],
            ],
        ],
        tablefmt="fancy_grid",
    )
)

And this produces:

We can see prompt_2 is producing slightly better results that prompt_1. We can also see that the 3b model is consistently outperforming its 1b sibling, and both of them don't follow the instruction very rarely.

Here's some anallysis on some correct vs incorrect predictions on different sequemce lengths

llama1b-mistakes-seqlen

llama3b-mistakes-seqlen

Here's a link to the GitHub repo that has the code.