Evaluate

The TABLET package offers several useful features for evaluating performance of LLMs + instructions on tabular datasets. TABLET provides code to evaluate arbitrary huggingface models on tasks and also provides tools to simply get the HuggingFace dataset for a particular task so you can perform whatever evaluation you want.

Task Storage

First, let’s look at how the task datasets are stored in TABLET. All the tasks are stored in

Tablet/data/benchmark/performance

For example, the Adult task is store at

Tablet/data/benchmark/performance/Adult

Within this directory, there are different directories for each instruction annotation for the Adult task. For example, let’s look at one of the prototypes generated instructions. This instruction is stored at

Tablet/data/benchmark/performance/Adult/prototypes-synthetic-performance-0

Instructions collected through other sources have different paths. The rulesets generated instructions all have the directory name

ruleset-synthetic-performance-*

And the naturally occurring instructions have

prototypes-naturallanguage-performance-*

Note, the usage of prototypes here is just to retain formatting consistency with the other directory names.

Within each directory, there are four files

../test.csv
../test.json
../train.csv
../train.json

These are the training and testing sets, stored both in their tabular formats (the .csv’s) and their natural language formats (the .json) files. Within the json files, there are each prompt component, like the header, data point serialization, and instruction.

Getting a HuggingFace Dataset for a Task

Here’s how to use the TABLET package to get a Huggingface dataset for a particular task. Let’s say we want to get one of the Adult and Whooping Cough datasets at these locations

Tablet/data/benchmark/performance/Adult/prototypes-synthetic-performance-0
Tablet/data/benchmark/performance/A37/prototypes-synthetic-performance-0

We can get the test datasets as follows

from Tablet import evaluate

benchmark_path = "./data/benchmark/performance/"
tasks = ['A37/prototypes-synthetic-performance-0',
         'Adult/prototypes-synthetic-performance-0']
evaluator = evaluate.Evaluator(benchmark_path=benchmark_path,
                               tasks_to_run=tasks,
                               encoding_format="flan",
                               k_shot=0)
whooping_cough, adult = evaluator.get_test_hf_datasets()

We can specify k_shot here to control how many k_shot instances are sampled from the training data and included into the prompts. Then, we can access the Adult test data and labels as

test_data, ground_truth_labels = adult['text'], adult['label'] 

Evaluating Performance on a Task

We can also directly evaluate performance on tasks. For instance, evaluating 2-shot Flan-T5 small performance on Adult with prototypes generated instructions with 3 seeds is as follows

from Tablet import evaluate

benchmark_path = "./data/benchmark/performance/"
tasks = ['Adult/prototypes-synthetic-performance-0']
evaluator = evaluate.Evaluator(benchmark_path=benchmark_path,
                               tasks_to_run=tasks,
                               encoding_format="flan",
                               results_file="my_cool_results.txt",
                               k_shot=2)
evaluator.run_eval(how_many=3)

The results will be appended to my_cool_results.txt.