Hosted with nbsanity. See source notebook on GitHub.

Open In Colab

Advent of Haystack - Day 7

Santa collapsed in his chair in a huff. “What’s wrong?” asked Mrs Claus.

“There’s just too many toys to check and not enough time! Christmas is almost here!”

“Well can’t you just check some of them?”

“I wish it were that easy! But my elves make so many different toys, and we have to make sure every kid gets the right one!”

Elf Jane couldn’t help overhearing from the next room. She was a regular attendee at the local north pole hackathon, and thought she might have a solution. She’d learned a lot about evaluation recently, and thought she could build an LLM Judge to help.

For this challenge, you need to help Elf Jane and complete the code cells with #TODO text

Installation

! pip install -q arize-phoenix==6.1.0 haystack-ai==2.7.0 openinference-instrumentation-haystack==0.1.13 'httpx<0.28'

Data

Elf Jane started by checking out the big elf database of christmas wishlists (aka the BEDCW).

children = [
    {
        "name": "Timmy",
        "age": 7,
        "likes": "Lego",
        "dislikes": "Vegetables",
        "list": "nice",
    },
    {
        "name": "Tommy",
        "age": 9,
        "likes": "Sports Equipment",
        "dislikes": "Reading",
        "list": "naughty",
    },
    {
        "name": "Tammy",
        "age": 8,
        "likes": "Art Supplies",
        "dislikes": "Loud Noises",
        "list": "nice",
    },
    {
        "name": "Tina",
        "age": 6,
        "likes": "Science Kits",
        "dislikes": "Spicy Food",
        "list": "nice",
    },
    {
        "name": "Toby",
        "age": 10,
        "likes": "Video Games",
        "dislikes": "Early Mornings",
        "list": "nice",
    },
    {
        "name": "Tod",
        "age": 5,
        "likes": "Musical Instruments",
        "dislikes": "Bath Time",
        "list": "nice",
    },
    {
        "name": "Todd",
        "age": 8,
        "likes": "Remote Control Cars",
        "dislikes": "Homework",
        "list": "naughty",
    },
    {
        "name": "Tara",
        "age": 7,
        "likes": "Magic Sets",
        "dislikes": "Thunder",
        "list": "nice",
    },
    {
        "name": "Teri",
        "age": 9,
        "likes": "Building Blocks",
        "dislikes": "Broccoli",
        "list": "nice",
    },
    {
        "name": "Trey",
        "age": 6,
        "likes": "Board Games",
        "dislikes": "Bedtime",
        "list": "nice",
    },
    {
        "name": "Tyler",
        "age": 8,
        "likes": "Action Figures",
        "dislikes": "Cleaning",
        "list": "nice",
    },
    {"name": "Tracy", "age": 7, "likes": "Dolls", "dislikes": "Dark", "list": "nice"},
    {
        "name": "Tony",
        "age": 9,
        "likes": "Chemistry Sets",
        "dislikes": "Dentist",
        "list": "nice",
    },
    {"name": "Theo", "age": 6, "likes": "Puzzles", "dislikes": "Shots", "list": "nice"},
    {
        "name": "Terry",
        "age": 10,
        "likes": "Model Trains",
        "dislikes": "Chores",
        "list": "naughty",
    },
    {
        "name": "Tessa",
        "age": 5,
        "likes": "Stuffed Animals",
        "dislikes": "Time Out",
        "list": "nice",
    },
    {"name": "Troy", "age": 8, "likes": "Robots", "dislikes": "Naps", "list": "nice"},
    {
        "name": "Talia",
        "age": 7,
        "likes": "Craft Kits",
        "dislikes": "Spinach",
        "list": "nice",
    },
    {
        "name": "Tyson",
        "age": 9,
        "likes": "Microscopes",
        "dislikes": "Cold",
        "list": "nice",
    },
    {
        "name": "Tatum",
        "age": 6,
        "likes": "Drawing Sets",
        "dislikes": "Shots",
        "list": "nice",
    },
]

1. Adding Tracing 📝

Elf Jane knew that the elves were busy, and didn’t always log their toy making process. She knew that she’d first need to trace the toy making process using Arize Phoenix.

from getpass import getpass
import phoenix as px
from openinference.instrumentation.haystack import HaystackInstrumentor
from phoenix.otel import register

px.launch_app().view()

tracer_provider = register()
🌍 To view the Phoenix app in your browser, visit https://hkj442j0hfq5-496ff2e9c6d22116-6006-colab.googleusercontent.com/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix
📺 Opening a view to the Phoenix app. The app is running at https://hkj442j0hfq5-496ff2e9c6d22116-6006-colab.googleusercontent.com/
🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: default
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {'user-agent': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.
# Use Phoenix's autoinstrumentor to automatically track traces from Haystack
HaystackInstrumentor().instrument(tracer_provider=tracer_provider, skip_dep_check=True)

2. Trace Toy Making Process 🚂

With tracing in place, Elf Jane had some of her closest elf friends build a batch of toys she could trace.

⭐️ Feel free to replace OpenAIChatGenerator with other ChatGenerators supported in Haystack

import os

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")
Enter your OpenAI API key: ··········
from haystack.dataclasses import ChatMessage
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import ChatPromptBuilder
from haystack import Pipeline

messages = [
    ChatMessage.from_system(
        "You are a toy maker elf. Your job is to make toys for the nice kids on the nice list. If the child is on the naughty list, give them a 'Rabbit R1'. {{name}} is on the {{list}} list"
    ),
    ChatMessage.from_user(
        "Create a toy for {{name}} that they will like. {{name}} is {{age}} years old and likes {{likes}} and dislikes {{dislikes}}."
    ),
]

builder = ChatPromptBuilder(messages)
chat_generator = OpenAIChatGenerator(model="gpt-4o")

pipeline = Pipeline()
pipeline.add_component("builder", builder)
pipeline.add_component("chat_generator", chat_generator)

pipeline.connect("builder", "chat_generator")


def make_toy(child):
    return pipeline.run({"builder": {**child}})["chat_generator"]["replies"]
pipeline.show()

for child in children:
    make_toy(child)

3. Evaluate Toy Correctness 🔬

Elf Jane was now ready to evaluate the toys she made. She knew that she could use an LLM Judge to evaluate whether the toys matched the child’s wishlist. She started by building a judge.

llm_judge_prompt = """
Evaluate the toy for this child, based on their likes and dislikes

All children on the naughty list get a 'Rabbit R1'. Any other toy given to a naughty child is incorrect.

Respond with a single word: 'correct' or 'incorrect'. Also include a short explanation for your answer.

Description of the child: {description}
Toy: {toy}

*****
Example output:
label: 'correct'
explanation: 'The toy is a Lego set, which is one of the child's likes.'
*****
"""
import phoenix as px

# Download the traces from Phoenix
# HINT: https://docs.arize.com/phoenix/evaluation/how-to-evals/evaluating-phoenix-traces#download-trace-dataset-from-phoenix
spans_df = px.Client().get_spans_dataframe()
spans_df
/usr/local/lib/python3.10/dist-packages/phoenix/trace/dsl/query.py:741: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  df_attributes = pd.DataFrame.from_records(
name span_kind parent_id start_time end_time status_code status_message events context.span_id context.trace_id ... attributes.output.mime_type attributes.input.mime_type attributes.output.value attributes.openinference.span.kind attributes.llm.token_count.completion attributes.llm.output_messages attributes.llm.token_count.prompt attributes.llm.model_name attributes.llm.token_count.total attributes.llm.input_messages
context.span_id
51890ed9b9d0b4e0 haystack.tracing.auto_enable UNKNOWN None 2024-12-30 21:19:10.741445+00:00 2024-12-30 21:19:10.749995+00:00 ERROR ImportError: cannot import name 'Span' from pa... [{'name': 'exception', 'timestamp': '2024-12-3... 51890ed9b9d0b4e0 0db288f7eec1d71a0675b9ac131f90ac ... None None None None NaN None NaN None NaN None
054f323f6b6b7af3 haystack.tracing.auto_enable UNKNOWN None 2024-12-30 21:19:15.862320+00:00 2024-12-30 21:19:15.862497+00:00 UNSET [] 054f323f6b6b7af3 076657b901d1317e5f966f35b5fd475b ... None None None None NaN None NaN None NaN None
2b5d4d9230514da2 haystack.component.run UNKNOWN 50065d7e957f21ab 2024-12-30 21:19:39.937971+00:00 2024-12-30 21:19:39.941460+00:00 UNSET [] 2b5d4d9230514da2 34cca62a58dd4b542b34851d06805599 ... None None None None NaN None NaN None NaN None
50065d7e957f21ab ChatPromptBuilder (builder) CHAIN e292f10918ed018f 2024-12-30 21:19:39.937541+00:00 2024-12-30 21:19:39.944100+00:00 OK [] 50065d7e957f21ab 34cca62a58dd4b542b34851d06805599 ... application/json application/json {"prompt": ["ChatMessage(content=\"You are a t... CHAIN NaN None NaN None NaN None
04378f01a108e263 haystack.component.run UNKNOWN a974808d61e79f67 2024-12-30 21:19:39.946261+00:00 2024-12-30 21:19:54.614621+00:00 UNSET [] 04378f01a108e263 34cca62a58dd4b542b34851d06805599 ... None None None None NaN None NaN None NaN None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8c4c4b36b90cdf3b ChatPromptBuilder (builder) CHAIN 36b0833942399225 2024-12-30 21:21:16.383451+00:00 2024-12-30 21:21:16.403037+00:00 OK [] 8c4c4b36b90cdf3b c958063ecd6eb75228ef7db19d2077af ... application/json application/json {"prompt": ["ChatMessage(content=\"You are a t... CHAIN NaN None NaN None NaN None
64605bc0478476ff haystack.component.run UNKNOWN db791b193994469b 2024-12-30 21:21:16.415345+00:00 2024-12-30 21:21:20.190329+00:00 UNSET [] 64605bc0478476ff c958063ecd6eb75228ef7db19d2077af ... None None None None NaN None NaN None NaN None
db791b193994469b OpenAIChatGenerator (chat_generator) LLM 36b0833942399225 2024-12-30 21:21:16.414693+00:00 2024-12-30 21:21:20.193359+00:00 OK [] db791b193994469b c958063ecd6eb75228ef7db19d2077af ... application/json application/json {"replies": ["ChatMessage(content='For Tatum, ... LLM 250.0 [{'message.role': 'assistant', 'message.conten... 83.0 gpt-4o-2024-08-06 333.0 [{'message.role': 'system', 'message.content':...
36b0833942399225 haystack.pipeline.run UNKNOWN b8f5499376727ae4 2024-12-30 21:21:16.382693+00:00 2024-12-30 21:21:20.195741+00:00 UNSET [] 36b0833942399225 c958063ecd6eb75228ef7db19d2077af ... None None None None NaN None NaN None NaN None
b8f5499376727ae4 Pipeline CHAIN None 2024-12-30 21:21:16.371890+00:00 2024-12-30 21:21:20.197737+00:00 OK [] b8f5499376727ae4 c958063ecd6eb75228ef7db19d2077af ... application/json application/json {"chat_generator": {"replies": ["ChatMessage(c... CHAIN NaN None NaN None NaN None

122 rows × 22 columns

spans_df.span_kind.value_counts()
count
span_kind
UNKNOWN 62
CHAIN 40
LLM 20

spans_df[spans_df.span_kind == "LLM"]
name span_kind parent_id start_time end_time status_code status_message events context.span_id context.trace_id ... attributes.output.mime_type attributes.input.mime_type attributes.output.value attributes.openinference.span.kind attributes.llm.token_count.completion attributes.llm.output_messages attributes.llm.token_count.prompt attributes.llm.model_name attributes.llm.token_count.total attributes.llm.input_messages
context.span_id
a974808d61e79f67 OpenAIChatGenerator (chat_generator) LLM e292f10918ed018f 2024-12-30 21:19:39.945971+00:00 2024-12-30 21:19:54.617055+00:00 OK [] a974808d61e79f67 34cca62a58dd4b542b34851d06805599 ... application/json application/json {"replies": ["ChatMessage(content='For Timmy, ... LLM 292.0 [{'message.role': 'assistant', 'message.conten... 83.0 gpt-4o-2024-08-06 375.0 [{'message.role': 'system', 'message.content':...
831170dd879999a9 OpenAIChatGenerator (chat_generator) LLM 8361954531f1e967 2024-12-30 21:19:54.629977+00:00 2024-12-30 21:19:55.574859+00:00 OK [] 831170dd879999a9 09857b0927c3881e79cc34a13b1cfef2 ... application/json application/json {"replies": ["ChatMessage(content=\"Since Tomm... LLM 47.0 [{'message.role': 'assistant', 'message.conten... 80.0 gpt-4o-2024-08-06 127.0 [{'message.role': 'system', 'message.content':...
fddbd864dee1b913 OpenAIChatGenerator (chat_generator) LLM 814338ad2143bc23 2024-12-30 21:19:55.590187+00:00 2024-12-30 21:19:58.798686+00:00 OK [] fddbd864dee1b913 6fd145dba7576ba486b2f11585c44b1f ... application/json application/json {"replies": ["ChatMessage(content='For Tammy, ... LLM 222.0 [{'message.role': 'assistant', 'message.conten... 82.0 gpt-4o-2024-08-06 304.0 [{'message.role': 'system', 'message.content':...
25a0815c0b072cd6 OpenAIChatGenerator (chat_generator) LLM 1ed7156b276ee864 2024-12-30 21:19:58.813465+00:00 2024-12-30 21:20:04.502611+00:00 OK [] 25a0815c0b072cd6 64435ef6a0f3d1d12dcd1486eaab1ba6 ... application/json application/json {"replies": ["ChatMessage(content='For Tina, a... LLM 351.0 [{'message.role': 'assistant', 'message.conten... 82.0 gpt-4o-2024-08-06 433.0 [{'message.role': 'system', 'message.content':...
f30c75e575a2890a OpenAIChatGenerator (chat_generator) LLM 603293a3bb4a7b60 2024-12-30 21:20:04.515530+00:00 2024-12-30 21:20:10.315617+00:00 OK [] f30c75e575a2890a c5361039d9da60af1f9fcd3eaa93468d ... application/json application/json {"replies": ["ChatMessage(content='For Toby, I... LLM 252.0 [{'message.role': 'assistant', 'message.conten... 83.0 gpt-4o-2024-08-06 335.0 [{'message.role': 'system', 'message.content':...
7641b23bb934ba18 OpenAIChatGenerator (chat_generator) LLM ac52c4b71265c7ad 2024-12-30 21:20:10.328041+00:00 2024-12-30 21:20:13.537578+00:00 OK [] 7641b23bb934ba18 7747de68109a211b910a211a0379bdbf ... application/json application/json {"replies": ["ChatMessage(content='For Tod, wh... LLM 306.0 [{'message.role': 'assistant', 'message.conten... 81.0 gpt-4o-2024-08-06 387.0 [{'message.role': 'system', 'message.content':...
c993b4c410f6a4d8 OpenAIChatGenerator (chat_generator) LLM 5b7b0c18058c9b03 2024-12-30 21:20:13.549354+00:00 2024-12-30 21:20:15.277630+00:00 OK [] c993b4c410f6a4d8 9b05bb38e3709dc9323a2a25ac1556ba ... application/json application/json {"replies": ["ChatMessage(content='Since Todd ... LLM 116.0 [{'message.role': 'assistant', 'message.conten... 81.0 gpt-4o-2024-08-06 197.0 [{'message.role': 'system', 'message.content':...
29aefc6f40ef5bd4 OpenAIChatGenerator (chat_generator) LLM 9e674fbecb0749a9 2024-12-30 21:20:15.293944+00:00 2024-12-30 21:20:19.675824+00:00 OK [] 29aefc6f40ef5bd4 30ba1e88a5ed38f0792e8fcbb0df8c7b ... application/json application/json {"replies": ["ChatMessage(content='For Tara, w... LLM 235.0 [{'message.role': 'assistant', 'message.conten... 80.0 gpt-4o-2024-08-06 315.0 [{'message.role': 'system', 'message.content':...
fb0445e28a1cf2b2 OpenAIChatGenerator (chat_generator) LLM 6687f610d8452989 2024-12-30 21:20:19.688252+00:00 2024-12-30 21:20:24.570018+00:00 OK [] fb0445e28a1cf2b2 a3df719ace9e1a4a948456b973011513 ... application/json application/json {"replies": ["ChatMessage(content='For 9-year-... LLM 244.0 [{'message.role': 'assistant', 'message.conten... 84.0 gpt-4o-2024-08-06 328.0 [{'message.role': 'system', 'message.content':...
abfeea386d9417e2 OpenAIChatGenerator (chat_generator) LLM c8914043f7f7a5b7 2024-12-30 21:20:24.581873+00:00 2024-12-30 21:20:32.002100+00:00 OK [] abfeea386d9417e2 e7abf926ac9384659204c6e644b47bf0 ... application/json application/json {"replies": ["ChatMessage(content='For Trey, w... LLM 368.0 [{'message.role': 'assistant', 'message.conten... 81.0 gpt-4o-2024-08-06 449.0 [{'message.role': 'system', 'message.content':...
117e832a18b4ccd6 OpenAIChatGenerator (chat_generator) LLM 35e4299a30e8406a 2024-12-30 21:20:32.018025+00:00 2024-12-30 21:20:35.238899+00:00 OK [] 117e832a18b4ccd6 e016f46a5cbc0b45e3decd966423198d ... application/json application/json {"replies": ["ChatMessage(content='For Tyler, ... LLM 246.0 [{'message.role': 'assistant', 'message.conten... 80.0 gpt-4o-2024-08-06 326.0 [{'message.role': 'system', 'message.content':...
7b9aabc099e936bf OpenAIChatGenerator (chat_generator) LLM 46669109f9ae00e6 2024-12-30 21:20:35.250643+00:00 2024-12-30 21:20:39.544956+00:00 OK [] 7b9aabc099e936bf 489affb18bb8164e20bbfc0945cdba4f ... application/json application/json {"replies": ["ChatMessage(content='For Tracy, ... LLM 309.0 [{'message.role': 'assistant', 'message.conten... 79.0 gpt-4o-2024-08-06 388.0 [{'message.role': 'system', 'message.content':...
321d2e909a2da0ad OpenAIChatGenerator (chat_generator) LLM 7062dcfa11f36863 2024-12-30 21:20:39.569210+00:00 2024-12-30 21:20:44.072822+00:00 OK [] 321d2e909a2da0ad fdd294173534687528e97780d86f7e44 ... application/json application/json {"replies": ["ChatMessage(content='For Tony, I... LLM 277.0 [{'message.role': 'assistant', 'message.conten... 80.0 gpt-4o-2024-08-06 357.0 [{'message.role': 'system', 'message.content':...
22ddb78f8ec8bde2 OpenAIChatGenerator (chat_generator) LLM 54ee0c08cde66f15 2024-12-30 21:20:44.085434+00:00 2024-12-30 21:20:48.891916+00:00 OK [] 22ddb78f8ec8bde2 8ee6e037a4ca76c71585fb6df2d560e2 ... application/json application/json {"replies": ["ChatMessage(content='For Theo, w... LLM 156.0 [{'message.role': 'assistant', 'message.conten... 80.0 gpt-4o-2024-08-06 236.0 [{'message.role': 'system', 'message.content':...
792cc8df916a4e4f OpenAIChatGenerator (chat_generator) LLM 2117c6fbe9f4d85b 2024-12-30 21:20:48.905360+00:00 2024-12-30 21:20:51.411726+00:00 OK [] 792cc8df916a4e4f a55ed617c5d49ad4daa3cbe51a511e0f ... application/json application/json {"replies": ["ChatMessage(content=\"Since Terr... LLM 94.0 [{'message.role': 'assistant', 'message.conten... 82.0 gpt-4o-2024-08-06 176.0 [{'message.role': 'system', 'message.content':...
473d797dc3c07cfb OpenAIChatGenerator (chat_generator) LLM da06f15713cc600d 2024-12-30 21:20:51.428047+00:00 2024-12-30 21:20:55.472863+00:00 OK [] 473d797dc3c07cfb 8909e460eace4d04563dce3bbb21a308 ... application/json application/json {"replies": ["ChatMessage(content='For Tessa, ... LLM 231.0 [{'message.role': 'assistant', 'message.conten... 85.0 gpt-4o-2024-08-06 316.0 [{'message.role': 'system', 'message.content':...
2a4002fc523cda89 OpenAIChatGenerator (chat_generator) LLM c3e63026a60725db 2024-12-30 21:20:55.484989+00:00 2024-12-30 21:21:02.448618+00:00 OK [] 2a4002fc523cda89 95c5cc4ad95d4720f4fb77a3a337c679 ... application/json application/json {"replies": ["ChatMessage(content='For Troy, I... LLM 300.0 [{'message.role': 'assistant', 'message.conten... 80.0 gpt-4o-2024-08-06 380.0 [{'message.role': 'system', 'message.content':...
ce6a5eca6905780b OpenAIChatGenerator (chat_generator) LLM f0c26cbaf321e1a2 2024-12-30 21:21:02.464233+00:00 2024-12-30 21:21:08.158557+00:00 OK [] ce6a5eca6905780b 2afbc4a4939e859c115ca8da6c7cf87f ... application/json application/json {"replies": ["ChatMessage(content='For Talia, ... LLM 237.0 [{'message.role': 'assistant', 'message.conten... 84.0 gpt-4o-2024-08-06 321.0 [{'message.role': 'system', 'message.content':...
a6ae3e015c0301d4 OpenAIChatGenerator (chat_generator) LLM 944e31ed8abc33a9 2024-12-30 21:21:08.187750+00:00 2024-12-30 21:21:16.353757+00:00 OK [] a6ae3e015c0301d4 04a4ac8b09c3d58c2d03c06a67da1337 ... application/json application/json {"replies": ["ChatMessage(content='For Tyson, ... LLM 339.0 [{'message.role': 'assistant', 'message.conten... 81.0 gpt-4o-2024-08-06 420.0 [{'message.role': 'system', 'message.content':...
db791b193994469b OpenAIChatGenerator (chat_generator) LLM 36b0833942399225 2024-12-30 21:21:16.414693+00:00 2024-12-30 21:21:20.193359+00:00 OK [] db791b193994469b c958063ecd6eb75228ef7db19d2077af ... application/json application/json {"replies": ["ChatMessage(content='For Tatum, ... LLM 250.0 [{'message.role': 'assistant', 'message.conten... 83.0 gpt-4o-2024-08-06 333.0 [{'message.role': 'system', 'message.content':...

20 rows × 22 columns

df = spans_df[spans_df.span_kind == "LLM"][
    ["attributes.input.value", "attributes.output.value"]
]
df.head()
attributes.input.value attributes.output.value
context.span_id
a974808d61e79f67 {"messages": ["ChatMessage(content=\"You are a... {"replies": ["ChatMessage(content='For Timmy, ...
831170dd879999a9 {"messages": ["ChatMessage(content=\"You are a... {"replies": ["ChatMessage(content=\"Since Tomm...
fddbd864dee1b913 {"messages": ["ChatMessage(content=\"You are a... {"replies": ["ChatMessage(content='For Tammy, ...
25a0815c0b072cd6 {"messages": ["ChatMessage(content=\"You are a... {"replies": ["ChatMessage(content='For Tina, a...
f30c75e575a2890a {"messages": ["ChatMessage(content=\"You are a... {"replies": ["ChatMessage(content='For Toby, I...
import json
json.loads(df["attributes.input.value"].values[0])["messages"]
'ChatMessage(content="You are a toy maker elf. Your job is to make toys for the nice kids on the nice list. If the child is on the naughty list, give them a \'Rabbit R1\'. Timmy is on the nice list", role=<ChatRole.SYSTEM: \'system\'>, name=None, meta={})'
json.loads(df["attributes.output.value"].values[0])["replies"]
['ChatMessage(content=\'For Timmy, I\\\'d create a custom Lego Adventure Set tailored to his interests! Here\\\'s a description of the set:\\n\\n**Timmy\\\'s Jurassic Adventure Lego Set:**\\n\\n1. **Dinosaur Safari:** This set comes with buildable dinosaur figures like a T-Rex and Triceratops, perfect for Timmy\\\'s adventurous spirit. He can explore and create daring scenarios with these prehistoric giants.\\n\\n2. **Expedition Vehicle:** A cool off-road vehicle with a mini scientist figure that Timmy can use to navigate through the Lego jungle, capturing exciting moments and venturing through imaginative landscapes.\\n\\n3. **Mystery Fossil Site:** An interactive dig site where Timmy can discover hidden "fossils" (special brick pieces) and learn about dinosaurs in a fun, engaging way.\\n\\n4. **Jungle Hut Hideout:** A detailed jungle hut where mini-figures can rest and strategize their next adventure. It includes accessories like binoculars, a map, and a treasure chest.\\n\\n5. **Bonus Feature:** Since Timmy likes Lego and might need a nudge to warm up to vegetables, include a fun mini-veg garden as a bonus side build, where Lego mini-figures can grow their own food. It\\\'s optional to include in his adventure world, but it adds a healthy twist to his playset!\\n\\nThis Lego set encourages creativity, imaginative play, and story-building, sure to provide hours of joy for Timmy!\', role=<ChatRole.ASSISTANT: \'assistant\'>, name=None, meta={\'model\': \'gpt-4o-2024-08-06\', \'index\': 0, \'finish_reason\': \'stop\', \'usage\': {\'completion_tokens\': 292, \'prompt_tokens\': 83, \'total_tokens\': 375, \'completion_tokens_details\': CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), \'prompt_tokens_details\': PromptTokensDetails(audio_tokens=0, cached_tokens=0)}})']
df["description"] = df["attributes.input.value"].apply(
    lambda x: json.loads(x)["messages"]
)
df["toy"] = df["attributes.output.value"].apply(lambda x: json.loads(x)["replies"])
df.drop(["attributes.input.value", "attributes.output.value"], axis=1, inplace=True)
df.head()
description toy
context.span_id
a974808d61e79f67 [ChatMessage(content="You are a toy maker elf.... [ChatMessage(content='For Timmy, I\'d create a...
831170dd879999a9 [ChatMessage(content="You are a toy maker elf.... [ChatMessage(content="Since Tommy is on the na...
fddbd864dee1b913 [ChatMessage(content="You are a toy maker elf.... [ChatMessage(content='For Tammy, who is 8 year...
25a0815c0b072cd6 [ChatMessage(content="You are a toy maker elf.... [ChatMessage(content='For Tina, a curious 6-ye...
f30c75e575a2890a [ChatMessage(content="You are a toy maker elf.... [ChatMessage(content='For Toby, I\'ll create t...
import nest_asyncio

nest_asyncio.apply()
from phoenix.evals import (
    llm_classify,
    OpenAIModel,  # can swap for another model supported by Phoenix or run open-source models through LiteLLM and Ollama: https://docs.arize.com/phoenix/evaluation/evaluation-models
)
from phoenix.evals.templates import ClassificationTemplate

# Evaluate the traces with the LLM Judge
# HINT: https://docs.arize.com/phoenix/evaluation/how-to-evals/bring-your-own-evaluator#categorical-llm_classify
# HINT: For evaluation, try using a different language model than the one you used for toy matching
# referred: https://github.com/deepset-ai/haystack/discussions/8579#discussioncomment-11649855
eval_results = llm_classify(
    dataframe=df,
    template=ClassificationTemplate(
        template=llm_judge_prompt,
        rails=["correct", "incorrect"],
        explanation_template="Explanation: ",
    ),
    model=OpenAIModel(model="gpt-4o-mini"),
    rails=["correct", "incorrect"],
    provide_explanation=True,
)
eval_results["score"] = eval_results["label"].apply(
    lambda x: 1 if x == "correct" else 0
)
eval_results.head()
label explanation exceptions execution_status execution_seconds score
context.span_id
a974808d61e79f67 correct The statement is correct because it accurately... [] COMPLETED 2.970437 1
831170dd879999a9 incorrect The question asks for the correct response to ... [] COMPLETED 1.046979 0
fddbd864dee1b913 correct The statement is correct because it accurately... [] COMPLETED 1.313569 1
25a0815c0b072cd6 correct The statement is correct because it accurately... [] COMPLETED 1.931698 1
f30c75e575a2890a correct The response is correct because it accurately ... [] COMPLETED 1.026723 1
eval_results.score.value_counts()
count
score
1 17
0 3

(17 / 20) * 100
85.0
from phoenix.trace import SpanEvaluations

# Upload results into Phoenix
# HINT: https://docs.arize.com/phoenix/evaluation/how-to-evals/evaluating-phoenix-traces#download-trace-dataset-from-phoenix
px.Client().log_evaluations(
    SpanEvaluations(eval_name="evaluate_toy", dataframe=eval_results)
)

image.png

traces

image.png

image.png

spans

image.png

image.png

4. View the results in the Arize Phoenix UI 🐦‍🔥

And just like that, Elf Jane had saved Santa hours of time and made sure every kid got the right toy!

In Phoenix, she could see “correct” and “incorrect” labels on all the traces, and even see the explanations for each label!

She couldn’t wait to show Santa, and all her friends at the hackathon.