Hosted with nbsanity. See source notebook on GitHub.

Advent of Haystack - Day 7

Santa collapsed in his chair in a huff. “What’s wrong?” asked Mrs Claus.

“There’s just too many toys to check and not enough time! Christmas is almost here!”

“Well can’t you just check some of them?”

“I wish it were that easy! But my elves make so many different toys, and we have to make sure every kid gets the right one!”

Elf Jane couldn’t help overhearing from the next room. She was a regular attendee at the local north pole hackathon, and thought she might have a solution. She’d learned a lot about evaluation recently, and thought she could build an LLM Judge to help.

For this challenge, you need to help Elf Jane and complete the code cells with #TODO text

Installation

! pip install -q arize-phoenix==6.1.0 haystack-ai==2.7.0 openinference-instrumentation-haystack==0.1.13 'httpx<0.28'

Data

Elf Jane started by checking out the big elf database of christmas wishlists (aka the BEDCW).

children = [
    {
        "name": "Timmy",
        "age": 7,
        "likes": "Lego",
        "dislikes": "Vegetables",
        "list": "nice",
    },
    {
        "name": "Tommy",
        "age": 9,
        "likes": "Sports Equipment",
        "dislikes": "Reading",
        "list": "naughty",
    },
    {
        "name": "Tammy",
        "age": 8,
        "likes": "Art Supplies",
        "dislikes": "Loud Noises",
        "list": "nice",
    },
    {
        "name": "Tina",
        "age": 6,
        "likes": "Science Kits",
        "dislikes": "Spicy Food",
        "list": "nice",
    },
    {
        "name": "Toby",
        "age": 10,
        "likes": "Video Games",
        "dislikes": "Early Mornings",
        "list": "nice",
    },
    {
        "name": "Tod",
        "age": 5,
        "likes": "Musical Instruments",
        "dislikes": "Bath Time",
        "list": "nice",
    },
    {
        "name": "Todd",
        "age": 8,
        "likes": "Remote Control Cars",
        "dislikes": "Homework",
        "list": "naughty",
    },
    {
        "name": "Tara",
        "age": 7,
        "likes": "Magic Sets",
        "dislikes": "Thunder",
        "list": "nice",
    },
    {
        "name": "Teri",
        "age": 9,
        "likes": "Building Blocks",
        "dislikes": "Broccoli",
        "list": "nice",
    },
    {
        "name": "Trey",
        "age": 6,
        "likes": "Board Games",
        "dislikes": "Bedtime",
        "list": "nice",
    },
    {
        "name": "Tyler",
        "age": 8,
        "likes": "Action Figures",
        "dislikes": "Cleaning",
        "list": "nice",
    },
    {"name": "Tracy", "age": 7, "likes": "Dolls", "dislikes": "Dark", "list": "nice"},
    {
        "name": "Tony",
        "age": 9,
        "likes": "Chemistry Sets",
        "dislikes": "Dentist",
        "list": "nice",
    },
    {"name": "Theo", "age": 6, "likes": "Puzzles", "dislikes": "Shots", "list": "nice"},
    {
        "name": "Terry",
        "age": 10,
        "likes": "Model Trains",
        "dislikes": "Chores",
        "list": "naughty",
    },
    {
        "name": "Tessa",
        "age": 5,
        "likes": "Stuffed Animals",
        "dislikes": "Time Out",
        "list": "nice",
    },
    {"name": "Troy", "age": 8, "likes": "Robots", "dislikes": "Naps", "list": "nice"},
    {
        "name": "Talia",
        "age": 7,
        "likes": "Craft Kits",
        "dislikes": "Spinach",
        "list": "nice",
    },
    {
        "name": "Tyson",
        "age": 9,
        "likes": "Microscopes",
        "dislikes": "Cold",
        "list": "nice",
    },
    {
        "name": "Tatum",
        "age": 6,
        "likes": "Drawing Sets",
        "dislikes": "Shots",
        "list": "nice",
    },
]

1. Adding Tracing 📝

Elf Jane knew that the elves were busy, and didn’t always log their toy making process. She knew that she’d first need to trace the toy making process using Arize Phoenix.

from getpass import getpass
import phoenix as px
from openinference.instrumentation.haystack import HaystackInstrumentor
from phoenix.otel import register

px.launch_app().view()

tracer_provider = register()

🌍 To view the Phoenix app in your browser, visit https://hkj442j0hfq5-496ff2e9c6d22116-6006-colab.googleusercontent.com/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix
📺 Opening a view to the Phoenix app. The app is running at https://hkj442j0hfq5-496ff2e9c6d22116-6006-colab.googleusercontent.com/
🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: default
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {'user-agent': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.

# Use Phoenix's autoinstrumentor to automatically track traces from Haystack
HaystackInstrumentor().instrument(tracer_provider=tracer_provider, skip_dep_check=True)

2. Trace Toy Making Process 🚂

With tracing in place, Elf Jane had some of her closest elf friends build a batch of toys she could trace.

⭐️ Feel free to replace OpenAIChatGenerator with other ChatGenerators supported in Haystack

import os

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

Enter your OpenAI API key: ··········

from haystack.dataclasses import ChatMessage
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import ChatPromptBuilder
from haystack import Pipeline

messages = [
    ChatMessage.from_system(
        "You are a toy maker elf. Your job is to make toys for the nice kids on the nice list. If the child is on the naughty list, give them a 'Rabbit R1'. {{name}} is on the {{list}} list"
    ),
    ChatMessage.from_user(
        "Create a toy for {{name}} that they will like. {{name}} is {{age}} years old and likes {{likes}} and dislikes {{dislikes}}."
    ),
]

builder = ChatPromptBuilder(messages)
chat_generator = OpenAIChatGenerator(model="gpt-4o")

pipeline = Pipeline()
pipeline.add_component("builder", builder)
pipeline.add_component("chat_generator", chat_generator)

pipeline.connect("builder", "chat_generator")


def make_toy(child):
    return pipeline.run({"builder": {**child}})["chat_generator"]["replies"]

pipeline.show()

for child in children:
    make_toy(child)

3. Evaluate Toy Correctness 🔬

Elf Jane was now ready to evaluate the toys she made. She knew that she could use an LLM Judge to evaluate whether the toys matched the child’s wishlist. She started by building a judge.

llm_judge_prompt = """
Evaluate the toy for this child, based on their likes and dislikes

All children on the naughty list get a 'Rabbit R1'. Any other toy given to a naughty child is incorrect.

Respond with a single word: 'correct' or 'incorrect'. Also include a short explanation for your answer.

Description of the child: {description}
Toy: {toy}

*****
Example output:
label: 'correct'
explanation: 'The toy is a Lego set, which is one of the child's likes.'
*****
"""

import phoenix as px

# Download the traces from Phoenix
# HINT: https://docs.arize.com/phoenix/evaluation/how-to-evals/evaluating-phoenix-traces#download-trace-dataset-from-phoenix
spans_df = px.Client().get_spans_dataframe()
spans_df

/usr/local/lib/python3.10/dist-packages/phoenix/trace/dsl/query.py:741: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  df_attributes = pd.DataFrame.from_records(

	name	span_kind	parent_id	start_time	end_time	status_code	status_message	events	context.span_id	context.trace_id	...	attributes.output.mime_type	attributes.input.mime_type	attributes.output.value	attributes.openinference.span.kind	attributes.llm.token_count.completion	attributes.llm.output_messages	attributes.llm.token_count.prompt	attributes.llm.model_name	attributes.llm.token_count.total	attributes.llm.input_messages
context.span_id
51890ed9b9d0b4e0	haystack.tracing.auto_enable	UNKNOWN	None	2024-12-30 21:19:10.741445+00:00	2024-12-30 21:19:10.749995+00:00	ERROR	ImportError: cannot import name 'Span' from pa...	[{'name': 'exception', 'timestamp': '2024-12-3...	51890ed9b9d0b4e0	0db288f7eec1d71a0675b9ac131f90ac	...	None	None	None	None	NaN	None	NaN	None	NaN	None
054f323f6b6b7af3	haystack.tracing.auto_enable	UNKNOWN	None	2024-12-30 21:19:15.862320+00:00	2024-12-30 21:19:15.862497+00:00	UNSET		[]	054f323f6b6b7af3	076657b901d1317e5f966f35b5fd475b	...	None	None	None	None	NaN	None	NaN	None	NaN	None
2b5d4d9230514da2	haystack.component.run	UNKNOWN	50065d7e957f21ab	2024-12-30 21:19:39.937971+00:00	2024-12-30 21:19:39.941460+00:00	UNSET		[]	2b5d4d9230514da2	34cca62a58dd4b542b34851d06805599	...	None	None	None	None	NaN	None	NaN	None	NaN	None
50065d7e957f21ab	ChatPromptBuilder (builder)	CHAIN	e292f10918ed018f	2024-12-30 21:19:39.937541+00:00	2024-12-30 21:19:39.944100+00:00	OK		[]	50065d7e957f21ab	34cca62a58dd4b542b34851d06805599	...	application/json	application/json	{"prompt": ["ChatMessage(content=\"You are a t...	CHAIN	NaN	None	NaN	None	NaN	None
04378f01a108e263	haystack.component.run	UNKNOWN	a974808d61e79f67	2024-12-30 21:19:39.946261+00:00	2024-12-30 21:19:54.614621+00:00	UNSET		[]	04378f01a108e263	34cca62a58dd4b542b34851d06805599	...	None	None	None	None	NaN	None	NaN	None	NaN	None
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8c4c4b36b90cdf3b	ChatPromptBuilder (builder)	CHAIN	36b0833942399225	2024-12-30 21:21:16.383451+00:00	2024-12-30 21:21:16.403037+00:00	OK		[]	8c4c4b36b90cdf3b	c958063ecd6eb75228ef7db19d2077af	...	application/json	application/json	{"prompt": ["ChatMessage(content=\"You are a t...	CHAIN	NaN	None	NaN	None	NaN	None
64605bc0478476ff	haystack.component.run	UNKNOWN	db791b193994469b	2024-12-30 21:21:16.415345+00:00	2024-12-30 21:21:20.190329+00:00	UNSET		[]	64605bc0478476ff	c958063ecd6eb75228ef7db19d2077af	...	None	None	None	None	NaN	None	NaN	None	NaN	None
db791b193994469b	OpenAIChatGenerator (chat_generator)	LLM	36b0833942399225	2024-12-30 21:21:16.414693+00:00	2024-12-30 21:21:20.193359+00:00	OK		[]	db791b193994469b	c958063ecd6eb75228ef7db19d2077af	...	application/json	application/json	{"replies": ["ChatMessage(content='For Tatum, ...	LLM	250.0	[{'message.role': 'assistant', 'message.conten...	83.0	gpt-4o-2024-08-06	333.0	[{'message.role': 'system', 'message.content':...
36b0833942399225	haystack.pipeline.run	UNKNOWN	b8f5499376727ae4	2024-12-30 21:21:16.382693+00:00	2024-12-30 21:21:20.195741+00:00	UNSET		[]	36b0833942399225	c958063ecd6eb75228ef7db19d2077af	...	None	None	None	None	NaN	None	NaN	None	NaN	None
b8f5499376727ae4	Pipeline	CHAIN	None	2024-12-30 21:21:16.371890+00:00	2024-12-30 21:21:20.197737+00:00	OK		[]	b8f5499376727ae4	c958063ecd6eb75228ef7db19d2077af	...	application/json	application/json	{"chat_generator": {"replies": ["ChatMessage(c...	CHAIN	NaN	None	NaN	None	NaN	None

122 rows × 22 columns

spans_df.span_kind.value_counts()

	count
span_kind
UNKNOWN	62
CHAIN	40
LLM	20

dtype: int64

spans_df[spans_df.span_kind == "LLM"]

	name	span_kind	parent_id	start_time	end_time	status_code	events	context.span_id	context.trace_id	...	attributes.output.mime_type	attributes.input.mime_type	attributes.output.value	attributes.openinference.span.kind	attributes.llm.token_count.completion	attributes.llm.output_messages	attributes.llm.token_count.prompt	attributes.llm.model_name	attributes.llm.token_count.total	attributes.llm.input_messages
context.span_id
a974808d61e79f67	OpenAIChatGenerator (chat_generator)	LLM	e292f10918ed018f	2024-12-30 21:19:39.945971+00:00	2024-12-30 21:19:54.617055+00:00	OK	[]	a974808d61e79f67	34cca62a58dd4b542b34851d06805599	...	application/json	application/json	{"replies": ["ChatMessage(content='For Timmy, ...	LLM	292.0	[{'message.role': 'assistant', 'message.conten...	83.0	gpt-4o-2024-08-06	375.0	[{'message.role': 'system', 'message.content':...
831170dd879999a9	OpenAIChatGenerator (chat_generator)	LLM	8361954531f1e967	2024-12-30 21:19:54.629977+00:00	2024-12-30 21:19:55.574859+00:00	OK	[]	831170dd879999a9	09857b0927c3881e79cc34a13b1cfef2	...	application/json	application/json	{"replies": ["ChatMessage(content=\"Since Tomm...	LLM	47.0	[{'message.role': 'assistant', 'message.conten...	80.0	gpt-4o-2024-08-06	127.0	[{'message.role': 'system', 'message.content':...
fddbd864dee1b913	OpenAIChatGenerator (chat_generator)	LLM	814338ad2143bc23	2024-12-30 21:19:55.590187+00:00	2024-12-30 21:19:58.798686+00:00	OK	[]	fddbd864dee1b913	6fd145dba7576ba486b2f11585c44b1f	...	application/json	application/json	{"replies": ["ChatMessage(content='For Tammy, ...	LLM	222.0	[{'message.role': 'assistant', 'message.conten...	82.0	gpt-4o-2024-08-06	304.0	[{'message.role': 'system', 'message.content':...
25a0815c0b072cd6	OpenAIChatGenerator (chat_generator)	LLM	1ed7156b276ee864	2024-12-30 21:19:58.813465+00:00	2024-12-30 21:20:04.502611+00:00	OK	[]	25a0815c0b072cd6	64435ef6a0f3d1d12dcd1486eaab1ba6	...	application/json	application/json	{"replies": ["ChatMessage(content='For Tina, a...	LLM	351.0	[{'message.role': 'assistant', 'message.conten...	82.0	gpt-4o-2024-08-06	433.0	[{'message.role': 'system', 'message.content':...
f30c75e575a2890a	OpenAIChatGenerator (chat_generator)	LLM	603293a3bb4a7b60	2024-12-30 21:20:04.515530+00:00	2024-12-30 21:20:10.315617+00:00	OK	[]	f30c75e575a2890a	c5361039d9da60af1f9fcd3eaa93468d	...	application/json	application/json	{"replies": ["ChatMessage(content='For Toby, I...	LLM	252.0	[{'message.role': 'assistant', 'message.conten...	83.0	gpt-4o-2024-08-06	335.0	[{'message.role': 'system', 'message.content':...
7641b23bb934ba18	OpenAIChatGenerator (chat_generator)	LLM	ac52c4b71265c7ad	2024-12-30 21:20:10.328041+00:00	2024-12-30 21:20:13.537578+00:00	OK	[]	7641b23bb934ba18	7747de68109a211b910a211a0379bdbf	...	application/json	application/json	{"replies": ["ChatMessage(content='For Tod, wh...	LLM	306.0	[{'message.role': 'assistant', 'message.conten...	81.0	gpt-4o-2024-08-06	387.0	[{'message.role': 'system', 'message.content':...
c993b4c410f6a4d8	OpenAIChatGenerator (chat_generator)	LLM	5b7b0c18058c9b03	2024-12-30 21:20:13.549354+00:00	2024-12-30 21:20:15.277630+00:00	OK	[]	c993b4c410f6a4d8	9b05bb38e3709dc9323a2a25ac1556ba	...	application/json	application/json	{"replies": ["ChatMessage(content='Since Todd ...	LLM	116.0	[{'message.role': 'assistant', 'message.conten...	81.0	gpt-4o-2024-08-06	197.0	[{'message.role': 'system', 'message.content':...
29aefc6f40ef5bd4	OpenAIChatGenerator (chat_generator)	LLM	9e674fbecb0749a9	2024-12-30 21:20:15.293944+00:00	2024-12-30 21:20:19.675824+00:00	OK	[]	29aefc6f40ef5bd4	30ba1e88a5ed38f0792e8fcbb0df8c7b	...	application/json	application/json	{"replies": ["ChatMessage(content='For Tara, w...	LLM	235.0	[{'message.role': 'assistant', 'message.conten...	80.0	gpt-4o-2024-08-06	315.0	[{'message.role': 'system', 'message.content':...
fb0445e28a1cf2b2	OpenAIChatGenerator (chat_generator)	LLM	6687f610d8452989	2024-12-30 21:20:19.688252+00:00	2024-12-30 21:20:24.570018+00:00	OK	[]	fb0445e28a1cf2b2	a3df719ace9e1a4a948456b973011513	...	application/json	application/json	{"replies": ["ChatMessage(content='For 9-year-...	LLM	244.0	[{'message.role': 'assistant', 'message.conten...	84.0	gpt-4o-2024-08-06	328.0	[{'message.role': 'system', 'message.content':...
abfeea386d9417e2	OpenAIChatGenerator (chat_generator)	LLM	c8914043f7f7a5b7	2024-12-30 21:20:24.581873+00:00	2024-12-30 21:20:32.002100+00:00	OK	[]	abfeea386d9417e2	e7abf926ac9384659204c6e644b47bf0	...	application/json	application/json	{"replies": ["ChatMessage(content='For Trey, w...	LLM	368.0	[{'message.role': 'assistant', 'message.conten...	81.0	gpt-4o-2024-08-06	449.0	[{'message.role': 'system', 'message.content':...
117e832a18b4ccd6	OpenAIChatGenerator (chat_generator)	LLM	35e4299a30e8406a	2024-12-30 21:20:32.018025+00:00	2024-12-30 21:20:35.238899+00:00	OK	[]	117e832a18b4ccd6	e016f46a5cbc0b45e3decd966423198d	...	application/json	application/json	{"replies": ["ChatMessage(content='For Tyler, ...	LLM	246.0	[{'message.role': 'assistant', 'message.conten...	80.0	gpt-4o-2024-08-06	326.0	[{'message.role': 'system', 'message.content':...
7b9aabc099e936bf	OpenAIChatGenerator (chat_generator)	LLM	46669109f9ae00e6	2024-12-30 21:20:35.250643+00:00	2024-12-30 21:20:39.544956+00:00	OK	[]	7b9aabc099e936bf	489affb18bb8164e20bbfc0945cdba4f	...	application/json	application/json	{"replies": ["ChatMessage(content='For Tracy, ...	LLM	309.0	[{'message.role': 'assistant', 'message.conten...	79.0	gpt-4o-2024-08-06	388.0	[{'message.role': 'system', 'message.content':...
321d2e909a2da0ad	OpenAIChatGenerator (chat_generator)	LLM	7062dcfa11f36863	2024-12-30 21:20:39.569210+00:00	2024-12-30 21:20:44.072822+00:00	OK	[]	321d2e909a2da0ad	fdd294173534687528e97780d86f7e44	...	application/json	application/json	{"replies": ["ChatMessage(content='For Tony, I...	LLM	277.0	[{'message.role': 'assistant', 'message.conten...	80.0	gpt-4o-2024-08-06	357.0	[{'message.role': 'system', 'message.content':...
22ddb78f8ec8bde2	OpenAIChatGenerator (chat_generator)	LLM	54ee0c08cde66f15	2024-12-30 21:20:44.085434+00:00	2024-12-30 21:20:48.891916+00:00	OK	[]	22ddb78f8ec8bde2	8ee6e037a4ca76c71585fb6df2d560e2	...	application/json	application/json	{"replies": ["ChatMessage(content='For Theo, w...	LLM	156.0	[{'message.role': 'assistant', 'message.conten...	80.0	gpt-4o-2024-08-06	236.0	[{'message.role': 'system', 'message.content':...
792cc8df916a4e4f	OpenAIChatGenerator (chat_generator)	LLM	2117c6fbe9f4d85b	2024-12-30 21:20:48.905360+00:00	2024-12-30 21:20:51.411726+00:00	OK	[]	792cc8df916a4e4f	a55ed617c5d49ad4daa3cbe51a511e0f	...	application/json	application/json	{"replies": ["ChatMessage(content=\"Since Terr...	LLM	94.0	[{'message.role': 'assistant', 'message.conten...	82.0	gpt-4o-2024-08-06	176.0	[{'message.role': 'system', 'message.content':...
473d797dc3c07cfb	OpenAIChatGenerator (chat_generator)	LLM	da06f15713cc600d	2024-12-30 21:20:51.428047+00:00	2024-12-30 21:20:55.472863+00:00	OK	[]	473d797dc3c07cfb	8909e460eace4d04563dce3bbb21a308	...	application/json	application/json	{"replies": ["ChatMessage(content='For Tessa, ...	LLM	231.0	[{'message.role': 'assistant', 'message.conten...	85.0	gpt-4o-2024-08-06	316.0	[{'message.role': 'system', 'message.content':...
2a4002fc523cda89	OpenAIChatGenerator (chat_generator)	LLM	c3e63026a60725db	2024-12-30 21:20:55.484989+00:00	2024-12-30 21:21:02.448618+00:00	OK	[]	2a4002fc523cda89	95c5cc4ad95d4720f4fb77a3a337c679	...	application/json	application/json	{"replies": ["ChatMessage(content='For Troy, I...	LLM	300.0	[{'message.role': 'assistant', 'message.conten...	80.0	gpt-4o-2024-08-06	380.0	[{'message.role': 'system', 'message.content':...
ce6a5eca6905780b	OpenAIChatGenerator (chat_generator)	LLM	f0c26cbaf321e1a2	2024-12-30 21:21:02.464233+00:00	2024-12-30 21:21:08.158557+00:00	OK	[]	ce6a5eca6905780b	2afbc4a4939e859c115ca8da6c7cf87f	...	application/json	application/json	{"replies": ["ChatMessage(content='For Talia, ...	LLM	237.0	[{'message.role': 'assistant', 'message.conten...	84.0	gpt-4o-2024-08-06	321.0	[{'message.role': 'system', 'message.content':...
a6ae3e015c0301d4	OpenAIChatGenerator (chat_generator)	LLM	944e31ed8abc33a9	2024-12-30 21:21:08.187750+00:00	2024-12-30 21:21:16.353757+00:00	OK	[]	a6ae3e015c0301d4	04a4ac8b09c3d58c2d03c06a67da1337	...	application/json	application/json	{"replies": ["ChatMessage(content='For Tyson, ...	LLM	339.0	[{'message.role': 'assistant', 'message.conten...	81.0	gpt-4o-2024-08-06	420.0	[{'message.role': 'system', 'message.content':...
db791b193994469b	OpenAIChatGenerator (chat_generator)	LLM	36b0833942399225	2024-12-30 21:21:16.414693+00:00	2024-12-30 21:21:20.193359+00:00	OK	[]	db791b193994469b	c958063ecd6eb75228ef7db19d2077af	...	application/json	application/json	{"replies": ["ChatMessage(content='For Tatum, ...	LLM	250.0	[{'message.role': 'assistant', 'message.conten...	83.0	gpt-4o-2024-08-06	333.0	[{'message.role': 'system', 'message.content':...

20 rows × 22 columns

df = spans_df[spans_df.span_kind == "LLM"][
    ["attributes.input.value", "attributes.output.value"]
]

df.head()

	attributes.input.value	attributes.output.value
context.span_id
a974808d61e79f67	{"messages": ["ChatMessage(content=\"You are a...	{"replies": ["ChatMessage(content='For Timmy, ...
831170dd879999a9	{"messages": ["ChatMessage(content=\"You are a...	{"replies": ["ChatMessage(content=\"Since Tomm...
fddbd864dee1b913	{"messages": ["ChatMessage(content=\"You are a...	{"replies": ["ChatMessage(content='For Tammy, ...
25a0815c0b072cd6	{"messages": ["ChatMessage(content=\"You are a...	{"replies": ["ChatMessage(content='For Tina, a...
f30c75e575a2890a	{"messages": ["ChatMessage(content=\"You are a...	{"replies": ["ChatMessage(content='For Toby, I...

import json

json.loads(df["attributes.input.value"].values[0])["messages"]

'ChatMessage(content="You are a toy maker elf. Your job is to make toys for the nice kids on the nice list. If the child is on the naughty list, give them a \'Rabbit R1\'. Timmy is on the nice list", role=<ChatRole.SYSTEM: \'system\'>, name=None, meta={})'

json.loads(df["attributes.output.value"].values[0])["replies"]

['ChatMessage(content=\'For Timmy, I\\\'d create a custom Lego Adventure Set tailored to his interests! Here\\\'s a description of the set:\\n\\n**Timmy\\\'s Jurassic Adventure Lego Set:**\\n\\n1. **Dinosaur Safari:** This set comes with buildable dinosaur figures like a T-Rex and Triceratops, perfect for Timmy\\\'s adventurous spirit. He can explore and create daring scenarios with these prehistoric giants.\\n\\n2. **Expedition Vehicle:** A cool off-road vehicle with a mini scientist figure that Timmy can use to navigate through the Lego jungle, capturing exciting moments and venturing through imaginative landscapes.\\n\\n3. **Mystery Fossil Site:** An interactive dig site where Timmy can discover hidden "fossils" (special brick pieces) and learn about dinosaurs in a fun, engaging way.\\n\\n4. **Jungle Hut Hideout:** A detailed jungle hut where mini-figures can rest and strategize their next adventure. It includes accessories like binoculars, a map, and a treasure chest.\\n\\n5. **Bonus Feature:** Since Timmy likes Lego and might need a nudge to warm up to vegetables, include a fun mini-veg garden as a bonus side build, where Lego mini-figures can grow their own food. It\\\'s optional to include in his adventure world, but it adds a healthy twist to his playset!\\n\\nThis Lego set encourages creativity, imaginative play, and story-building, sure to provide hours of joy for Timmy!\', role=<ChatRole.ASSISTANT: \'assistant\'>, name=None, meta={\'model\': \'gpt-4o-2024-08-06\', \'index\': 0, \'finish_reason\': \'stop\', \'usage\': {\'completion_tokens\': 292, \'prompt_tokens\': 83, \'total_tokens\': 375, \'completion_tokens_details\': CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), \'prompt_tokens_details\': PromptTokensDetails(audio_tokens=0, cached_tokens=0)}})']

df["description"] = df["attributes.input.value"].apply(
    lambda x: json.loads(x)["messages"]
)
df["toy"] = df["attributes.output.value"].apply(lambda x: json.loads(x)["replies"])

df.drop(["attributes.input.value", "attributes.output.value"], axis=1, inplace=True)

df.head()

	description	toy
context.span_id
a974808d61e79f67	[ChatMessage(content="You are a toy maker elf....	[ChatMessage(content='For Timmy, I\'d create a...
831170dd879999a9	[ChatMessage(content="You are a toy maker elf....	[ChatMessage(content="Since Tommy is on the na...
fddbd864dee1b913	[ChatMessage(content="You are a toy maker elf....	[ChatMessage(content='For Tammy, who is 8 year...
25a0815c0b072cd6	[ChatMessage(content="You are a toy maker elf....	[ChatMessage(content='For Tina, a curious 6-ye...
f30c75e575a2890a	[ChatMessage(content="You are a toy maker elf....	[ChatMessage(content='For Toby, I\'ll create t...

import nest_asyncio

nest_asyncio.apply()

from phoenix.evals import (
    llm_classify,
    OpenAIModel,  # can swap for another model supported by Phoenix or run open-source models through LiteLLM and Ollama: https://docs.arize.com/phoenix/evaluation/evaluation-models
)
from phoenix.evals.templates import ClassificationTemplate

# Evaluate the traces with the LLM Judge
# HINT: https://docs.arize.com/phoenix/evaluation/how-to-evals/bring-your-own-evaluator#categorical-llm_classify
# HINT: For evaluation, try using a different language model than the one you used for toy matching
# referred: https://github.com/deepset-ai/haystack/discussions/8579#discussioncomment-11649855
eval_results = llm_classify(
    dataframe=df,
    template=ClassificationTemplate(
        template=llm_judge_prompt,
        rails=["correct", "incorrect"],
        explanation_template="Explanation: ",
    ),
    model=OpenAIModel(model="gpt-4o-mini"),
    rails=["correct", "incorrect"],
    provide_explanation=True,
)
eval_results["score"] = eval_results["label"].apply(
    lambda x: 1 if x == "correct" else 0
)
eval_results.head()

	label	explanation	exceptions	execution_status	execution_seconds	score
context.span_id
a974808d61e79f67	correct	The statement is correct because it accurately...	[]	COMPLETED	2.970437	1
831170dd879999a9	incorrect	The question asks for the correct response to ...	[]	COMPLETED	1.046979	0
fddbd864dee1b913	correct	The statement is correct because it accurately...	[]	COMPLETED	1.313569	1
25a0815c0b072cd6	correct	The statement is correct because it accurately...	[]	COMPLETED	1.931698	1
f30c75e575a2890a	correct	The response is correct because it accurately ...	[]	COMPLETED	1.026723	1

eval_results.score.value_counts()

	count
score
1	17
0	3

dtype: int64

(17 / 20) * 100

85.0

from phoenix.trace import SpanEvaluations

# Upload results into Phoenix
# HINT: https://docs.arize.com/phoenix/evaluation/how-to-evals/evaluating-phoenix-traces#download-trace-dataset-from-phoenix
px.Client().log_evaluations(
    SpanEvaluations(eval_name="evaluate_toy", dataframe=eval_results)
)

traces

spans

4. View the results in the Arize Phoenix UI 🐦‍🔥

And just like that, Elf Jane had saved Santa hours of time and made sure every kid got the right toy!

In Phoenix, she could see “correct” and “incorrect” labels on all the traces, and even see the explanations for each label!

She couldn’t wait to show Santa, and all her friends at the hackathon.