! pip install -q arize-phoenix==6.1.0 haystack-ai==2.7.0 openinference-instrumentation-haystack==0.1.13 'httpx<0.28'Advent of Haystack - Day 7
Santa collapsed in his chair in a huff. “What’s wrong?” asked Mrs Claus.
“There’s just too many toys to check and not enough time! Christmas is almost here!”
“Well can’t you just check some of them?”
“I wish it were that easy! But my elves make so many different toys, and we have to make sure every kid gets the right one!”
Elf Jane couldn’t help overhearing from the next room. She was a regular attendee at the local north pole hackathon, and thought she might have a solution. She’d learned a lot about evaluation recently, and thought she could build an LLM Judge to help.
For this challenge, you need to help Elf Jane and complete the code cells with #TODO text

Installation
Data
Elf Jane started by checking out the big elf database of christmas wishlists (aka the BEDCW).
children = [
{
"name": "Timmy",
"age": 7,
"likes": "Lego",
"dislikes": "Vegetables",
"list": "nice",
},
{
"name": "Tommy",
"age": 9,
"likes": "Sports Equipment",
"dislikes": "Reading",
"list": "naughty",
},
{
"name": "Tammy",
"age": 8,
"likes": "Art Supplies",
"dislikes": "Loud Noises",
"list": "nice",
},
{
"name": "Tina",
"age": 6,
"likes": "Science Kits",
"dislikes": "Spicy Food",
"list": "nice",
},
{
"name": "Toby",
"age": 10,
"likes": "Video Games",
"dislikes": "Early Mornings",
"list": "nice",
},
{
"name": "Tod",
"age": 5,
"likes": "Musical Instruments",
"dislikes": "Bath Time",
"list": "nice",
},
{
"name": "Todd",
"age": 8,
"likes": "Remote Control Cars",
"dislikes": "Homework",
"list": "naughty",
},
{
"name": "Tara",
"age": 7,
"likes": "Magic Sets",
"dislikes": "Thunder",
"list": "nice",
},
{
"name": "Teri",
"age": 9,
"likes": "Building Blocks",
"dislikes": "Broccoli",
"list": "nice",
},
{
"name": "Trey",
"age": 6,
"likes": "Board Games",
"dislikes": "Bedtime",
"list": "nice",
},
{
"name": "Tyler",
"age": 8,
"likes": "Action Figures",
"dislikes": "Cleaning",
"list": "nice",
},
{"name": "Tracy", "age": 7, "likes": "Dolls", "dislikes": "Dark", "list": "nice"},
{
"name": "Tony",
"age": 9,
"likes": "Chemistry Sets",
"dislikes": "Dentist",
"list": "nice",
},
{"name": "Theo", "age": 6, "likes": "Puzzles", "dislikes": "Shots", "list": "nice"},
{
"name": "Terry",
"age": 10,
"likes": "Model Trains",
"dislikes": "Chores",
"list": "naughty",
},
{
"name": "Tessa",
"age": 5,
"likes": "Stuffed Animals",
"dislikes": "Time Out",
"list": "nice",
},
{"name": "Troy", "age": 8, "likes": "Robots", "dislikes": "Naps", "list": "nice"},
{
"name": "Talia",
"age": 7,
"likes": "Craft Kits",
"dislikes": "Spinach",
"list": "nice",
},
{
"name": "Tyson",
"age": 9,
"likes": "Microscopes",
"dislikes": "Cold",
"list": "nice",
},
{
"name": "Tatum",
"age": 6,
"likes": "Drawing Sets",
"dislikes": "Shots",
"list": "nice",
},
]1. Adding Tracing 📝
Elf Jane knew that the elves were busy, and didn’t always log their toy making process. She knew that she’d first need to trace the toy making process using Arize Phoenix.
from getpass import getpass
import phoenix as px
from openinference.instrumentation.haystack import HaystackInstrumentor
from phoenix.otel import register
px.launch_app().view()
tracer_provider = register()🌍 To view the Phoenix app in your browser, visit https://hkj442j0hfq5-496ff2e9c6d22116-6006-colab.googleusercontent.com/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix
📺 Opening a view to the Phoenix app. The app is running at https://hkj442j0hfq5-496ff2e9c6d22116-6006-colab.googleusercontent.com/
🔭 OpenTelemetry Tracing Details 🔭
| Phoenix Project: default
| Span Processor: SimpleSpanProcessor
| Collector Endpoint: localhost:4317
| Transport: gRPC
| Transport Headers: {'user-agent': '****'}
|
| Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|
| `register` has set this TracerProvider as the global OpenTelemetry default.
| To disable this behavior, call `register` with `set_global_tracer_provider=False`.
# Use Phoenix's autoinstrumentor to automatically track traces from Haystack
HaystackInstrumentor().instrument(tracer_provider=tracer_provider, skip_dep_check=True)2. Trace Toy Making Process 🚂
With tracing in place, Elf Jane had some of her closest elf friends build a batch of toys she could trace.
⭐️ Feel free to replace OpenAIChatGenerator with other ChatGenerators supported in Haystack
import os
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")Enter your OpenAI API key: ··········
from haystack.dataclasses import ChatMessage
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import ChatPromptBuilder
from haystack import Pipeline
messages = [
ChatMessage.from_system(
"You are a toy maker elf. Your job is to make toys for the nice kids on the nice list. If the child is on the naughty list, give them a 'Rabbit R1'. {{name}} is on the {{list}} list"
),
ChatMessage.from_user(
"Create a toy for {{name}} that they will like. {{name}} is {{age}} years old and likes {{likes}} and dislikes {{dislikes}}."
),
]
builder = ChatPromptBuilder(messages)
chat_generator = OpenAIChatGenerator(model="gpt-4o")
pipeline = Pipeline()
pipeline.add_component("builder", builder)
pipeline.add_component("chat_generator", chat_generator)
pipeline.connect("builder", "chat_generator")
def make_toy(child):
return pipeline.run({"builder": {**child}})["chat_generator"]["replies"]pipeline.show()
for child in children:
make_toy(child)3. Evaluate Toy Correctness 🔬
Elf Jane was now ready to evaluate the toys she made. She knew that she could use an LLM Judge to evaluate whether the toys matched the child’s wishlist. She started by building a judge.
llm_judge_prompt = """
Evaluate the toy for this child, based on their likes and dislikes
All children on the naughty list get a 'Rabbit R1'. Any other toy given to a naughty child is incorrect.
Respond with a single word: 'correct' or 'incorrect'. Also include a short explanation for your answer.
Description of the child: {description}
Toy: {toy}
*****
Example output:
label: 'correct'
explanation: 'The toy is a Lego set, which is one of the child's likes.'
*****
"""import phoenix as px
# Download the traces from Phoenix
# HINT: https://docs.arize.com/phoenix/evaluation/how-to-evals/evaluating-phoenix-traces#download-trace-dataset-from-phoenix
spans_df = px.Client().get_spans_dataframe()
spans_df/usr/local/lib/python3.10/dist-packages/phoenix/trace/dsl/query.py:741: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
df_attributes = pd.DataFrame.from_records(
| name | span_kind | parent_id | start_time | end_time | status_code | status_message | events | context.span_id | context.trace_id | ... | attributes.output.mime_type | attributes.input.mime_type | attributes.output.value | attributes.openinference.span.kind | attributes.llm.token_count.completion | attributes.llm.output_messages | attributes.llm.token_count.prompt | attributes.llm.model_name | attributes.llm.token_count.total | attributes.llm.input_messages | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| context.span_id | |||||||||||||||||||||
| 51890ed9b9d0b4e0 | haystack.tracing.auto_enable | UNKNOWN | None | 2024-12-30 21:19:10.741445+00:00 | 2024-12-30 21:19:10.749995+00:00 | ERROR | ImportError: cannot import name 'Span' from pa... | [{'name': 'exception', 'timestamp': '2024-12-3... | 51890ed9b9d0b4e0 | 0db288f7eec1d71a0675b9ac131f90ac | ... | None | None | None | None | NaN | None | NaN | None | NaN | None |
| 054f323f6b6b7af3 | haystack.tracing.auto_enable | UNKNOWN | None | 2024-12-30 21:19:15.862320+00:00 | 2024-12-30 21:19:15.862497+00:00 | UNSET | [] | 054f323f6b6b7af3 | 076657b901d1317e5f966f35b5fd475b | ... | None | None | None | None | NaN | None | NaN | None | NaN | None | |
| 2b5d4d9230514da2 | haystack.component.run | UNKNOWN | 50065d7e957f21ab | 2024-12-30 21:19:39.937971+00:00 | 2024-12-30 21:19:39.941460+00:00 | UNSET | [] | 2b5d4d9230514da2 | 34cca62a58dd4b542b34851d06805599 | ... | None | None | None | None | NaN | None | NaN | None | NaN | None | |
| 50065d7e957f21ab | ChatPromptBuilder (builder) | CHAIN | e292f10918ed018f | 2024-12-30 21:19:39.937541+00:00 | 2024-12-30 21:19:39.944100+00:00 | OK | [] | 50065d7e957f21ab | 34cca62a58dd4b542b34851d06805599 | ... | application/json | application/json | {"prompt": ["ChatMessage(content=\"You are a t... | CHAIN | NaN | None | NaN | None | NaN | None | |
| 04378f01a108e263 | haystack.component.run | UNKNOWN | a974808d61e79f67 | 2024-12-30 21:19:39.946261+00:00 | 2024-12-30 21:19:54.614621+00:00 | UNSET | [] | 04378f01a108e263 | 34cca62a58dd4b542b34851d06805599 | ... | None | None | None | None | NaN | None | NaN | None | NaN | None | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8c4c4b36b90cdf3b | ChatPromptBuilder (builder) | CHAIN | 36b0833942399225 | 2024-12-30 21:21:16.383451+00:00 | 2024-12-30 21:21:16.403037+00:00 | OK | [] | 8c4c4b36b90cdf3b | c958063ecd6eb75228ef7db19d2077af | ... | application/json | application/json | {"prompt": ["ChatMessage(content=\"You are a t... | CHAIN | NaN | None | NaN | None | NaN | None | |
| 64605bc0478476ff | haystack.component.run | UNKNOWN | db791b193994469b | 2024-12-30 21:21:16.415345+00:00 | 2024-12-30 21:21:20.190329+00:00 | UNSET | [] | 64605bc0478476ff | c958063ecd6eb75228ef7db19d2077af | ... | None | None | None | None | NaN | None | NaN | None | NaN | None | |
| db791b193994469b | OpenAIChatGenerator (chat_generator) | LLM | 36b0833942399225 | 2024-12-30 21:21:16.414693+00:00 | 2024-12-30 21:21:20.193359+00:00 | OK | [] | db791b193994469b | c958063ecd6eb75228ef7db19d2077af | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Tatum, ... | LLM | 250.0 | [{'message.role': 'assistant', 'message.conten... | 83.0 | gpt-4o-2024-08-06 | 333.0 | [{'message.role': 'system', 'message.content':... | |
| 36b0833942399225 | haystack.pipeline.run | UNKNOWN | b8f5499376727ae4 | 2024-12-30 21:21:16.382693+00:00 | 2024-12-30 21:21:20.195741+00:00 | UNSET | [] | 36b0833942399225 | c958063ecd6eb75228ef7db19d2077af | ... | None | None | None | None | NaN | None | NaN | None | NaN | None | |
| b8f5499376727ae4 | Pipeline | CHAIN | None | 2024-12-30 21:21:16.371890+00:00 | 2024-12-30 21:21:20.197737+00:00 | OK | [] | b8f5499376727ae4 | c958063ecd6eb75228ef7db19d2077af | ... | application/json | application/json | {"chat_generator": {"replies": ["ChatMessage(c... | CHAIN | NaN | None | NaN | None | NaN | None |
122 rows × 22 columns
spans_df.span_kind.value_counts()| count | |
|---|---|
| span_kind | |
| UNKNOWN | 62 |
| CHAIN | 40 |
| LLM | 20 |
spans_df[spans_df.span_kind == "LLM"]| name | span_kind | parent_id | start_time | end_time | status_code | status_message | events | context.span_id | context.trace_id | ... | attributes.output.mime_type | attributes.input.mime_type | attributes.output.value | attributes.openinference.span.kind | attributes.llm.token_count.completion | attributes.llm.output_messages | attributes.llm.token_count.prompt | attributes.llm.model_name | attributes.llm.token_count.total | attributes.llm.input_messages | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| context.span_id | |||||||||||||||||||||
| a974808d61e79f67 | OpenAIChatGenerator (chat_generator) | LLM | e292f10918ed018f | 2024-12-30 21:19:39.945971+00:00 | 2024-12-30 21:19:54.617055+00:00 | OK | [] | a974808d61e79f67 | 34cca62a58dd4b542b34851d06805599 | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Timmy, ... | LLM | 292.0 | [{'message.role': 'assistant', 'message.conten... | 83.0 | gpt-4o-2024-08-06 | 375.0 | [{'message.role': 'system', 'message.content':... | |
| 831170dd879999a9 | OpenAIChatGenerator (chat_generator) | LLM | 8361954531f1e967 | 2024-12-30 21:19:54.629977+00:00 | 2024-12-30 21:19:55.574859+00:00 | OK | [] | 831170dd879999a9 | 09857b0927c3881e79cc34a13b1cfef2 | ... | application/json | application/json | {"replies": ["ChatMessage(content=\"Since Tomm... | LLM | 47.0 | [{'message.role': 'assistant', 'message.conten... | 80.0 | gpt-4o-2024-08-06 | 127.0 | [{'message.role': 'system', 'message.content':... | |
| fddbd864dee1b913 | OpenAIChatGenerator (chat_generator) | LLM | 814338ad2143bc23 | 2024-12-30 21:19:55.590187+00:00 | 2024-12-30 21:19:58.798686+00:00 | OK | [] | fddbd864dee1b913 | 6fd145dba7576ba486b2f11585c44b1f | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Tammy, ... | LLM | 222.0 | [{'message.role': 'assistant', 'message.conten... | 82.0 | gpt-4o-2024-08-06 | 304.0 | [{'message.role': 'system', 'message.content':... | |
| 25a0815c0b072cd6 | OpenAIChatGenerator (chat_generator) | LLM | 1ed7156b276ee864 | 2024-12-30 21:19:58.813465+00:00 | 2024-12-30 21:20:04.502611+00:00 | OK | [] | 25a0815c0b072cd6 | 64435ef6a0f3d1d12dcd1486eaab1ba6 | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Tina, a... | LLM | 351.0 | [{'message.role': 'assistant', 'message.conten... | 82.0 | gpt-4o-2024-08-06 | 433.0 | [{'message.role': 'system', 'message.content':... | |
| f30c75e575a2890a | OpenAIChatGenerator (chat_generator) | LLM | 603293a3bb4a7b60 | 2024-12-30 21:20:04.515530+00:00 | 2024-12-30 21:20:10.315617+00:00 | OK | [] | f30c75e575a2890a | c5361039d9da60af1f9fcd3eaa93468d | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Toby, I... | LLM | 252.0 | [{'message.role': 'assistant', 'message.conten... | 83.0 | gpt-4o-2024-08-06 | 335.0 | [{'message.role': 'system', 'message.content':... | |
| 7641b23bb934ba18 | OpenAIChatGenerator (chat_generator) | LLM | ac52c4b71265c7ad | 2024-12-30 21:20:10.328041+00:00 | 2024-12-30 21:20:13.537578+00:00 | OK | [] | 7641b23bb934ba18 | 7747de68109a211b910a211a0379bdbf | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Tod, wh... | LLM | 306.0 | [{'message.role': 'assistant', 'message.conten... | 81.0 | gpt-4o-2024-08-06 | 387.0 | [{'message.role': 'system', 'message.content':... | |
| c993b4c410f6a4d8 | OpenAIChatGenerator (chat_generator) | LLM | 5b7b0c18058c9b03 | 2024-12-30 21:20:13.549354+00:00 | 2024-12-30 21:20:15.277630+00:00 | OK | [] | c993b4c410f6a4d8 | 9b05bb38e3709dc9323a2a25ac1556ba | ... | application/json | application/json | {"replies": ["ChatMessage(content='Since Todd ... | LLM | 116.0 | [{'message.role': 'assistant', 'message.conten... | 81.0 | gpt-4o-2024-08-06 | 197.0 | [{'message.role': 'system', 'message.content':... | |
| 29aefc6f40ef5bd4 | OpenAIChatGenerator (chat_generator) | LLM | 9e674fbecb0749a9 | 2024-12-30 21:20:15.293944+00:00 | 2024-12-30 21:20:19.675824+00:00 | OK | [] | 29aefc6f40ef5bd4 | 30ba1e88a5ed38f0792e8fcbb0df8c7b | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Tara, w... | LLM | 235.0 | [{'message.role': 'assistant', 'message.conten... | 80.0 | gpt-4o-2024-08-06 | 315.0 | [{'message.role': 'system', 'message.content':... | |
| fb0445e28a1cf2b2 | OpenAIChatGenerator (chat_generator) | LLM | 6687f610d8452989 | 2024-12-30 21:20:19.688252+00:00 | 2024-12-30 21:20:24.570018+00:00 | OK | [] | fb0445e28a1cf2b2 | a3df719ace9e1a4a948456b973011513 | ... | application/json | application/json | {"replies": ["ChatMessage(content='For 9-year-... | LLM | 244.0 | [{'message.role': 'assistant', 'message.conten... | 84.0 | gpt-4o-2024-08-06 | 328.0 | [{'message.role': 'system', 'message.content':... | |
| abfeea386d9417e2 | OpenAIChatGenerator (chat_generator) | LLM | c8914043f7f7a5b7 | 2024-12-30 21:20:24.581873+00:00 | 2024-12-30 21:20:32.002100+00:00 | OK | [] | abfeea386d9417e2 | e7abf926ac9384659204c6e644b47bf0 | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Trey, w... | LLM | 368.0 | [{'message.role': 'assistant', 'message.conten... | 81.0 | gpt-4o-2024-08-06 | 449.0 | [{'message.role': 'system', 'message.content':... | |
| 117e832a18b4ccd6 | OpenAIChatGenerator (chat_generator) | LLM | 35e4299a30e8406a | 2024-12-30 21:20:32.018025+00:00 | 2024-12-30 21:20:35.238899+00:00 | OK | [] | 117e832a18b4ccd6 | e016f46a5cbc0b45e3decd966423198d | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Tyler, ... | LLM | 246.0 | [{'message.role': 'assistant', 'message.conten... | 80.0 | gpt-4o-2024-08-06 | 326.0 | [{'message.role': 'system', 'message.content':... | |
| 7b9aabc099e936bf | OpenAIChatGenerator (chat_generator) | LLM | 46669109f9ae00e6 | 2024-12-30 21:20:35.250643+00:00 | 2024-12-30 21:20:39.544956+00:00 | OK | [] | 7b9aabc099e936bf | 489affb18bb8164e20bbfc0945cdba4f | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Tracy, ... | LLM | 309.0 | [{'message.role': 'assistant', 'message.conten... | 79.0 | gpt-4o-2024-08-06 | 388.0 | [{'message.role': 'system', 'message.content':... | |
| 321d2e909a2da0ad | OpenAIChatGenerator (chat_generator) | LLM | 7062dcfa11f36863 | 2024-12-30 21:20:39.569210+00:00 | 2024-12-30 21:20:44.072822+00:00 | OK | [] | 321d2e909a2da0ad | fdd294173534687528e97780d86f7e44 | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Tony, I... | LLM | 277.0 | [{'message.role': 'assistant', 'message.conten... | 80.0 | gpt-4o-2024-08-06 | 357.0 | [{'message.role': 'system', 'message.content':... | |
| 22ddb78f8ec8bde2 | OpenAIChatGenerator (chat_generator) | LLM | 54ee0c08cde66f15 | 2024-12-30 21:20:44.085434+00:00 | 2024-12-30 21:20:48.891916+00:00 | OK | [] | 22ddb78f8ec8bde2 | 8ee6e037a4ca76c71585fb6df2d560e2 | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Theo, w... | LLM | 156.0 | [{'message.role': 'assistant', 'message.conten... | 80.0 | gpt-4o-2024-08-06 | 236.0 | [{'message.role': 'system', 'message.content':... | |
| 792cc8df916a4e4f | OpenAIChatGenerator (chat_generator) | LLM | 2117c6fbe9f4d85b | 2024-12-30 21:20:48.905360+00:00 | 2024-12-30 21:20:51.411726+00:00 | OK | [] | 792cc8df916a4e4f | a55ed617c5d49ad4daa3cbe51a511e0f | ... | application/json | application/json | {"replies": ["ChatMessage(content=\"Since Terr... | LLM | 94.0 | [{'message.role': 'assistant', 'message.conten... | 82.0 | gpt-4o-2024-08-06 | 176.0 | [{'message.role': 'system', 'message.content':... | |
| 473d797dc3c07cfb | OpenAIChatGenerator (chat_generator) | LLM | da06f15713cc600d | 2024-12-30 21:20:51.428047+00:00 | 2024-12-30 21:20:55.472863+00:00 | OK | [] | 473d797dc3c07cfb | 8909e460eace4d04563dce3bbb21a308 | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Tessa, ... | LLM | 231.0 | [{'message.role': 'assistant', 'message.conten... | 85.0 | gpt-4o-2024-08-06 | 316.0 | [{'message.role': 'system', 'message.content':... | |
| 2a4002fc523cda89 | OpenAIChatGenerator (chat_generator) | LLM | c3e63026a60725db | 2024-12-30 21:20:55.484989+00:00 | 2024-12-30 21:21:02.448618+00:00 | OK | [] | 2a4002fc523cda89 | 95c5cc4ad95d4720f4fb77a3a337c679 | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Troy, I... | LLM | 300.0 | [{'message.role': 'assistant', 'message.conten... | 80.0 | gpt-4o-2024-08-06 | 380.0 | [{'message.role': 'system', 'message.content':... | |
| ce6a5eca6905780b | OpenAIChatGenerator (chat_generator) | LLM | f0c26cbaf321e1a2 | 2024-12-30 21:21:02.464233+00:00 | 2024-12-30 21:21:08.158557+00:00 | OK | [] | ce6a5eca6905780b | 2afbc4a4939e859c115ca8da6c7cf87f | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Talia, ... | LLM | 237.0 | [{'message.role': 'assistant', 'message.conten... | 84.0 | gpt-4o-2024-08-06 | 321.0 | [{'message.role': 'system', 'message.content':... | |
| a6ae3e015c0301d4 | OpenAIChatGenerator (chat_generator) | LLM | 944e31ed8abc33a9 | 2024-12-30 21:21:08.187750+00:00 | 2024-12-30 21:21:16.353757+00:00 | OK | [] | a6ae3e015c0301d4 | 04a4ac8b09c3d58c2d03c06a67da1337 | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Tyson, ... | LLM | 339.0 | [{'message.role': 'assistant', 'message.conten... | 81.0 | gpt-4o-2024-08-06 | 420.0 | [{'message.role': 'system', 'message.content':... | |
| db791b193994469b | OpenAIChatGenerator (chat_generator) | LLM | 36b0833942399225 | 2024-12-30 21:21:16.414693+00:00 | 2024-12-30 21:21:20.193359+00:00 | OK | [] | db791b193994469b | c958063ecd6eb75228ef7db19d2077af | ... | application/json | application/json | {"replies": ["ChatMessage(content='For Tatum, ... | LLM | 250.0 | [{'message.role': 'assistant', 'message.conten... | 83.0 | gpt-4o-2024-08-06 | 333.0 | [{'message.role': 'system', 'message.content':... |
20 rows × 22 columns
df = spans_df[spans_df.span_kind == "LLM"][
["attributes.input.value", "attributes.output.value"]
]df.head()| attributes.input.value | attributes.output.value | |
|---|---|---|
| context.span_id | ||
| a974808d61e79f67 | {"messages": ["ChatMessage(content=\"You are a... | {"replies": ["ChatMessage(content='For Timmy, ... |
| 831170dd879999a9 | {"messages": ["ChatMessage(content=\"You are a... | {"replies": ["ChatMessage(content=\"Since Tomm... |
| fddbd864dee1b913 | {"messages": ["ChatMessage(content=\"You are a... | {"replies": ["ChatMessage(content='For Tammy, ... |
| 25a0815c0b072cd6 | {"messages": ["ChatMessage(content=\"You are a... | {"replies": ["ChatMessage(content='For Tina, a... |
| f30c75e575a2890a | {"messages": ["ChatMessage(content=\"You are a... | {"replies": ["ChatMessage(content='For Toby, I... |
import jsonjson.loads(df["attributes.input.value"].values[0])["messages"]'ChatMessage(content="You are a toy maker elf. Your job is to make toys for the nice kids on the nice list. If the child is on the naughty list, give them a \'Rabbit R1\'. Timmy is on the nice list", role=<ChatRole.SYSTEM: \'system\'>, name=None, meta={})'
json.loads(df["attributes.output.value"].values[0])["replies"]['ChatMessage(content=\'For Timmy, I\\\'d create a custom Lego Adventure Set tailored to his interests! Here\\\'s a description of the set:\\n\\n**Timmy\\\'s Jurassic Adventure Lego Set:**\\n\\n1. **Dinosaur Safari:** This set comes with buildable dinosaur figures like a T-Rex and Triceratops, perfect for Timmy\\\'s adventurous spirit. He can explore and create daring scenarios with these prehistoric giants.\\n\\n2. **Expedition Vehicle:** A cool off-road vehicle with a mini scientist figure that Timmy can use to navigate through the Lego jungle, capturing exciting moments and venturing through imaginative landscapes.\\n\\n3. **Mystery Fossil Site:** An interactive dig site where Timmy can discover hidden "fossils" (special brick pieces) and learn about dinosaurs in a fun, engaging way.\\n\\n4. **Jungle Hut Hideout:** A detailed jungle hut where mini-figures can rest and strategize their next adventure. It includes accessories like binoculars, a map, and a treasure chest.\\n\\n5. **Bonus Feature:** Since Timmy likes Lego and might need a nudge to warm up to vegetables, include a fun mini-veg garden as a bonus side build, where Lego mini-figures can grow their own food. It\\\'s optional to include in his adventure world, but it adds a healthy twist to his playset!\\n\\nThis Lego set encourages creativity, imaginative play, and story-building, sure to provide hours of joy for Timmy!\', role=<ChatRole.ASSISTANT: \'assistant\'>, name=None, meta={\'model\': \'gpt-4o-2024-08-06\', \'index\': 0, \'finish_reason\': \'stop\', \'usage\': {\'completion_tokens\': 292, \'prompt_tokens\': 83, \'total_tokens\': 375, \'completion_tokens_details\': CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), \'prompt_tokens_details\': PromptTokensDetails(audio_tokens=0, cached_tokens=0)}})']
df["description"] = df["attributes.input.value"].apply(
lambda x: json.loads(x)["messages"]
)
df["toy"] = df["attributes.output.value"].apply(lambda x: json.loads(x)["replies"])df.drop(["attributes.input.value", "attributes.output.value"], axis=1, inplace=True)df.head()| description | toy | |
|---|---|---|
| context.span_id | ||
| a974808d61e79f67 | [ChatMessage(content="You are a toy maker elf.... | [ChatMessage(content='For Timmy, I\'d create a... |
| 831170dd879999a9 | [ChatMessage(content="You are a toy maker elf.... | [ChatMessage(content="Since Tommy is on the na... |
| fddbd864dee1b913 | [ChatMessage(content="You are a toy maker elf.... | [ChatMessage(content='For Tammy, who is 8 year... |
| 25a0815c0b072cd6 | [ChatMessage(content="You are a toy maker elf.... | [ChatMessage(content='For Tina, a curious 6-ye... |
| f30c75e575a2890a | [ChatMessage(content="You are a toy maker elf.... | [ChatMessage(content='For Toby, I\'ll create t... |
import nest_asyncio
nest_asyncio.apply()from phoenix.evals import (
llm_classify,
OpenAIModel, # can swap for another model supported by Phoenix or run open-source models through LiteLLM and Ollama: https://docs.arize.com/phoenix/evaluation/evaluation-models
)
from phoenix.evals.templates import ClassificationTemplate
# Evaluate the traces with the LLM Judge
# HINT: https://docs.arize.com/phoenix/evaluation/how-to-evals/bring-your-own-evaluator#categorical-llm_classify
# HINT: For evaluation, try using a different language model than the one you used for toy matching
# referred: https://github.com/deepset-ai/haystack/discussions/8579#discussioncomment-11649855
eval_results = llm_classify(
dataframe=df,
template=ClassificationTemplate(
template=llm_judge_prompt,
rails=["correct", "incorrect"],
explanation_template="Explanation: ",
),
model=OpenAIModel(model="gpt-4o-mini"),
rails=["correct", "incorrect"],
provide_explanation=True,
)
eval_results["score"] = eval_results["label"].apply(
lambda x: 1 if x == "correct" else 0
)
eval_results.head()| label | explanation | exceptions | execution_status | execution_seconds | score | |
|---|---|---|---|---|---|---|
| context.span_id | ||||||
| a974808d61e79f67 | correct | The statement is correct because it accurately... | [] | COMPLETED | 2.970437 | 1 |
| 831170dd879999a9 | incorrect | The question asks for the correct response to ... | [] | COMPLETED | 1.046979 | 0 |
| fddbd864dee1b913 | correct | The statement is correct because it accurately... | [] | COMPLETED | 1.313569 | 1 |
| 25a0815c0b072cd6 | correct | The statement is correct because it accurately... | [] | COMPLETED | 1.931698 | 1 |
| f30c75e575a2890a | correct | The response is correct because it accurately ... | [] | COMPLETED | 1.026723 | 1 |
eval_results.score.value_counts()| count | |
|---|---|
| score | |
| 1 | 17 |
| 0 | 3 |
(17 / 20) * 10085.0
from phoenix.trace import SpanEvaluations
# Upload results into Phoenix
# HINT: https://docs.arize.com/phoenix/evaluation/how-to-evals/evaluating-phoenix-traces#download-trace-dataset-from-phoenix
px.Client().log_evaluations(
SpanEvaluations(eval_name="evaluate_toy", dataframe=eval_results)
)traces
spans
4. View the results in the Arize Phoenix UI 🐦🔥
And just like that, Elf Jane had saved Santa hours of time and made sure every kid got the right toy!
In Phoenix, she could see “correct” and “incorrect” labels on all the traces, and even see the explanations for each label!
She couldn’t wait to show Santa, and all her friends at the hackathon.
