> ## Documentation Index > Fetch the complete documentation index at: https://mintlify.com/langchain-ai/lca-reliable-agents/llms.txt > Use this file to discover all available pages before exploring further. # Analyzing Agent Behavior > Use LangSmith traces to understand and improve agent performance ## Why Tracing Matters AI agents are complex systems that make multiple LLM calls, use various tools, and maintain conversation state. When something goes wrong—or right—you need to understand exactly what happened. Tracing gives you: See every step: LLM calls, tool invocations, inputs, outputs, and timing Understand why an agent made a particular decision or used a specific tool Measure latency, token usage, and cost per interaction Identify common failure modes or inefficient behavior patterns ## Setting Up Tracing The OfficeFlow agent uses LangSmith for tracing. The setup involves three steps: ### 1. Wrap the OpenAI Client ```python theme={null} from langsmith.wrappers import wrap_openai from openai import AsyncOpenAI # Wrap client for automatic LLM call tracing client = wrap_openai(AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))) ``` This automatically traces all calls to `client.chat.completions.create()` and `client.embeddings.create()`. ```typescript theme={null} import { wrapOpenAI } from "langsmith/wrappers"; import OpenAI from "openai"; // Wrap client for automatic LLM call tracing const client = wrapOpenAI(new OpenAI({ apiKey: process.env.OPENAI_API_KEY })); ``` This automatically traces all calls to `client.chat.completions.create()` and `client.embeddings.create()`. ### 2. Decorate Tools ```python theme={null} from langsmith import traceable @traceable(name="query_database", run_type="tool") def query_database(query: str, db_path: str) -> str: """Execute SQL query against the inventory database.""" try: conn = sqlite3.connect(db_path) cursor = conn.cursor() cursor.execute(query) results = cursor.fetchall() conn.close() return str(results) except Exception as e: return f"Error: {str(e)}" @traceable(name="search_knowledge_base", run_type="tool") async def search_knowledge_base(query: str, top_k: int = 2) -> str: """Search knowledge base using semantic similarity.""" # ... implementation pass ``` ```typescript theme={null} import { traceable } from "langsmith/traceable"; const queryDatabase = traceable( (query: string, dbPath: string): string => { try { const db = new Database(dbPath); const results = db.prepare(query).all(); db.close(); return JSON.stringify(results); } catch (e: any) { return `Error: ${e.message}`; } }, { name: "query_database", run_type: "tool" } ); const searchKnowledgeBase = traceable( async (query: string, topK: number = 2): Promise => { // ... implementation }, { name: "search_knowledge_base", run_type: "tool" } ); ``` ### 3. Trace the Main Chat Function ```python theme={null} from langsmith import uuid7 thread_id = str(uuid7()) # Unique ID for this conversation @traceable(name="Emma", metadata={"thread_id": thread_id}) async def chat(question: str) -> str: """Process a user question and return assistant response.""" # ... chat logic pass ``` ```typescript theme={null} import { uuid7 } from "langsmith"; const threadId = String(uuid7()); // Unique ID for this conversation const chat = traceable( async (question: string): Promise<{ messages: any[]; output: string }> => { // ... chat logic }, { name: "Emma", metadata: { thread_id: threadId } } ); ``` ### Environment Configuration Set these environment variables to enable LangSmith: ```bash theme={null} LANGCHAIN_TRACING_V2=true LANGCHAIN_API_KEY=your_api_key_here LANGCHAIN_PROJECT=officeflow-agent # Optional: organize traces by project ``` ## Anatomy of a Trace When you run the agent and ask a question, LangSmith creates a hierarchical trace: ``` Emma (parent trace) ├── ChatOpenAI (LLM call #1) │ ├── Input: system prompt + user question │ └── Output: assistant message with tool_calls │ ├── query_database (tool) │ ├── Input: {query: "SELECT name FROM sqlite_master WHERE type='table'"} │ └── Output: "[('products',), ('inventory',)]" │ ├── query_database (tool) │ ├── Input: {query: "PRAGMA table_info(products)"} │ └── Output: "[(0, 'id', 'INTEGER', ...), ...]" │ ├── query_database (tool) │ ├── Input: {query: "SELECT * FROM products WHERE category='Paper'"} │ └── Output: "[('P001', 'Copy Paper', 'Paper', ...), ...]" │ └── ChatOpenAI (LLM call #2) ├── Input: previous messages + tool results └── Output: final user-facing response ``` ### Key Components * Represents the entire interaction * Contains metadata like `thread_id` for grouping conversations * Shows total latency and cost for the full response * Includes full prompt (system + history + user message) * Shows model parameters (temperature, model name, etc.) * Displays token counts (prompt tokens, completion tokens) * Records latency and cost * Contains the raw response including tool calls * Shows which tool was called and why * Displays input arguments (e.g., the SQL query) * Records the output returned to the LLM * Captures errors if the tool failed * Measures tool execution time ## Common Debugging Scenarios ### Scenario 1: Wrong Tool Called **Problem**: Agent uses `search_knowledge_base` when it should use `query_database`. **How to diagnose**: 1. Open the trace in LangSmith 2. Look at the first LLM call's output 3. Check the `tool_calls` array to see which tool was selected 4. Examine the LLM's reasoning by looking at any `content` before tool calls 5. Review your tool descriptions - are they clear and distinct? **Common causes**: * Tool descriptions are too similar * System prompt doesn't clearly delineate when to use each tool * User query is ambiguous **Fix**: Update tool descriptions or add examples to the system prompt. ### Scenario 2: Tool Returns Error **Problem**: Tool call fails with an error like "no such column: product\_name". **How to diagnose**: Click on the `query_database` node in the trace tree Look at the SQL query the agent generated. Does it match your schema? See the error message. Does it indicate a schema mismatch, permission issue, or syntax error? Look at previous tool calls in the trace. Did the agent discover the schema first? **Common causes**: * Agent didn't discover schema (missing from tool description) * Agent hallucinated column names * Database connection issues **Fix**: Add schema discovery instructions to tool description (see v2 in [Agent Versions](/agents/agent-versions)). ### Scenario 3: Poor Response Quality **Problem**: Agent's final response is too verbose, inaccurate, or unhelpful. **How to diagnose**: 1. Open the final LLM call in the trace 2. Examine the full prompt passed to the LLM: * System prompt * Conversation history * Tool results * User question 3. Check if tool results contained the right information 4. Look for context window issues (truncation, too much irrelevant data) 5. Review the completion to see if the LLM properly synthesized the tool results **Common causes**: * System prompt doesn't include relevant guidelines * Tool returned too much or too little information * No examples of good responses in the prompt * Agent is using stale conversation history **Fix**: * Add specific instructions to system prompt (e.g., conciseness directive) * Improve tool output formatting * Add few-shot examples ### Scenario 4: Excessive Tool Use **Problem**: Agent makes 5+ tool calls for a simple question. **How to diagnose**: 1. Count tool call nodes in the trace 2. Check if tool calls are redundant (same query multiple times) 3. Look for schema discovery happening multiple times 4. See if agent is exploring different tables unnecessarily **Common causes**: * Agent doesn't remember it already discovered schema * Tool description encourages exploration * Agent is uncertain and tries multiple approaches **Fix**: * Cache schema information in conversation history * Provide schema upfront in system prompt for small databases * Add instruction to minimize tool calls ### Scenario 5: Ignoring Tool Results **Problem**: Agent calls a tool but doesn't use the results in its response. **How to diagnose**: 1. Find the tool call in the trace 2. Note what data was returned 3. Look at the final LLM call 4. Check if the tool result is in the prompt but not referenced in the completion 5. See if there's a prompt engineering issue causing the LLM to ignore tool results **Common causes**: * Tool result format is hard to parse (e.g., deeply nested JSON) * Tool returned error but system prompt doesn't handle errors well * System prompt doesn't emphasize using tool results **Fix**: * Format tool results more clearly (e.g., use markdown tables) * Add explicit instruction: "Base your answer on the tool results" * Handle tool errors gracefully in the tool function ## Analyzing Patterns Across Traces ### Filters and Search LangSmith allows you to: ```python theme={null} # All traces for a specific thread thread_id = "abc-123-def-456" # Filter in LangSmith UI: metadata.thread_id = "abc-123-def-456" ``` View all interactions in a single conversation to understand context. ``` # Show only errors status = "error" # Show only successful traces status = "success" ``` Focus on problematic interactions. ``` # Find all queries about "paper" input contains "paper" # Find responses that mentioned returns output contains "returns@officeflow.com" ``` Identify how the agent handles specific topics. ### Metrics and Analytics LangSmith provides aggregate metrics: * P50, P95, P99 latency * Identify slow traces * Compare versions * Total tokens used * Cost per trace * Cost breakdown by model * Percentage of failed traces * Common error types * Trends over time * How often each tool is called * Average calls per trace * Tool success rate ## Comparing Agent Versions When you improve your agent (e.g., v4 → v5), use traces to measure impact: ### 1. Create Separate Projects ```bash theme={null} # v4 traces LANGCHAIN_PROJECT=officeflow-agent-v4 # v5 traces LANGCHAIN_PROJECT=officeflow-agent-v5 ``` ### 2. Run the Same Test Cases Create a test set and run it against both versions: ```python theme={null} test_cases = [ "Do you have copy paper?", "What's your return policy?", "I need 500 staplers for my office", "Are you open on weekends?", ] for question in test_cases: response = await chat(question) # Automatically traced to current project ``` ```typescript theme={null} const testCases = [ "Do you have copy paper?", "What's your return policy?", "I need 500 staplers for my office", "Are you open on weekends?", ]; for (const question of testCases) { const response = await chat(question); // Automatically traced to current project } ``` ### 3. Compare Metrics | Metric | v4 | v5 | Change | | ----------------------- | -------- | -------- | ------ | | Avg latency | 2.3s | 1.9s | -17% | | Avg tokens (completion) | 156 | 98 | -37% | | Avg cost per trace | \$0.0023 | \$0.0015 | -35% | | Tool calls per trace | 2.1 | 2.1 | 0% | | Error rate | 2.3% | 2.1% | -0.2pp | This shows v5's conciseness directive reduced token usage by 37% without affecting tool usage or increasing errors. ### 4. Qualitative Analysis Beyond metrics, manually review traces: * [ ] Do responses sound natural and helpful? * [ ] Is tool usage logical and efficient? * [ ] Are errors handled gracefully? * [ ] Does the agent follow all instructions in the system prompt? * [ ] Are there any unexpected behaviors? * [ ] Does the agent maintain context across turns? * [ ] Are there any prompt injection vulnerabilities? ## Debugging Workflow When investigating an issue: Run the agent with the problematic input. Note the trace URL. Click through to LangSmith and open the trace tree. * If wrong tool: Look at first LLM call's tool selection * If tool error: Find the failing tool node * If bad output: Examine final LLM call Click on the problematic node and review: * Input: What data did this step receive? * Output: What did it produce? * Metadata: Timing, model parameters, etc. Based on the trace, what caused the issue? * Unclear instructions? * Missing context? * Tool implementation bug? * Model limitation? * Update system prompt * Improve tool description * Fix tool implementation * Add error handling Run the same input again and compare new trace to old trace. ## Advanced: Custom Metadata Add custom metadata to traces for richer analysis: ```python theme={null} from langsmith import traceable @traceable( name="Emma", metadata={ "thread_id": thread_id, "customer_id": "CUST-12345", # If authenticated "version": "v5", "environment": "production", } ) async def chat(question: str) -> str: # ... implementation pass ``` ```typescript theme={null} import { traceable } from "langsmith/traceable"; const chat = traceable( async (question: string): Promise<{ messages: any[]; output: string }> => { // ... implementation }, { name: "Emma", metadata: { thread_id: threadId, customer_id: "CUST-12345", // If authenticated version: "v5", environment: "production", }, } ); ``` Then filter by custom metadata in LangSmith: * `metadata.version = "v5"` * `metadata.environment = "production"` * `metadata.customer_id = "CUST-12345"` ## Best Practices Wrap all LLM calls, tools, and major functions. More visibility is better. Name traces and tools clearly so you can understand traces at a glance. Include context like user IDs, versions, environments for better filtering. Don't just check traces when things break. Review periodically to find improvements. LangSmith traces are shareable URLs. Use them in bug reports and code reviews. Traces show what happened. Evaluations measure if it was good. ## Tracing in Production Be mindful of these considerations when tracing in production: * **Privacy**: Traces contain user inputs and agent outputs. Ensure compliance with privacy policies. * **Cost**: LangSmith charges based on trace volume. Monitor usage. * **Performance**: Tracing adds minimal latency (\~10-50ms) but test in your environment. * **Sampling**: For high-volume applications, consider sampling (trace 10% of requests). Example sampling implementation: ```python theme={null} import random from langsmith import traceable SAMPLE_RATE = 0.1 # Trace 10% of requests async def chat(question: str) -> str: should_trace = random.random() < SAMPLE_RATE if should_trace: return await chat_traced(question) else: return await chat_untraced(question) @traceable(name="Emma", metadata={"thread_id": thread_id}) async def chat_traced(question: str) -> str: # ... implementation pass async def chat_untraced(question: str) -> str: # Same implementation without @traceable pass ``` ```typescript theme={null} import { traceable } from "langsmith/traceable"; const SAMPLE_RATE = 0.1; // Trace 10% of requests async function chat(question: string): Promise<{ messages: any[]; output: string }> { const shouldTrace = Math.random() < SAMPLE_RATE; if (shouldTrace) { return await chatTraced(question); } else { return await chatUntraced(question); } } const chatTraced = traceable( async (question: string): Promise<{ messages: any[]; output: string }> => { // ... implementation }, { name: "Emma", metadata: { thread_id: threadId } } ); async function chatUntraced(question: string): Promise<{ messages: any[]; output: string }> { // Same implementation without traceable wrapper } ``` ## Next Steps Get hands-on experience with the OfficeFlow agent and generate your own traces Use traces to create datasets and build automated evaluations