Skip to content

链评估

链评估是 LangChain 应用开发中的重要环节,它用于评估链的性能和质量。本节将详细介绍如何评估 LangChain 1.2 中的链的性能。

什么是链评估?

链评估是指通过一系列指标和测试,评估链在特定任务上的性能和质量。链评估可以帮助开发者了解链的 strengths和 weaknesses,从而进行有针对性的优化。

在 LangChain 1.2 中,推荐使用 LCEL 语法创建链,并使用相应的评估方法来评估链的性能。

评估指标

链的评估指标与模型评估类似,包括:

指标描述适用场景
准确性(Accuracy)链预测正确的比例分类任务
精确率(Precision)链预测为正例的样本中实际为正例的比例分类任务
召回率(Recall)实际为正例的样本中被链预测为正例的比例分类任务
F1 分数精确率和召回率的调和平均值分类任务
BLEU评估生成文本与参考文本的相似度文本生成任务
ROUGE评估生成文本与参考文本的相似度文本摘要任务
响应时间链执行的时间性能评估
内存使用链执行的内存消耗性能评估

评估方法

1. 基于任务的评估

基于任务的评估是指在特定任务上评估链的性能,如问答、文本分类、文本生成等。

示例:评估问答链(使用 LCEL)

python
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.evaluation.qa import QAEvalChain

# 初始化语言模型
llm = OpenAI(api_key="your-api-key")

# 创建提示词模板
template = """You are a helpful assistant. Answer the following question:

Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])

# 创建输出解析器
output_parser = StrOutputParser()

# 使用 LCEL 语法创建链
chain = prompt | llm | output_parser

# 创建评估链
eval_chain = QAEvalChain.from_llm(llm)

# 准备测试数据
test_data = [
    {
        "question": "What is LangChain?",
        "answer": "LangChain is a framework for building LLM applications."
    },
    {
        "question": "What are the core components of LangChain?",
        "answer": "The core components of LangChain are models, prompts, chains, agents, tools, memory, and vector stores."
    }
]

# 运行链并收集结果
examples = []
for item in test_data:
    result = chain.invoke({"question": item["question"]})
    examples.append({
        "query": item["question"],
        "answer": item["answer"],
        "result": result
    })

# 评估结果
evaluations = eval_chain.evaluate(examples)

# 打印评估结果
for i, evaluation in enumerate(evaluations):
    print(f"Example {i+1}:")
    print(f"Question: {examples[i]['query']}")
    print(f"Reference Answer: {examples[i]['answer']}")
    print(f"Chain Answer: {examples[i]['result']}")
    print(f"Evaluation: {evaluation}")
    print()

2. 基于基准测试的评估

基于基准测试的评估是指使用标准化的基准测试集评估链的性能。

示例:使用自定义基准测试评估链(使用 LCEL)

python
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
import time

# 初始化语言模型
llm = OpenAI(api_key="your-api-key")

# 创建提示词模板
template = """You are a helpful assistant. Answer the following question:

Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])

# 创建输出解析器
output_parser = StrOutputParser()

# 使用 LCEL 语法创建链
chain = prompt | llm | output_parser

# 准备基准测试数据
benchmark_data = [
    "What is LangChain?",
    "What are the core components of LangChain?",
    "How to use LangChain to build a chatbot?",
    "What is the difference between LangChain and other LLM frameworks?",
    "How to integrate LangChain with external tools?"
]

# 评估链的性能
total_time = 0
results = []

for question in benchmark_data:
    start_time = time.time()
    result = chain.invoke({"question": question})
    end_time = time.time()
    execution_time = end_time - start_time
    total_time += execution_time
    results.append({
        "question": question,
        "answer": result,
        "execution_time": execution_time
    })

# 打印评估结果
print(f"Total execution time: {total_time:.2f} seconds")
print(f"Average execution time: {total_time / len(benchmark_data):.2f} seconds per question")
print()

for i, item in enumerate(results):
    print(f"Question {i+1}: {item['question']}")
    print(f"Answer: {item['answer']}")
    print(f"Execution time: {item['execution_time']:.2f} seconds")
    print()

3. 人工评估

人工评估是指由人类评估者对链的输出进行评分,是最直接和可靠的评估方法。

示例:人工评估链(使用 LCEL)

python
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

# 初始化语言模型
llm = OpenAI(api_key="your-api-key")

# 创建提示词模板
template = """You are a helpful assistant. Answer the following question:

Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])

# 创建输出解析器
output_parser = StrOutputParser()

# 使用 LCEL 语法创建链
chain = prompt | llm | output_parser

# 准备测试数据
test_data = [
    "What is LangChain?",
    "What are the core components of LangChain?",
    "How to use LangChain to build a chatbot?"
]

# 运行链并收集结果
results = []
for question in test_data:
    result = chain.invoke({"question": question})
    results.append({
        "question": question,
        "answer": result
    })

# 打印结果供人工评估
print("Please evaluate the following responses:")
print("Rate each response on a scale of 1-5 for relevance, accuracy, clarity, and completeness.")
print()

for i, item in enumerate(results):
    print(f"Example {i+1}:")
    print(f"Question: {item['question']}")
    print(f"Answer: {item['answer']}")
    print()
    print("Relevance: ")
    print("Accuracy: ")
    print("Clarity: ")
    print("Completeness: ")
    print()

评估工具

LangChain 提供了多种评估工具,以下是一些常用的评估工具:

1. QAEvalChain

QAEvalChain 用于评估问答链的性能。

python
from langchain_openai import OpenAI
from langchain.evaluation.qa import QAEvalChain

# 初始化语言模型
llm = OpenAI(api_key="your-api-key")

# 创建评估链
eval_chain = QAEvalChain.from_llm(llm)

# 评估结果
examples = [
    {
        "query": "What is LangChain?",
        "answer": "LangChain is a framework for building LLM applications.",
        "result": "LangChain is a framework for developing applications powered by language models."
    }
]

evaluations = eval_chain.evaluate(examples)
print(evaluations)

2. StringEvaluator

StringEvaluator 用于评估字符串输出的质量。

python
from langchain_openai import OpenAI
from langchain.evaluation import load_evaluator

# 初始化语言模型
llm = OpenAI(api_key="your-api-key")

# 加载评估器
evaluator = load_evaluator("string", llm=llm)

# 评估结果
result = evaluator.evaluate(
    prediction="LangChain is a framework for building LLM applications.",
    reference="LangChain is a framework for developing applications powered by language models."
)
print(result)

3. runnable_evaluate

runnable_evaluate 是一个通用的评估函数,用于评估任何可运行的对象。

python
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.evaluation import runnable_evaluate

# 初始化语言模型
llm = OpenAI(api_key="your-api-key")

# 创建提示词模板
template = """You are a helpful assistant. Answer the following question:

Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])

# 创建输出解析器
output_parser = StrOutputParser()

# 使用 LCEL 语法创建链
chain = prompt | llm | output_parser

# 准备测试数据
dataset = [
    {"question": "What is LangChain?"},
    {"question": "What are the core components of LangChain?"}
]

# 评估链
results = runnable_evaluate(
    chain,
    dataset=dataset,
    evaluation_name="qa"
)
print(results)

评估的最佳实践

1. 定义明确的评估目标

在评估链之前,明确评估的目标和指标,确保评估结果能够反映链的实际性能。

2. 使用多样化的测试数据

使用多样化的测试数据,包括不同类型、不同难度的样本,确保评估结果的全面性。

3. 结合多种评估方法

结合自动评估和人工评估,获得更全面、更准确的评估结果。

4. 定期评估

定期评估链的性能,跟踪链的变化和改进。

5. 对比不同链的实现

对比不同链的实现,选择最适合特定任务的链。

示例:评估不同链的实现

python
from langchain_openai import OpenAI, ChatOpenAI
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.evaluation.qa import QAEvalChain
from langchain_core.messages import HumanMessage, SystemMessage

# 初始化语言模型
llm = OpenAI(api_key="your-api-key")
chat_model = ChatOpenAI(api_key="your-api-key")

# 创建提示词模板
template1 = """You are a helpful assistant. Answer the following question:

Question: {question}
Answer:"""
prompt1 = PromptTemplate(template=template1, input_variables=["question"])

# 创建聊天提示词模板
chat_template = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant"),
    ("human", "{question}")
])

# 创建输出解析器
output_parser = StrOutputParser()

# 使用 LCEL 语法创建链
chain1 = prompt1 | llm | output_parser
chain2 = chat_template | chat_model | output_parser

# 创建评估链
eval_chain = QAEvalChain.from_llm(chat_model)  # 使用 ChatGPT 作为评估者

# 准备测试数据
test_data = [
    {
        "question": "What is LangChain?",
        "answer": "LangChain is a framework for building LLM applications."
    },
    {
        "question": "What are the core components of LangChain?",
        "answer": "The core components of LangChain are models, prompts, chains, agents, tools, memory, and vector stores."
    },
    {
        "question": "How to use LangChain to build a chatbot?",
        "answer": "To build a chatbot with LangChain, you can use the RunnableWithMessageHistory or ChatOpenAI with memory."
    }
]

# 测试链 1
print("Testing chain 1 (OpenAI with LCEL):")
chain1_examples = []
for item in test_data:
    result = chain1.invoke({"question": item["question"]})
    chain1_examples.append({
        "query": item["question"],
        "answer": item["answer"],
        "result": result
    })

chain1_evaluations = eval_chain.evaluate(chain1_examples)
for i, evaluation in enumerate(chain1_evaluations):
    print(f"Example {i+1}: {evaluation}")

# 测试链 2
print("\nTesting chain 2 (ChatOpenAI with LCEL):")
chain2_examples = []
for item in test_data:
    result = chain2.invoke({"question": item["question"]})
    chain2_examples.append({
        "query": item["question"],
        "answer": item["answer"],
        "result": result
    })

chain2_evaluations = eval_chain.evaluate(chain2_examples)
for i, evaluation in enumerate(chain2_evaluations):
    print(f"Example {i+1}: {evaluation}")

总结

链评估是 LangChain 应用开发中的重要环节,它用于评估链的性能和质量。在 LangChain 1.2 中,推荐使用 LCEL 语法创建链,并使用相应的评估方法来评估链的性能。

通过本文的介绍,您应该已经了解了如何使用 LangChain 提供的评估工具和方法来评估链的性能。在实际应用中,您可以根据具体需求选择合适的评估方法和指标,定期评估链的性能,从而不断优化您的 LLM 应用。