Appearance
链评估
链评估是 LangChain 应用开发中的重要环节,它用于评估链的性能和质量。本节将详细介绍如何评估 LangChain 1.2 中的链的性能。
什么是链评估?
链评估是指通过一系列指标和测试,评估链在特定任务上的性能和质量。链评估可以帮助开发者了解链的 strengths和 weaknesses,从而进行有针对性的优化。
在 LangChain 1.2 中,推荐使用 LCEL 语法创建链,并使用相应的评估方法来评估链的性能。
评估指标
链的评估指标与模型评估类似,包括:
| 指标 | 描述 | 适用场景 |
|---|---|---|
| 准确性(Accuracy) | 链预测正确的比例 | 分类任务 |
| 精确率(Precision) | 链预测为正例的样本中实际为正例的比例 | 分类任务 |
| 召回率(Recall) | 实际为正例的样本中被链预测为正例的比例 | 分类任务 |
| F1 分数 | 精确率和召回率的调和平均值 | 分类任务 |
| BLEU | 评估生成文本与参考文本的相似度 | 文本生成任务 |
| ROUGE | 评估生成文本与参考文本的相似度 | 文本摘要任务 |
| 响应时间 | 链执行的时间 | 性能评估 |
| 内存使用 | 链执行的内存消耗 | 性能评估 |
评估方法
1. 基于任务的评估
基于任务的评估是指在特定任务上评估链的性能,如问答、文本分类、文本生成等。
示例:评估问答链(使用 LCEL)
python
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.evaluation.qa import QAEvalChain
# 初始化语言模型
llm = OpenAI(api_key="your-api-key")
# 创建提示词模板
template = """You are a helpful assistant. Answer the following question:
Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])
# 创建输出解析器
output_parser = StrOutputParser()
# 使用 LCEL 语法创建链
chain = prompt | llm | output_parser
# 创建评估链
eval_chain = QAEvalChain.from_llm(llm)
# 准备测试数据
test_data = [
{
"question": "What is LangChain?",
"answer": "LangChain is a framework for building LLM applications."
},
{
"question": "What are the core components of LangChain?",
"answer": "The core components of LangChain are models, prompts, chains, agents, tools, memory, and vector stores."
}
]
# 运行链并收集结果
examples = []
for item in test_data:
result = chain.invoke({"question": item["question"]})
examples.append({
"query": item["question"],
"answer": item["answer"],
"result": result
})
# 评估结果
evaluations = eval_chain.evaluate(examples)
# 打印评估结果
for i, evaluation in enumerate(evaluations):
print(f"Example {i+1}:")
print(f"Question: {examples[i]['query']}")
print(f"Reference Answer: {examples[i]['answer']}")
print(f"Chain Answer: {examples[i]['result']}")
print(f"Evaluation: {evaluation}")
print()2. 基于基准测试的评估
基于基准测试的评估是指使用标准化的基准测试集评估链的性能。
示例:使用自定义基准测试评估链(使用 LCEL)
python
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
import time
# 初始化语言模型
llm = OpenAI(api_key="your-api-key")
# 创建提示词模板
template = """You are a helpful assistant. Answer the following question:
Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])
# 创建输出解析器
output_parser = StrOutputParser()
# 使用 LCEL 语法创建链
chain = prompt | llm | output_parser
# 准备基准测试数据
benchmark_data = [
"What is LangChain?",
"What are the core components of LangChain?",
"How to use LangChain to build a chatbot?",
"What is the difference between LangChain and other LLM frameworks?",
"How to integrate LangChain with external tools?"
]
# 评估链的性能
total_time = 0
results = []
for question in benchmark_data:
start_time = time.time()
result = chain.invoke({"question": question})
end_time = time.time()
execution_time = end_time - start_time
total_time += execution_time
results.append({
"question": question,
"answer": result,
"execution_time": execution_time
})
# 打印评估结果
print(f"Total execution time: {total_time:.2f} seconds")
print(f"Average execution time: {total_time / len(benchmark_data):.2f} seconds per question")
print()
for i, item in enumerate(results):
print(f"Question {i+1}: {item['question']}")
print(f"Answer: {item['answer']}")
print(f"Execution time: {item['execution_time']:.2f} seconds")
print()3. 人工评估
人工评估是指由人类评估者对链的输出进行评分,是最直接和可靠的评估方法。
示例:人工评估链(使用 LCEL)
python
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
# 初始化语言模型
llm = OpenAI(api_key="your-api-key")
# 创建提示词模板
template = """You are a helpful assistant. Answer the following question:
Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])
# 创建输出解析器
output_parser = StrOutputParser()
# 使用 LCEL 语法创建链
chain = prompt | llm | output_parser
# 准备测试数据
test_data = [
"What is LangChain?",
"What are the core components of LangChain?",
"How to use LangChain to build a chatbot?"
]
# 运行链并收集结果
results = []
for question in test_data:
result = chain.invoke({"question": question})
results.append({
"question": question,
"answer": result
})
# 打印结果供人工评估
print("Please evaluate the following responses:")
print("Rate each response on a scale of 1-5 for relevance, accuracy, clarity, and completeness.")
print()
for i, item in enumerate(results):
print(f"Example {i+1}:")
print(f"Question: {item['question']}")
print(f"Answer: {item['answer']}")
print()
print("Relevance: ")
print("Accuracy: ")
print("Clarity: ")
print("Completeness: ")
print()评估工具
LangChain 提供了多种评估工具,以下是一些常用的评估工具:
1. QAEvalChain
QAEvalChain 用于评估问答链的性能。
python
from langchain_openai import OpenAI
from langchain.evaluation.qa import QAEvalChain
# 初始化语言模型
llm = OpenAI(api_key="your-api-key")
# 创建评估链
eval_chain = QAEvalChain.from_llm(llm)
# 评估结果
examples = [
{
"query": "What is LangChain?",
"answer": "LangChain is a framework for building LLM applications.",
"result": "LangChain is a framework for developing applications powered by language models."
}
]
evaluations = eval_chain.evaluate(examples)
print(evaluations)2. StringEvaluator
StringEvaluator 用于评估字符串输出的质量。
python
from langchain_openai import OpenAI
from langchain.evaluation import load_evaluator
# 初始化语言模型
llm = OpenAI(api_key="your-api-key")
# 加载评估器
evaluator = load_evaluator("string", llm=llm)
# 评估结果
result = evaluator.evaluate(
prediction="LangChain is a framework for building LLM applications.",
reference="LangChain is a framework for developing applications powered by language models."
)
print(result)3. runnable_evaluate
runnable_evaluate 是一个通用的评估函数,用于评估任何可运行的对象。
python
from langchain_openai import OpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.evaluation import runnable_evaluate
# 初始化语言模型
llm = OpenAI(api_key="your-api-key")
# 创建提示词模板
template = """You are a helpful assistant. Answer the following question:
Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])
# 创建输出解析器
output_parser = StrOutputParser()
# 使用 LCEL 语法创建链
chain = prompt | llm | output_parser
# 准备测试数据
dataset = [
{"question": "What is LangChain?"},
{"question": "What are the core components of LangChain?"}
]
# 评估链
results = runnable_evaluate(
chain,
dataset=dataset,
evaluation_name="qa"
)
print(results)评估的最佳实践
1. 定义明确的评估目标
在评估链之前,明确评估的目标和指标,确保评估结果能够反映链的实际性能。
2. 使用多样化的测试数据
使用多样化的测试数据,包括不同类型、不同难度的样本,确保评估结果的全面性。
3. 结合多种评估方法
结合自动评估和人工评估,获得更全面、更准确的评估结果。
4. 定期评估
定期评估链的性能,跟踪链的变化和改进。
5. 对比不同链的实现
对比不同链的实现,选择最适合特定任务的链。
示例:评估不同链的实现
python
from langchain_openai import OpenAI, ChatOpenAI
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.evaluation.qa import QAEvalChain
from langchain_core.messages import HumanMessage, SystemMessage
# 初始化语言模型
llm = OpenAI(api_key="your-api-key")
chat_model = ChatOpenAI(api_key="your-api-key")
# 创建提示词模板
template1 = """You are a helpful assistant. Answer the following question:
Question: {question}
Answer:"""
prompt1 = PromptTemplate(template=template1, input_variables=["question"])
# 创建聊天提示词模板
chat_template = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant"),
("human", "{question}")
])
# 创建输出解析器
output_parser = StrOutputParser()
# 使用 LCEL 语法创建链
chain1 = prompt1 | llm | output_parser
chain2 = chat_template | chat_model | output_parser
# 创建评估链
eval_chain = QAEvalChain.from_llm(chat_model) # 使用 ChatGPT 作为评估者
# 准备测试数据
test_data = [
{
"question": "What is LangChain?",
"answer": "LangChain is a framework for building LLM applications."
},
{
"question": "What are the core components of LangChain?",
"answer": "The core components of LangChain are models, prompts, chains, agents, tools, memory, and vector stores."
},
{
"question": "How to use LangChain to build a chatbot?",
"answer": "To build a chatbot with LangChain, you can use the RunnableWithMessageHistory or ChatOpenAI with memory."
}
]
# 测试链 1
print("Testing chain 1 (OpenAI with LCEL):")
chain1_examples = []
for item in test_data:
result = chain1.invoke({"question": item["question"]})
chain1_examples.append({
"query": item["question"],
"answer": item["answer"],
"result": result
})
chain1_evaluations = eval_chain.evaluate(chain1_examples)
for i, evaluation in enumerate(chain1_evaluations):
print(f"Example {i+1}: {evaluation}")
# 测试链 2
print("\nTesting chain 2 (ChatOpenAI with LCEL):")
chain2_examples = []
for item in test_data:
result = chain2.invoke({"question": item["question"]})
chain2_examples.append({
"query": item["question"],
"answer": item["answer"],
"result": result
})
chain2_evaluations = eval_chain.evaluate(chain2_examples)
for i, evaluation in enumerate(chain2_evaluations):
print(f"Example {i+1}: {evaluation}")总结
链评估是 LangChain 应用开发中的重要环节,它用于评估链的性能和质量。在 LangChain 1.2 中,推荐使用 LCEL 语法创建链,并使用相应的评估方法来评估链的性能。
通过本文的介绍,您应该已经了解了如何使用 LangChain 提供的评估工具和方法来评估链的性能。在实际应用中,您可以根据具体需求选择合适的评估方法和指标,定期评估链的性能,从而不断优化您的 LLM 应用。