索引优化

向量数据库的索引是影响检索性能的关键因素。合理的索引设计可以显著提高检索速度和准确性。本章节将介绍向量数据库的索引类型、选择策略和优化方法。

1. 索引类型

精确索引

暴力搜索（Brute-force）：

计算查询向量与所有存储向量的相似度
适用于小规模数据（<10,000个向量）
准确性100%，但速度随数据量线性增长

示例：

python

# FAISS中的暴力搜索索引
import faiss

# 创建暴力搜索索引
dimension = 768
index = faiss.IndexFlatL2(dimension)  # L2距离
# 或使用余弦相似度
# index = faiss.IndexFlatIP(dimension)  # 内积（用于余弦相似度）

# 添加向量
index.add(vectors)

# 搜索
k = 5
D, I = index.search(query_vector, k)

近似最近邻（ANN）索引

HNSW（Hierarchical Navigable Small World）：

基于图的索引结构
搜索速度快，准确性高
内存占用较大
构建时间较长

示例：

python

# FAISS中的HNSW索引
import faiss

# 创建HNSW索引
dimension = 768
index = faiss.IndexHNSWFlat(dimension, 16)  # 16是M参数，控制图的连接度
index.hnsw.efConstruction = 40  # 构建时的搜索宽度
index.hnsw.efSearch = 16  # 搜索时的搜索宽度

# 添加向量
index.add(vectors)

IVF（Inverted File Index）：

基于聚类的索引结构
将向量空间划分为多个聚类
搜索时只需检查最近的几个聚类
平衡了速度和准确性

示例：

python

# FAISS中的IVF索引
nlist = 100  # 聚类中心数量
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist)

# 需要训练
index.train(vectors)
index.add(vectors)

# 搜索（nprobe控制搜索的聚类数量）
index.nprobe = 10
D, I = index.search(query_vector, k)

PQ（Product Quantization）：

向量压缩技术
将高维向量压缩为低维表示
大幅减少内存占用
适合超大规模数据

示例：

python

# FAISS中的PQ索引
m = 16  # 子向量数量
dimension = 768
nbits = 8  # 每个子向量的比特数

index = faiss.IndexPQ(dimension, m, nbits)
index.train(vectors)
index.add(vectors)

组合索引

IVF + PQ：

结合IVF和PQ的优势
先通过IVF快速定位聚类
再在聚类内使用PQ进行精确搜索

示例：

python

# FAISS中的IVFPQ索引
nlist = 100
m = 16
nbits = 8

quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFPQ(quantizer, dimension, nlist, m, nbits)

index.train(vectors)
index.add(vectors)
index.nprobe = 10

2. 索引选择策略

根据数据规模选择

数据规模	推荐索引	说明
< 10k	Flat	暴力搜索足够快
10k - 100k	HNSW	快速且准确
100k - 1M	IVF 或 HNSW	平衡性能和内存
1M - 10M	IVF + PQ	压缩存储
> 10M	分层索引	多级索引结构

根据性能需求选择

高准确性需求：

使用HNSW，增大efSearch参数
或使用Flat索引（小规模数据）

高速度需求：

使用HNSW，减小efSearch参数
使用IVF，减小nprobe参数

低内存需求：

使用PQ或IVFPQ进行向量压缩
权衡准确性和内存占用

3. 参数调优

HNSW参数

python

# efConstruction: 构建时的搜索宽度
# 越大，索引质量越好，构建时间越长
index.hnsw.efConstruction = 40  # 默认值，可增大到100-200

# efSearch: 搜索时的搜索宽度
# 越大，准确性越高，搜索越慢
index.hnsw.efSearch = 16  # 默认值，可增大到50-100

# M: 每个节点的连接数
# 越大，图越稠密，搜索越快，内存占用越大
index = faiss.IndexHNSWFlat(dimension, M=16)  # 默认16，可增大到32-64

IVF参数

python

# nlist: 聚类中心数量
# 通常设置为数据量的平方根
nlist = int(np.sqrt(len(vectors)))

# nprobe: 搜索时检查的聚类数量
# 越大，准确性越高，搜索越慢
index.nprobe = 10  # 默认值，可增大到20-50

PQ参数

python

# m: 子向量数量
# 必须能整除维度
dimension = 768
m = 16  # 每个子向量48维

# nbits: 每个子向量的比特数
# 通常为8，表示256个聚类中心
nbits = 8

4. 性能测试

基准测试代码

python

import time
import numpy as np

def benchmark_index(index, query_vectors, k=10, warmup=5, repeat=10):
    """测试索引性能"""
    # 预热
    for _ in range(warmup):
        index.search(query_vectors[:10], k)
    
    # 正式测试
    times = []
    for _ in range(repeat):
        start = time.time()
        D, I = index.search(query_vectors, k)
        times.append(time.time() - start)
    
    return {
        'mean_time': np.mean(times),
        'std_time': np.std(times),
        'qps': len(query_vectors) / np.mean(times)
    }

# 生成测试数据
dimension = 768
n_vectors = 100000
n_queries = 1000

vectors = np.random.random((n_vectors, dimension)).astype('float32')
query_vectors = np.random.random((n_queries, dimension)).astype('float32')

# 测试不同索引
indexes = {
    'Flat': faiss.IndexFlatL2(dimension),
    'HNSW': faiss.IndexHNSWFlat(dimension, 16),
    'IVF': faiss.IndexIVFFlat(faiss.IndexFlatL2(dimension), dimension, 100)
}

for name, index in indexes.items():
    if name == 'IVF':
        index.train(vectors)
    index.add(vectors)
    
    results = benchmark_index(index, query_vectors)
    print(f"{name}: {results['qps']:.2f} QPS")

5. 最佳实践

索引构建流程

python

def build_optimized_index(vectors, dimension, target_recall=0.95):
    """构建优化的索引"""
    # 1. 根据数据规模选择索引类型
    n_vectors = len(vectors)
    
    if n_vectors < 10000:
        index = faiss.IndexFlatL2(dimension)
    elif n_vectors < 1000000:
        index = faiss.IndexHNSWFlat(dimension, 32)
        index.hnsw.efConstruction = 100
    else:
        nlist = int(np.sqrt(n_vectors))
        quantizer = faiss.IndexHNSWFlat(dimension, 32)
        index = faiss.IndexIVFFlat(quantizer, dimension, nlist)
        index.train(vectors)
    
    # 2. 添加向量
    index.add(vectors)
    
    return index

动态索引更新

python

class DynamicIndex:
    def __init__(self, dimension):
        self.dimension = dimension
        self.vectors = []
        self.index = None
        self.rebuild_threshold = 1000  # 触发重建的阈值
    
    def add(self, new_vectors):
        self.vectors.extend(new_vectors)
        
        # 检查是否需要重建索引
        if len(new_vectors) > self.rebuild_threshold:
            self.rebuild_index()
        elif self.index is not None:
            self.index.add(np.array(new_vectors))
    
    def rebuild_index(self):
        """重建索引以获得最佳性能"""
        self.index = build_optimized_index(
            np.array(self.vectors), 
            self.dimension
        )

索引优化 ​

1. 索引类型 ​

精确索引 ​

近似最近邻（ANN）索引 ​

组合索引 ​

2. 索引选择策略 ​

根据数据规模选择 ​

根据性能需求选择 ​

3. 参数调优 ​

HNSW参数 ​

IVF参数 ​

PQ参数 ​

4. 性能测试 ​

基准测试代码 ​

5. 最佳实践 ​

索引构建流程 ​

动态索引更新 ​

索引优化

1. 索引类型

精确索引

近似最近邻（ANN）索引

组合索引

2. 索引选择策略

根据数据规模选择

根据性能需求选择

3. 参数调优

HNSW参数

IVF参数

PQ参数

4. 性能测试

基准测试代码

5. 最佳实践

索引构建流程

动态索引更新