索引创建与管理

索引概述

索引是 Milvus 实现高效向量搜索的核心机制。通过为向量字段创建索引，可以显著提升搜索速度，特别是在大规模数据集上。

为什么需要索引

加速搜索: 将时间复杂度从 O(n) 降低到接近 O(log n)
支持大规模数据: 处理十亿级向量数据
平衡精度与性能: 通过参数调整召回率和搜索速度

索引类型详解

FLAT（暴力搜索）

特点: 精确搜索，不损失精度，但速度最慢

适用场景:

数据量较小（< 100万）
需要 100% 召回率
作为基准测试

python

index_params = {
    "index_type": "FLAT",
    "metric_type": "L2",      # 或 "IP", "COSINE"
    "params": {}
}

IVF_FLAT（倒排文件索引）

特点: 基于聚类的索引，平衡速度和精度

原理:

使用 K-means 将向量空间划分为 nlist 个聚类中心
搜索时先找到最近的 nprobe 个聚类中心
只在选中的聚类中搜索

python

index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {
        "nlist": 128             # 聚类中心数，建议 4 × sqrt(n)
    }
}

# 搜索参数
search_params = {
    "metric_type": "L2",
    "params": {
        "nprobe": 8              # 搜索的聚类数，越大越精确但越慢
    }
}

参数说明:

nlist: 聚类中心数量，影响索引构建时间和搜索速度
nprobe: 搜索时查询的聚类数，影响召回率

IVF_SQ8（标量量化索引）

特点: 使用 8-bit 量化压缩向量，节省存储空间

适用场景:

内存资源有限
可以容忍轻微精度损失

python

index_params = {
    "index_type": "IVF_SQ8",
    "metric_type": "L2",
    "params": {
        "nlist": 128
    }
}

IVF_PQ（乘积量化索引）

特点: 将向量分割成子向量分别量化，压缩率高

适用场景:

超大规模数据集
内存极度受限

python

index_params = {
    "index_type": "IVF_PQ",
    "metric_type": "L2",
    "params": {
        "nlist": 128,
        "m": 8,                   # 子向量数量，必须整除向量维度
        "nbits": 8                # 每个子向量的量化位数
    }
}

参数说明:

m: 子向量数量，通常设置为 dim/4 到 dim/8
nbits: 量化位数，通常为 8

HNSW（分层可导航小世界图）

特点: 基于图的索引，搜索速度最快，召回率高

适用场景:

对搜索速度要求极高
需要高召回率
内存充足

python

index_params = {
    "index_type": "HNSW",
    "metric_type": "L2",
    "params": {
        "M": 16,                  # 每个节点的最大连接数
        "efConstruction": 200     # 构建时的搜索范围
    }
}

# 搜索参数
search_params = {
    "metric_type": "L2",
    "params": {
        "ef": 64                  # 搜索时的搜索范围，越大越精确
    }
}

参数说明:

M: 每个节点的最大连接数，建议 8-64，越大索引越大但越精确
efConstruction: 构建时的搜索范围，建议 64-512
ef: 搜索时的搜索范围，必须 >= topk

ANNOY（近似最近邻树）

特点: 基于树的索引，构建速度快

适用场景:

静态数据集
需要快速构建索引

python

index_params = {
    "index_type": "ANNOY",
    "metric_type": "L2",
    "params": {
        "n_trees": 8              # 树的数量
    }
}

# 搜索参数
search_params = {
    "metric_type": "L2",
    "params": {
        "search_k": -1            # 搜索的节点数，-1 表示 n_trees × topk
    }
}

DISKANN（磁盘索引）

特点: 专为磁盘存储优化的索引，支持超大规模数据

适用场景:

数据量超过内存容量
可以接受磁盘 I/O 延迟

python

index_params = {
    "index_type": "DISKANN",
    "metric_type": "L2",
    "params": {}
}

# 搜索参数
search_params = {
    "metric_type": "L2",
    "params": {
        "search_list": 16         # 候选列表大小
    }
}

创建索引

基本流程

python

from pymilvus import Collection

# 获取集合
collection = Collection("article_search")

# 定义索引参数
index_params = {
    "index_type": "IVF_FLAT",
    "metric_type": "L2",
    "params": {
        "nlist": 128
    }
}

# 创建索引
collection.create_index(
    field_name="article_vector",   # 向量字段名
    index_params=index_params
)

# 等待索引构建完成
collection.load()
print("索引创建成功！")

完整示例

python

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType
import random

def create_index_demo():
    """创建索引完整示例"""
    
    # 连接 Milvus
    connections.connect(host="localhost", port="19530")
    
    # 创建测试集合
    fields = [
        FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128)
    ]
    
    schema = CollectionSchema(fields, "索引测试集合")
    collection = Collection("index_demo", schema)
    
    # 插入测试数据
    print("插入测试数据...")
    data = [{"vector": [random.random() for _ in range(128)]} for _ in range(10000)]
    collection.insert(data)
    
    # 创建 HNSW 索引
    print("创建 HNSW 索引...")
    index_params = {
        "index_type": "HNSW",
        "metric_type": "L2",
        "params": {
            "M": 16,
            "efConstruction": 200
        }
    }
    
    collection.create_index(
        field_name="vector",
        index_params=index_params,
        index_name="vector_hnsw_index"  # 指定索引名称
    )
    
    # 查看索引进度
    print("等待索引构建完成...")
    utility.wait_for_index_building_complete("index_demo")
    
    # 加载集合并测试搜索
    collection.load()
    
    # 执行搜索
    search_params = {
        "metric_type": "L2",
        "params": {"ef": 64}
    }
    
    results = collection.search(
        data=[[random.random() for _ in range(128)]],
        anns_field="vector",
        param=search_params,
        limit=10
    )
    
    print(f"搜索完成，找到 {len(results[0])} 个结果")
    
    # 清理
    from pymilvus import utility
    utility.drop_collection("index_demo")

if __name__ == "__main__":
    create_index_demo()

索引管理

查看索引信息

python

# 获取索引信息
index_info = collection.indexes
for index in index_info:
    print(f"索引名称: {index.index_name}")
    print(f"字段: {index.field_name}")
    print(f"参数: {index.params}")

# 获取特定字段的索引
index = collection.index("article_vector")
print(f"索引参数: {index.params}")

删除索引

python

# 删除指定字段的索引
collection.drop_index(field_name="article_vector")

# 删除指定名称的索引
collection.drop_index(index_name="vector_hnsw_index")

索引状态检查

python

from pymilvus import utility

# 检查索引构建进度
progress = utility.index_building_progress("article_search")
print(f"索引构建进度: {progress}")

# 等待索引构建完成
utility.wait_for_index_building_complete("article_search")
print("索引构建完成！")

索引选择指南

根据数据规模选择

数据规模	推荐索引	说明
< 100万	FLAT / IVF_FLAT	数据量小，精确搜索
100万 - 1000万	IVF_FLAT / HNSW	平衡速度和精度
1000万 - 1亿	HNSW / IVF_PQ	需要高性能索引
> 1亿	DISKANN / IVF_PQ	超大规模数据

根据场景选择

场景	推荐索引	原因
需要 100% 召回率	FLAT	精确搜索
内存充足，追求速度	HNSW	最快的搜索速度
内存有限	IVF_SQ8 / IVF_PQ	压缩存储
数据超过内存	DISKANN	磁盘存储
频繁更新数据	IVF_FLAT	支持增量更新

参数调优建议

IVF 系列参数

python

# 小规模数据（< 100万）
{"nlist": 128, "nprobe": 8}

# 中等规模（100万 - 1000万）
{"nlist": 1024, "nprobe": 16}

# 大规模（> 1000万）
{"nlist": 4096, "nprobe": 64}

HNSW 参数

python

# 追求速度
{"M": 8, "efConstruction": 64, "ef": 32}

# 平衡方案
{"M": 16, "efConstruction": 200, "ef": 64}

# 追求精度
{"M": 32, "efConstruction": 400, "ef": 128}

性能优化建议

索引构建优化

批量插入后创建索引: 避免频繁重建索引
选择合适的 nlist: 通常设置为 4 × sqrt(n)
使用 GPU 加速: 大规模数据可考虑 GPU 索引

搜索优化

调整搜索参数: 根据召回率要求调整 nprobe/ef
使用分区: 减少搜索范围
批量搜索: 一次性搜索多个向量

python

# 批量搜索
query_vectors = [vector1, vector2, vector3, ...]
results = collection.search(
    data=query_vectors,
    anns_field="vector",
    param=search_params,
    limit=10
)

常见问题

索引构建失败

python

# 检查数据量
if collection.num_entities < 1000:
    print("数据量太小，建议使用 FLAT 索引")

# 检查向量维度
dim = collection.schema.fields[1].params.get("dim")
print(f"向量维度: {dim}")

召回率过低

python

# 增加搜索参数
search_params = {
    "metric_type": "L2",
    "params": {
        "nprobe": 128,    # 增加搜索的聚类数
        "ef": 256         # 增加搜索范围
    }
}

内存不足

python

# 使用量化索引
index_params = {
    "index_type": "IVF_SQ8",  # 或 IVF_PQ
    "metric_type": "L2",
    "params": {"nlist": 128}
}

下一步

掌握索引管理后，你可以：

索引创建与管理 ​

索引概述 ​

为什么需要索引 ​

索引类型详解 ​

FLAT（暴力搜索） ​

IVF_FLAT（倒排文件索引） ​

IVF_SQ8（标量量化索引） ​

IVF_PQ（乘积量化索引） ​

HNSW（分层可导航小世界图） ​

ANNOY（近似最近邻树） ​

DISKANN（磁盘索引） ​

创建索引 ​

基本流程 ​

完整示例 ​

索引管理 ​

查看索引信息 ​

删除索引 ​

索引状态检查 ​

索引选择指南 ​

根据数据规模选择 ​

根据场景选择 ​

参数调优建议 ​

IVF 系列参数 ​

HNSW 参数 ​

性能优化建议 ​

索引构建优化 ​

搜索优化 ​

常见问题 ​

索引构建失败 ​

召回率过低 ​

内存不足 ​

下一步 ​

索引创建与管理

索引概述

为什么需要索引

索引类型详解

FLAT（暴力搜索）

IVF_FLAT（倒排文件索引）

IVF_SQ8（标量量化索引）

IVF_PQ（乘积量化索引）

HNSW（分层可导航小世界图）

ANNOY（近似最近邻树）

DISKANN（磁盘索引）

创建索引

基本流程

完整示例

索引管理

查看索引信息

删除索引

索引状态检查

索引选择指南

根据数据规模选择

根据场景选择

参数调优建议

IVF 系列参数

HNSW 参数

性能优化建议

索引构建优化

搜索优化

常见问题

索引构建失败

召回率过低

内存不足

下一步