ElasticSearch-分词器介绍及测试Standard标准分词器English英文分词器Chinese中文分词器IKIK-分词器

2025-03-05 约 2801 字预计阅读 6 分钟

https://bing.ee123.net/img/rand?artid=146040899

ElasticSearch 分词器介绍及测试：Standard（标准分词器）、English（英文分词器）、Chinese（中文分词器）、IK（IK

分词器）

本文 ElasticSearch 版本为：7.17.9 ，为了对应 spring-boot-starter-parent 的 2.7.9 版本

ElasticSearch 分词器介绍及测试

ElasticSearch 提供了多种内置的分词器（Analyzer），用于文本的分析和分词。分词器是文本分析的核心，决定了如何把输入的文本字符串分解成一个个“词项”（token）。不同的分词器适用于不同的语言和场景，如中文、英文等。本文将介绍常用的分词器及其应用。

1 Standard Analyzer（标准分词器）

功能：standard 是 ElasticSearch 的默认分词器，基于 Unicode 文本分解标准，适用于多种语言。它会将文本中的标点符号、常见停用词移除，并将文本转化为小写。
用途：适用于大多数通用场景，尤其是处理混合语言或没有特殊分词需求的情况。
分词示例 ：
输入："The quick brown fox"
输出：["the", "quick", "brown", "fox"] 使用 ElasticSearch 的可视化界面 Kibana 的调试工具 Dev Tools 调用解析接口测试：

`standard` 是 ElasticSearch 的默认分词器，基于 Unicode 文本分解标准，适用于多种语言。它会将文本中的标点符号、常见停用词移除，并将文本转化为小写。

POST /_analyze { “analyzer”: “standard”, “text”: “The quick brown fox” } 解析结果： #! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See to enable security. { “tokens” : [ { “token” : “the”, “start_offset” : 0, “end_offset” : 3, “type” : “”, “position” : 0 }, { “token” : “quick”, “start_offset” : 4, “end_offset” : 9, “type” : “”, “position” : 1 }, { “token” : “brown”, “start_offset” : 10, “end_offset” : 15, “type” : “”, “position” : 2 }, { “token” : “fox”, “start_offset” : 16, “end_offset” : 19, “type” : “”, “position” : 3 } ] }

2 English Analyzer（英文分词器）

功能：english 分词器专用于英文文本的分析，除了进行基本的分词，还会进行停用词过滤，并将所有文本转换为小写字母。
用途：适用于英文文本的分析，特别是在英文搜索引擎或英文数据处理中。
分词示例 ：
输入："The quick brown fox"
输出：["quick", "brown", "fox"]（the 被移除作为停用词）使用 ElasticSearch 的可视化界面 Kibana 的调试工具 Dev Tools 调用解析接口测试：

`english` 分词器专用于英文文本的分析，除了进行基本的分词，还会进行停用词过滤，并将所有文本转换为小写字母。

POST /_analyze { “analyzer”: “english”, “text”: “The quick brown fox” } 解析结果： #! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See to enable security. { “tokens” : [ { “token” : “quick”, “start_offset” : 4, “end_offset” : 9, “type” : “”, “position” : 1 }, { “token” : “brown”, “start_offset” : 10, “end_offset” : 15, “type” : “”, “position” : 2 }, { “token” : “fox”, “start_offset” : 16, “end_offset” : 19, “type” : “”, “position” : 3 } ] }

3 Chinese Analyzer（中文分词器）

功能：chinese 分词器专为中文文本设计，基于分词字典并结合最大匹配法等技术，将中文文本分解成多个词项。
用途：适用于中文文本的分词处理，特别是中文搜索引擎和中文语料处理。对中文的解析很差 。
分词示例 ：
输入："今天天气很好"
期望的输出：["今天", "天气", "很", "好",]
实际的输出：["今","天", "天","气", "很", "好"] 使用 ElasticSearch 的可视化界面 Kibana 的调试工具 Dev Tools 调用解析接口测试：

`chinese` 分词器专为中文文本设计，基于分词字典并结合最大匹配法等技术，将中文文本分解成多个词项。

POST /_analyze { “analyzer”: “chinese”, “text”: “今天天气很好” } 解析结果： #! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See to enable security. { “tokens” : [ { “token” : “今”, “start_offset” : 0, “end_offset” : 1, “type” : “”, “position” : 0 }, { “token” : “天”, “start_offset” : 1, “end_offset” : 2, “type” : “”, “position” : 1 }, { “token” : “天”, “start_offset” : 2, “end_offset” : 3, “type” : “”, “position” : 2 }, { “token” : “气”, “start_offset” : 3, “end_offset” : 4, “type” : “”, “position” : 3 }, { “token” : “很”, “start_offset” : 4, “end_offset” : 5, “type” : “”, “position” : 4 }, { “token” : “好”, “start_offset” : 5, “end_offset” : 6, “type” : “”, “position” : 5 } ] }

4 IK Analyzer（IK 分词器）

官网资源 ：
功能：IK Analyzer 是一个开源的中文分词器，专门用于处理中文文本。它结合了多种中文分词技术，支持细粒度和粗粒度的分词。
安装：需要作为 ElasticSearch 插件安装，支持通过精确模式和智能模式两种分词策略。
分词示例 ：
输入："今天天气不错，适合出游"
ik_smart（最少切分） ：["今天天气", "不错", "适合", "出游"]
ik_max_word（最细切分） ：["今天天气", "今天", "天天", "天气", "不错", "适合", "合出", "出游"]
扩展词典 ：支持自定义扩展词典，用户可以添加特定词语、行业术语、网络热词等。【】使用 ElasticSearch 的可视化界面 Kibana 的调试工具 Dev Tools 调用解析接口测试：

`IK Analyzer` ik_smart（最少切分）。

POST /_analyze { “analyzer”: “ik_smart”, “text”: “今天天气不错，适合出游” } 解析结果： #! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See to enable security. { “tokens” : [ { “token” : “今天天气”, “start_offset” : 0, “end_offset” : 4, “type” : “CN_WORD”, “position” : 0 }, { “token” : “不错”, “start_offset” : 4, “end_offset” : 6, “type” : “CN_WORD”, “position” : 1 }, { “token” : “适合”, “start_offset” : 7, “end_offset” : 9, “type” : “CN_WORD”, “position” : 2 }, { “token” : “出游”, “start_offset” : 9, “end_offset” : 11, “type” : “CN_WORD”, “position” : 3 } ] } 使用 ElasticSearch 的可视化界面 Kibana 的调试工具 Dev Tools 调用解析接口测试：

`IK Analyzer` ik_smart（最少切分）。

POST /_analyze { “analyzer”: “ik_smart”, “text”: “今天天气不错，适合出游” } 解析结果： #! Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See to enable security. { “tokens” : [ { “token” : “今天天气”, “start_offset” : 0, “end_offset” : 4, “type” : “CN_WORD”, “position” : 0 }, { “token” : “今天”, “start_offset” : 0, “end_offset” : 2, “type” : “CN_WORD”, “position” : 1 }, { “token” : “天天”, “start_offset” : 1, “end_offset” : 3, “type” : “CN_WORD”, “position” : 2 }, { “token” : “天气”, “start_offset” : 2, “end_offset” : 4, “type” : “CN_WORD”, “position” : 3 }, { “token” : “不错”, “start_offset” : 4, “end_offset” : 6, “type” : “CN_WORD”, “position” : 4 }, { “token” : “适合”, “start_offset” : 7, “end_offset” : 9, “type” : “CN_WORD”, “position” : 5 }, { “token” : “合出”, “start_offset” : 8, “end_offset” : 10, “type” : “CN_WORD”, “position” : 6 }, { “token” : “出游”, “start_offset” : 9, “end_offset” : 11, “type” : “CN_WORD”, “position” : 7 } ] }

官网资源

你可以访问 ElasticSearch 官方文档页面，获取有关不同分词器和分析器的详细介绍，以及如何配置和使用它们：

小结

ElasticSearch 提供了多种内置分词器，能够适应不同语言和文本格式的需求。选择合适的分词器对于实现高效的搜索和分析至关重要。你可以根据实际的应用场景选择 standard、chinese、english 等分词器，或根据需要创建自定义分词器来满足特定的文本分析需求。如果你有特殊的需求，可以深入研究分词器的配置选项和扩展方式。

目录

ElasticSearch-分词器介绍及测试Standard标准分词器English英文分词器Chinese中文分词器IKIK-分词器

ElasticSearch 分词器介绍及测试：Standard（标准分词器）、English（英文分词器）、Chinese（中文分词器）、IK（IK

ElasticSearch 分词器介绍及测试

1 Standard Analyzer（标准分词器）

standard 是 ElasticSearch 的默认分词器，基于 Unicode 文本分解标准，适用于多种语言。它会将文本中的标点符号、常见停用词移除，并将文本转化为小写。

2 English Analyzer（英文分词器）

english 分词器专用于英文文本的分析，除了进行基本的分词，还会进行停用词过滤，并将所有文本转换为小写字母。

3 Chinese Analyzer（中文分词器）

chinese 分词器专为中文文本设计，基于分词字典并结合最大匹配法等技术，将中文文本分解成多个词项。

chinese 分词器专为中文文本设计，基于分词字典并结合最大匹配法等技术，将中文文本分解成多个词项。

4 IK Analyzer（IK 分词器）

IK Analyzer ik_smart（最少切分）。

IK Analyzer ik_smart（最少切分）。

官网资源

小结

`standard` 是 ElasticSearch 的默认分词器，基于 Unicode 文本分解标准，适用于多种语言。它会将文本中的标点符号、常见停用词移除，并将文本转化为小写。

`english` 分词器专用于英文文本的分析，除了进行基本的分词，还会进行停用词过滤，并将所有文本转换为小写字母。

`chinese` 分词器专为中文文本设计，基于分词字典并结合最大匹配法等技术，将中文文本分解成多个词项。

`chinese` 分词器专为中文文本设计，基于分词字典并结合最大匹配法等技术，将中文文本分解成多个词项。

`IK Analyzer` ik_smart（最少切分）。

`IK Analyzer` ik_smart（最少切分）。