目录

C项目实战校园公告搜索引擎完整实现与优化指南

【C++项目实战】校园公告搜索引擎:完整实现与优化指南

https://i-blog.csdnimg.cn/direct/52bc67966cad45eda96494d9b411954d.png

🎬 个人主页:

📖 个人专栏:

⛰️ 道阻且长,行则将至

https://i-blog.csdnimg.cn/direct/fc2b77a8d9c3456ea21d7e871906f5b0.jpeg



📚一、项目概述

📖1.项目背景

杭州师范大学教务处官网是学校发布公告的重要平台,旨在为校内师生提供及时的信息服务。然而,目前官网存在以下问题:

① 更新滞后 :主页展示的公告多为旧信息,用户难以快速获取最新动态,增加了时间成本。

② 搜索功能不足 :官网搜索引擎缺乏按时间排序的功能,这显然满足不了用户的核心需求,因为公告具有时效性。

③ 界面设计欠佳 :搜索界面不够美观,用户体验较差。

基于以上问题,我决定开发一个 教务处官网公告的搜索引擎 ,旨在为校内师生提供一个更高效、更直观的信息检索工具,帮助用户快速获取最新公告信息,提升使用体验。

https://i-blog.csdnimg.cn/direct/9026e65d5bd4424494a09d3090bb3cb4.png

📖2.主要功能

校园公告搜索引擎是一个专门服务于本校师生的信息检索平台,其核心功能是基于教务处官网的公告公文提供 关键字搜索 服务。用户可以通过在搜索框中输入关键字,快速浏览相关公告的摘要信息,并直接点击链接跳转至学校官网查看完整内容,实现高效便捷的信息获取。下面是项目的界面展示:

📖3.界面展示

界面设计简洁直观,包含以下内容:

① 搜索框: 位于页面顶部显著位置,支持用户输入关键字进行公告检索;

② 按时间排序选项: 位于搜索框侧边,提供将搜索结果按发布时间排序的功能。考虑到官网公告的时效性,这个功能是很必要的。

③ 翻页按钮: 位于页面底部,方便用户在搜索结果较多时进行分页浏览。

学校官网也有自己的搜索引擎,但是不具备时间排序的功能,这就有一个问题: 用户想通过关键词搜索到最新的公告,但是服务器返回的结果是默认按照关键词权重(关键词在文章0出现的频率)进行排序的,用户并不能立刻得到想要的结果:

这是学校官网的搜索结果:

https://i-blog.csdnimg.cn/direct/2dc4a3a6826b4bf49d98e37a162305e9.png

这是个人引擎的搜索结果:

https://i-blog.csdnimg.cn/direct/86efac1b44444cb9a41baebfa6ca9f50.png

由于引擎搜索数据来源全部来自学校官网,数据量其实并不大(从教务处官网爬下来的公告,总共也就两千多条),所以关键字的覆盖范围有限,如果用户输入了一个不存在的关键字,系统会贴心地给出提示,并给出以下选项:

跳转学校官网: 可以直接去学校官网查看最新公告(目前项目还有瑕疵,尚未实现在线更新功能,有待后续开发);

② 访问博主个人博客: 相当于打个广告吧hh;

查看项目源码: 如果对这个项目感兴趣,也可以跳转查看源码。

https://i-blog.csdnimg.cn/direct/d7bc6954a9574c65aef2a0008a7ff62d.png

📚二、技术背景

📖1.技术栈

本项目采用以下技术栈:

Boost 准标准库:用于高效的 文件操作 和字符串处理。

cppjieba 分词库:实现 中文关键字的分词 功能,提升搜索准确性。

jsoncpp 序列化工具:将搜索结果 序列化为JSON格式 ,便于前后端交互。

httplib 服务器库:快速搭建 轻量级HTTP服务器 ,处理搜索请求与响应。

接下来,详细介绍项目的具体实现过程。

📖2.核心逻辑

首先我们需要了解搜索引擎的核心逻辑:客户端发送搜索关键字,服务端根据关键字检索匹配对应的结果,并将结果返回给客户端。搜索结果通常由三部分组成:

① 文档标题:简明扼要地概括文档内容;

② 文档摘要:包含关键字的部分内容,帮助用户快速了解文档相关性;

③ 文档URL:提供跳转链接,方便用户查看完整内容。

所以实现搜索返回结果,最关键的是: 如何根据关键字匹配返回内容?返回内容从何而来? 我们不能简单地将客户端的关键字转发给其他搜索引擎,因为返回结果必须来自本地服务器。因此,我们需要在本地构建一个数据库,存储文档的基本数据单元,即: 文档标题、文档摘要、文档URL。

文档内容又从何而来,市面上主流的搜索引擎(如Google、百度)通过 网络爬虫 不断从互联网上抓取网页内容, 将其转换为数据单元并存储在本地数据库中,从而实现全网搜索。

构建一个全网搜索引擎的成本和资源需求极高,尤其是对于个人开发者来说,无论是数据存储、计算能力还是网络带宽,都远远超出了博主现有云服务器的承载能力。然而,实现一个 校园网站公告的搜索引擎 则是一个更加实际和可行的选择。 校园公告的数据量相对较小,存储和检索的开销也相对较低,完全可以在云服务器的能力范围内高效运行 。这种小而精的项目不仅降低了技术门槛,还能为校园用户提供切实的便利,是一个理想的学习和实践目标。

📚三、后端实现

在掌握了搜索引擎的基本原理和数据结构后,我们开始着手实现后端部分。以下是具体的设计与实现过程:

📖1.构建 原生数据库

🔖 网页爬取

首选,我们需要从教务处官网爬取内容,将结果保存为本地.html文件,这些文件是后续处理的基础数据源。这里我们使用wget工具进行网页爬取:

wget 是一个强大的命令行工具,用于从网络上下载文件,具有递归下载、断点续传、限速等特性,非常适合用于批量下载文件或爬取网站内容。我们在云服务终端输入以下命令:

wget –recursive –no-clobber –no-parent –convert-links –domains jwc.hznu.edu.cn –directory-prefix=data/source_html -A .shtml

参数说明:

–recursive:递归下载整个网站的内容;

–no-clobber:如果文件已存在,则不会重复下载(避免覆盖);

–no-parent:不下载父目录中的内容,仅限当前目录及子目录;

–convert-links:将下载的文件中的链接转换为本地链接,方便本地查看;

–domains jwc.hznu.edu.cn:限制只下载指定域名下的内容

–directory-prefix=data/source_html:将下载的内容保存到 data/source_html 目录下;

-A .shtml:仅下载 .shtml 文件。

https://i-blog.csdnimg.cn/direct/e602d32fb6e7445184849afcf9d3791a.png 官网的文件就是shtml格式的,这样我们就能准确地将所有公告文件爬取到本地,忽略不需要的文件。由于官网中公告文件都保存在“/c”目录下,我们通过wget工具爬取到本地后转换成本地链接,也是保存在/c目录下面:

https://i-blog.csdnimg.cn/direct/ea06f2aea1774833af6594245c283874.png

我们可以看到,公告文件是按照时间进行分目有序存储,我们可以通过find指令查看总文件数:

https://i-blog.csdnimg.cn/direct/db0661408fea4233b782a3daca01297d.png

接下来的步骤,就是把下载下来的2064个shtml文件整理到一起:

🔖 数据整理

爬取到的 .html 文件内容较为杂乱,需要进一步整理并提取关键信息。为了实现这一目标,首先需要递归地遍历 /c 目录,获取全部的 .shtml 文件。虽然可以通过 open 打开文件读取数据,但递归访问目录是一个需要解决的问题。这里,我们可以使用 Boost.Filesystem 库 来实现这一功能。

Boost.Filesystem 是一个强大的文件系统操作库,提供了跨平台的目录遍历、文件操作等功能。然而, Boost 并不是 C++ 官方标准库 ,因此需要先下载并安装到本地才能使用。

下面是Boost::filesystem库的具体使用过程:

// 借助 Boost 库递归遍历目录,汇总 .html 文件
bool Enumfile(const string &src_path, vector<string> *files_list)
{
    namespace fs = boost::filesystem; // 命名空间别名,简化代码
    fs::path root_path(src_path);    // 将字符串路径转换为 boost::filesystem::path 对象

    // 判断路径是否存在
    if (!fs::exists(root_path))
    {
        cerr << src_path << " not exists!" << endl;
        return false;
    }

    // 设置一个空的迭代器,作为结束标志
    fs::recursive_directory_iterator end;

    // 递归遍历目录
    for (fs::recursive_directory_iterator it(root_path); it != end; ++it)
    {
        // 如果当前文件不是普通文件,则跳过
        if (!fs::is_regular_file(*it))
            continue;

        // 如果当前文件后缀不是 .html,则跳过
        if (it->path().extension() != ".html")
            continue;

        // 将文件名以字符串形式插入列表
        files_list->push_back(it->path().string());
    }

    return true;
}

如此一来,我们所有的shtml文件内容就以一个个字符串的形式保存在了vector数组中。

🔖标签清洗

shtml文件中包含了关于整个网页的内容,但是我们只需要三部分内容: 文档标题、文档摘要、文档URL 。所以下一步,我们要从shtml源文件中提取出这三要素作为一个数据单元进行存储,这个步骤就是 标签清洗 的过程。

①文档标题

在html中,定义网页的标题,但是在官网公告中</p> <p><img class="lazyload" src="/TechBlog/svg/loading.min.svg" data-src="https://i-blog.csdnimg.cn/direct/173a4fe6800d4469aa4494684cfc80ee.png" data-srcset="https://i-blog.csdnimg.cn/direct/173a4fe6800d4469aa4494684cfc80ee.png, https://i-blog.csdnimg.cn/direct/173a4fe6800d4469aa4494684cfc80ee.png 1.5x, https://i-blog.csdnimg.cn/direct/173a4fe6800d4469aa4494684cfc80ee.png 2x" data-sizes="auto" alt="https://i-blog.csdnimg.cn/direct/173a4fe6800d4469aa4494684cfc80ee.png" title="https://i-blog.csdnimg.cn/direct/173a4fe6800d4469aa4494684cfc80ee.png" /></p> <p>将“杭州师范大学教务处”作为网页的标题,而并不是公告的标题,所以我们需要找到公告标题对应的标签是什么。在网页中,我们选中公告标题,选择网页检查,就可以看到源码中的位置:</p> <p><img class="lazyload" src="/TechBlog/svg/loading.min.svg" data-src="https://i-blog.csdnimg.cn/direct/178ec1a8438d4a308c28c32ba078935c.png" data-srcset="https://i-blog.csdnimg.cn/direct/178ec1a8438d4a308c28c32ba078935c.png, https://i-blog.csdnimg.cn/direct/178ec1a8438d4a308c28c32ba078935c.png 1.5x, https://i-blog.csdnimg.cn/direct/178ec1a8438d4a308c28c32ba078935c.png 2x" data-sizes="auto" alt="https://i-blog.csdnimg.cn/direct/178ec1a8438d4a308c28c32ba078935c.png" title="https://i-blog.csdnimg.cn/direct/178ec1a8438d4a308c28c32ba078935c.png" /></p> <p>原来标题被定义为了<h1>标签,也就是一级标题,所以我们在源文件中,只需要定位到<h1>标签就可以找到文档标题了。</p> <p><strong>②文档正文</strong></p> <p>文档正文内容位于标签<div class="sp-content">下,我们同样定位到该标签处,将后续标签清洗,保留正文部分。</p> <p><img class="lazyload" src="/TechBlog/svg/loading.min.svg" data-src="https://i-blog.csdnimg.cn/direct/934fbf635db84cb59b7c935076c4f634.png" data-srcset="https://i-blog.csdnimg.cn/direct/934fbf635db84cb59b7c935076c4f634.png, https://i-blog.csdnimg.cn/direct/934fbf635db84cb59b7c935076c4f634.png 1.5x, https://i-blog.csdnimg.cn/direct/934fbf635db84cb59b7c935076c4f634.png 2x" data-sizes="auto" alt="https://i-blog.csdnimg.cn/direct/934fbf635db84cb59b7c935076c4f634.png" title="https://i-blog.csdnimg.cn/direct/934fbf635db84cb59b7c935076c4f634.png" /></p> <p>标签清洗的核心逻辑是:在遍历 HTML 内容时,忽略掉所有被 <code><</code> 和 <code>></code> 包围的部分(即标签),而保留未被 <code><</code> 和 <code>></code> 包围的部分(即正文内容)。为了实现这一逻辑,我们可以通过定义一个状态机来实现。</p> <p>状态机的设计如下:</p> <p>① <strong>初始状态为 <code>TAG</code></strong> ,因为遍历通常从 <code><</code> 开始;</p> <p>② 当遇到 <code>></code> 时,状态 <strong>从 <code>TAG</code> 切换到 <code>CONTENT</code></strong> ,表示接下来是正文部分;</p> <p>③ 当再次遇到 <code><</code> 时,状态 <strong>从 <code>CONTENT</code> 切换回 <code>TAG</code></strong> ,表示接下来的内容是标签;</p> <p>④ 重复上述过程,直到遍历完整个 HTML 内容。</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">bool</span> <span class="nf">ParseContent</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">file</span><span class="p">,</span> <span class="n">string</span> <span class="o">*</span><span class="n">content</span><span class="p">)</span> </span></span><span class="line"><span class="cl"><span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">size_t</span> <span class="n">begin</span> <span class="o">=</span> <span class="n">file</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="s">"<div class=</span><span class="se">\"</span><span class="s">sp-content</span><span class="se">\"</span><span class="s">>"</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span><span class="p">(</span><span class="n">begin</span> <span class="o">==</span> <span class="n">string</span><span class="o">::</span><span class="n">npos</span><span class="p">){</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="nb">false</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="n">begin</span> <span class="o">+=</span> <span class="n">string</span><span class="p">(</span><span class="s">"<div class=</span><span class="se">\"</span><span class="s">sp-content</span><span class="se">\"</span><span class="s">>"</span><span class="p">).</span><span class="n">size</span><span class="p">();</span> </span></span><span class="line"><span class="cl"> <span class="n">size_t</span> <span class="n">end</span> <span class="o">=</span> <span class="n">file</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="s">"<div class=</span><span class="se">\"</span><span class="s">foter</span><span class="se">\"</span><span class="s">>"</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span><span class="p">(</span><span class="n">end</span> <span class="o">==</span> <span class="n">string</span><span class="o">::</span><span class="n">npos</span><span class="p">){</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="nb">false</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="c1">// 使用状态机,去除标签 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="k">enum</span> <span class="nc">status</span><span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">LABLE</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="n">CONTENT</span> </span></span><span class="line"><span class="cl"> <span class="p">};</span> </span></span><span class="line"><span class="cl"> <span class="k">enum</span> <span class="nc">status</span> <span class="n">s</span> <span class="o">=</span> <span class="n">LABLE</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span><span class="p">(</span><span class="n">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="n">begin</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">end</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">){</span> </span></span><span class="line"><span class="cl"> <span class="c1">// cout << "curent status: " << s << endl; </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="k">switch</span><span class="p">(</span><span class="n">s</span><span class="p">){</span> </span></span><span class="line"><span class="cl"> <span class="k">case</span> <span class="nl">LABLE</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span><span class="p">(</span><span class="n">file</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="sc">'>'</span><span class="p">){</span> </span></span><span class="line"><span class="cl"> <span class="n">s</span> <span class="o">=</span> <span class="n">CONTENT</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="k">break</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">case</span> <span class="nl">CONTENT</span><span class="p">:</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span><span class="p">(</span><span class="n">file</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">==</span> <span class="sc">'<'</span><span class="p">){</span> </span></span><span class="line"><span class="cl"> <span class="n">s</span> <span class="o">=</span> <span class="n">LABLE</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span><span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="kt">char</span> <span class="n">tmp</span> <span class="o">=</span> <span class="n">file</span><span class="p">[</span><span class="n">i</span><span class="p">];</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span><span class="p">(</span><span class="n">tmp</span> <span class="o">==</span> <span class="sc">'\n'</span><span class="p">)</span> <span class="n">tmp</span> <span class="o">=</span> <span class="sc">' '</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">content</span><span class="o">-></span><span class="n">push_back</span><span class="p">(</span><span class="n">tmp</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="k">break</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">default</span><span class="o">:</span> </span></span><span class="line"><span class="cl"> <span class="k">break</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="nb">true</span><span class="p">;</span> </span></span><span class="line"><span class="cl"><span class="p">}</span></span></span></code></pre></div></div><p><strong>③文档URL</strong></p> <p>在 HTML 源码中,通常不会直接包含网页的完整 URL 信息,因此我们需要通过其他方式推断出 URL。 网页在网站中的存储通常遵循一定的路径规则 。以教务处官网为例,所有公告网页都存储在 <code>/c</code> 目录下。当我们使用 <code>wget</code> 工具将这些网页下载到本地时,文件的路径结构与官网保持一致,即在本地也保留了 <code>/c</code> 目录下的相对路径。基于这一特性,我们可以通过以下步骤获取网页的完整 URL:即 将官网的基础路径与本地文件的相对路径拼接,就得到了完整的URL 。</p> <p><img class="lazyload" src="/TechBlog/svg/loading.min.svg" data-src="https://i-blog.csdnimg.cn/direct/5ee3efcbdb334c34ad0ad6688628e879.png" data-srcset="https://i-blog.csdnimg.cn/direct/5ee3efcbdb334c34ad0ad6688628e879.png, https://i-blog.csdnimg.cn/direct/5ee3efcbdb334c34ad0ad6688628e879.png 1.5x, https://i-blog.csdnimg.cn/direct/5ee3efcbdb334c34ad0ad6688628e879.png 2x" data-sizes="auto" alt="https://i-blog.csdnimg.cn/direct/5ee3efcbdb334c34ad0ad6688628e879.png" title="https://i-blog.csdnimg.cn/direct/5ee3efcbdb334c34ad0ad6688628e879.png" /></p> <h5 id="保存信息">🔖保存信息</h5> <p>我们将提取出的 <strong>文档标题</strong> 、 <strong>文档摘要</strong> 和 <strong>文档URL</strong> 这三个关键信息存储在一个 <code>DocInfo</code> 结构体中,作为基本的数据单元,然后将这些数据单元按行写入到一个文本文件( <code>raw.txt</code> )中,其中每个数据单元内部的字段之间用特殊分隔符(如 <code>\3</code> )分隔,不同数据单元之间用换行符 <code>\n</code> 分隔。这个 <code>raw.txt</code> 文件不仅实现了 数据的持久化存储 ,还为后续索引构建和搜索功能提供 基础数据源 。</p> <p><strong>下面是构建原生数据库的核心代码片段:</strong></p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="k">const</span> <span class="n">string</span> <span class="n">src_path</span> <span class="o">=</span> <span class="s">"data/source_html/test"</span><span class="p">;</span> </span></span><span class="line"><span class="cl"><span class="k">const</span> <span class="n">string</span> <span class="n">output</span> <span class="o">=</span> <span class="s">"data/raw_data/test.txt"</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="k">typedef</span> <span class="k">struct</span> <span class="nc">DocInfo</span><span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">string</span> <span class="n">title</span><span class="p">;</span> <span class="c1">// 文档标题 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">string</span> <span class="n">content</span><span class="p">;</span> <span class="c1">// 文档的内容 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">string</span> <span class="n">url</span><span class="p">;</span> <span class="c1">// 文档的url </span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="p">}</span><span class="n">DocInfo_t</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="c1">// const & 输入 </span></span></span><span class="line"><span class="cl"><span class="c1">// * 输出 </span></span></span><span class="line"><span class="cl"><span class="c1">// & 输入输出 </span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="kt">bool</span> <span class="nf">Enumfile</span> <span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">src_path</span><span class="p">,</span> <span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">></span> <span class="o">*</span><span class="n">files_list</span><span class="p">);</span> </span></span><span class="line"><span class="cl"><span class="kt">bool</span> <span class="nf">ParseHtml</span> <span class="p">(</span><span class="k">const</span> <span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">></span> <span class="o">&</span><span class="n">files_list</span><span class="p">,</span> <span class="n">vector</span><span class="o"><</span><span class="n">DocInfo_t</span><span class="o">></span> <span class="o">*</span><span class="n">results</span><span class="p">);</span> </span></span><span class="line"><span class="cl"><span class="kt">bool</span> <span class="nf">SaveHtml</span> <span class="p">(</span><span class="k">const</span> <span class="n">vector</span><span class="o"><</span><span class="n">DocInfo_t</span><span class="o">></span> <span class="o">&</span><span class="n">results</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">output</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> </span></span><span class="line"><span class="cl"><span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="c1">// 1.遍历指定目录,将html文件汇总在列表里 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">></span> <span class="n">files_list</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="n">Enumfile</span><span class="p">(</span><span class="n">src_path</span><span class="p">,</span> <span class="o">&</span><span class="n">files_list</span><span class="p">))</span> </span></span><span class="line"><span class="cl"> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">cerr</span> <span class="o"><<</span> <span class="s">"Enum file name error!"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="mi">1</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="c1">// 2.将列表中的每个文件进行解析,提取关键数据 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">vector</span><span class="o"><</span><span class="n">DocInfo_t</span><span class="o">></span> <span class="n">results</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="n">ParseHtml</span><span class="p">(</span><span class="n">files_list</span><span class="p">,</span> <span class="o">&</span><span class="n">results</span><span class="p">))</span> </span></span><span class="line"><span class="cl"> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">cerr</span> <span class="o"><<</span> <span class="s">"Parse html error!"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="mi">2</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="c1">// 3.将解析后的数据保存到指定文件中 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="k">if</span><span class="p">(</span><span class="o">!</span><span class="n">SaveHtml</span><span class="p">(</span><span class="n">results</span><span class="p">,</span> <span class="n">output</span><span class="p">))</span> </span></span><span class="line"><span class="cl"> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">cerr</span> <span class="o"><<</span> <span class="s">"Save html error!"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="mi">3</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="mi">0</span><span class="p">;</span> </span></span><span class="line"><span class="cl"><span class="p">}</span></span></span></code></pre></div></div><h4 id="2构建索引">📖2.构建索引</h4> <p>在数据库建立完成后,我们可以编写程序处理搜索关键字并返回相关内容:首先 接收用户输入的关键字,然后在数据库中检索 <code>title</code> 和 <code>content</code> 字段包含该关键字的文档 ;由于一个关键字可能匹配多个文档,而数据库未对结果排序,我们需要将这些文档提取到 <code>vector</code> 容器中,按相关性或其他规则进行排序,最后将排序后的文档作为搜索结果返回给用户。</p> <p>所以第一步我们需要建立的是通过文档ID索引文档信息的 正排索引 。</p> <h5 id="正排索引">🔖正排索引</h5> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"> <span class="c1">// 正排索引数据节点 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="k">struct</span> <span class="nc">DocInfo</span><span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">string</span> <span class="n">title</span><span class="p">;</span> <span class="c1">// 文档标题 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">string</span> <span class="n">content</span><span class="p">;</span> <span class="c1">// 文档去标签后的内容 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">string</span> <span class="n">url</span><span class="p">;</span> <span class="c1">// 官网的网址 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">string</span> <span class="n">time</span><span class="p">;</span> <span class="c1">// 文档时间 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="kt">uint64_t</span> <span class="n">doc_id</span><span class="p">;</span> <span class="c1">// 文档ID </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="p">};</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="c1">// 正排索引通过数组实现,下标天然为文档ID </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">vector</span><span class="o"><</span><span class="n">DocInfo</span><span class="o">></span> <span class="n">forward_index</span><span class="p">;</span></span></span></code></pre></div></div><p>构建正排索引的过程是 <strong>通过文档 ID 索引文档信息</strong> 。在 <code>vector</code> 容器中,下标天然可以作为文档 ID,而文档信息结构包括【title、content、URL、ID】。我们可以使用 <code>std::ifstream</code> 创建一个读取流,将文档内容写入流中,并通过 <code>std::getline</code> 方法循环读取,每次读取的恰好是一个文档(文档之间用 <code>\n</code> 分隔)。</p> <p>读取到的文档是一个字符串,信息段之间用 <code>\3</code> 分隔,因此需要对字符串进行分割。我们可以手动编写分割代码(遍历字符串,遇到 <code>\3</code> 时分割),也可以使用 <strong><code>boost::split</code></strong> 方法,它能够根据指定字符分割字符串,并将结果存储到 <code>vector</code> 数组中。分割完成后,将数据段组合成 <code>DocInfo</code> 结构,并存储到正排索引的 <code>vector</code> 容器中。</p> <p><strong>下面是构建正排索引的代码实现:</strong></p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"> <span class="c1">// 创建正排索引 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">DocInfo</span> <span class="o">*</span><span class="nf">BuildForwardInfo</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">line</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="c1">// 1.对字符串进行切分:title、content、url </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">></span> <span class="n">results</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">const</span> <span class="n">string</span> <span class="n">sep</span> <span class="o">=</span> <span class="s">"</span><span class="se">\3</span><span class="s">"</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">ns_util</span><span class="o">::</span><span class="n">StringUtil</span><span class="o">::</span><span class="n">CutString</span><span class="p">(</span><span class="n">line</span><span class="p">,</span> <span class="o">&</span><span class="n">results</span><span class="p">,</span> <span class="n">sep</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span><span class="p">(</span><span class="n">results</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o"><</span> <span class="mi">3</span><span class="p">){</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="k">nullptr</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="c1">// 2.字符串填充到DocInfo结构中 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">DocInfo</span> <span class="n">doc</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">doc</span><span class="p">.</span><span class="n">title</span> <span class="o">=</span> <span class="n">results</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span> </span></span><span class="line"><span class="cl"> <span class="n">doc</span><span class="p">.</span><span class="n">content</span> <span class="o">=</span> <span class="n">results</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span> </span></span><span class="line"><span class="cl"> <span class="n">doc</span><span class="p">.</span><span class="n">url</span> <span class="o">=</span> <span class="n">results</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span> </span></span><span class="line"><span class="cl"> <span class="n">doc</span><span class="p">.</span><span class="n">doc_id</span> <span class="o">=</span> <span class="n">forward_index</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="c1">// 从URL中提取时间 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">doc</span><span class="p">.</span><span class="n">time</span> <span class="o">=</span> <span class="n">ExtractTimeFromUrl</span><span class="p">(</span><span class="n">doc</span><span class="p">.</span><span class="n">url</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="c1">// 3.插入到正排索引的vector中 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">forward_index</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">move</span><span class="p">(</span><span class="n">doc</span><span class="p">));</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="o">&</span><span class="n">forward_index</span><span class="p">.</span><span class="n">back</span><span class="p">();</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span></span></span></code></pre></div></div><h5 id="倒排索引">🔖倒排索引</h5> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"> <span class="c1">// 倒排索引数据节点 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="k">struct</span> <span class="nc">InvertedElem</span><span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="kt">uint64_t</span> <span class="n">doc_id</span><span class="p">;</span> <span class="c1">// 文档ID </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">string</span> <span class="n">word</span><span class="p">;</span> <span class="c1">// 关键字 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="kt">int</span> <span class="n">weight</span><span class="p">;</span> <span class="c1">// 关键字的权重 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="p">};</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="c1">// 倒排索引通过键值对实现,一个关键字映射一个/多个倒排拉链结构 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">unordered_map</span><span class="o"><</span><span class="n">string</span><span class="p">,</span> <span class="n">InvertedList</span><span class="o">></span> <span class="n">inverted_index</span><span class="p">;</span></span></span></code></pre></div></div><p>为了通过关键字获取文档信息,我们需要构建倒排索引。倒排索引是一种映射关系,通过关键字映射到文档信息,文档信息结构包括【ID、word关键字、weight权重】。其中, 文档 ID 用于索引正排容器以获取更详细的文档信息,而 weight 权重则用于文档的排序 。由于关键字在每个文档中出现的频率不同,我们需要将检索到的文档按照关键字出现频率 <strong>从高到低排列返回</strong> 。</p> <p>因此,我们需要构建倒排索引结构,这就要求我们将所有关键字列举出来。关键字是从文档的 <code>title</code> 和 <code>content</code> 中提取的,因此 在构建每一个正排索引时,可以同时构建该文档的所有关键字的倒排索引 。那么,关键字的提取规则是什么呢?</p> <p>我们可以使用 <strong><code>cppjieba</code> 分词工具</strong> ,其中的 <strong><code>jieba::for_search</code></strong> 方法专门用于搜索关键字的分词。由于关键字在 <code>title</code> 和 <code>content</code> 中出现的权重不同,我们需要定义两个 <code>vector</code> 容器,分别存储 <code>title</code> 和 <code>content</code> 的关键字分词结果,并分别统计关键字出现的次数。最后,按照特定算法计算关键字的权重,从而完成倒排索引的构建。</p> <p><strong>下面是构建倒排索引的代码实现:</strong></p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"> <span class="c1">// 创建倒排索引 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="kt">bool</span> <span class="nf">BuildInvertedIndex</span><span class="p">(</span><span class="k">const</span> <span class="n">DocInfo</span> <span class="o">&</span><span class="n">doc</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="k">struct</span> <span class="nc">word_cnt</span><span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="kt">int</span> <span class="n">title_cnt</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="kt">int</span> <span class="n">content_cnt</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="n">word_cnt</span><span class="p">()</span><span class="o">:</span> <span class="n">title_cnt</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">content_cnt</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="p">{};</span> </span></span><span class="line"><span class="cl"> <span class="p">};</span> </span></span><span class="line"><span class="cl"> <span class="n">unordered_map</span><span class="o"><</span><span class="n">string</span><span class="p">,</span> <span class="n">word_cnt</span><span class="o">></span> <span class="n">word_map</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="c1">// 对标题进行分词 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">></span> <span class="n">title_words</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">ns_util</span><span class="o">::</span><span class="n">JiebaUtil</span><span class="o">::</span><span class="n">CutString</span><span class="p">(</span><span class="n">doc</span><span class="p">.</span><span class="n">title</span><span class="p">,</span> <span class="o">&</span><span class="n">title_words</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="c1">// 统计标题中关键字的频次 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="k">for</span><span class="p">(</span><span class="k">auto</span> <span class="o">&</span><span class="nl">it</span> <span class="p">:</span> <span class="n">title_words</span><span class="p">){</span> </span></span><span class="line"><span class="cl"> <span class="n">word_map</span><span class="p">[</span><span class="n">it</span><span class="p">].</span><span class="n">title_cnt</span><span class="o">++</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="c1">// 对正文进行分词 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">></span> <span class="n">content_words</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">ns_util</span><span class="o">::</span><span class="n">JiebaUtil</span><span class="o">::</span><span class="n">CutString</span><span class="p">(</span><span class="n">doc</span><span class="p">.</span><span class="n">content</span><span class="p">,</span> <span class="o">&</span><span class="n">content_words</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="c1">// 统计正文中关键字的频次 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="k">for</span><span class="p">(</span><span class="k">auto</span> <span class="o">&</span><span class="nl">it</span> <span class="p">:</span> <span class="n">content_words</span><span class="p">){</span> </span></span><span class="line"><span class="cl"> <span class="n">word_map</span><span class="p">[</span><span class="n">it</span><span class="p">].</span><span class="n">content_cnt</span><span class="o">++</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="k">constexpr</span> <span class="kt">int</span> <span class="n">X</span> <span class="o">=</span> <span class="mi">10</span><span class="p">;</span> <span class="c1">// 定义常量 X </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="k">constexpr</span> <span class="kt">int</span> <span class="n">Y</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// 定义常量 Y </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="c1">// 统计关键字及其权重,插入InvertedList倒排拉链中 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="k">for</span><span class="p">(</span><span class="k">auto</span> <span class="o">&</span><span class="nl">it</span> <span class="p">:</span> <span class="n">word_map</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">InvertedElem</span> <span class="n">word_elem</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">word_elem</span><span class="p">.</span><span class="n">doc_id</span> <span class="o">=</span> <span class="n">doc</span><span class="p">.</span><span class="n">doc_id</span><span class="p">;</span> <span class="c1">// 文档ID </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">word_elem</span><span class="p">.</span><span class="n">word</span> <span class="o">=</span> <span class="n">it</span><span class="p">.</span><span class="n">first</span><span class="p">;</span> <span class="c1">// 关键字 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">word_elem</span><span class="p">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">it</span><span class="p">.</span><span class="n">second</span><span class="p">.</span><span class="n">title_cnt</span> <span class="o">*</span> <span class="n">X</span> <span class="o">+</span> <span class="n">it</span><span class="p">.</span><span class="n">second</span><span class="p">.</span><span class="n">content_cnt</span> <span class="o">*</span> <span class="n">Y</span><span class="p">;</span> <span class="c1">// 计算权重(简易版) </span></span></span><span class="line"><span class="cl"><span class="c1"></span> </span></span><span class="line"><span class="cl"> <span class="n">InvertedList</span> <span class="o">&</span><span class="n">inverted_list</span> <span class="o">=</span> <span class="n">inverted_index</span><span class="p">[</span><span class="n">it</span><span class="p">.</span><span class="n">first</span><span class="p">];</span> </span></span><span class="line"><span class="cl"> <span class="n">inverted_list</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">move</span><span class="p">(</span><span class="n">word_elem</span><span class="p">));</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="nb">true</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span></span></span></code></pre></div></div><p>因此,构建索引的步骤如下: 循环读取数据库,提取每个文档并构建对应的正排索引,然后根据正排索引中的 <code>title</code> 和 <code>content</code> 提取全部关键字,构建倒排索引 。</p> <p>至此,我们已经可以正式编写执行查询流程的程序了。</p> <h4 id="3-编写查询程序">📖3. 编写查询程序</h4> <h5 id="对搜索关键字分词">🔖对搜索关键字分词</h5> <p>用户输入的关键字并不能直接用于索引搜索,而是需要先进行分词处理 。我们可以使用 <code>jieba</code> 分词工具对关键字进行切分,然后将分词结果放入倒排索引中进行检索。</p> <p>这里存在一个问题:同一个文档可能会被多次返回。例如,文档内容为“小明来了北京”,用户搜索的关键字也是“小明来了北京”,分词结果为“小明/来了/北京”,这三个词可能分别检索到同一个文档。如果不对这种情况进行去重处理,搜索结果中就会出现重复的文档。</p> <p>我们通过 <code>JiebaUtil::CutString</code> 方法对 <code>query</code> 进行分词,并将分词结果存储在 <code>words</code> 中:</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"> <span class="k">class</span> <span class="nc">StringUtil</span><span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="k">public</span><span class="o">:</span> </span></span><span class="line"><span class="cl"> <span class="k">static</span> <span class="kt">void</span> <span class="n">CutString</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">target</span><span class="p">,</span> <span class="n">vector</span><span class="o"><</span><span class="n">string</span><span class="o">></span> <span class="o">*</span><span class="n">out</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">sep</span><span class="p">)</span> </span></span><span class="line"><span class="cl"> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">boost</span><span class="o">::</span><span class="n">split</span><span class="p">(</span><span class="o">*</span><span class="n">out</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">boost</span><span class="o">::</span><span class="n">is_any_of</span><span class="p">(</span><span class="n">sep</span><span class="p">),</span> <span class="n">boost</span><span class="o">::</span><span class="n">token_compress_on</span><span class="p">);</span> <span class="c1">// 压缩重复字符 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="p">};</span></span></span></code></pre></div></div><h5 id="对检索结果进行去重">🔖对检索结果进行去重</h5> <p>为了解决重复文档的问题,我们使用哈希表 <code>inverted_map</code> ,通过文档 ID 映射倒排索引的方式完成去重操作。对于检索到的重复文档,将其权重累加起来。 由于这些文档是由不同关键字检索到的,还需要将这些关键字保存起来 。为此,我们定义了 <code>InvertedElemPrint</code> 结构,用于存储文档 ID、关键字列表和权重。</p> <p>我们遍历每个分词结果,从倒排索引中获取相关文档,并将其合并到 <code>inverted_map</code> 中:</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="n">unordered_map</span><span class="o"><</span><span class="kt">uint64_t</span><span class="p">,</span> <span class="n">InvertedElemPrint</span><span class="o">></span> <span class="n">inverted_map</span><span class="p">;</span> </span></span><span class="line"><span class="cl"><span class="k">for</span><span class="p">(</span><span class="n">string</span> <span class="nl">word</span> <span class="p">:</span> <span class="n">words</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">boost</span><span class="o">::</span><span class="n">to_lower</span><span class="p">(</span><span class="n">word</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="n">ns_index</span><span class="o">::</span><span class="n">InvertedList</span> <span class="o">*</span><span class="n">word_list</span> <span class="o">=</span> <span class="n">index</span><span class="o">-></span><span class="n">GetInvertedIndex</span><span class="p">(</span><span class="n">word</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span><span class="p">(</span><span class="k">nullptr</span> <span class="o">==</span> <span class="n">word_list</span><span class="p">)</span> <span class="k">continue</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span><span class="p">(</span><span class="k">const</span> <span class="k">auto</span> <span class="o">&</span><span class="nl">elem</span><span class="p">:</span> <span class="o">*</span><span class="n">word_list</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="k">auto</span> <span class="o">&</span><span class="n">item</span> <span class="o">=</span> <span class="n">inverted_map</span><span class="p">[</span><span class="n">elem</span><span class="p">.</span><span class="n">doc_id</span><span class="p">];</span> </span></span><span class="line"><span class="cl"> <span class="n">item</span><span class="p">.</span><span class="n">doc_id</span> <span class="o">=</span> <span class="n">elem</span><span class="p">.</span><span class="n">doc_id</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">item</span><span class="p">.</span><span class="n">words</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">elem</span><span class="p">.</span><span class="n">word</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="n">item</span><span class="p">.</span><span class="n">weight</span> <span class="o">+=</span> <span class="n">elem</span><span class="p">.</span><span class="n">weight</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"><span class="p">}</span></span></span></code></pre></div></div><h5 id="对检索结果排序">🔖对检索结果排序</h5> <p>在得到去重后的倒排索引集合后,需要按照权重 <code>weight</code> 对结果进行降序排列。我们使用 <code>std::sort</code> 函数实现这一排序操作,确保最相关的文档排在前面。</p> <p>我们将 <code>inverted_map</code> 中的数据移动到 <code>gather</code> 中, <strong>并按权重排序</strong> :</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="n">vector</span><span class="o"><</span><span class="n">InvertedElemPrint</span><span class="o">></span> <span class="n">gather</span><span class="p">;</span> </span></span><span class="line"><span class="cl"><span class="k">for</span><span class="p">(</span><span class="k">const</span> <span class="k">auto</span> <span class="o">&</span><span class="nl">item</span> <span class="p">:</span> <span class="n">inverted_map</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">gather</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">move</span><span class="p">(</span><span class="n">item</span><span class="p">.</span><span class="n">second</span><span class="p">));</span> </span></span><span class="line"><span class="cl"><span class="p">}</span> </span></span><span class="line"><span class="cl"><span class="n">sort</span><span class="p">(</span><span class="n">gather</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span> <span class="n">gather</span><span class="p">.</span><span class="n">end</span><span class="p">(),</span> <span class="p">[](</span><span class="k">const</span> <span class="n">InvertedElemPrint</span> <span class="o">&</span><span class="n">e1</span><span class="p">,</span> <span class="k">const</span> <span class="n">InvertedElemPrint</span> <span class="o">&</span><span class="n">e2</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">e1</span><span class="p">.</span><span class="n">weight</span> <span class="o">></span> <span class="n">e2</span><span class="p">.</span><span class="n">weight</span><span class="p">;</span> </span></span><span class="line"><span class="cl"><span class="p">});</span></span></span></code></pre></div></div><h5 id="序列化与摘要生成">🔖序列化与摘要生成</h5> <p>整理后的索引结果 无法直接通过网络传输,需要以序列化和反序列化的方式进行处理 。我们使用 <code>Json</code> 作为通用的序列化工具,构建 <code>json</code> 字符串返回给用户。但还有一个问题:索引中提取的是文档的全部正文内容,如果直接将全部内容返回并显示在用户的搜索界面上,显然不够友好。因此,我们需要对正文部分 <strong>生成摘要</strong> ,便于用户快速了解文档内容并决定是否跳转查看详情。</p> <p>摘要内容最好包含用户搜索的关键字。我们通过 <code>GetDigest</code> 方法生成摘要:在 <code>content</code> 中查找第一个关键字的位置,然后取关键字前 50 字节和后 100 字节作为摘要内容。</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="n">string</span> <span class="nf">GetDigest</span><span class="p">(</span><span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">content</span><span class="p">,</span> <span class="k">const</span> <span class="n">string</span> <span class="o">&</span><span class="n">key</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="k">const</span> <span class="kt">int</span> <span class="n">prev_chars</span> <span class="o">=</span> <span class="mi">50</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">const</span> <span class="kt">int</span> <span class="n">next_chars</span> <span class="o">=</span> <span class="mi">100</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">auto</span> <span class="n">itea</span> <span class="o">=</span> <span class="n">search</span><span class="p">(</span><span class="n">content</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span> <span class="n">content</span><span class="p">.</span><span class="n">end</span><span class="p">(),</span> <span class="n">key</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span> <span class="n">key</span><span class="p">.</span><span class="n">end</span><span class="p">(),</span> <span class="p">[](</span><span class="kt">char</span> <span class="n">a</span><span class="p">,</span> <span class="kt">char</span> <span class="n">b</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="p">(</span><span class="n">tolower</span><span class="p">(</span><span class="n">a</span><span class="p">)</span> <span class="o">==</span> <span class="n">tolower</span><span class="p">(</span><span class="n">b</span><span class="p">));</span> </span></span><span class="line"><span class="cl"> <span class="p">});</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span><span class="p">(</span><span class="n">content</span><span class="p">.</span><span class="n">end</span><span class="p">()</span> <span class="o">==</span> <span class="n">itea</span><span class="p">)</span> <span class="k">return</span> <span class="s">"未找到关键词"</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="c1">// 计算字符位置并生成摘要 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="n">string</span> <span class="n">utf8_content</span> <span class="o">=</span> <span class="n">content</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">vector</span><span class="o"><</span><span class="n">size_t</span><span class="o">></span> <span class="n">char_positions</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">size_t</span> <span class="n">byte_pos</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">while</span> <span class="p">(</span><span class="n">byte_pos</span> <span class="o"><</span> <span class="n">utf8_content</span><span class="p">.</span><span class="n">size</span><span class="p">())</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">char_positions</span><span class="p">.</span><span class="n">push_back</span><span class="p">(</span><span class="n">byte_pos</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">c</span> <span class="o">=</span> <span class="n">utf8_content</span><span class="p">[</span><span class="n">byte_pos</span><span class="p">];</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="p">(</span><span class="n">c</span> <span class="o"><</span> <span class="mh">0x80</span><span class="p">)</span> <span class="n">byte_pos</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">c</span> <span class="o"><</span> <span class="mh">0xE0</span><span class="p">)</span> <span class="n">byte_pos</span> <span class="o">+=</span> <span class="mi">2</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">c</span> <span class="o"><</span> <span class="mh">0xF0</span><span class="p">)</span> <span class="n">byte_pos</span> <span class="o">+=</span> <span class="mi">3</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">else</span> <span class="n">byte_pos</span> <span class="o">+=</span> <span class="mi">4</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="kt">int</span> <span class="n">char_pos</span> <span class="o">=</span> <span class="n">distance</span><span class="p">(</span><span class="n">content</span><span class="p">.</span><span class="n">begin</span><span class="p">(),</span> <span class="n">itea</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="kt">int</span> <span class="n">char_index</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="p">(</span><span class="n">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">char_positions</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="p">(</span><span class="n">char_positions</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">>=</span> <span class="p">(</span><span class="n">size_t</span><span class="p">)</span><span class="n">char_pos</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">char_index</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">break</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="kt">int</span> <span class="n">start_char</span> <span class="o">=</span> <span class="n">max</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">char_index</span> <span class="o">-</span> <span class="n">prev_chars</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="kt">int</span> <span class="n">end_char</span> <span class="o">=</span> <span class="n">min</span><span class="p">((</span><span class="kt">int</span><span class="p">)</span><span class="n">char_positions</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="n">char_index</span> <span class="o">+</span> <span class="n">next_chars</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="n">size_t</span> <span class="n">start_byte</span> <span class="o">=</span> <span class="n">char_positions</span><span class="p">[</span><span class="n">start_char</span><span class="p">];</span> </span></span><span class="line"><span class="cl"> <span class="n">size_t</span> <span class="n">end_byte</span> <span class="o">=</span> <span class="p">(</span><span class="n">end_char</span> <span class="o">+</span> <span class="mi">1</span> <span class="o"><</span> <span class="n">char_positions</span><span class="p">.</span><span class="n">size</span><span class="p">())</span> <span class="o">?</span> <span class="n">char_positions</span><span class="p">[</span><span class="n">end_char</span> <span class="o">+</span> <span class="mi">1</span><span class="p">]</span> <span class="o">:</span> <span class="n">utf8_content</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="n">string</span> <span class="n">digest</span> <span class="o">=</span> <span class="n">utf8_content</span><span class="p">.</span><span class="n">substr</span><span class="p">(</span><span class="n">start_byte</span><span class="p">,</span> <span class="n">end_byte</span> <span class="o">-</span> <span class="n">start_byte</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="kt">bool</span> <span class="n">has_more_at_start</span> <span class="o">=</span> <span class="p">(</span><span class="n">start_char</span> <span class="o">></span> <span class="mi">0</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="kt">bool</span> <span class="n">has_more_at_end</span> <span class="o">=</span> <span class="p">(</span><span class="n">end_char</span> <span class="o"><</span> <span class="p">(</span><span class="kt">int</span><span class="p">)</span><span class="n">char_positions</span><span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="p">(</span><span class="n">has_more_at_start</span><span class="p">)</span> <span class="n">digest</span> <span class="o">=</span> <span class="s">"..."</span> <span class="o">+</span> <span class="n">digest</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="p">(</span><span class="n">has_more_at_end</span><span class="p">)</span> <span class="n">digest</span> <span class="o">=</span> <span class="n">digest</span> <span class="o">+</span> <span class="s">"..."</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"> <span class="k">return</span> <span class="n">digest</span><span class="p">;</span> </span></span><span class="line"><span class="cl"><span class="p">}</span></span></span></code></pre></div></div><h5 id="构建json结果">🔖构建JSON结果</h5> <p>最后,我们将排序后的结果构建为 <code>json</code> 字符串返回给用户。</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">BuildJsonResult</span><span class="p">(</span><span class="k">const</span> <span class="n">vector</span><span class="o"><</span><span class="n">InvertedElemPrint</span><span class="o">></span> <span class="o">&</span><span class="n">gather</span><span class="p">,</span> <span class="n">string</span> <span class="o">*</span><span class="n">json_string</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">Json</span><span class="o">::</span><span class="n">Value</span> <span class="n">root</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="k">for</span><span class="p">(</span><span class="k">auto</span> <span class="o">&</span><span class="nl">item</span> <span class="p">:</span> <span class="n">gather</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">ns_index</span><span class="o">::</span><span class="n">DocInfo</span> <span class="o">*</span><span class="n">doc</span> <span class="o">=</span> <span class="n">index</span><span class="o">-></span><span class="n">GetForwardIndex</span><span class="p">(</span><span class="n">item</span><span class="p">.</span><span class="n">doc_id</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span><span class="p">(</span><span class="k">nullptr</span> <span class="o">==</span> <span class="n">doc</span><span class="p">)</span> <span class="k">continue</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">Json</span><span class="o">::</span><span class="n">Value</span> <span class="n">elem</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">elem</span><span class="p">[</span><span class="s">"title"</span><span class="p">]</span> <span class="o">=</span> <span class="n">doc</span><span class="o">-></span><span class="n">title</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">elem</span><span class="p">[</span><span class="s">"digest"</span><span class="p">]</span> <span class="o">=</span> <span class="n">GetDigest</span><span class="p">(</span><span class="n">doc</span><span class="o">-></span><span class="n">content</span><span class="p">,</span> <span class="n">item</span><span class="p">.</span><span class="n">words</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span> </span></span><span class="line"><span class="cl"> <span class="n">elem</span><span class="p">[</span><span class="s">"url"</span><span class="p">]</span> <span class="o">=</span> <span class="n">doc</span><span class="o">-></span><span class="n">url</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">elem</span><span class="p">[</span><span class="s">"id"</span><span class="p">]</span> <span class="o">=</span> <span class="n">doc</span><span class="o">-></span><span class="n">doc_id</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">elem</span><span class="p">[</span><span class="s">"weight"</span><span class="p">]</span> <span class="o">=</span> <span class="n">item</span><span class="p">.</span><span class="n">weight</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">elem</span><span class="p">[</span><span class="s">"time"</span><span class="p">]</span> <span class="o">=</span> <span class="n">doc</span><span class="o">-></span><span class="n">time</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">root</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">elem</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="n">Json</span><span class="o">::</span><span class="n">StyledWriter</span> <span class="n">writer</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="o">*</span><span class="n">json_string</span> <span class="o">=</span> <span class="n">writer</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">root</span><span class="p">);</span> </span></span><span class="line"><span class="cl"><span class="p">}</span></span></span></code></pre></div></div><h4 id="4编写服务器主程序">📖4.编写服务器主程序</h4> <p>我们使用 <code>httplib</code> 库实现了一个简单的 HTTP 服务器,用于处理用户的搜索请求并返回结果。 <code>httplib</code> 是一个轻量级的 C++ HTTP 库,易于集成和使用。</p> <h5 id="初始化索引">🔖初始化索引</h5> <p>在程序启动时,我们首先需要 初始化搜索引擎的索引 。通过调用 <code>ns_sercher::Sercher</code> 类的 <code>InitIndex</code> 方法,从指定的数据文件 <code>data/raw_data/raw.txt</code> 中加载数据并构建正排索引和倒排索引。</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="n">ns_sercher</span><span class="o">::</span><span class="n">Sercher</span> <span class="n">serch</span><span class="p">;</span> </span></span><span class="line"><span class="cl"><span class="n">serch</span><span class="p">.</span><span class="n">InitIndex</span><span class="p">(</span><span class="n">input</span><span class="p">);</span></span></span></code></pre></div></div><h5 id="设置http服务器">🔖设置HTTP服务器</h5> <p>我们使用 <strong><code>httplib</code> 库</strong> 创建一个 HTTP 服务器,并设置服务器的根目录为 <code>./wwwroot</code> 。该目录用于存放静态资源文件(如 HTML、CSS、JavaScript 等),供客户端访问。</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="n">httplib</span><span class="o">::</span><span class="n">Server</span> <span class="n">svr</span><span class="p">;</span> </span></span><span class="line"><span class="cl"><span class="n">svr</span><span class="p">.</span><span class="n">set_base_dir</span><span class="p">(</span><span class="n">root_path</span><span class="p">.</span><span class="n">c_str</span><span class="p">());</span></span></span></code></pre></div></div><h5 id="处理搜索请求">🔖处理搜索请求</h5> <p>我们为服务器定义了一个 <code>/search</code> 路由,用于处理用户的搜索请求。该路由通过 <code>GET</code> 方法接收用户输入的关键字,并根据请求参数执行不同的搜索逻辑。</p> <p>根据请求参数 <code>time_priority</code> 的值,决定是 按时间排序还是按权重排序 :</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="k">if</span> <span class="p">(</span><span class="n">time_priority</span><span class="p">){</span> </span></span><span class="line"><span class="cl"> <span class="n">cout</span> <span class="o"><<</span> <span class="s">"按时间排序"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">serch</span><span class="p">.</span><span class="n">TimePrioritySerch</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="o">&</span><span class="n">json_string</span><span class="p">);</span> </span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="k">else</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">cout</span> <span class="o"><<</span> <span class="s">"按权重排序"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">serch</span><span class="p">.</span><span class="n">CommonSerch</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="o">&</span><span class="n">json_string</span><span class="p">);</span> </span></span><span class="line"><span class="cl"><span class="p">}</span></span></span></code></pre></div></div><p>如果搜索关键词为空,返回全部文档的时间排序结果(便于浏览最新公告):</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="c1">// 检查输入是否为空或仅包含空格 </span></span></span><span class="line"><span class="cl"><span class="c1"></span><span class="k">if</span> <span class="p">(</span><span class="n">word</span><span class="p">.</span><span class="n">empty</span><span class="p">()</span> <span class="o">||</span> <span class="n">word</span><span class="p">.</span><span class="n">find_first_not_of</span><span class="p">(</span><span class="sc">' '</span><span class="p">)</span> <span class="o">==</span> <span class="n">string</span><span class="o">::</span><span class="n">npos</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">cout</span> <span class="o"><<</span> <span class="s">"返回所有文档信息"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="n">serch</span><span class="p">.</span><span class="n">GetAllDocuments</span><span class="p">(</span><span class="o">&</span><span class="n">json_string</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="n">resp</span><span class="p">.</span><span class="n">set_content</span><span class="p">(</span><span class="n">json_string</span><span class="p">,</span> <span class="s">"application/json; charset=utf-8"</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="k">return</span><span class="p">;</span> </span></span><span class="line"><span class="cl"><span class="p">}</span></span></span></code></pre></div></div><p>如果搜索关键字不存在,则返回空结果和广告信息:</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-cpp" data-lang="cpp"><span class="line"><span class="cl"><span class="k">if</span> <span class="p">(</span><span class="n">json_string</span><span class="p">.</span><span class="n">empty</span><span class="p">())</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="n">json_string</span> <span class="o">=</span> <span class="n">R</span><span class="s">"({"</span><span class="n">results</span><span class="s">": [], "</span><span class="n">ads</span><span class="s">": [</span> </span></span><span class="line"><span class="cl"> <span class="p">{</span><span class="s">"text"</span><span class="o">:</span> <span class="s">"进入校园官网:"</span><span class="p">,</span> <span class="s">"url"</span><span class="o">:</span> <span class="s">"https://jwc.hznu.edu.cn/"</span><span class="p">,</span> <span class="s">"linkText"</span><span class="o">:</span> <span class="s">"杭州师范大学教务处"</span><span class="p">},</span> </span></span><span class="line"><span class="cl"> <span class="p">{</span><span class="s">"text"</span><span class="o">:</span> <span class="s">"分享学习笔记,记录生活点滴,欢迎访问我的博客:"</span><span class="p">,</span> <span class="s">"url"</span><span class="o">:</span> <span class="s">"https://kanhai-night.blog.csdn.net"</span><span class="p">,</span> <span class="s">"linkText"</span><span class="o">:</span> <span class="s">"Kanhai's 技术博客"</span><span class="p">},</span> </span></span><span class="line"><span class="cl"> <span class="p">{</span><span class="s">"text"</span><span class="o">:</span> <span class="s">"本项目已开源:"</span><span class="p">,</span> <span class="s">"url"</span><span class="o">:</span> <span class="s">"https://gitee.com/HZNUYuwen/Linux_gitee/tree/master/HZNUSercher"</span><span class="p">,</span> <span class="s">"linkText"</span><span class="o">:</span> <span class="s">"查看项目源码"</span><span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="p">]})</span><span class="s">";</span> </span></span><span class="line"><span class="cl"><span class="p">}</span></span></span></code></pre></div></div><h3 id="四前端实现">📚四、前端实现</h3> <p>前端页面实现了搜索功能的核心交互逻辑,包括关键字输入、搜索请求、结果展示和分页浏览。以下是对主要功能的介绍:</p> <h4 id="1页面结构">📖1.页面结构</h4> <p>页面分为以下几个部分:</p> <p><strong>① 搜索框</strong> :用户输入关键字,并选择是否按时间排序。</p> <p><strong>② 搜索结果区域</strong> :动态展示搜索结果的标题、摘要和链接。</p> <p><strong>③ 分页控件</strong> :支持上一页和下一页的翻页操作。</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-html" data-lang="html"><span class="line"><span class="cl"><span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"container initial-state"</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"search"</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"><</span><span class="nt">input</span> <span class="na">type</span><span class="o">=</span><span class="s">"text"</span> <span class="na">value</span><span class="o">=</span><span class="s">"请输入搜索关键字"</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"search-options"</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"><</span><span class="nt">label</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"><</span><span class="nt">input</span> <span class="na">type</span><span class="o">=</span><span class="s">"checkbox"</span> <span class="na">id</span><span class="o">=</span><span class="s">"time-priority"</span><span class="p">></span> 按时间先后 </span></span><span class="line"><span class="cl"> <span class="p"></</span><span class="nt">label</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"></</span><span class="nt">div</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"><</span><span class="nt">button</span> <span class="na">onclick</span><span class="o">=</span><span class="s">"Search()"</span><span class="p">></span>搜索一下<span class="p"></</span><span class="nt">button</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"></</span><span class="nt">div</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"result hidden"</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="c"><!-- 动态生成网页内容 --></span> </span></span><span class="line"><span class="cl"> <span class="p"></</span><span class="nt">div</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"><</span><span class="nt">div</span> <span class="na">class</span><span class="o">=</span><span class="s">"pagination hidden"</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"><</span><span class="nt">button</span> <span class="na">onclick</span><span class="o">=</span><span class="s">"prevPage()"</span><span class="p">></span>上一页<span class="p"></</span><span class="nt">button</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"><</span><span class="nt">span</span> <span class="na">id</span><span class="o">=</span><span class="s">"page-info"</span><span class="p">></</span><span class="nt">span</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"><</span><span class="nt">button</span> <span class="na">onclick</span><span class="o">=</span><span class="s">"nextPage()"</span><span class="p">></span>下一页<span class="p"></</span><span class="nt">button</span><span class="p">></span> </span></span><span class="line"><span class="cl"> <span class="p"></</span><span class="nt">div</span><span class="p">></span> </span></span><span class="line"><span class="cl"><span class="p"></</span><span class="nt">div</span><span class="p">></span></span></span></code></pre></div></div><h4 id="2搜索功能">📖2.搜索功能</h4> <p>通过 <code>Search</code> 函数发起搜索请求,将用户输入的关键字发送到后端,并动态更新搜索结果:</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="cl"><span class="kd">function</span> <span class="nx">Search</span><span class="p">()</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="nx">currentQuery</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="s2">".container .search input"</span><span class="p">).</span><span class="nx">val</span><span class="p">().</span><span class="nx">trim</span><span class="p">();</span> <span class="c1">// 获取关键字 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="kd">let</span> <span class="nx">timePriority</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="s2">"#time-priority"</span><span class="p">).</span><span class="nx">is</span><span class="p">(</span><span class="s2">":checked"</span><span class="p">);</span> <span class="c1">// 是否按时间排序 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="nx">$</span><span class="p">.</span><span class="nx">ajax</span><span class="p">({</span> </span></span><span class="line"><span class="cl"> <span class="nx">type</span><span class="o">:</span> <span class="s2">"GET"</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nx">url</span><span class="o">:</span> <span class="s2">"/search?word="</span> <span class="o">+</span> <span class="nx">currentQuery</span> <span class="o">+</span> <span class="s2">"&time_priority="</span> <span class="o">+</span> <span class="nx">timePriority</span><span class="p">,</span> </span></span><span class="line"><span class="cl"> <span class="nx">success</span><span class="o">:</span> <span class="kd">function</span><span class="p">(</span><span class="nx">data</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="nx">searchResults</span> <span class="o">=</span> <span class="nx">data</span><span class="p">;</span> <span class="c1">// 保存搜索结果 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="nx">currentPage</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="c1">// 重置页码 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="nx">BuildHtml</span><span class="p">();</span> <span class="c1">// 渲染结果 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="nx">setResultState</span><span class="p">();</span> <span class="c1">// 切换到结果状态 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="p">}</span> </span></span><span class="line"><span class="cl"> <span class="p">});</span> </span></span><span class="line"><span class="cl"><span class="p">}</span></span></span></code></pre></div></div><h4 id="3结果展示">📖3.结果展示</h4> <p>通过 <code>BuildHtml</code> 函数动态生成搜索结果,并支持关键字高亮显示:</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="cl"><span class="kd">function</span> <span class="nx">BuildHtml</span><span class="p">()</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="kd">let</span> <span class="nx">result_lable</span> <span class="o">=</span> <span class="nx">$</span><span class="p">(</span><span class="s2">".container .result"</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="nx">result_lable</span><span class="p">.</span><span class="nx">empty</span><span class="p">();</span> <span class="c1">// 清空之前的结果 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="kd">let</span> <span class="nx">start</span> <span class="o">=</span> <span class="nx">currentPage</span> <span class="o">*</span> <span class="nx">resultsPerPage</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="kd">let</span> <span class="nx">end</span> <span class="o">=</span> <span class="nx">start</span> <span class="o">+</span> <span class="nx">resultsPerPage</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="kd">let</span> <span class="nx">pageResults</span> <span class="o">=</span> <span class="nx">searchResults</span><span class="p">.</span><span class="nx">slice</span><span class="p">(</span><span class="nx">start</span><span class="p">,</span> <span class="nx">end</span><span class="p">);</span> <span class="c1">// 获取当前页结果 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> </span></span><span class="line"><span class="cl"> <span class="k">for</span> <span class="p">(</span><span class="kd">let</span> <span class="nx">elem</span> <span class="k">of</span> <span class="nx">pageResults</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="kd">let</span> <span class="nx">highlightedTitle</span> <span class="o">=</span> <span class="nx">highlightKeyword</span><span class="p">(</span><span class="nx">elem</span><span class="p">.</span><span class="nx">title</span><span class="p">,</span> <span class="nx">currentQuery</span><span class="p">);</span> <span class="c1">// 高亮标题 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="kd">let</span> <span class="nx">highlightedDigest</span> <span class="o">=</span> <span class="nx">highlightKeyword</span><span class="p">(</span><span class="nx">elem</span><span class="p">.</span><span class="nx">digest</span><span class="p">,</span> <span class="nx">currentQuery</span><span class="p">);</span> <span class="c1">// 高亮摘要 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="nx">result_lable</span><span class="p">.</span><span class="nx">append</span><span class="p">(</span><span class="sb">` </span></span></span><span class="line"><span class="cl"><span class="sb"> <div class="item"> </span></span></span><span class="line"><span class="cl"><span class="sb"> <a href="</span><span class="si">${</span><span class="nx">elem</span><span class="p">.</span><span class="nx">url</span><span class="si">}</span><span class="sb">" target="_blank"></span><span class="si">${</span><span class="nx">highlightedTitle</span><span class="si">}</span><span class="sb"></a> </span></span></span><span class="line"><span class="cl"><span class="sb"> <p></span><span class="si">${</span><span class="nx">highlightedDigest</span><span class="si">}</span><span class="sb"></p> </span></span></span><span class="line"><span class="cl"><span class="sb"> <i></span><span class="si">${</span><span class="nx">elem</span><span class="p">.</span><span class="nx">url</span><span class="si">}</span><span class="sb"></i> </span></span></span><span class="line"><span class="cl"><span class="sb"> <span style="display: block; color: #888; font-size: 12px; margin-top: 5px;"> </span></span></span><span class="line"><span class="cl"><span class="sb"> </span><span class="si">${</span><span class="nx">elem</span><span class="p">.</span><span class="nx">time</span> <span class="o">?</span> <span class="s2">"发布时间: "</span> <span class="o">+</span> <span class="nx">elem</span><span class="p">.</span><span class="nx">time</span> <span class="o">:</span> <span class="s2">""</span><span class="si">}</span><span class="sb"> </span></span></span><span class="line"><span class="cl"><span class="sb"> </span> </span></span></span><span class="line"><span class="cl"><span class="sb"> </div> </span></span></span><span class="line"><span class="cl"><span class="sb"> `</span><span class="p">);</span> </span></span><span class="line"><span class="cl"> <span class="p">}</span> </span></span><span class="line"><span class="cl"><span class="p">}</span></span></span></code></pre></div></div><h4 id="4分页功能">📖4.分页功能</h4> <p>通过 <code>prevPage</code> 和 <code>nextPage</code> 函数实现分页浏览:</p> <div class="code-block code-line-numbers open" style="counter-reset: code-block 0"> <div class="code-header language-bash"> <span class="code-title"><i class="arrow fas fa-chevron-right fa-fw" aria-hidden="true"></i></span> <span class="ellipses"><i class="fas fa-ellipsis-h fa-fw" aria-hidden="true"></i></span> <span class="copy" title="复制到剪贴板"><i class="far fa-copy fa-fw" aria-hidden="true"></i></span> </div><div class="highlight"><pre tabindex="0" class="chroma"><code class="language-javascript" data-lang="javascript"><span class="line"><span class="cl"><span class="kd">function</span> <span class="nx">prevPage</span><span class="p">()</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="p">(</span><span class="nx">currentPage</span> <span class="o">></span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="nx">currentPage</span><span class="o">--</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="nx">BuildHtml</span><span class="p">();</span> <span class="c1">// 更新结果 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="p">}</span> </span></span><span class="line"><span class="cl"><span class="p">}</span> </span></span><span class="line"><span class="cl"> </span></span><span class="line"><span class="cl"><span class="kd">function</span> <span class="nx">nextPage</span><span class="p">()</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="k">if</span> <span class="p">((</span><span class="nx">currentPage</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="nx">resultsPerPage</span> <span class="o"><</span> <span class="nx">searchResults</span><span class="p">.</span><span class="nx">length</span><span class="p">)</span> <span class="p">{</span> </span></span><span class="line"><span class="cl"> <span class="nx">currentPage</span><span class="o">++</span><span class="p">;</span> </span></span><span class="line"><span class="cl"> <span class="nx">BuildHtml</span><span class="p">();</span> <span class="c1">// 更新结果 </span></span></span><span class="line"><span class="cl"><span class="c1"></span> <span class="p">}</span> </span></span><span class="line"><span class="cl"><span class="p">}</span></span></span></code></pre></div></div><h3 id="五完整演示">📚五、完整演示</h3> <p>下面是 <strong>《校园公告搜索引擎》</strong> 项目功能的完整演示:</p> <h4 id="1初始界面">📖1.初始界面</h4> <p><img class="lazyload" src="/TechBlog/svg/loading.min.svg" data-src="https://i-blog.csdnimg.cn/direct/455f1718fddd4d9abd9b4c0250475d60.png" data-srcset="https://i-blog.csdnimg.cn/direct/455f1718fddd4d9abd9b4c0250475d60.png, https://i-blog.csdnimg.cn/direct/455f1718fddd4d9abd9b4c0250475d60.png 1.5x, https://i-blog.csdnimg.cn/direct/455f1718fddd4d9abd9b4c0250475d60.png 2x" data-sizes="auto" alt="https://i-blog.csdnimg.cn/direct/455f1718fddd4d9abd9b4c0250475d60.png" title="https://i-blog.csdnimg.cn/direct/455f1718fddd4d9abd9b4c0250475d60.png" /></p> <h4 id="2无关键字输入">📖2.无关键字输入</h4> <p>当无关键字输入时,返回用户的结果是经过时间排序的全部文档内容</p> <p><img class="lazyload" src="/TechBlog/svg/loading.min.svg" data-src="https://i-blog.csdnimg.cn/direct/6257a71f4111442abec60358db4b57e6.png" data-srcset="https://i-blog.csdnimg.cn/direct/6257a71f4111442abec60358db4b57e6.png, https://i-blog.csdnimg.cn/direct/6257a71f4111442abec60358db4b57e6.png 1.5x, https://i-blog.csdnimg.cn/direct/6257a71f4111442abec60358db4b57e6.png 2x" data-sizes="auto" alt="https://i-blog.csdnimg.cn/direct/6257a71f4111442abec60358db4b57e6.png" title="https://i-blog.csdnimg.cn/direct/6257a71f4111442abec60358db4b57e6.png" /></p> <h4 id="3输入关键字无返回">📖3.输入关键字(无返回)</h4> <h4></h4> <h4 id="4输入关键字有返回">📖4.输入关键字(有返回)</h4> <h5 id="默认排序">🔖默认排序</h5> <h5 id="按时间排序">🔖按时间排序</h5> <p><img class="lazyload" src="/TechBlog/svg/loading.min.svg" data-src="https://i-blog.csdnimg.cn/direct/4c7978e22e154028b0b6b9e536dc2681.png" data-srcset="https://i-blog.csdnimg.cn/direct/4c7978e22e154028b0b6b9e536dc2681.png, https://i-blog.csdnimg.cn/direct/4c7978e22e154028b0b6b9e536dc2681.png 1.5x, https://i-blog.csdnimg.cn/direct/4c7978e22e154028b0b6b9e536dc2681.png 2x" data-sizes="auto" alt="https://i-blog.csdnimg.cn/direct/4c7978e22e154028b0b6b9e536dc2681.png" title="https://i-blog.csdnimg.cn/direct/4c7978e22e154028b0b6b9e536dc2681.png" /></p> <h3 id="六总结">📚六、总结</h3> <p>在这里就不对项目本身过多赘述了,下面说一下项目的不足与优化方向:</p> <h4 id="1优化方向">📖1.优化方向</h4> <p><strong>① 在线更新:</strong> 目前项目尚未实现在线更新功能,获取的官网公告数据截至 2025年3月14日,最新的官网公告未能同步到搜索引擎。未来可以引入定时任务或实时爬虫机制,确保数据及时更新。</p> <p><strong>② 热词统计:</strong> 在搜索时,如果能智能显示热门搜索关键词,可以进一步提升用户体验。</p> <p><strong>③ 登录系统:</strong> 由于该搜索引擎主要服务于本校师生,可以增加登录认证功能。</p> <p><strong>④ 响应速度:</strong> 目前服务器的响应速度还有提升空间。可以通过优化索引结构、引入缓存机制等。</p> <p><strong>⑤ 扩大搜索范围:</strong> 除了教务处官网,未来可以引入其他学校官网(如学院、图书馆、招生办等)的数据作为搜索对象。</p> <p><strong>⑥ 引入域名:</strong> 目前项目通过 IP 地址和端口号访问服务器,这种方式不够直观且不利于记忆。</p> <h4 id="2源码及网址">📖2.源码及网址</h4> <p>这里给出项目源码以及访问网址:</p> <p> </p> <p> </p> <hr> <p>以上就是【校园公告搜索引擎】的全部内容,欢迎指正~</p> <p><strong>码文不易,还请多多关注支持,这是我持续创作的最大动力!</strong></p> </div><div class="post-footer" id="post-footer"> <div class="post-info"> <div class="post-info-line"> <div class="post-info-mod"> <span>更新于 2025-03-14</span> </div></div> <div class="post-info-line"> <div class="post-info-md"></div> <div class="post-info-share"> <span><a href="javascript:void(0);" title="分享到 X" data-sharer="x" data-url="https://linjonh.github.io/TechBlog/2025-03-14-146230928/" data-title="C项目实战校园公告搜索引擎完整实现与优化指南" data-hashtags="网络,服务器,搜索引擎,C"><i class="fab fa-x-twitter fa-fw" aria-hidden="true"></i></a><a href="javascript:void(0);" title="分享到 Facebook" data-sharer="facebook" data-url="https://linjonh.github.io/TechBlog/2025-03-14-146230928/" data-hashtag="网络"><i class="fab fa-facebook-square fa-fw" aria-hidden="true"></i></a><a href="javascript:void(0);" title="分享到 Hacker News" data-sharer="hackernews" data-url="https://linjonh.github.io/TechBlog/2025-03-14-146230928/" data-title="C项目实战校园公告搜索引擎完整实现与优化指南"><i class="fab fa-hacker-news fa-fw" aria-hidden="true"></i></a><a href="javascript:void(0);" title="分享到 Line" data-sharer="line" data-url="https://linjonh.github.io/TechBlog/2025-03-14-146230928/" data-title="C项目实战校园公告搜索引擎完整实现与优化指南"><i data-svg-src="https://cdn.jsdelivr.net/npm/simple-icons@14.9.0/icons/line.svg" aria-hidden="true"></i></a><a href="javascript:void(0);" title="分享到 微博" data-sharer="weibo" data-url="https://linjonh.github.io/TechBlog/2025-03-14-146230928/" data-title="C项目实战校园公告搜索引擎完整实现与优化指南" data-image="https://bing.ee123.net/img/rand?artid=146230928"><i class="fab fa-weibo fa-fw" aria-hidden="true"></i></a></span> </div> </div> </div> <div class="post-info-more"> <section class="post-tags"><i class="fas fa-tags fa-fw" aria-hidden="true"></i> <a href="/TechBlog/tags/%E7%BD%91%E7%BB%9C/">网络</a>, <a href="/TechBlog/tags/%E6%9C%8D%E5%8A%A1%E5%99%A8/">服务器</a>, <a href="/TechBlog/tags/%E6%90%9C%E7%B4%A2%E5%BC%95%E6%93%8E/">搜索引擎</a>, <a href="/TechBlog/tags/c/">C</a></section> <section> <span><a href="javascript:void(0);" onclick="window.history.back();">返回</a></span> | <span><a href="/TechBlog/">主页</a></span> </section> </div> <div class="post-nav"><a href="/TechBlog/2025-03-14-146247250/" class="prev" rel="prev" title="fastapiangular实现Tcp在线聊天室功能"><i class="fas fa-angle-left fa-fw" aria-hidden="true"></i>fastapiangular实现Tcp在线聊天室功能</a> <a href="/TechBlog/2025-03-14-146247348/" class="next" rel="next" title="硬件测试基于FPGA的16PSK帧同步系统开发与硬件片内测试,包含高斯信道,误码统计,可设置SNR">硬件测试基于FPGA的16PSK帧同步系统开发与硬件片内测试,包含高斯信道,误码统计,可设置SNR<i class="fas fa-angle-right fa-fw" aria-hidden="true"></i></a></div> </div> <div id="comments"><div id="giscus" class="comment"></div><noscript> Please enable JavaScript to view the comments powered by <a href="https://giscus.app">Giscus</a>. </noscript></div></article></div> </main><footer class="footer"> <div class="footer-container"><div class="footer-line" itemscope itemtype="http://schema.org/CreativeWork"><i class="far fa-copyright fa-fw" aria-hidden="true"></i><span itemprop="copyrightYear">2018 - 2025</span><span class="author" itemprop="copyrightHolder"> <a href="/TechBlog/" target="_blank">JAY.LIN</a></span> | <span class="license"><a rel="license external nofollow noopener noreffer" href="https://creativecommons.org/licenses/by-nc/4.0/" target="_blank">CC BY-NC 4.0</a></span></div> </div> </footer></div> <div id="fixed-buttons"><a href="#" id="back-to-top" class="fixed-button" title="回到顶部"> <i class="fas fa-arrow-up fa-fw" aria-hidden="true"></i> </a> </div> <div id="fixed-buttons-hidden"><a href="#" id="view-comments" class="fixed-button" title="查看评论"> <i class="fas fa-comment fa-fw" aria-hidden="true"></i> </a> </div><script src="https://cdn.jsdelivr.net/npm/autocomplete.js@0.38.1/dist/autocomplete.min.js"></script><script src="https://cdn.jsdelivr.net/npm/algoliasearch@5.20.2/dist/lite/builds/browser.umd.min.js"></script><script src="https://cdn.jsdelivr.net/npm/lazysizes@5.3.2/lazysizes.min.js"></script><script src="https://cdn.jsdelivr.net/npm/clipboard@2.0.11/dist/clipboard.min.js"></script><script src="https://cdn.jsdelivr.net/npm/sharer.js@0.5.2/sharer.min.js"></script><script>window.config={"comment":{"giscus":{"category":"Announcements","categoryId":"DIC_kwDON5YUqc4Cm9Ln","darkTheme":"dark","emitMetadata":"0","inputPosition":"bottom","lang":"zh-CN","lazyLoading":false,"lightTheme":"light","mapping":"pathname","reactionsEnabled":"1","repo":"linjonh/blog_comment_repo","repoId":"R_kgDON5YUqQ"}},"search":{"algoliaAppID":"LWD9RXY9DT","algoliaIndex":"prod_techblog_index","algoliaSearchKey":"496be2c0d0b37c0789b0c2ffc9602471","highlightTag":"em","maxResultLength":10,"noResultsFound":"没有找到结果","snippetLength":50,"type":"algolia"}};</script><script src="/TechBlog/js/theme.min.js"></script><script> window.dataLayer=window.dataLayer||[];function gtag(){dataLayer.push(arguments);}gtag('js', new Date()); gtag('config', 'G-75ZRFNZ5Y8'); </script><script src="https://www.googletagmanager.com/gtag/js?id=G-75ZRFNZ5Y8" async></script><aside id="notification" class="toast" role="alert" aria-live="assertive" aria-atomic="true" data-bs-animation="true" data-bs-autohide="false"> <div class="toast-header"> <button type="button" class="btn-close ms-auto" data-bs-dismiss="toast" aria-label="Close"></button> </div> <div class="toast-body text-center pt-0"> <p class="px-2 mb-3">发现新版本的内容。</p> <button type="button" class="btn btn-primary" aria-label="Update"> 更新 </button> </div> </aside></body> </html>