C++开源搜索引擎xapian开发入门

news/2024/7/10 19:26:12 标签: c++, 开源, 搜索引擎, xapian

开源搜索引擎框架和产品有很多,例如elasticsearch,sphinx,xapian,lucence,typesense,MeiliSearch 等,分别用不同的语言实现,具有类似但不完全相同的功能。准确来说不属于通用的搜索引擎,而是属于一种基于索引的文字检索系统。

考虑到方便将这种检索系统通过代码开发的形式集成到自己的项目种,而不是单独部署一个完整的系统使用,这里推荐使用C++语言编写的xapian,作为依赖库的形式,调用C++ api在工程中使用。

以下基于一个简单的demo来延时如何使用xapian来构建索引和发起检索。

项目结构

xapian_starter
	- xapian-core-1.4.22
	- src
	  |- main.cpp
	- CMakeLists.txt

注意

  • xapian官网仅提供了unix系统下的编译指南,这里的demo仅支持unix下编译运行
  • 在部分环境中编译还需要额外引入zlib库的头文件和库文件

CMakeLists.txt

cmake_minimum_required(VERSION 3.0)

# this only works for unix, xapian source code not support compile in windows yet

project(xapian_demo)

set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

include_directories(
    ${CMAKE_CURRENT_SOURCE_DIR}/xapian-core-1.4.22/include
)

link_directories(
    ${CMAKE_CURRENT_SOURCE_DIR}/xapian-core-1.4.22/.libs
)

file(GLOB SRC
    src/*.h
    src/*.cpp
)

add_executable(${PROJECT_NAME} ${SRC})

target_link_libraries(${PROJECT_NAME}
    xapian
)

main.cpp

#include <iostream>
#include <string>
#include "xapian.h"

const std::string index_data_path = "./index_data";
const std::string doc_id1 = "doc1";
const std::string doc_title1 = "How to build self search engine";
const std::string doc_content1 = "What is the search engine?\nMaybe you should ask baidu or google.\nBut I want to develop my own app.\nThen you may need the xapian source code.";
const std::string doc_id2 = "doc2";
const std::string doc_title2 = "Nex generation search platform";
const std::string doc_content2 = "Every one know search is use full\nIt can be done just by a PC or phone.\nPlatform is very important";

const int DOC_ID_FIELD = 101;

void save_data()
{
	std::cout << "--- save_data" << std::endl;

	Xapian::WritableDatabase db(index_data_path, Xapian::DB_CREATE_OR_OPEN);

	Xapian::TermGenerator indexer;

	Xapian::Document doc1;
	doc1.add_value(DOC_ID_FIELD, doc_id1); // custom property
	doc1.set_data(doc_content1); // payload
	indexer.set_document(doc1);
	indexer.index_text(doc_title1); // could use space seperated text line like terms or article
	db.add_document(doc1);

	Xapian::Document doc2;
	doc2.add_value(DOC_ID_FIELD, doc_id2); // custom property
	doc2.set_data(doc_content2);
	indexer.set_document(doc2);
	indexer.index_text(doc_title2);
	db.add_document(doc2);

	db.commit();
}

void search_data1()
{
	std::cout << "--- search_data1" << std::endl;

	Xapian::Database db(index_data_path);

	Xapian::Enquire enquire(db);
	Xapian::QueryParser qp;

	// std::string query_str = "search engine";
	// Xapian::Query query = qp.parse_query(query_str);
	Xapian::Query term1("search");
	Xapian::Query term2("engine");
	Xapian::Query query = Xapian::Query(Xapian::Query::OP_OR, term1, term2);

	std::cout << "query is: " << query.get_description() << std::endl;

	enquire.set_query(query);

	Xapian::MSet matches = enquire.get_mset(0, 10); // find top 10 results
	std::cout << matches.get_matches_estimated() << " results found" << std::endl;
	std::cout << "matches 1-" << matches.size() << std::endl;

	for (Xapian::MSetIterator it = matches.begin(); it != matches.end(); ++it)
	{
		Xapian::Document doc = it.get_document();
		std::string doc_id = doc.get_value(DOC_ID_FIELD);
		// FIXME: not every record will show field value, should do filter later
		std::cout << "rank: " << it.get_rank() + 1 << ", weight: " << it.get_weight() << ", match_ratio: " << it.get_percent() << "%, match_no: " << *it << ", doc_id: " << doc_id << ", doc content: [" << doc.get_data() << "]\n" << std::endl;
	}
}

void search_data2()
{
	std::cout << "--- search_data2" << std::endl;

	Xapian::Database db(index_data_path);

	Xapian::Enquire enquire(db);
	Xapian::QueryParser qp;

	Xapian::Query term1("search");
	Xapian::Query term2("platform");
	Xapian::Query query = Xapian::Query(Xapian::Query::OP_AND, term1, term2);

	std::cout << "query is: " << query.get_description() << std::endl;

	enquire.set_query(query);

	Xapian::MSet matches = enquire.get_mset(0, 10); // find top 10 results, like split page
	std::cout << matches.get_matches_estimated() << " results found" << std::endl;
	std::cout << "matches 1-" << matches.size() << std::endl;

	for (Xapian::MSetIterator it = matches.begin(); it != matches.end(); ++it)
	{
		Xapian::Document doc = it.get_document();
		std::string doc_id = doc.get_value(DOC_ID_FIELD);
		// FIXME: not every record will show field value, should do filter later
		std::cout << "rank: " << it.get_rank() + 1 << ", weight: " << it.get_weight() << ", match_ratio: " << it.get_percent() << "%, match_no: " << *it << ", doc_id: " << doc_id << ", doc content: [" << doc.get_data() << "]\n" << std::endl;
	}
}

int main(int argc, char** argv)
{
	std::cout << "hello xapian" << std::endl;

	save_data();
	search_data1();
	search_data2();

	return 0;
}

其中

  • 任何文件或者数据都需要体检构建索引进入xapian的本地存储系统
  • 构建索引可以利用文章标题或者文章内容的分词列表,默认识别空格分隔的字符串,英文天然支持,中文需要提前用其他的代码预先做分词再传入
  • 为了便于跟数据库结合使用,可以在构建索引阶段给文本关联一个属性值,方便检索完的结果可以利用属性值取实际的业务数据库中精准获取完整的数据
  • 检索的结果中可能存在部分结果没有属性值,所以建议检索完后再做过滤

运行结果

--- save_data
--- search_data1
query is: Query((search OR engine))
19 results found
matches 1-10
rank: 1, weight: 0.354232, match_ratio: 100%, match_no: 4, doc_id: , doc content: [What is the search engine?
Maybe you should ask baidu or google.
But I want to develop my own app.
Then you may need the xapian source code.]

rank: 2, weight: 0.354232, match_ratio: 100%, match_no: 6, doc_id: , doc content: [What is the search engine?
Maybe you should ask baidu or google.
But I want to develop my own app.
Then you may need the xapian source code.]

rank: 3, weight: 0.354232, match_ratio: 100%, match_no: 8, doc_id: , doc content: [What is the search engine?
Maybe you should ask baidu or google.
But I want to develop my own app.
Then you may need the xapian source code.]

rank: 4, weight: 0.354232, match_ratio: 100%, match_no: 10, doc_id: doc1, doc content: [What is the search engine?
Maybe you should ask baidu or google.
But I want to develop my own app.
Then you may need the xapian source code.]

rank: 5, weight: 0.354232, match_ratio: 100%, match_no: 12, doc_id: doc1, doc content: [What is the search engine?
Maybe you should ask baidu or google.
But I want to develop my own app.
Then you may need the xapian source code.]

rank: 6, weight: 0.354232, match_ratio: 100%, match_no: 14, doc_id: doc1, doc content: [What is the search engine?
Maybe you should ask baidu or google.
But I want to develop my own app.
Then you may need the xapian source code.]

rank: 7, weight: 0.354232, match_ratio: 100%, match_no: 16, doc_id: doc1, doc content: [What is the search engine?
Maybe you should ask baidu or google.
But I want to develop my own app.
Then you may need the xapian source code.]

rank: 8, weight: 0.354232, match_ratio: 100%, match_no: 18, doc_id: doc1, doc content: [What is the search engine?
Maybe you should ask baidu or google.
But I want to develop my own app.
Then you may need the xapian source code.]

rank: 9, weight: 0.209633, match_ratio: 59%, match_no: 1, doc_id: , doc content: [What is the search engine?
Maybe you should ask baidu or google.
But I want to develop my own app.
Then you may need the xapian source code.]

rank: 10, weight: 0.209633, match_ratio: 59%, match_no: 2, doc_id: , doc content: [What is the search engine?
Maybe you should ask baidu or google.
But I want to develop my own app.
Then you may need the xapian source code.]

--- search_data2
query is: Query((search AND platform))
8 results found
matches 1-8
rank: 1, weight: 0.605063, match_ratio: 100%, match_no: 5, doc_id: , doc content: [Every one know search is use full
It can be done just by a PC or phone.
Platform is very important]

rank: 2, weight: 0.605063, match_ratio: 100%, match_no: 7, doc_id: , doc content: [Every one know search is use full
It can be done just by a PC or phone.
Platform is very important]

rank: 3, weight: 0.605063, match_ratio: 100%, match_no: 9, doc_id: , doc content: [Every one know search is use full
It can be done just by a PC or phone.
Platform is very important]

rank: 4, weight: 0.605063, match_ratio: 100%, match_no: 11, doc_id: doc2, doc content: [Every one know search is use full
It can be done just by a PC or phone.
Platform is very important]

rank: 5, weight: 0.605063, match_ratio: 100%, match_no: 13, doc_id: doc2, doc content: [Every one know search is use full
It can be done just by a PC or phone.
Platform is very important]

rank: 6, weight: 0.605063, match_ratio: 100%, match_no: 15, doc_id: doc2, doc content: [Every one know search is use full
It can be done just by a PC or phone.
Platform is very important]

rank: 7, weight: 0.605063, match_ratio: 100%, match_no: 17, doc_id: doc2, doc content: [Every one know search is use full
It can be done just by a PC or phone.
Platform is very important]

rank: 8, weight: 0.605063, match_ratio: 100%, match_no: 19, doc_id: doc2, doc content: [Every one know search is use full
It can be done just by a PC or phone.
Platform is very important]


http://www.niftyadmin.cn/n/699376.html

相关文章

爱普生L3515打印不出东西,打印后机器工作进纸后出来空白纸张

环境&#xff1a; 爱普生L3515彩色喷墨打印机 问题描述&#xff1a; 爱普生L3515突然打印不出东西&#xff0c;打印后机器工作进纸&#xff0c;出来空白纸张 解决方案&#xff1a; 1.检查是否没有墨水了&#xff0c;开始以为没了&#xff0c;加了满了 2.测试打印机还是打不…

【java】线程池简介

线程池简介 一、什么是线程池 线程池是一种利用池化技术思想来实现的线程管理技术&#xff0c;主要是为了复用线程、便利地管理线程和任务、并将线程的创建和任务的执行解耦开来。我们可以创建线程池来复用已经创建的线程来降低频繁创建和销毁线程所带来的资源消耗。 二、线…

SpringBoot入门:使用IDEA构建第一个SpringBoot项目

SpringBoot框架介绍 Spring Boot是一个简化Spring开发的框架&#xff0c;用来监护spring应用开发&#xff0c;约定大于配置&#xff0c;去繁就简&#xff0c;just run 就能创建一个独立的&#xff0c;产品级的应用。我们在使用Spring Boot时只需要配置相应的Spring Boot就可以…

AI 边缘计算控制器GEAC91

1 产品概览 产品概览 GEAC91 AI边缘计算控制器是一款基于 NVIDIA Jetson AGX Xavier处理 器、面向智能边缘计算应用场景的解决方案。 GEAC91控制器具有 控制器具有 GMSL2、 千兆网口CAN总线、 RS232、RS422、 USB3.0、USB2.0、SD卡等丰富的外设接口 &#xff0c;支持常见激光雷…

ARM终端之A系列

1、 何为A系列ARM终端 在了解arm终端前&#xff0c;我们先来看看何为终端&#xff1a; 终端&#xff08;Terminal&#xff09;也称终端设备&#xff0c;是计算机网络中处于网络最外围的设备&#xff0c;主要用于用户信息的输入以及处理结果的输出等。 在早期计算机系统中&#…

基于WebSocket的简易聊天室的基本实现梳理

一&#xff0c;前言 目前在很多网站为了实现推送技术所用的技术都是 Ajax 轮询。轮询是在特定的的时间间隔&#xff08;如每1秒&#xff09;&#xff0c;由浏览器对服务器发出HTTP请求&#xff0c;然后由服务器返回最新的数据给客户端的浏览器。HTTP 协议是一种无状态的、无连…

013-从零搭建微服务-认证中心(五)

写在最前 如果这个项目让你有所收获&#xff0c;记得 Star 关注哦&#xff0c;这对我是非常不错的鼓励与支持。 源码地址&#xff08;后端&#xff09;&#xff1a;https://gitee.com/csps/mingyue 源码地址&#xff08;前端&#xff09;&#xff1a;https://gitee.com/csps…

基于matlab使用基本形态运算符和 blob分析的组合从视频流中提取信息(附源码)

一、前言 此示例演示如何使用基本形态运算符和 blob 分析的组合从视频流中提取信息。在本例中&#xff0c;该示例计算每个视频帧中大肠杆菌的数量。请注意&#xff0c;细胞的亮度各不相同&#xff0c;这使得分割任务更具挑战性。 形态运算符是一种图像处理中常用的操作&#…