构建一个简单而有效的搜索引擎 ylc3000 2025-11-18 0 浏览 0 点赞 长文 # 构建一个简单而有效的搜索引擎 # Building a Simple Search Engine That Actually Works > by karboosx, Published November 9, 2025 > > 作者:karboosx,发布于 2025 年 11 月 9 日 --- ## 为什么要自己构建? ## Why Build Your Own? 我知道你在想什么。“为什么不直接用 Elasticsearch?” 或者 “Algolia 怎么样?” 这些都是有效的选择,但它们也带来了复杂性。你需要学习它们的 API,管理它们的基础设施,并处理它们的特殊问题。 Look, I know what you're thinking. "Why not just use Elasticsearch?" or "What about Algolia?" Those are valid options, but they come with complexity. You need to learn their APIs, manage their infrastructure, and deal with their quirks. 有时候,你只是想要一个: Sometimes you just want something that: - 能与你现有数据库配合使用的 - 不需要外部服务的 - 易于理解和调试的 - 能真正找到相关结果的 - Works with your existing database - Doesn't require external services - Is easy to understand and debug - Actually finds relevant results 这就是我所构建的。一个使用你现有数据库、尊重你当前架构,并让你完全控制其工作方式的搜索引擎。 That's what I built. A search engine that uses your existing database, respects your current architecture, and gives you full control over how it works. --- ## 核心思想 ## The Core Idea 这个概念很简单:**将所有内容分词,存储起来,然后在搜索时匹配词元**。 The concept is simple: **tokenize everything, store it, then match tokens when searching**. 它的工作原理如下: Here's how it works: 1. **索引**:当你添加或更新内容时,我们将其拆分为词元(单词、前缀、n-grams),并带有权重地存储它们。 2. **搜索**:当有人搜索时,我们以同样的方式对他们的查询进行分词,找到匹配的词元,并对结果进行评分。 3. **评分**:我们使用存储的权重来计算相关性分数。 1. **Indexing**: When you add or update content, we split it into tokens (words, prefixes, n-grams) and store them with weights 2. **Searching**: When someone searches, we tokenize their query the same way, find matching tokens, and score the results 3. **Scoring**: We use the stored weights to calculate relevance scores 其中的奥妙在于分词和加权。让我来解释一下。 The magic is in the tokenization and weighting. Let me show you what I mean. --- ## 构建模块 1:数据库结构 ## Building Block 1: The Database Schema 我们需要两个简单的表:`index_tokens` 和 `index_entries`。 We need two simple tables: `index_tokens` and `index_entries`. ### index_tokens 这个表存储所有唯一的词元及其分词器权重。每个词元名称可以有多个不同权重的记录——每个分词器一个。 This table stores all unique tokens with their tokenizer weights. Each token name can have multiple records with different weights—one per tokenizer. ```php // index_tokens 表结构 id | name | weight ---|---------|------- 1 | parser | 20 // 来自 WordTokenizer 2 | parser | 5 // 来自 PrefixTokenizer 3 | parser | 1 // 来自 NGramsTokenizer 4 | parser | 10 // 来自 SingularTokenizer ``` 为什么每个权重都要单独存储词元?因为不同的分词器会产生相同但权重不同的词元。例如,来自 `WordTokenizer` 的 "parser" 权重为 20,但来自 `PrefixTokenizer` 的 "parser" 权重为 5。我们需要独立的记录来正确地为匹配项评分。 Why store separate tokens per weight? Different tokenizers produce the same token with different weights. For example, "parser" from WordTokenizer has weight 20, but "parser" from PrefixTokenizer has weight 5. We need separate records to properly score matches. 唯一约束作用于 `(name, weight)`,所以同一个词元名称可以以不同的权重多次存在。 The unique constraint is on `(name, weight)`, so the same token name can exist multiple times with different weights. ### index_entries 这个表将词元与文档通过特定字段的权重连接起来。 This table links tokens to documents with field-specific weights. ```php // index_entries 表结构 id | token_id | document_type | field_id | document_id | weight ---|----------|---------------|----------|-------------|------- 1 | 1 | 1 | 1 | 42 | 2000 2 | 2 | 1 | 1 | 42 | 500 ``` 这里的 `weight` 是最终计算出的权重:`field_weight × tokenizer_weight × ceil(sqrt(token_length))`。这包含了我们评分所需的一切。我们稍后会讨论评分。 The `weight` here is the final calculated weight: `field_weight × tokenizer_weight × ceil(sqrt(token_length))`. This encodes everything we need for scoring. We will talk about scoring later in the post. 我们在以下字段上添加索引: We add indexes on: - `(document_type, document_id)` - 用于快速文档查找 - `token_id` - 用于快速词元查找 - `(document_type, field_id)` - 用于特定字段的查询 - `weight` - 用于按权重筛选 - `(document_type, document_id)` - for fast document lookups - `token_id` - for fast token lookups - `(document_type, field_id)` - for field-specific queries - `weight` - for filtering by weight 为什么是这种结构?简单、高效,并且充分利用了数据库的优势。 Why this structure? Simple, efficient, and leverages what databases do best. --- ## 构建模块 2:分词 ## Building Block 2: Tokenization 什么是分词?就是把文本分解成可搜索的小块。根据我们使用的分词器,单词 "parser" 可以变成 `["parser"]`、`["par", "pars", "parse", "parser"]` 或 `["par", "ars", "rse", "ser"]` 等词元。 What is tokenization? It's breaking text into searchable pieces. The word "parser" becomes tokens like `["parser"]`, `["par", "pars", "parse", "parser"]`, or `["par", "ars", "rse", "ser"]` depending on which tokenizer we use. 为什么需要多种分词器?不同的策略满足不同的匹配需求。一种用于精确匹配,一种用于部分匹配,还有一种用于处理拼写错误。 Why multiple tokenizers? Different strategies for different matching needs. One tokenizer for exact matches, another for partial matches, another for typos. 所有的分词器都实现一个简单的接口: All tokenizers implement a simple interface: ```php interface TokenizerInterface { public function tokenize(string $text): array; // 返回 Token 对象数组 public function getWeight(): int; // 返回分词器权重 } ``` 简单的契约,易于扩展。 Simple contract, easy to extend. ### 单词分词器 (Word Tokenizer) 这个很简单——它将文本分割成单个的单词。"parser" 就变成了 `["parser"]`。简单,但对于精确匹配非常强大。 This one is straightforward—it splits text into individual words. "parser" becomes just `["parser"]`. Simple, but powerful for exact matches. 首先,我们对文本进行规范化处理。全部转为小写,移除特殊字符,规范化空白符: First, we normalize the text. Lowercase everything, remove special characters, normalize whitespace: ```php class WordTokenizer implements TokenizerInterface { public function tokenize(string $text): array { // 规范化:小写,移除特殊字符 $text = mb_strtolower(trim($text)); $text = preg_replace('/[^a-z0-9]/', ' ', $text); $text = preg_replace('/\s+/', ' ', $text); ``` 接下来,我们将其分割成单词并过滤掉太短的: Next, we split into words and filter out short ones: ```php // 分割成单词,过滤短词 $words = explode(' ', $text); $words = array_filter($words, fn($w) => mb_strlen($w) >= 2); ``` 为什么要过滤短词?单个字符的词通常太常见,没什么用。"a"、"I"、"x" 对搜索没有帮助。 Why filter short words? Single-character words are usually too common to be useful. "a", "I", "x" don't help with search. 最后,我们将唯一的单词作为 `Token` 对象返回: Finally, we return unique words as Token objects: ```php // 以带权重的 Token 对象返回 return array_map( fn($word) => new Token($word, $this->weight), array_unique($words) ); } } ``` 权重:20 (精确匹配高优先级) Weight: 20 (high priority for exact matches) ### 前缀分词器 (Prefix Tokenizer) 这个分词器生成单词的前缀。"parser" 会变成 `["par", "pars", "parse", "parser"]` (最小长度为 4)。这有助于部分匹配和类似自动补全的行为。 This generates word prefixes. "parser" becomes `["par", "pars", "parse", "parser"]` (with min length 4). This helps with partial matches and autocomplete-like behavior. 首先,我们提取单词 (与 `WordTokenizer` 相同的规范化处理): First, we extract words (same normalization as WordTokenizer): ```php class PrefixTokenizer implements TokenizerInterface { public function __construct( private int $minPrefixLength = 4, private int $weight = 5 ) {} public function tokenize(string $text): array { // 与 WordTokenizer 相同的规范化 $words = $this->extractWords($text); ``` 然后,对每个单词,我们从最小长度开始生成前缀,直到整个单词: Then, for each word, we generate prefixes from the minimum length to the full word: ```php $tokens = []; foreach ($words as $word) { $wordLength = mb_strlen($word); // 从最小长度生成到整个单词的前缀 for ($i = $this->minPrefixLength; $i <= $wordLength; $i++) { $prefix = mb_substr($word, 0, $i); $tokens[$prefix] = true; // 使用关联数组保证唯一性 } } ``` 为什么使用关联数组?它能确保唯一性。如果 "parser" 在文本中出现两次,我们只想要一个 "parser" 词元。 Why use an associative array? It ensures uniqueness. If "parser" appears twice in the text, we only want one "parser" token. 最后,我们将键转换为 `Token` 对象: Finally, we convert the keys to Token objects: ```php return array_map( fn($prefix) => new Token($prefix, $this->weight), array_keys($tokens) ); } } ``` 权重:5 (中等优先级) Weight: 5 (medium priority) 为什么有最小长度?避免产生太多微小的词元。长度小于 4 的前缀通常太常见,用处不大。 Why min length? Avoid too many tiny tokens. Prefixes shorter than 4 characters are usually too common to be useful. ### N-Grams 分词器 这会创建固定长度的字符序列 (我用的是 3)。"parser" 会变成 `["par", "ars", "rse", "ser"]`。这能捕捉到拼写错误和部分单词匹配。 This creates character sequences of a fixed length (I use 3). "parser" becomes `["par", "ars", "rse", "ser"]`. This catches typos and partial word matches. 首先,我们提取单词: First, we extract words: ```php class NGramsTokenizer implements TokenizerInterface { public function __construct( private int $ngramLength = 3, private int $weight = 1 ) {} public function tokenize(string $text): array { $words = $this->extractWords($text); ``` 然后,对每个单词,我们用一个固定长度的窗口滑过它: Then, for each word, we slide a window of fixed length across it: ```php $tokens = []; foreach ($words as $word) { $wordLength = mb_strlen($word); // 固定长度的滑动窗口 for ($i = 0; $i <= $wordLength - $this->ngramLength; $i++) { $ngram = mb_substr($word, $i, $this->ngramLength); $tokens[$ngram] = true; } } ``` 滑动窗口:对于长度为 3 的 "parser",我们得到: The sliding window: for "parser" with length 3, we get: - 位置 0: "par" - 位置 1: "ars" - 位置 2: "rse" - 位置 3: "ser" - Position 0: "par" - Position 1: "ars" - Position 2: "rse" - Position 3: "ser" 为什么这能行?即使用户输入 "parsr" (拼写错误),我们仍然能得到 "par" 和 "ars" 词元,它们能匹配到拼写正确的 "parser"。 Why this works? Even if someone types "parsr" (typo), we still get "par" and "ars" tokens, which match the correctly spelled "parser". 最后,我们转换为 `Token` 对象: Finally, we convert to Token objects: ```php return array_map( fn($ngram) => new Token($ngram, $this->weight), array_keys($tokens) ); } } ``` 权重:1 (低优先级,但能捕捉边缘情况) Weight: 1 (low priority, but catches edge cases) 为什么是 3?在覆盖范围和噪音之间取得平衡。太短会匹配到太多东西,太长会错过拼写错误。 Why 3? Balance between coverage and noise. Too short and you get too many matches, too long and you miss typos. ### 规范化 (Normalization) 所有分词器都执行相同的规范化: All tokenizers do the same normalization: - 全部转为小写 - 移除特殊字符 (只保留字母和数字) - 规范化空白 (多个空格转为单个空格) - Lowercase everything - Remove special characters (keep only alphanumerical) - Normalize whitespace (multiple spaces to single space) 这确保了无论输入格式如何,都能有一致的匹配。 This ensures consistent matching regardless of input format. --- ## 构建模块 3:权重系统 ## Building Block 3: The Weight System 我们有三个层级的权重协同工作: We have three levels of weights working together: 1. **字段权重**:标题 vs 内容 vs 关键词 2. **分词器权重**:单词 vs 前缀 vs n-gram (存储在 `index_tokens` 中) 3. **文档权重**:存储在 `index_entries` 中 (计算得出:`field_weight × tokenizer_weight × ceil(sqrt(token_length))`) 1. **Field weights**: Title vs content vs keywords 2. **Tokenizer weights**: Word vs prefix vs n-gram (stored in index_tokens) 3. **Document weights**: Stored in index_entries (calculated: `field_weight × tokenizer_weight × ceil(sqrt(token_length))`) ### 最终权重计算 ### Final Weight Calculation 在索引时,我们这样计算最终权重: When indexing, we calculate the final weight like this: ```php $finalWeight = $fieldWeight * $tokenizerWeight * ceil(sqrt($tokenLength)); ``` 例如: For example: - 标题字段:权重 10 - 单词分词器:权重 20 - 词元 "parser":长度 6 - 最终权重:`10 × 20 × ceil(sqrt(6)) = 10 × 20 × 3 = 600` - Title field: weight 10 - Word tokenizer: weight 20 - Token "parser": length 6 - Final weight: `10 × 20 × ceil(sqrt(6)) = 10 × 20 × 3 = 600` 为什么使用 `ceil(sqrt())`?较长的词元更具体,但我们不希望权重随着词元变得非常长而爆炸式增长。"parser" 比 "par" 更具体,但一个 100 个字符的词元不应该有 100 倍的权重。平方根函数给了我们递减的回报——较长的词元仍然得分更高,但不是线性的。我们使用 `ceil()` 向上取整,保持权重为整数。 Why use `ceil(sqrt())`? Longer tokens are more specific, but we don't want weights to blow up with very long tokens. "parser" is more specific than "par", but a 100-character token shouldn't have 100x the weight. The square root function gives us diminishing returns—longer tokens still score higher, but not linearly. We use `ceil()` to round up to the nearest integer, keeping weights as whole numbers. ### 调整权重 ### Tuning Weights 你可以根据你的使用场景调整权重: You can adjust weights for your use case: - 如果标题最重要,增加标题的字段权重。 - 如果你想优先考虑精确匹配,增加精确匹配分词器的权重。 - 如果你希望长词元更重要或不那么重要,可以调整词元长度函数 (ceil(sqrt), log, 或线性)。 - Increase field weights for titles if titles are most important - Increase tokenizer weights for exact matches if you want to prioritize exact matches - Adjust the token length function (ceil(sqrt), log, or linear) if you want longer tokens to matter more or less 你可以精确地看到权重是如何计算的,并根据需要进行调整。 You can see exactly how weights are calculated and adjust them as needed. --- ## 构建模块 4:索引服务 ## Building Block 4: The Indexing Service 索引服务接收一个文档,并将其所有词元存储到数据库中。 The indexing service takes a document and stores all its tokens in the database. ### 接口 (The Interface) 可被索引的文档实现 `IndexableDocumentInterface`: Documents that can be indexed implement `IndexableDocumentInterface`: ```php interface IndexableDocumentInterface { public function getDocumentId(): int; public function getDocumentType(): DocumentType; public function getIndexableFields(): IndexableFields; } ``` 要使文档可搜索,你需要实现这三个方法: To make a document searchable, you implement these three methods: ```php class Post implements IndexableDocumentInterface { public function getDocumentId(): int { return $this->id ?? 0; } public function getDocumentType(): DocumentType { return DocumentType::POST; } public function getIndexableFields(): IndexableFields { $fields = IndexableFields::create() ->addField(FieldId::TITLE, $this->title ?? '', 10) ->addField(FieldId::CONTENT, $this->content ?? '', 1); // 如果有关键词,则添加 if (!empty($this->keywords)) { $fields->addField(FieldId::KEYWORDS, $this->keywords, 20); } return $fields; } } ``` 需要实现的三个方法: Three methods to implement: - `getDocumentType()`: 返回文档类型枚举 - `getDocumentId()`: 返回文档 ID - `getIndexableFields()`: 使用流式 API 构建带权重的字段 - `getDocumentType()`: returns the document type enum - `getDocumentId()`: returns the document ID - `getIndexableFields()`: builds fields with weights using fluent API 你可以通过以下方式索引文档: You can index documents: - 在创建/更新时 (通过事件监听器) - 通过命令:`app:index-document`, `app:reindex-documents` - 通过 cron (用于批量重新索引) - On create/update (via event listeners) - Via commands: `app:index-document`, `app:reindex-documents` - Via cron (for batch reindexing) ### 工作原理 ### How It Works 这是索引过程的逐步分解。 Here's the indexing process, step by step. 首先,我们获取文档信息: First, we get the document information: ```php class SearchIndexingService { public function indexDocument(IndexableDocumentInterface $document): void { // 1. 获取文档信息 $documentType = $document->getDocumentType(); $documentId = $document->getDocumentId(); $indexableFields = $document->getIndexableFields(); $fields = $indexableFields->getFields(); $weights = $indexableFields->getWeights(); ``` 文档通过 `IndexableFields` 构建器提供其字段和权重。 The document provides its fields and weights via the `IndexableFields` builder. 接下来,我们删除该文档的现有索引。这处理了更新操作——如果文档改变了,我们需要重新索引它: Next, we remove the existing index for this document. This handles updates—if the document changed, we need to reindex it: ```php // 2. 删除该文档的现有索引 $this->removeDocumentIndex($documentType, $documentId); // 3. 准备批量插入数据 $insertData = []; ``` 为什么先删除?如果我们只是添加新的词元,就会有重复。最好从头开始。 Why remove first? If we just add new tokens, we'll have duplicates. Better to start fresh. 现在,我们处理每个字段。对每个字段,我们运行所有的分词器: Now, we process each field. For each field, we run all tokenizers: ```php // 4. 处理每个字段 foreach ($fields as $fieldIdValue => $content) { if (empty($content)) { continue; } $fieldId = FieldId::from($fieldIdValue); $fieldWeight = $weights[$fieldIdValue] ?? 0; // 5. 在此字段上运行所有分词器 foreach ($this->tokenizers as $tokenizer) { $tokens = $tokenizer->tokenize($content); ``` 对于每个分词器,我们得到词元。然后,对于每个词元,我们在数据库中找到或创建它,并计算最终权重: For each tokenizer, we get tokens. Then, for each token, we find or create it in the database and calculate the final weight: ```php foreach ($tokens as $token) { $tokenValue = $token->value; $tokenWeight = $token->weight; // 6. 在 index_tokens 中查找或创建词元 $tokenId = $this->findOrCreateToken($tokenValue, $tokenWeight); // 7. 计算最终权重 $tokenLength = mb_strlen($tokenValue); $finalWeight = (int) ($fieldWeight * $tokenWeight * ceil(sqrt($tokenLength))); // 8. 添加到批量插入 $insertData[] = [ 'token_id' => $tokenId, 'document_type' => $documentType->value, 'field_id' => $fieldId->value, 'document_id' => $documentId, 'weight' => $finalWeight, ]; } } } ``` 为什么批量插入?为了性能。我们收集所有行,然后在一个查询中插入,而不是一次插入一行。 Why batch insert? Performance. Instead of inserting one row at a time, we collect all rows and insert them in one query. 最后,我们批量插入所有数据: Finally, we batch insert everything: ```php // 9. 为性能进行批量插入 if (!empty($insertData)) { $this->batchInsertSearchDocuments($insertData); } } ``` `findOrCreateToken` 方法很简单: The `findOrCreateToken` method is straightforward: ```php private function findOrCreateToken(string $name, int $weight): int { // 尝试查找具有相同名称和权重的现有词元 $sql = "SELECT id FROM index_tokens WHERE name = ? AND weight = ?"; $result = $this->connection->executeQuery($sql, [$name, $weight])->fetchAssociative(); if ($result) { return (int) $result['id']; } // 创建新词元 $insertSql = "INSERT INTO index_tokens (name, weight) VALUES (?, ?)"; $this->connection->executeStatement($insertSql, [$name, $weight]); return (int) $this->connection->lastInsertId(); } }``` 为什么要查找或创建?词元在文档间共享。如果 "parser" 已经以权重 20 存在,我们就重用它。没必要创建重复的。 Why find or create? Tokens are shared across documents. If "parser" already exists with weight 20, we reuse it. No need to create duplicates. 关键点: The key points: - 我们首先删除旧索引 (处理更新) - 我们为性能批量插入 (一个查询代替多个) - 我们查找或创建词元 (避免重复) - 我们动态计算最终权重 - We remove old index first (handles updates) - We batch insert for performance (one query instead of many) - We find or create tokens (avoids duplicates) - We calculate final weight on the fly --- ## 构建模块 5:搜索服务 ## Building Block 5: The Search Service 搜索服务接收一个查询字符串,并找到相关的文档。它以与索引文档时相同的方式对查询进行分词,然后在数据库中将这些词元与索引的词元进行匹配。结果按相关性评分,并以文档 ID 和分数的列表形式返回。 The search service takes a query string and finds relevant documents. It tokenizes the query the same way we tokenized documents during indexing, then matches those tokens against the indexed tokens in the database. The results are scored by relevance and returned as document IDs with scores. ### 工作原理 ### How It Works 这是搜索过程的逐步分解。 Here's the search process, step by step. 首先,我们使用所有分词器对查询进行分词: First, we tokenize the query using all tokenizers: ```php class SearchService { public function search(DocumentType $documentType, string $query, ?int $limit = null): array { // 1. 使用所有分词器对查询进行分词 $queryTokens = $this->tokenizeQuery($query); if (empty($queryTokens)) { return []; } ``` 如果查询没有产生任何词元 (例如,只有特殊字符),我们返回空结果。 If the query produces no tokens (e.g., only special characters), we return empty results. ### 为什么使用相同的分词器对查询进行分词? ### Why Tokenize the Query Using the Same Tokenizers? 不同的分词器产生不同的词元值。如果我们用一组分词器索引,用另一组搜索,就会错过匹配。 Different tokenizers produce different token values. If we index with one set and search with another, we'll miss matches. 例如: Example: - 用 `PrefixTokenizer` 索引创建词元:"par", "pars", "parse", "parser" - 只用 `WordTokenizer` 搜索创建词元:"parser" - 我们会找到 "parser",但找不到只有 "par" 或 "pars" 词元的文档 - 结果:不完整的匹配,丢失相关文档! - Indexing with PrefixTokenizer creates tokens: "par", "pars", "parse", "parser" - Searching with only WordTokenizer creates token: "parser" - We'll find "parser", but we won't find documents that only have "par" or "pars" tokens - Result: Incomplete matches, missing relevant documents! **解决方案**:索引和搜索都使用相同的分词器。相同的分词策略 = 相同的词元值 = 完整的匹配。 **The solution**: Use the same tokenizers for both indexing and searching. Same tokenization strategy = same token values = complete matches. 这就是为什么 `SearchService` 和 `SearchIndexingService` 都接收同一组分词器的原因。 This is why the `SearchService` and `SearchIndexingService` both receive the same set of tokenizers. 接下来,我们提取唯一的词元值。多个分词器可能会产生相同的词元值,所以我们去重: Next, we extract unique token values. Multiple tokenizers might produce the same token value, so we deduplicate: ```php // 2. 提取唯一的词元值 $tokenValues = array_unique(array_map( fn($token) => $token instanceof Token ? $token->value : $token, $queryTokens )); ``` 为什么要提取值?我们按词元名称搜索,而不是按权重。我们需要唯一的词元名称来搜索。 Why extract values? We search by token name, not by weight. We need the unique token names to search for. 然后,我们按长度对词元进行排序 (最长的在前)。这优先考虑了特定的匹配: Then, we sort tokens by length (longest first). This prioritizes specific matches: ```php // 3. 对词元排序 (最长的在前 - 优先考虑特定匹配) usort($tokenValues, fn($a, $b) => mb_strlen($b) <=> mb_strlen($a)); ``` 为什么要排序?较长的词元更具体。"parser" 比 "par" 更具体,所以我们想先搜索 "parser"。 Why sort? Longer tokens are more specific. "parser" is more specific than "par", so we want to search for "parser" first. 我们还限制了词元的数量,以防止带有巨量查询的 DoS 攻击: We also limit the token count to prevent DoS attacks with huge queries: ```php // 4. 限制词元数量 (防止巨量查询的 DoS 攻击) if (count($tokenValues) > 300) { $tokenValues = array_slice($tokenValues, 0, 300); } ``` 为什么要限制?恶意用户可能会发送一个产生数千个词元的查询,导致性能问题。我们保留最长的 300 个词元 (已经排序)。 Why limit? A malicious user could send a query that produces thousands of tokens, causing performance issues. We keep the longest 300 tokens (already sorted). 现在,我们执行优化的 SQL 查询。`executeSearch()` 方法构建并执行 SQL 查询: Now, we execute the optimized SQL query. The `executeSearch()` method builds the SQL query and executes it: ```php // 5. 执行优化的 SQL 查询 $results = $this->executeSearch($documentType, $tokenValues, $limit); ``` 在 `executeSearch()` 内部,我们用参数占位符构建 SQL 查询,执行它,过滤低分结果,并转换为 `SearchResult` 对象: Inside `executeSearch()`, we build the SQL query with parameter placeholders, execute it, filter low-scoring results, and convert to SearchResult objects: ```php private function executeSearch(DocumentType $documentType, array $tokenValues, int $tokenCount, ?int $limit, int $minTokenWeight): array { // 为词元值构建参数占位符 $tokenPlaceholders = implode(',', array_fill(0, $tokenCount, '?')); // 构建 SQL 查询 (完整查询见下文 "SQL 查询" 部分) $sql = "SELECT sd.document_id, ... FROM index_entries sd ..."; // 构建参数数组 $params = [ $documentType->value, // document_type ...$tokenValues, // IN 子句的词元值 $documentType->value, // 用于子查询 ...$tokenValues, // 子查询的词元值 $minTokenWeight, // 最小词元权重 // ... 更多参数 ]; // 使用参数绑定执行查询 $results = $this->connection->executeQuery($sql, $params)->fetchAllAssociative(); // 过滤掉标准化分数低的结果 (低于阈值) $results = array_filter($results, fn($r) => (float) $r['score'] >= 0.05); // 转换为 SearchResult 对象 return array_map( fn($result) => new SearchResult( documentId: (int) $result['document_id'], score: (float) $result['score'] ), $results ); } ``` SQL 查询完成了繁重的工作:找到匹配的文档,计算分数,并按相关性排序。我们使用原生 SQL 以获得性能和完全的控制——我们可以根据需要精确地优化查询。 The SQL query does the heavy lifting: finds matching documents, calculates scores, and sorts by relevance. We use raw SQL for performance and full control—we can optimize the query exactly how we need it. 查询使用 JOIN 连接词元和文档,使用子查询进行规范化,使用聚合进行评分,并在词元名称、文档类型和权重上建立索引。我们使用参数绑定来确保安全 (防止 SQL 注入)。 The query uses JOINs to connect tokens and documents, subqueries for normalization, aggregation for scoring, and indexes on token name, document type, and weight. We use parameter binding for security (prevents SQL injection). 我们将在下一节看到完整的查询。 We'll see the full query in the next section. 然后,主 `search()` 方法返回结果: The main `search()` method then returns the results: ```php // 5. 返回结果 return $results; } } ``` ### 评分算法 ### The Scoring Algorithm 评分算法平衡了多个因素。让我们一步步分解。 The scoring algorithm balances multiple factors. Let's break it down step by step. 基础分数是所有匹配词元权重的总和: The base score is the sum of all matched token weights: ```sql SELECT sd.document_id, SUM(sd.weight) as base_score FROM index_entries sd INNER JOIN index_tokens st ON sd.token_id = st.id WHERE sd.document_type = ? AND st.name IN (?, ?, ?) -- 查询词元 GROUP BY sd.document_id ``` - `sd.weight`:来自 `index_entries` (field_weight × tokenizer_weight × ceil(sqrt(token_length))) - `sd.weight`: from index_entries (field_weight × tokenizer_weight × ceil(sqrt(token_length))) 为什么不乘以 `st.weight`?分词器权重在索引时已经包含在 `sd.weight` 中了。来自 `index_tokens` 的 `st.weight` 仅在完整 SQL 查询的 WHERE 子句中用于过滤 (确保至少有一个权重 >= minTokenWeight 的词元)。 Why not multiply by `st.weight`? The tokenizer weight is already included in `sd.weight` during indexing. The `st.weight` from `index_tokens` is used only in the full SQL query's WHERE clause for filtering (ensures at least one token with weight >= minTokenWeight). 这给了我们原始分数。但我们需要的不仅仅是这个。 This gives us the raw score. But we need more than that. 我们添加一个词元多样性加成。匹配更多唯一词元的文档得分更高: We add a token diversity boost. Documents matching more unique tokens score higher: ```sql (1.0 + LOG(1.0 + COUNT(DISTINCT sd.token_id))) * base_score ``` 为什么?一个匹配 5 个不同词元的文档比一个匹配同一词元 5 次的文档更相关。LOG 函数使这个加成呈对数增长——匹配 10 个词元不会得到 10 倍的加成。 Why? A document matching 5 different tokens is more relevant than one matching the same token 5 times. The LOG function makes this boost logarithmic—matching 10 tokens doesn't give 10x the boost. 我们还添加一个平均权重质量加成。具有更高质量匹配的文档得分更高: We also add an average weight quality boost. Documents with higher quality matches score higher: ```sql (1.0 + LOG(1.0 + AVG(sd.weight))) * base_score ``` 为什么?一个具有高权重匹配 (例如,标题匹配) 的文档比一个具有低权重匹配 (例如,内容匹配) 的文档更相关。同样,LOG 使其呈对数增长。 Why? A document with high-weight matches (e.g., title matches) is more relevant than one with low-weight matches (e.g., content matches). Again, LOG makes this logarithmic. 我们应用一个文档长度惩罚。防止长文档占主导地位: We apply a document length penalty. Prevents long documents from dominating: ```sql base_score / (1.0 + LOG(1.0 + doc_token_count.token_count)) ``` 为什么?一个 1000 字的文档不会仅仅因为它有更多的词元就自动击败一个 100 字的文档。LOG 函数使这个惩罚呈对数增长——一个 10 倍长的文档不会得到 10 倍的惩罚。 Why? A 1000-word document doesn't automatically beat a 100-word document just because it has more tokens. The LOG function makes this penalty logarithmic—a 10x longer document doesn't get 10x the penalty. 最后,我们通过除以最大分数来进行归一化: Finally, we normalize by dividing by the maximum score: ```sql score / GREATEST(1.0, max_score) as normalized_score ``` 这给了我们一个 0-1 的范围,使得不同查询的分数具有可比性。 This gives us a 0-1 range, making scores comparable across different queries. 完整的公式如下: The full formula looks like this: ```sql SELECT sd.document_id, ( SUM(sd.weight) * -- 基础分数 (1.0 + LOG(1.0 + COUNT(DISTINCT sd.token_id))) * -- 词元多样性加成 (1.0 + LOG(1.0 + AVG(sd.weight))) / -- 平均权重质量加成 (1.0 + LOG(1.0 + doc_token_count.token_count)) -- 文档长度惩罚 ) / GREATEST(1.0, max_score) as score -- 归一化 FROM index_entries sd INNER JOIN index_tokens st ON sd.token_id = st.id INNER JOIN ( SELECT document_id, COUNT(*) as token_count FROM index_entries WHERE document_type = ? GROUP BY document_id ) doc_token_count ON sd.document_id = doc_token_count.document_id WHERE sd.document_type = ? AND st.name IN (?, ?, ?) -- 查询词元 AND sd.document_id IN ( SELECT DISTINCT document_id FROM index_entries sd2 INNER JOIN index_tokens st2 ON sd2.token_id = st2.id WHERE sd2.document_type = ? AND st2.name IN (?, ?, ?) AND st2.weight >= ? -- 确保至少有一个有意义权重的词元 ) GROUP BY sd.document_id ORDER BY score DESC LIMIT ? ``` 为什么要有带 `st2.weight >= ?` 的子查询?这确保我们只包括那些至少有一个匹配词元具有有意义分词器权重的文档。没有这个过滤器,一个只匹配低优先级词元(比如权重为 1 的 n-grams)的文档,即使它不匹配任何高优先级词元(比如权重为 20 的单词),也会被包括进来。这个子查询过滤掉了那些只匹配噪音的文档。我们想要的是至少匹配一个有意义词元的文档。 Why the subquery with `st2.weight >= ?`? This ensures we only include documents that have at least one matching token with a meaningful tokenizer weight. Without this filter, a document matching only low-priority tokens (like n-grams with weight 1) would be included even if it doesn't match any high-priority tokens (like words with weight 20). This subquery filters out documents that only match noise. We want documents that match at least one meaningful token. 为什么用这个公式?它平衡了多个相关性因素。精确匹配得分高,但匹配多个词元的文档也得分高。长文档不会占主导,但高质量的匹配会。 Why this formula? It balances multiple factors for relevance. Exact matches score high, but so do documents matching many tokens. Long documents don't dominate, but high-quality matches do. 如果没有权重为 10 的结果,我们会用权重 1 重试 (作为边缘情况的后备)。 If no results with weight 10, we retry with weight 1 (fallback for edge cases). ### 将 ID 转换为文档 ### Converting IDs to Documents 搜索服务返回带有文档 ID 和分数的 `SearchResult` 对象: The search service returns `SearchResult` objects with document IDs and scores: ```php class SearchResult { public function __construct( public readonly int $documentId, public readonly float $score ) {} } ``` 但我们需要的是实际的文档,而不仅仅是 ID。我们使用仓库 (repositories) 来转换它们: But we need actual documents, not just IDs. We convert them using repositories: ```php // 执行搜索 $searchResults = $this->searchService->search( DocumentType::POST, $query, $limit ); // 从搜索结果中获取文档 ID (保留顺序) $documentIds = array_map(fn($result) => $result->documentId, $searchResults); // 根据 ID 获取文档 (保留搜索结果的顺序) $documents = $this->documentRepository->findByIds($documentIds); ``` 为什么要保留顺序?搜索结果是按相关性分数排序的。我们希望在显示结果时保持这个顺序。 Why preserve order? The search results are sorted by relevance score. We want to keep that order when displaying results. 仓库方法处理转换: The repository method handles the conversion: ```php public function findByIds(array $ids): array { if (empty($ids)) { return []; } return $this->createQueryBuilder('d') ->where('d.id IN (:ids)') ->setParameter('ids', $ids) ->orderBy('FIELD(d.id, :ids)') // 保留 ID 数组的顺序 ->getQuery() ->getResult(); } ``` `FIELD()` 函数保留了 ID 数组的顺序,所以文档会以与搜索结果相同的顺序出现。 The `FIELD()` function preserves the order from the IDs array, so documents appear in the same order as search results. --- ## 结果:你得到了什么 ## The Result: What You Get 你得到的是一个: What you get is a search engine that: - **快速找到相关结果** (利用数据库索引) - **处理拼写错误** (n-grams 捕捉部分匹配) - **处理部分单词** (前缀分词器) - **优先考虑精确匹配** (单词分词器权重最高) - **与现有数据库配合使用** (无需外部服务) - **易于理解和调试** (一切都是透明的) - **完全控制行为** (调整权重、添加分词器、修改评分) - **Finds relevant results quickly** (leverages database indexes) - **Handles typos** (n-grams catch partial matches) - **Handles partial words** (prefix tokenizer) - **Prioritizes exact matches** (word tokenizer has highest weight) - **Works with existing database** (no external services) - **Easy to understand and debug** (everything is transparent) - **Full control over behavior** (adjust weights, add tokenizers, modify scoring) --- ## 扩展系统 ## Extending the System 想添加一个新的分词器?实现 `TokenizerInterface`: Want to add a new tokenizer? Implement `TokenizerInterface`: ```php class StemmingTokenizer implements TokenizerInterface { public function tokenize(string $text): array { // 你的词干提取逻辑在这里 // 返回 Token 对象数组 } public function getWeight(): int { return 15; // 你的权重 } } ``` 在你的服务配置中注册它,它就会自动用于索引和搜索。 Register it in your services configuration, and it's automatically used for both indexing and searching. 想添加一个新的文档类型?实现 `IndexableDocumentInterface`: Want to add a new document type? Implement `IndexableDocumentInterface`: ```php class Comment implements IndexableDocumentInterface { public function getIndexableFields(): IndexableFields { return IndexableFields::create() ->addField(FieldId::CONTENT, $this->content ?? '', 5); } } ``` 想调整权重?更改配置。想修改评分?编辑 SQL 查询。一切都在你的掌控之中。 Want to adjust weights? Change the configuration. Want to modify scoring? Edit the SQL query. Everything is under your control. --- ## 结论 ## Conclusion 所以,这就是它了。一个简单而有效的搜索引擎。它不花哨,也不需要大量的基础设施,但对于大多数用例来说,它是完美的。 So there you have it. A simple search engine that actually works. It's not fancy, and it doesn't need a lot of infrastructure, but for most use cases, it's perfect. 关键的见解?有时候最好的解决方案是你理解的那个。没有魔法,没有黑箱,只有直白的、言行一致的代码。 The key insight? Sometimes the best solution is the one you understand. No magic, no black boxes, just straightforward code that does what it says. 你拥有它,你控制它,你可以调试它。而这,价值连城。 You own it, you control it, you can debug it. And that's worth a lot. 网闻录 构建一个简单而有效的搜索引擎