简单的模式匹配技术,用于改进搜索框






4.33/5 (3投票s)
简单的模式匹配技术,用于搜索建议框
引言
毫无疑问,每个人都喜欢谷歌搜索引擎及其搜索框建议。智能友好的搜索匹配建议带来了绝佳的用户体验。我们很少看到没有搜索框的网站。讨论一些可以改善搜索体验的内容会更有价值。
模式匹配是一种在计算机时代占有一席之地的技术。模式匹配是我们大脑自己的方式,希望它能帮助我们的网站更好地与我们沟通。这里我们讨论一种可以用于搜索建议框的简单模式匹配技术。
背景
如果我们研究生物信息学,可能会熟悉一个词“序列比对”。它只是比对两个基因并比较它们的相似性。这对于搜索由字母序列组成的单词是正确的。事实上,生物信息学可能就是从计算机算法中借鉴了模式匹配。
一旦我们以这种方式比对两个单词,就可以根据匹配情况进行评分,并对不匹配的地方进行惩罚。通过这种方式,我们可以根据匹配分数对匹配的单词列表进行排序。
例如,让我们输入“sugest”,它可能与“suggest”或“sugging”匹配。让我们比对这些词。
sug_est
suggest
上面的词以最匹配的方式对齐,并且在搜索键中有一个间隙。当我们对齐另一对时,我们有四个间隙,如下所示。
sug____est
sugging
所以我们可以建议第一对更匹配。这种模式匹配不能通过简单的 SQL 'LIKE %sugest%' 或许多网站使用的预定义正则表达式模式来实现。在搜索时,尤其是在输入电影、产品、人物等名称时,经常会拼错或漏掉字母。
所以让我们尝试改进模式匹配搜索查询,下面进一步讨论。
Using the Code
此代码是 MySQL 存储过程。但这个概念不限于 MySQL,可以扩展到任何数据库,如 SQL Server、Oracle 等。一开始,代码使用精确的单词来搜索匹配项。
SET searchWord=CONCAT('%',' ', searchWord,'%');
这个搜索模式搜索包含用户输入的精确字符的单词,允许在其两侧有任何其他单词。如果搜索结果记录数量足够,它将不再进一步搜索。下一步,它会遍历用户输入的字符长度,并在字符之间插入百分号(%),然后在每次迭代中查询数据库。这允许在插入位置有多个字符,并在初始搜索失败时增加找到匹配项的机会。例如,每次迭代的搜索词可能看起来像 winnin%g, winni%ng, winn%ing.......w%inning。这里有两点需要注意:首先,插入从最后一个字符开始到第一个字符。这是为了保持初始输入字符的完整性。例如,如果用户输入“wining”,那么它可能是“从头开始赢”(由“win%ing”产生),而不是“战争正在开始”(由“w%ining”产生)。接下来,它不会保留上一次迭代中插入的百分号,因为过多的百分号会导致不相关的匹配。例如,“h%a%p%p%y”可以是任何东西,但不一定是“happy”。在每次迭代中,它都会检查是否获得了足够的结果并停止进一步处理。
loop1:WHILE (i<(str_len-3)) DO
/* Limit check to avoid unnecessary loops */
IF totalResultCount>resultLimit THEN
LEAVE loop1;
END IF;
SET i=i+1;
/*Addition of % in the search key */
SET formattedString=INSERT(searchWord,str_len-i,0,'%');
/*Get the matched ones. Watch for the result limit
Result has a score for each matched selection*/
CALL getResult(formattedString,totalResultCount,resultLimit);
END WHILE loop1;
代码还在每次迭代中对结果进行评分,如果插入的字符过多,可以使用阈值惩罚级别拒绝匹配。这可以防止结果中出现不匹配的匹配项。
/*at the procedure getResult */
IF score>lowScoreThreshold THEN
INSERT INTO RESULT (word,score) VALUES (tempWord,score);
END IF;
接下来,代码进入第二组迭代,其中一切都与第一次相同,但它用“%”替换一个字符,而不是仅仅在之间插入。此迭代考虑了单个字符的拼写错误,例如“scoreing”而不是“scoring”。之所以在单独的迭代中执行此任务,是因为我们假设拼写错误可能不会发生,并且不需要给予太多重视。因此,此迭代仅在以上所有搜索都未能产生足够的匹配项时才会发生。
SET i=0;
loop2:WHILE (i<(str_len-3)) DO
IF totalResultCount>resultLimit THEN
LEAVE loop2;
END IF;
SET i=i+1;
SET formattedString=INSERT(searchWord,str_len-i,1,'%');
CALL getResult(formattedString,totalResultCount,resultLimit);
END WHILE loop2;
在这些迭代中,会调用一个名为“getResult
”的存储过程来查询数据库。
DECLARE word_list CURSOR FOR SELECT word, scoreWords(sanitizeSearchKeyWord(searchWord),_
word) AS score FROM words WHERE word LIKE searchWord OR word LIKE CONCAT_
(sanitizeSearchKeyWord(searchWord) ,'%');
此查询包含其他存储函数,例如 scoreWord
,它对结果匹配进行评分。此查询在 where
子句中有两部分,一部分带有开头空格,另一部分没有。这是为了在搜索键为“Happ”时,同时获取“Are you Happy”(% Happ%)和“Happy X-Mas”(%Happ%)的匹配项。但存在一个问题,即“sin”会产生“basin”。无论如何,这在查询时是不可避免的,评分过程会考虑到它并给出较低的分数。结果词和相应的分数存储在结果表中。
OPEN word_list ;
loop1:LOOP
SET i=i+1;
FETCH word_list INTO tempWord,score;
IF i>resultLimit THEN
LEAVE loop1;
END IF;
IF done=1 THEN
LEAVE loop1;
END IF;
IF score>lowScoreThreshold THEN
INSERT INTO RESULT (word,score) VALUES (tempWord,score);
END IF;
IF duplicate_key<>1 THEN
SET totalCount=totalCount+1;
END IF;
END LOOP loop1;
CLOSE word_list;
让我们讨论一下评分函数。用户输入的可能是一个单词,但匹配结果可能是一个句子。将整个句子包含在计算匹配分数中是没有意义的。因此,需要提取匹配搜索键的单词。困难在于结果不包含精确的搜索键。所以它调用另一个函数“locateSearchWord
”。一旦找到单词,它会从搜索键中删除所有“%”并获取其长度。然后将搜索键的长度与结果匹配进行比较。长度上的差异等于结果匹配中存在的新字符,这些字符被视为惩罚并相应地计算分数。
SET resultPart= locateSearchWord(searchWord,resultWord);
/*Find the length of the search key and the matched word resulted */
SET count1=LENGTH(REPLACE(searchWord,'%',''));
SET count2=LENGTH(resultPart);
SET diffCount=count2-count1;
IF diffCount<0 THEN
SET diffCount=diffCount*-1;
END IF;
/*The difference between the lengths is the indication of unmatched
chars which are subject to penalty */
IF LEFT(REPLACE(searchWord,'%',''),1)<>LEFT(resultPart,1) THEN
SET penalty==1*count1;
ELSE
SET penalty=diffCount * gapPenalty;
END IF
SET score=count1+penalty;
/*if the search key and the matched result are same then the score
will be equal to the length of the search key */
/*score is converted to percentage */
SET score=(score/count1)*100;
RETURN(score);
得分高于阈值的匹配结果将存储在一个表中,并进行查询以在 Ajax 建议框中提供建议。这个临时表可能有一个会话 ID 列,这样就可以处理多个用户的结果。通过这种方式,这个小工具可以有效地用于 Ajax 建议框。
概念验证
附带了一个包含示例表和存储过程的 SQL 导出。设置好 MySQL 数据库后,您可以对其进行测试。如果您的表中有单词“winning”,那么您运行 call suggest('wining'),它仍然会建议“winning”;或者如果您的表中有单词“happy”,并且用户错误地输入了“happty”,它仍然会建议“happy”。请在 MySQL 提示符下尝试更多示例。
call suggest('Happi')
call suggest('Happty')
call suggest('Hapy') etc.,
存储过程的顶部有一个定义者声明,如 'CREATE DEFINER=`root`@`localhost` PROCEDURE `suggest`'
。请根据您的用户名和主机进行更改。
更多想法
当我们看 Google 时,它的建议框中有很多智能功能。所以搜索不限于少数逻辑。它对新想法持开放态度。让我们看几个。
可以形成一个将搜索键和用户选择的建议关联起来的神经网络。例如,用户输入“新年”,并从建议中选择“新年祝福”,然后我们可以衡量搜索键与结果的权重,当这个权重超过神经网络节点上的阈值时,搜索键将作为有效匹配存储起来。随着数据的增长,它可以用于建议。
当用户搜索某些内容时,该搜索键可以与其区域相关联并加权。如果使用该键的用户数量超过定义的阈值,那么该信息可以用于建议本地重要的匹配项,就像 Google 和 YouTube 所做的那样。
这里可以添加新的想法,让网站得到我们大脑想要的东西。
注意
这个小小的存储过程并非旨在处理像 Google 数据库那样数百万的数据。本文是针对普通网站而撰写的。
完整代码转储
/*PROCEDURE suggest
Create search keywords and pass to query */
CREATE DEFINER=`root`@`localhost` PROCEDURE `suggest`(searchWord VARCHAR(250))
BEGIN
DECLARE formattedString VARCHAR(250)DEFAULT '';
DECLARE resultString VARCHAR(250) DEFAULT '';
DECLARE str_len INT DEFAULT 0;
DECLARE str_part VARCHAR(250) DEFAULT '';
DECLARE str_part1 VARCHAR(250) DEFAULT '';
DECLARE str_part2 VARCHAR(250) DEFAULT '';
DECLARE i INT DEFAULT 1;
DECLARE totalResultCount INT DEFAULT 0;
DECLARE resultLimit INT DEFAULT 10;
/* Result Table is to store the search results temporarily.
Its field named 'word'is unique field.
So it may throw duplicate entry error.
To prevent that a error handler is declared */
/*Result Table may have an id column which can facilitate
different sessions can use the same table */
/*delete the previous results */
DELETE FROM RESULT;
/* Initial search pattern */
SET searchWord=CONCAT('%',' ', searchWord,'%');
/*Get the matched ones. Watch for the result limit */
CALL getResult(searchWord,totalResultCount,resultLimit);
SET str_len=LENGTH(searchWord);
loop1:WHILE (i<(str_len-3)) DO
/* Limit check to avoid unnecessary loops */
IF totalResultCount>resultLimit THEN
LEAVE loop1;
END IF;
SET i=i+1;
/*Addition of % in the search key */
SET formattedString=INSERT(searchWord,str_len-i,0,'%');
/*Get the matched ones. Watch for the result limit
Result has a score for each matched selection*/
CALL getResult(formattedString,totalResultCount,resultLimit);
END WHILE loop1;
/*This loop is similar to the above but replace the char at position with % */
/*For examble %hello viewers%, %hello viewer%%, %hello viewe%s%.... %h%llo viewers% */
SET i=0;
loop2:WHILE (i<(str_len-3)) DO
IF totalResultCount>resultLimit THEN
LEAVE loop2;
END IF;
SET i=i+1;
SET formattedString=INSERT(searchWord,str_len-i,1,'%');
CALL getResult(formattedString,totalResultCount,resultLimit)
END WHILE loop2;
END IF;
SELECT word,score FROM RESULT ORDER BY score DESC LIMIT 10;
END
/* PROCEDURE getResult
Execute the search query and store the results temporarily */
CREATE DEFINER=`root`@`localhost` PROCEDURE `getResult`_
(searchWord VARCHAR(250),INOUT totalCount INT,resultLimit INT)
MODIFIES SQL DATA
BEGIN
/*This routine form various search key pattern and store the results in a table */
DECLARE formatted_string VARCHAR(250);
DECLARE result_word VARCHAR(250) DEFAULT '';
DECLARE tempWord VARCHAR(250);
DECLARE done INT DEFAULT 0;
DECLARE score FLOAT DEFAULT 0;
DECLARE lowScoreThreshold INT DEFAULT 30;
DECLARE i INT DEFAULT 0;
DECLARE duplicate_key INT DEFAULT 0;
/* Here is the main search query with the search key passed. It is more
expensive query as it may results in many records and scoring each record */
DECLARE word_list CURSOR FOR SELECT word,scoreWords_
(sanitizeSearchKeyWord(searchWord),word) AS score FROM words _
WHERE word LIKE searchWord OR word LIKE CONCAT_
(sanitizeSearchKeyWord(searchWord) ,'%');
DECLARE CONTINUE HANDLER FOR 1062 /* Duplicate key*/
SET duplicate_key=1;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done=1;
/*Loops through the selected records and validate with a threshold.
This loop only iterate up to the result limit. So not very costly
This records the total records found*/
OPEN word_list ;
loop1:LOOP
SET i=i+1;
FETCH word_list INTO tempWord,score;
IF i>resultLimit THEN
LEAVE loop1;
END IF;
IF done=1 THEN
LEAVE loop1;
END IF;
IF score>lowScoreThreshold THEN
INSERT INTO RESULT (word,score) VALUES (tempWord,score);
END IF;
IF duplicate_key<>1 THEN
SET totalCount=totalCount+1;
END IF;
END LOOP loop1;
CLOSE word_list;
END
/*FUNCTION scoreWords
Calculate the score comparing the searchkey and the resulted match */
CREATE DEFINER=`root`@`localhost` FUNCTION `scoreWords`_
(searchWord VARCHAR(250),resultWord VARCHAR(250)) RETURNS float
BEGIN
/*This routine score the match results */
DECLARE gapPenalty FLOAT DEFAULT -0.5;
DECLARE count1 INT DEFAULT 0;
DECLARE count2 INT DEFAULT 0;
DECLARE diffCount INT DEFAULT 0;
DECLARE penalty FLOAT DEFAULT 0;
DECLARE score FLOAT DEFAULT 0;
DECLARE resultPart VARCHAR(250);
/*Identify the position of matched search key in the resulted sentence */
SET resultPart= locateSearchWord(searchWord,resultWord);
/*Find the length of the seach key and the matched word resulted */
SET count1=LENGTH(REPLACE(searchWord,'%',''));
SET count2=LENGTH(resultPart);
SET diffCount=count2-count1;
IF diffCount<0 THEN
SET diffCount=diffCount*-1;
END IF;
/*The difference between the lengths is the indication of unmatched chars
which are subject to penalty */
IF LEFT(REPLACE(searchWord,'%',''),1)<>LEFT(resultPart,1) THEN
SET penalty==1*count1;
ELSE
SET penalty=diffCount * gapPenalty;
END IF
SET score=count1+penalty;
/*if the search key and the matched result are same then the score
will be equal to the length of the search key */
/*score is converted to percentage */
SET score=(score/count1)*100;
RETURN(score);
END
/*Function locateSearchWord
Locate the matching word from the result to calculate the score */
CREATE DEFINER=`root`@`localhost` FUNCTION `locateSearchWord`_
(searchWord VARCHAR(250),resultWord VARCHAR(250)) RETURNS varchar(250) CHARSET latin1
BEGIN
/* This routine locates the matched searchkey
like part in the resulted sentence */
/*It pass the matched part to score it against the search key */
DECLARE strPart1 VARCHAR(250) DEFAULT '';
DECLARE strPart2 VARCHAR(250) DEFAULT '';
DECLARE tmpWord VARCHAR(250) DEFAULT '';
DECLARE spaceCount INT DEFAULT 0;
DECLARE wordPosition INT DEFAULT 1;
DECLARE resultPart VARCHAR(250) DEFAULT '';
/*The place where we include the % may have any number of characters.
Those may not match with the searchkey
So we take the left and right part of the % from the search key */
SET strPart1=SUBSTRING_INDEX(searchWord,'%',1);
SET strPart2=SUBSTRING_INDEX(searchWord,'%',-1);
/*Finding the spaces i.e. equivalent number of words in the seach key */
SET tmpWord=REPLACE(searchWord,' ','');
SET spaceCount =LENGTH(searchWord)-LENGTH(tmpWord);
/*To find the matched part of the search key in the result
it takes the part which is more lengthy */
IF LENGTH(strPart1)>LENGTH(strPart2) THEN
SET wordPosition=LOCATE(REVERSE(strPart1),REVERSE(resultWord));
IF LOCATE(' ',REVERSE(resultWord),_
wordPosition+LENGTH(strPart1))=0 THEN
SET wordPosition=LENGTH(resultWord);
ELSE
SET wordPosition= LOCATE(' ',_
REVERSE(resultWord),wordPosition+LENGTH(strPart1))-1;
END IF;
SET tmpWord=LEFT(REVERSE(resultWord),wordPosition);
SET resultPart=SUBSTRING_INDEX(tmpWord,' ',-1*(spaceCount+1));
IF resultPart='' THEN
SET resultPart=tmpWord;
END IF;
SET resultPart=REVERSE(resultPart);
ELSEIF LENGTH(strPart1)<LENGTH(strPart2) _
OR(LENGTH(strPart1)=LENGTH(strPart2) AND LENGTH(strPart1)<>LENGTH(searchWord) _
AND LENGTH(strPart2)<>0) THEN
SET wordPosition=LOCATE(strPart2,resultWord);
IF LOCATE(' ',resultWord,wordPosition+LENGTH(strPart2))=0 THEN
SET wordPosition=LENGTH(resultWord);
ELSE
SET wordPosition= LOCATE(' ',_
resultWord,wordPosition+LENGTH(strPart2))-1;
END IF;
SET tmpWord=LEFT(resultWord,wordPosition);
SET resultPart=SUBSTRING_INDEX(tmpWord,' ',-1*(spaceCount+1));
IF resultPart='' THEN
SET resultPart=tmpWord;
END IF;
ELSE
SET wordPosition=LOCATE(searchWord,resultWord);
SET tmpWord=LEFT(resultWord,(wordPosition-1)+LENGTH(searchWord));
SET resultPart=SUBSTRING_INDEX(tmpWord,' ',-1*(spaceCount+1));
IF resultPart='' THEN
SET resultPart=tmpWord;
END IF;
END IF;
RETURN(resultPart);
END
/*FUNCTION sanitizeSeachKeyWord
Just remove the start and trailing % */
CREATE DEFINER=`root`@`localhost` _
FUNCTION `sanitizeSearchKeyWord`(searchWord VARCHAR(250)) _
RETURNS varchar(250) CHARSET latin1
BEGIN
SET searchWord=MID(searchWord,3,LENGTH(searchWord)-3);
RETURN searchWord;
END
关注点
我深受谷歌搜索的启发。除了一个简单的搜索框,它并没有什么花哨的地方,却能做很多事情。这篇文章是我的一点思考,由谷歌启发。
历史
- 2010年12月20日:首次发布