Python中如何改进检索条件匹配句子的效率?

需要写个小脚本给不会编程的同事用,大致是他们自己编写检索条件然后去运行脚本得出结果。

检索条件类似这种:虚汗 or ((怕 or 畏) and (寒 or 冷))

目前试了两种,in 方式( func1 )和正则匹配( func2 ) 测试语料 1220 条,花费时间:

func1 运行时间:26.843648433685303

func2 运行时间:46.992613554000854

当语料很大,检索条件量很多(或者嵌套很多)。效率感觉不怎么理想? 问问有没思路改进一下效率,或者有没更好的方法思路。

相关代码文件 链接: https://pan.baidu.com/s/1c2llxlu 密码: v7ia

代码下面贴出了自己写的

# -*- coding=utf-8 -*-
import os
import re
import time
import pandas as pd
from pandas import DataFrame
from itertools import permutations

def str_to_list§: “”" 处理字符串转换成列表 如:虚汗 or ((怕 or 畏) and (寒 or 冷)) 结果:[‘虚汗’,‘or’,[[‘怕’,‘or’,‘畏’],‘and’,[‘寒’,‘or’,‘冷’]]] “”" p = re.sub(’([^()()\s(?:and|or)]+|(?:and|or))’,’"\1",’,p) p = p.replace(’(’,’[’).replace(’(’,’[’).replace(’)’,’],’).replace(’)’,’],’) p = ‘[%s]’ % p return eval§

def list_to_regex§: “”" 处理列表转换成正则表达式 递归方法 如:[‘虚汗’,‘or’,[[‘怕’,‘or’,‘畏’],‘and’,[‘寒’,‘or’,‘冷’]]] 结果:(虚汗|((怕|畏).?(寒|冷)|(寒|冷).?(怕|畏))) “”" tempP = [list_to_regex(x) if isinstance(x,list) else x for x in p if x not in [‘or’,‘and’]] if ‘and’ in p: tempP = permutations(tempP) #生成排列 return ‘(%s)’ % ‘|’.join([’.*?’.join(x) for x in tempP]) else: return ‘(%s)’ % ‘|’.join(tempP)

def match_sentence(p,sentence): “”“转成 in 的方式去匹配句子”"" words = p.replace(’ ‘,’’).replace(’,’,’’).replace(’(’,’(’).replace(’)’,’)’).replace(‘and’,’,and,’).replace(‘or’,’,or,’).replace(’(’,’,(,’).replace(’)’,’,),’).split(’,’)

scriptStr = [w if w in 'and or ()' \
                else '"%s" in "%s"' % (w,sentence) for w in words]
                
if eval(' '.join(scriptStr)):
    return True
return False

def func1(patternFile,sentenceFile): “”" 转成正则再去匹配 patternFile – 含有检索条件的文件名 sentenceFile – 语料文件名 “”" dfS = pd.read_excel(sentenceFile) dfP = pd.read_excel(patternFile) #编译好的正则列表 regexList = [re.compile(list_to_regex(str_to_list(x))) for x in dfP.ix[:,-1]] resultFile = ‘result1.txt’ for senIdx in dfS.index: for i,patt in enumerate(regexList): keyword = dfP.ix[i,-2] sentence = dfS.ix[senIdx,-1] if patt.search(sentence): r = dfP.ix[[i],:-1] with open(resultFile,‘a’,encoding=‘utf-8’) as f: f.write(’%s\t%s\n’ % (keyword, sentence))

def func2(patternFile,sentenceFile): “”" 转成 in 的方式再去匹配 patternFile – 含有检索条件的文件名 sentenceFile – 语料文件名 “”" dfS = pd.read_excel(sentenceFile) dfP = pd.read_excel(patternFile) resultFile = ‘result2.txt’ for senIdx in dfS.index: for pattIdx in dfP.index: keyword = dfP.ix[pattIdx,-2] sentence = dfS.ix[senIdx,-1] if match_sentence(dfP.ix[pattIdx,-1],sentence): with open(resultFile,‘a’,encoding=‘utf-8’) as f: f.write(’%s\t%s\n’ % (keyword, sentence))

if name == ‘main’: “”“测试”"" patternFile = ‘检索条件_测试.xlsx’ sentenceFile = ‘语料_测试.xlsx’ t1 = time.time() func2(patternFile,sentenceFile) t2 = time.time() print(‘func2 运行时间:’,t2-t1)


Python中如何改进检索条件匹配句子的效率?

2 回复

用字典建立倒排索引,把每个词映射到包含它的句子索引。匹配时先分词,再取交集,这样复杂度从O(N*M)降到O(K),K是查询词在索引中的平均出现次数。

class SentenceSearcher:
    def __init__(self, sentences):
        self.sentences = sentences
        self.index = {}
        self.build_index()
    
    def build_index(self):
        """建立倒排索引"""
        for idx, sentence in enumerate(self.sentences):
            words = set(sentence.lower().split())  # 简单分词,去重
            for word in words:
                if word not in self.index:
                    self.index[word] = set()
                self.index[word].add(idx)
    
    def search(self, query):
        """检索包含所有查询词的句子"""
        query_words = set(query.lower().split())
        if not query_words:
            return []
        
        # 获取每个词对应的句子索引
        result_sets = []
        for word in query_words:
            if word in self.index:
                result_sets.append(self.index[word])
            else:
                return []  # 有词不存在,直接返回空
        
        # 取交集
        if result_sets:
            result_indices = set.intersection(*result_sets)
            return [self.sentences[i] for i in sorted(result_indices)]
        return []

# 使用示例
sentences = [
    "Python is a great programming language",
    "I love Python programming",
    "Java is also a good language",
    "Python and Java are both popular"
]

searcher = SentenceSearcher(sentences)
print(searcher.search("Python language"))  # 输出包含这两个词的句子

核心是空间换时间,预处理时建索引,查询时直接取交集。对于海量文本,还可以用更高效的数据结构如Trie树或布隆过滤器。

用倒排索引查句子快多了。


用全文检索引擎会好一点吧

回到顶部