Python中如何使用DrQA构建开放领域问答系统

DrQA 是一个阅读理解系统应用于开放领域的问答。

项目由 https://github.com/facebookresearch 发布。
项目地址： https://github.com/facebookresearch/DrQA

DrQA 是一个阅读理解系统用在开放领域问答。特别的，DrQA 针对一个机器阅读任务。在这个列表里，我们为一个潜在非常大的预料库中搜索一个问题的答案。所以，这个系统必须结合文本检索和机器文本理解。
我们实验 DrQA 专注于回答事实类问题，同时使用维基百科作为惟一的知识来源。维基百科是一个结构良好的大量，丰富，详细的文本来源。为了问答所有的问题，首先要接收一些潜在的相关文章，从 5 百万篇文章中，然后仔细扫描这些文本来找到答案。

DrQA is a system for reading comprehension applied to open-domain question answering. In particular, DrQA is targeted at the task of “ machine reading at scale ” (MRS). In this setting, we are searching for an answer to a question in a potentially very large corpus of unstructured documents (that may not be redundant). Thus the system has to combine the challenges of document retrieval (finding the relevant documents) with that of machine comprehension of text (identifying the answers from those documents).

Our experiments with DrQA focus on answering factoid questions while using Wikipedia as the unique knowledge source for documents. Wikipedia is a well-suited source of large-scale, rich, detailed information. In order to answer any question, one must first retrieve the few potentially relevant articles among more than 5 million, and then scan them carefully to identify the answer.
更多： http://www.tensorflownews.com/
Python中如何使用DrQA构建开放领域问答系统

vueper 1楼

要构建开放领域问答系统，DrQA是个不错的选择。它主要分两步走：文档检索和阅读理解。

首先用retriever模块从大量文档中找到相关段落。DrQA默认用TF-IDF加n-gram特征，你也可以换成BM25或向量检索。然后reader模块用预训练的BERT模型从候选段落里提取答案。

下面是个简化版的代码示例：

from drqa import retriever, pipeline
import warnings
warnings.filterwarnings('ignore')

# 1. 准备文档库
documents = [
    "Python是一种解释型编程语言，由Guido van Rossum创建。",
    "DrQA是Facebook开源的问答系统，包含检索和阅读两个组件。",
    "BERT是Google提出的预训练语言模型，在多项NLP任务上表现优异。"
]

# 2. 构建检索器（这里简化处理，实际需要建索引）
class SimpleRetriever:
    def __init__(self, docs):
        self.docs = docs
    
    def get_docs(self, question, k=2):
        # 简单关键词匹配，实际应用TF-IDF或向量检索
        return [self.docs[0], self.docs[1]] if "Python" in question else [self.docs[1]]

# 3. 加载阅读理解模型
reader = pipeline.DrQAReader(
    model='bert-base-uncased',
    tokenizer='bert-base-uncased',
    device='cpu'  # 用GPU就改成'cuda'
)

# 4. 问答流程
def answer_question(question):
    # 检索相关文档
    retriever = SimpleRetriever(documents)
    contexts = retriever.get_docs(question)
    
    # 阅读理解提取答案
    results = reader.predict(question, contexts)
    
    if results:
        best_answer = results[0]
        return best_answer['span'], best_answer['doc_id']
    return "未找到答案", -1

# 测试
question = "Python是谁创建的？"
answer, doc_id = answer_question(question)
print(f"问题：{question}")
print(f"答案：{answer} (来自文档{doc_id})")

实际部署时，文档检索部分要用DrQA的DocDB和TFIDFRetriever建正式索引。阅读理解模型可以换成roberta或albert。数据方面，建议用SQuAD或Natural Questions这类数据集微调模型。

简单说就是：先找相关文档，再用模型读文档找答案。