Python中使用antiword将doc转换成txt，如何解决中文乱码问题？

问题描述

使用 python 调用 antiword 将 doc 转换成 txt,发现返回值是乱码，powershell 上用 antiword 也是乱码，在 git bash 上调用就没问题

环境背景及自己尝试过哪些方法

环境：windows 平台，py3.6 尝试过改 locale,还试过 antiword 的-m 参数，但是没什么用

相关代码

pipe = subprocess.Popen(
        ['antiword', filename],
        stdout=subprocess.PIPE, 
        stderr=subprocess.PIPE  )
stdout, stderr = pipe.communicate()
return stdout

Python中使用antiword将doc转换成txt，如何解决中文乱码问题？

yuanlaile 1楼

pdf2text

nodeper 2楼

用 antiword 处理中文文档乱码，核心是正确指定输入文件的编码。antiword 默认假设文档是西欧编码，处理中文时就会乱码。

你需要用 -m 参数指定正确的映射文件（字符集转换表）。对于简体中文GBK编码的 .doc 文件，命令如下：

antiword -m UTF-8.txt your_document.doc > output.txt

关键点：

-m UTF-8.txt 参数告诉 antiword 使用 UTF-8.txt 这个映射文件来处理字符转换。这个文件通常随 antiword 安装，在Linux/macOS上可能在 /usr/share/antiword 或类似路径。
如果你的系统没有这个文件，可以从 antiword 源码包中获取，或者直接指定GBK到UTF-8的映射（如果系统有对应的文件，如 GBK.txt）。
如果文档是其他编码（如GB2312），你可能需要尝试不同的映射文件。

一个更健壮的Python封装示例：

import subprocess
import os

def doc_to_txt(doc_path, txt_path=None, charset_map="UTF-8.txt"):
    """
    使用antiword将.doc文件转换为.txt
    :param doc_path: 输入的.doc文件路径
    :param txt_path: 输出的.txt文件路径（默认为同名文件）
    :param charset_map: 字符集映射文件（如UTF-8.txt, GBK.txt）
    :return: 转换后的文本内容
    """
    if txt_path is None:
        txt_path = os.path.splitext(doc_path)[0] + ".txt"
    
    # 构建命令
    cmd = ["antiword", "-m", charset_map, doc_path]
    
    try:
        # 执行命令并捕获输出
        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
        text_content = result.stdout
        
        # 写入文件
        with open(txt_path, "w", encoding="utf-8") as f:
            f.write(text_content)
        
        return text_content
    except subprocess.CalledProcessError as e:
        print(f"转换失败: {e}")
        return None
    except FileNotFoundError:
        print("错误：未找到antiword命令，请确保已安装antiword")
        return None

# 使用示例
if __name__ == "__main__":
    # 转换单个文件
    text = doc_to_txt("你的文档.doc", charset_map="UTF-8.txt")
    if text:
        print("转换成功，前500字符：")
        print(text[:500])

如果上述方法仍乱码：

尝试其他映射文件：GBK.txt、GB2312.txt 等。
检查文档实际编码，老文档可能是GB2312。
考虑用 catdoc 替代（命令：catdoc -d utf-8 your_document.doc > output.txt）。

一句话总结：用 -m UTF-8.txt 参数指定编码映射。

eggper 3楼

每一个进程执行的都有上下文的, 比如说环境变量, 你在 Python 中把环境变量打出来看看编码的配置项

sinazl 4楼作者

我试过改 locale,没有用
代码：

locale.setlocale(locale.LC_ALL, ‘zh_CN.UTF-8’)
locale.setlocale(locale.LC_COLLATE,‘zh_CN.UTF-8’)
locale.setlocale(locale.LC_CTYPE,‘zh_CN.UTF-8’)
locale.setlocale(locale.LC_NUMERIC,‘zh_CN.UTF-8’)
locale.setlocale(locale.LC_MONETARY,‘zh_CN.UTF-8’)
locale.setlocale(locale.LC_TIME,‘zh_CN.UTF-8’)

phonegap100 5楼

Linux 和 Windows 下 git bash 是正常的
其实感觉跟 antiword 关系更大一些, cmd 下 chcp 65001 后 antiword 的输出仍然会有乱码
但在 git bash 下调用 antiword, 就不会输出乱码

wuwangju 6楼

GB2312 的问题吧