Python中如何用TensorFlow实现基于深度学习的神经聊天模型?
标题的我是指 DeepQA 项目的作者。 接上篇: https://www.v2ex.com/t/388328 ,翻译了部分项目说明文档。
概述
这个工作尝试重现这个论文的结果 A Neural Conversational Model (aka the Google chatbot). 它使用了循环神经网络( seq2seq 模型)来进行句子预测。它是用 python 和 TensorFlow 开发。
程序的加载主体部分是参考 Torch 的 neuralconvo from macournoyer.
现在, DeepQA 支持一下对话语料:
- Cornell Movie Dialogs corpus (default). Already included when cloning the repository.
- OpenSubtitles (thanks to Eschnou). Much bigger corpus (but also noisier). To use it, follow those instructions and use the flag
--corpus opensubs. - Supreme Court Conversation Data (thanks to julien-c). Available using
--corpus scotus. See the instructions for installation. - Ubuntu Dialogue Corpus (thanks to julien-c). Available using
--corpus ubuntu. See the instructions for installation. - Your own data (thanks to julien-c) by using a simple custom conversation format (See here for more info).
To speedup the training, it's also possible to use pre-trained word embeddings (thanks to Eschnou). More info here.
安装
这个程序需要一下依赖(easy to install using pip: pip3 install -r requirements.txt):
- python 3.5
- tensorflow (tested with v1.0)
- numpy
- CUDA (for using GPU)
- nltk (natural language toolkit for tokenized the sentences)
- tqdm (for the nice progression bars)
你可能需要下载附带的数据让 nltk 正常工作。
python3 -m nltk.downloader punkt
Cornell 数据集已经包括了。其他的数据集查看 readme 文件到他们所在的文件夹。 (在 data/).
网站接口需要一些附加的包:
- django (tested with 1.10)
- channels
- Redis (see here)
- asgi_redis (at least 1.0)
Docker 安装也是支持的,更多详细的教程参考 here.
运行
聊天机器人
训练这个模型,直接运行 main.py 。一旦训练完成,你可以测试结果用 main.py --test
(结果生成在 'save/model/samples_predictions.txt') 或者用 main.py --test interactive (更有趣).
Here are some flags which could be useful. For more help and options, use python main.py -h:
--modelTag <name>: allow to give a name to the current model to differentiate between them when testing/training.--keepAll: use this flag when training if when testing, you want to see the predictions at different steps (it can be interesting to see the program changes its name and age as the training progress). Warning: It can quickly take a lot of storage space if you don't increase the--saveEveryoption.--filterVocab 20or--vocabularySize 30000: Limit the vocabulary size to and optimize the performances and memory usage. Replace the words used less than 20 times by the<unknown>token and set a maximum vocabulary size.--verbose: when testing, will print the sentences as they are computed.--playDataset: show some dialogue samples from the dataset (can be use conjointly with--createDatasetif this is the only action you want to perform).
To visualize the computational graph and the cost with TensorBoard, just run tensorboard --logdir save/.
默认的这个网络架构是一个标准的 encoder/decoder 有两个 LSTM layers (隐藏层大小 256) ,然后 vocabulary 的 embedding size 是 32.这个网络用 ADAM 训练。最大的句子长度设置为 10 个单词,但是可以增加。
结果
当然,这个网络并不会很擅长聊天:
Q: Hi
A: Hi.
Q: What is your name ?
A: Laura.
Q: What does that mean ?
A: I dunno.
Q: How old are you ?
A: thirty-five.
Q: Will Google hire me ?
A: No.
Q: Tell me the alphabet
A: Fuck you.
Q: That’s not nice
A: Yeah.
这里有一些情况它并不能正确回答:
Q: Who is Laura ?
A: My brother.
Q: Say ‘goodbye’
A: Alright.
Q: What is cooking ?
A: A channel.
Q: Can you say no ?
A: No.
Q: Two plus two
A: Manny…
预训练模型
项目截图:

实测截图:

一步一步教程:
1.下载这个项目: https://github.com/Conchylicultor/DeepQA 2.下载训练好的模型: https://drive.google.com/file/d/0Bw-phsNSkq23OXRFTkNqN0JGUU0/view (如果网址不能打开的话,今晚我会上传到百度网盘,分享到: http://www.tensorflownews.com/) 3.解压之后放在 项目 save 目录下 如图所示

4.复制 save/model-pretrainedv2/dataset-cornell-old-lenght10-filter0-vocabSize0.pkl 这个文件到 data/samples/
如图所示:

5.在项目目录执行一下命令:
python3 main.py --modelTag pretrainedv2 --test interactive
程序读取了预训练的模型之后,如图:

聊天机器人资源合集
项目,语聊,论文,教程 https://github.com/fendouai/Awesome-Chatbot
更多教程:
http://www.tensorflownews.com/
DeepQA
备注:为了更加容易了解这个项目,说明部分翻译了项目的部分 readme,主要是介绍使用预处理数据来运行这个项目。
Python中如何用TensorFlow实现基于深度学习的神经聊天模型?
kinda of silly, but I’d like to try:)
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Model
import numpy as np
class NeuralChatModel:
def __init__(self, vocab_size, embedding_dim=256, lstm_units=512):
self.vocab_size = vocab_size
self.embedding_dim = embedding_dim
self.lstm_units = lstm_units
# 编码器
encoder_inputs = tf.keras.Input(shape=(None,))
encoder_embedding = Embedding(vocab_size, embedding_dim)(encoder_inputs)
encoder_lstm = LSTM(lstm_units, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]
# 解码器
decoder_inputs = tf.keras.Input(shape=(None,))
decoder_embedding = Embedding(vocab_size, embedding_dim)(decoder_inputs)
decoder_lstm = LSTM(lstm_units, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = Dense(vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
# 训练模型
self.model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
self.model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
# 推理模型
self.encoder_model = Model(encoder_inputs, encoder_states)
decoder_state_input_h = tf.keras.Input(shape=(lstm_units,))
decoder_state_input_c = tf.keras.Input(shape=(lstm_units,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
decoder_outputs, state_h, state_c = decoder_lstm(
decoder_embedding, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)
self.decoder_model = Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs] + decoder_states)
def train(self, encoder_input, decoder_input, decoder_target, batch_size=64, epochs=10):
self.model.fit(
[encoder_input, decoder_input],
decoder_target,
batch_size=batch_size,
epochs=epochs
)
def respond(self, input_seq, max_response_length=20):
# 编码输入
states_value = self.encoder_model.predict(input_seq)
# 生成响应
target_seq = np.zeros((1, 1))
target_seq[0, 0] = 1 # 起始标记
response = []
for _ in range(max_response_length):
output_tokens, h, c = self.decoder_model.predict(
[target_seq] + states_value)
# 采样下一个词
sampled_token_index = np.argmax(output_tokens[0, -1, :])
response.append(sampled_token_index)
if sampled_token_index == 2: # 结束标记
break
# 更新状态
target_seq = np.zeros((1, 1))
target_seq[0, 0] = sampled_token_index
states_value = [h, c]
return response
# 使用示例
vocab_size = 10000 # 词汇表大小
model = NeuralChatModel(vocab_size)
# 准备数据(示例)
encoder_input_data = np.random.randint(0, vocab_size, size=(1000, 20))
decoder_input_data = np.random.randint(0, vocab_size, size=(1000, 15))
decoder_target_data = np.random.randint(0, vocab_size, size=(1000, 15, 1))
# 训练
model.train(encoder_input_data, decoder_input_data, decoder_target_data, epochs=5)
# 生成响应
test_input = np.random.randint(0, vocab_size, size=(1, 20))
response = model.respond(test_input)
print("生成的响应索引:", response)
这是一个基础的seq2seq聊天模型实现。核心是编码器-解码器架构:编码器把输入语句编码成固定维度的向量,解码器基于这个向量生成响应。LSTM层处理序列数据,Embedding层将词索引转换为密集向量。
实际使用时需要:1)准备真实的对话数据集;2)添加注意力机制提升长文本效果;3)使用更大的词汇表和更深的网络结构。训练数据需要是成对的问答语句,并转换为整数序列。
建议先用小规模数据验证流程再扩展。
just do it!
Q: Can you say no?
A: No.
这句笑死了
悖论哈哈
神经聊天 [滑稽
哈哈,认识中文不?
俩神经病在聊天
没毛病啊。
会说中文不
楼主,不好意思。
看成神经病聊天了…
不会,没有中文数据集,稍后会发一个中文聊天机器人。

