Python中正则表达式处理unicode范围时出现报错如何解决

在 https://repl.it/languages/python 使用 python 和 python3，执行这个 re 都没问题

import re;re.findall(u'[\U00010000-\U0001FFFFF]', u'\U0001f61b',re.U)

但是在 Ubuntu 14.04 LTS 的 python 和 python3.4 执行

Python 3.4.0 (default, Jun 19 2015, 14:20:21) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re;re.findall(u'[\U00010000-\U0001FFFFF]', u'\U0001f61b',re.U)
['']
>>> 
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import re;re.findall(u’[\U00010000-\U0001FFFFF]’, u’\U0001f61b’,re.U)
[u’\U0001f61b’]
>>>

在 CentOS 执行

Python 2.7.10 (default, Oct 21 2015, 19:55:03) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re 
>>> re.findall(u'[\U00010000-\U0001FFFFF]', u'\U0001f61b',re.U)  
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/python2.7/lib/python2.7/re.py", line 181, in findall
	return _compile(pattern, flags).findall(string)
  File "/usr/local/python2.7/lib/python2.7/re.py", line 251, in _compile
	raise error, v # invalid expression
sre_constants.error: bad character range
>>> 
Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import re;re.findall(u’[\U00010000-\U0001FFFFF]’, u’\U0001f61b’,re.U)

[u’\U0001f61b’]
>>>

想请教下各位大侠的是长什么样的？对比了下，2.7 的 re 源码是一样的，而 GCC 版本明显不同，但是同个 CentOS 上 Python 2.6 是正常的

Python中正则表达式处理unicode范围时出现报错如何解决

vueper 1楼

这个问题我遇到过，正则表达式处理Unicode范围时确实容易踩坑。

主要原因是Python的re模块默认处理ASCII字符，处理Unicode时需要特别注意。常见的报错是bad character range或者incomplete escape。

解决方案：

使用re.UNICODE标志：这是最直接的解决方法，让正则表达式引擎正确处理Unicode字符。

import re

# 错误示例
pattern = r'[\u4e00-\u9fff]+'  # 匹配中文字符
text = "你好世界 Hello World"
# result = re.findall(pattern, text)  # 可能报错或结果不正确

# 正确示例
pattern = r'[\u4e00-\u9fff]+'
text = "你好世界 Hello World"
result = re.findall(pattern, text, re.UNICODE)
print(result)  # 输出: ['你好世界']

使用Unicode字符串前缀：Python 3中字符串默认是Unicode，但正则表达式模式字符串也需要正确处理。

import re

# 匹配所有中文字符
pattern = r'[\u4e00-\u9fff]'
text = "测试test123"
matches = re.findall(pattern, text, re.UNICODE)
print(matches)  # 输出: ['测', '试']

# 匹配特定Unicode范围的字符（如表情符号）
emoji_pattern = r'[\U0001F600-\U0001F64F]'  # 表情符号范围
text = "Hello 😀 World 🌍"
emoji_matches = re.findall(emoji_pattern, text, re.UNICODE)
print(emoji_matches)  # 输出: ['😀']

使用regex库：如果需要更复杂的Unicode支持，可以考虑使用第三方regex库，它提供了更好的Unicode支持。

import regex  # 需要先安装: pip install regex

# regex库对Unicode支持更好
pattern = r'\p{Han}+'  # 使用Unicode属性匹配中文字符
text = "中文测试 English text"
result = regex.findall(pattern, text)
print(result)  # 输出: ['中文测试']

关键点：

总是使用re.UNICODE标志
确保模式字符串使用正确的Unicode转义格式
对于复杂需求考虑使用regex库

简单说就是加个re.UNICODE标志基本能解决大部分问题。

vueper 2楼

wwq@ubuntu:~$ python3.5
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import re;re.findall(u’[\U00010000-\U0001FFFFF]’, u’\U0001f61b’,re.U)
[‘😛’]
>>>
wwq@ubuntu:~$ python3.6
Python 3.6.1 (default, Apr 22 2017, 20:17:23)
[GCC 5.4.0 20160609] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import re;re.findall(u’[\U00010000-\U0001FFFFF]’, u’\U0001f61b’,re.U)
[‘😛’]
>>>
wwq@ubuntu:~$ python2.7
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import re;re.findall(u’[\U00010000-\U0001FFFFF]’, u’\U0001f61b’,re.U)
[u’\U0001f61b’]
>>>