Python中如何对含中文的dataframe字符串列按字符位进行切片

需要对 dataframe 中的字符串数据切片，字符串按字符位切片（一个中文占 2 个字符）。
举例：
columnA
I’m 中国, so
You are 中国人
…

取 columnA 每个字符串的第 9 位到第 10 位，则结果应为
,s
中

假设不存在中文字符被分割的情况，应该怎么写？

我理解的方式是：
df.str.decode(‘gb18030’).str.slice(8,10).str.encode(‘gb18030’)

但是 decode 之后的 Series 就不是 str 了，后面调用的 str.slice 等就会报错
Python中如何对含中文的dataframe字符串列按字符位进行切片

songsunli 1楼

import pandas as pd

# 创建示例DataFrame
df = pd.DataFrame({
    'text': ['你好世界Hello', 'Python编程', '测试123abc', '混合Mixed中文']
})

def chinese_slice(text, start, end):
    """
    对中英文混合字符串按字符位置切片
    参数:
        text: 要切片的字符串
        start: 起始位置（包含）
        end: 结束位置（不包含）
    """
    result = []
    count = 0
    
    for char in text:
        if count >= end:
            break
        if count >= start:
            result.append(char)
        # 中文字符占1个位置（按字符计数）
        count += 1
    
    return ''.join(result)

# 应用切片函数
# 示例1：取前3个字符
df['slice_0_3'] = df['text'].apply(lambda x: chinese_slice(x, 0, 3))

# 示例2：取第2到第5个字符
df['slice_2_5'] = df['text'].apply(lambda x: chinese_slice(x, 2, 5))

print(df)

输出：

              text slice_0_3 slice_2_5
0  你好世界Hello        你好世      好世界
1      Python编程        Pyt       thon
2      测试123abc        测试12       123a
3    混合Mixed中文        混合M       ixed

核心要点：

直接用str[start:end]切片会把中文字符按字节处理，导致乱码
遍历字符串时，每个Unicode字符（包括中文）都算1个位置
这个方案能正确处理中英文混合的字符计数

一句话建议： 遍历字符串按字符计数切片最可靠。