Python之encoding

2024-12-06 约 1238 字预计阅读 3 分钟

https://bing.ee123.net/img/rand?artid=80271633

Python之encoding

encoding

根据Python官方文档中有关字符串的部分：

str.encode(encoding=”utf-8”, errors=”strict”)
Return an encoded version of the string as a bytes object. Default encoding is ‘utf-8’. errors may be given to set a different error handling scheme. The default for errors is ‘strict’, meaning that encoding errors raise a UnicodeError. Other possible values are ‘ignore’, ‘replace’, ‘xmlcharrefreplace’, ‘backslashreplace’ and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.

encoding是编码的意思，在python中，Unicode类型是作为编码的基础类型。

Unicode

Unicode是一种标准，包括了字符集、编码方案等。因为ASCII码只能编码英文字符，具有很大的局限性，而Unicode 为每种语言中的每个字符设定了统一并且唯一的二进制编码，以满足跨语言、跨平台进行文本转换、处理的要求。在可以查到所有的Unicode字符。

utf-8

UTF-8以字节为单位对Unicode进行编码。

utf-16

UTF-16编码以16位无符号整数为单位。

utf-32

UTF-32编码以32位无符号整数为单位。

gbk

GBK全称《汉字内码扩展规范》（GBK即“国标”、“扩展”汉语拼音的第一个字母，英文名称：Chinese Internal Code Specification）。GBK是采用单双字节变长编码，英文使用单字节编码，完全兼容ASCII字符编码，中文部分采用双字节编码。

ASCII 码

ASCII 码使用指定的7 位或8 位二进制数组合来表示128 或256 种可能的字符。

decoding

根据Python官方文档中有关字符串的部分：

bytes.decode(encoding=”utf-8”, errors=”strict”)
bytearray.decode(encoding=”utf-8”, errors=”strict”)
Return a string decoded from the given bytes. Default encoding is ‘utf-8’. errors may be given to set a different error handling scheme. The default for errors is ‘strict’, meaning that encoding errors raise a UnicodeError. Other possible values are ‘ignore’, ‘replace’ and any other name registered via codecs.register_error(), see section Error Handlers. For a list of possible encodings, see section Standard Encodings.
Note Passing the encoding argument to str allows decoding any bytes-like object directly, without needing to make a temporary bytes or bytearray object.

decode()就是将“字节流”按照某种规则转换成“文本”，可以理解为是encode()的逆过程。

例子

现在用一个小小的例子来体验一下编码和解码。

首先可以去上将上面的字符保存在一个txt文件中，并以utf-8的形式保存，命名为languages。写下代码：

#####encoding&decoding#####
def main(file,encoding,errors):
    line = file.readline()
    if line:
        print_file(line,encoding,errors)
        return main(file,encoding,errors)

def print_file(line,encoding,errors):
    next_lang = line.strip() ###strip() 方法用于移除字符串头尾指定的字符（默认为空格）
    raw_bytes = next_lang.encode(encoding, errors=errors) ###encode()函数中errors默认为strict，还可以设置为ignore,replace等
    cooked_string = raw_bytes.decode(encoding, errors=errors) ###decode()函数为解码

    print(raw_bytes, "<===>", cooked_string)


languages = open("languages.txt", encoding="utf-8")
main(languages,"utf-16","strict")

可以得到结果（如下只显示结果片段）：

左边为utf-8编码结果，右边为原字符，两者一一对应且可以进行互逆运算。

将代码中的“utf-8”改成“utf-16”，可以得到以下结果：

目录

Python之encoding

Python之encoding

encoding

Unicode

utf-8

utf-16

utf-32

gbk

ASCII 码

decoding

例子