Skip to content

文本编码

¥Text Encodings

访问时response.text,我们需要将响应字节解码为 un​​icode 文本表示。

¥When accessing response.text, we need to decode the response bytes into a unicode text representation.

默认情况下httpx将使用"charset"响应中包含的信息Content-Type标头来确定如何将响应字节解码为文本。

¥By default httpx will use "charset" information included in the response Content-Type header to determine how the response bytes should be decoded into text.

如果响应中不包含字符集信息,则默认行为是采用“utf-8”编码,这是迄今为止互联网上使用最广泛的文本编码。

¥In cases where no charset information is included on the response, the default behaviour is to assume "utf-8" encoding, which is by far the most widely used text encoding on the internet.

使用默认编码

¥Using the default encoding

为了更好地理解这一点,我们首先看看文本解码的默认行为……

¥To understand this better let's start by looking at the default behaviour for text decoding...

import httpx
# Instantiate a client with the default configuration.
client = httpx.Client()
# Using the client...
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else "utf-8".
print(response.text)  # The text will either be decoded with the Content-Type
                      # charset, or using "utf-8".

这通常完全没问题。大多数服务器都会返回格式正确的 Content-Type 头,其中包含字符集编码。大多数情况下,如果未包含字符集编码,则很可能使用 UTF-8,因为它被广泛采用。

¥This is normally absolutely fine. Most servers will respond with a properly formatted Content-Type header, including a charset encoding. And in most cases where no charset encoding is included, UTF-8 is very likely to be used, since it is so widely adopted.

使用显式编码

¥Using an explicit encoding

在某些情况下,我们可能会向服务器未明确设置字符集信息的站点发出请求,但我们知道其编码方式。在这种情况下,最好在客户端明确设置默认编码。

¥In some cases we might be making requests to a site where no character set information is being set explicitly by the server, but we know what the encoding is. In this case it's best to set the default encoding explicitly on the client.

import httpx
# Instantiate a client with a Japanese character set as the default encoding.
client = httpx.Client(default_encoding="shift-jis")
# Using the client...
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else "shift-jis".
print(response.text)  # The text will either be decoded with the Content-Type
                      # charset, or using "shift-jis".

使用自动检测

¥Using auto-detection

如果服务器不能可靠地包含字符集信息,并且我们不知道正在使用什么编码,我们可以启用自动检测,以便在从字节解码为文本时做出最佳猜测尝试。

¥In cases where the server is not reliably including character set information, and where we don't know what encoding is being used, we can enable auto-detection to make a best-guess attempt when decoding from bytes to text.

要使用自动检测,您需要设置default_encoding参数传递给可调用函数,而不是字符串。该可调用函数应该是一个函数,它接受输入字节作为参数,并返回用于将这些字节解码为文本的字符集。

¥To use auto-detection you need to set the default_encoding argument to a callable instead of a string. This callable should be a function which takes the input bytes as an argument and returns the character set to use for decoding those bytes to text.

有两个广泛使用的 Python 包都可以处理此功能:

¥There are two widely used Python packages which both handle this functionality:

让我们看一下如何使用其中一个包来安装自动检测...

¥Let's take a look at installing autodetection using one of these packages...

$ pip install httpx
$ pip install chardet

一次chardet安装后,我们可以配置客户端以使用字符集自动检测。

¥Once chardet is installed, we can configure a client to use character-set autodetection.

import httpx
import chardet

def autodetect(content):
    return chardet.detect(content).get("encoding")

# Using a client with character-set autodetection enabled.
client = httpx.Client(default_encoding=autodetect)
response = client.get(...)
print(response.encoding)  # This will either print the charset given in
                          # the Content-Type charset, or else the auto-detected
                          # character set.
print(response.text)