PYTHON-编码处理小结

用python处理中文时，读取文件或消息，http参数等等

一运行，发现乱码(字符串处理，读写文件，print)

然后，大多数人的做法是，调用encode/decode进行调试，并没有明确思考为何出现乱码

所以调试时最常出现的错误

错误1

1
2
3

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

错误2

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

首先

必须有大体概念，了解下字符集，字符编码

ASCII | Unicode | UTF-8 | 等等

str 和 unicode

str和unicode都是basestring的子类

所以有判断是否是字符串的方法

1 2	def is_str(s): return isinstance(s, basestring)

1 2	str -> decode('the_coding_of_str') -> unicode unicode -> encode('the_coding_you_want') -> str

str是字节串，由unicode经过编码(encode)后的字节组成的

s = '中文'
s = u'中文'.encode('utf-8')

>>> type('中文')
<type 'str'>

求长度(返回字节数)

>>> u'中文'.encode('utf-8')
'\xe4\xb8\xad\xe6\x96\x87'
>>> len(u'中文'.encode('utf-8'))
6

unicode才是真正意义上的字符串，由字符组成

声明方式

s = u'中文'
s = '中文'.decode('utf-8')
s = unicode('中文', 'utf-8')

type(u'中文')
<type 'unicode'>
求长度(返回字符数),在逻辑中真正想要用的

>>> u'中文'
u'\u4e2d\u6587'
>>> len(u'中文')
2

结论:
搞明白要处理的是str还是unicode, 使用对的处理方法(str.decode/unicode.encode)

下面是判断是否为unicode/str的方法

>>> isinstance(u'中文', unicode)
True
>>> isinstance('中文', unicode)
False

>>> isinstance('中文', str)
True
>>> isinstance(u'中文', str)
False
简单原则：不要对str使用encode，不要对unicode使用decode (事实上可以对str进行encode的，具体见最后，为了保证简单，不建议)

>>> '中文'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

>>> u'中文'.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

不同编码转换,使用unicode作为中间编码

1 2	#s是code_A的str s.decode('code_A').encode('code_B')

总结总结，本文仅适用于python2.x

默认编码与开头声明
首先是开头的地方声明编码

1	# coding: utf8

这个东西的用处是声明文件编码为utf8(要写在前两行内)，不然文件里如果有中文，比如

a = ‘美丽’
b = u’美丽’
中任何一种，运行前就会提示你SyntaxError，Non-ASCII character… 之类，因为python2.x的文件编码默认使用万恶的ascii
开头加上那句默认编码声明就会变成utf8，获取当前的默认编码

sys.getdefaultencoding()
unicode与utf8
在python中，使用unicode类型作为编码的基础类型，编解码要以其为中间形式过渡，即进行str和unicode之间的转换。
解码然后再编码的过程，即str->unicode->str的过程。中间得到的叫做unicode对象

这里需要强调的是unicode是一种字符编码方法，是 “与存储无关的表示”，而utf8是一种以unicode进行编码的计算机二进制表示，或者说传输规范。gbk，gb2312，gb18030, utf8等属于不同的字符集，转换编码就是在它们中的任意两者间进行。

控制台的编码

这又是另一个让人困惑的地方——控制台的编码导致的乱码问题甚至是报错。一般个人用的电脑上控制台基本上都是utf8编码的

一般submine保存时需要选择“utf-8 with Bom”.


#encode=utf-8
import json

data1 = {'b': 789, 'c': 456, 'a': 123}
encode_json = json.dumps(data1)
print type(encode_json), encode_json

decode_json = json.loads(encode_json)
print type(decode_json)
print decode_json['a']
print decode_json
print json.dumps(decode_json)

s='中国'
#s.encode('gbk')

print s