str型、unicode型

まず、文字列を扱うには2つの型がある。 str型と、unicode型。

基本的にはunicode型を使うようにすると間違いは少なさそう。

個別に調べていってみましょう

str型

"もじれつ" で表される型

>>> "もじれつ"
'\xe3\x82\x82\xe3\x81\x98\xe3\x82\x8c\xe3\x81\xa4'

unicode型

u"文字列" で表される型

>>> u"もじれつ"
u'\u3082\u3058\u308c\u3064'

unicode 関数

decode 関数で置き換え可能なのでこれは使わない方が身のためだろう。ただ混乱をきたすだけ。おそらくunicode 関数だけでしか出来ない事はない。

decode 関数

str型 の文字列に使うものだよ

unicode型 を返す

>>> "もじれつ".decode("utf-8")
u'\u3082\u3058\u308c\u3064'

unicode型 文字列も decode関数は持っている
unicode型 で中身が ascii な文字列はつかっても変わらない

>>> u"abcdef".decode("utf-8")
u'abcdef'

unicode型 の中身が 非ascii な文字列はException

>>> u"もじれつ".decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/var/pyenv/versions/teg_server2.7.12/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

もう1度述べるが、つまり unicode型 には使えない

encode 関数

unicode型 の文字列に使うものだよ

str型 を返す

>>> u"もじれつ".encode("utf-8")
'\xe3\x82\x82\xe3\x81\x98\xe3\x82\x8c\xe3\x81\xa4'

str型 文字列も encode関数は持っている
str型 で中身が ascii な文字列はつかっても変わらない

>>> "abcdef".encode("utf-8")
'abcdef'

str型 で中身が 非ascii な文字列はException

>>> "もじれつ".encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 0: ordinal not in range(128)