python unicode, encoding 본문

Programming

python unicode, encoding

halatha 2013. 4. 3. 09:35

http://docs.python.org/2/howto/unicode.html

e.g. UnicodeEncodeError: 'ascii' codec can't encode character u'\ua000' in position 0: ordinal not in range(128)

e.g. UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)


http://coreapython.hosting.paran.com/etc/Unicode%20in%20python.html


cp949 to utf-8

http://lilypad.egloos.com/651441


unicode objects

http://effbot.org/zone/unicode-objects.htm


한글 인코딩

http://harebox.tistory.com/entry/%ED%8C%8C%EC%9D%B4%EC%8D%AC-%ED%95%9C%EA%B8%80-%EC%9D%B8%EC%BD%94%EB%94%A9%EC%97%90-%EB%8C%80%ED%95%B4


문자열 변환

http://ask.python.kr/question/69253/%EB%AC%B8%EC%9E%90%EC%97%B4-%EB%B3%80%ED%99%98%EC%97%90-%EA%B4%80%EB%A0%A8%EB%90%9C-%EC%A7%88%EB%AC%B8/


  • 상황
    • wiki text를 parsing해 stdout으로는 정상 출력, file redirection은 UnicodeEncodeError
      • $ cat readXml.py
        #-*- coding: utf-8 -*-
        from xml.etree.ElementTree import iterparse
        import re
        import sys

        fileName = 'example.xml'

        xmlns = "{http://www.mediawiki.org/xml/export-0.8/}"
        targetTag = 'page'

        iparse = iterparse(fileName, ['start', 'end'])
        for event, elem in iparse:
                #print event, elem.tag
                if event == 'start' and elem.tag == xmlns + targetTag:
                        print 'found', elem.tag
                        pageNode = elem
                        break
        pages = (elem for event, elem in iparse if event == 'end' and elem.tag == xmlns + targetTag)
        cnt = 0
        keys = [ u'|이름=', u'|출생일=', u'|출생지=', u'|키=', u'|체중=', u'|포지션=', u'|등번호=' ]
        for page in pages:
                revisionElem = page.find(xmlns + 'revision')
                textElem = revisionElem.find(xmlns + 'text')
                if textElem.text is not None and u'축구 선수 정보' in textElem.text:
                        cnt += 1
                        print cnt, page.find(xmlns + 'title').text
                        for datum in textElem.text.split('\n'):
                                for key in keys:
                                        if datum.startswith(key):
                                                print datum.replace(key, ''),
                        #print

                if page in pageNode:
                        pageNode.remove(page)
        print 'total', cnt
        $ python2.7 readXml.py
        found {http://www.mediawiki.org/xml/export-0.8/}page
        1 크리스티아누 호날두
        크리스티아누 호날두 {{출생일과 나이|1985|2|5}}  {{POR}} [[마데이라 제도]] [[푼샬]] 187cm  85kg  [[공격수|윙포워드]],[[공격수|스트라이커]]  7 total 1
        $ python2.7 readXml.py > data.result
        Traceback (most recent call last):
          File "readXml.py", line 27, in <module>
            print cnt, page.find(xmlns + 'title').text
        UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
    • input file과 환경 변수 모두 utf-8
      • $ file example.xml
        example.xml: UTF-8 Unicode English text, with very long lines
        $ env | grep -i lang
        LANG=ko_KR.utf8
  • 해결; 항상 encoding, string.encode('utf-8', 'ignore') 사용
    • http://stackoverflow.com/questions/2224130/unicodeencodeerror-when-redirecting-stdout
    • $ cat readXml.py
      #-*- coding: utf-8 -*-
      from xml.etree.ElementTree import iterparse
      import re
      import sys

      fileName = 'example.xml'
      #fileName = 'kowiki-latest-pages-meta-current.xml'

      xmlns = "{http://www.mediawiki.org/xml/export-0.8/}"
      targetTag = 'page'

      iparse = iterparse(fileName, ['start', 'end'])
      for event, elem in iparse:
              #print event, elem.tag
              if event == 'start' and elem.tag == xmlns + targetTag:
                      print 'found', elem.tag
                      pageNode = elem
                      break
      pages = (elem for event, elem in iparse if event == 'end' and elem.tag == xmlns + targetTag)
      cnt = 0
      keys = [ u'|이름=', u'|출생일=', u'|출생지=', u'|키=', u'|체중=', u'|포지션=', u'|등번호=' ]
      for page in pages:
              revisionElem = page.find(xmlns + 'revision')
              textElem = revisionElem.find(xmlns + 'text')
              if textElem.text is not None and u'축구 선수 정보' in textElem.text:
                      cnt += 1
                      print cnt, page.find(xmlns + 'title').text.encode('utf-8', 'ignore')
                      for datum in textElem.text.split('\n'):
                              for key in keys:
                                      if datum.startswith(key):
                                              print datum.replace(key, '').encode('utf-8', 'ignore'),
                      #print

              if page in pageNode:
                      pageNode.remove(page)
      print 'total', cnt



Comments