Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
Tags
- hadoop
- France
- QT
- psychology
- ubuntu
- program
- essay
- Python
- history
- Malaysia
- Spain
- erlang
- Linux
- MySQL
- Book
- Kuala Lumpur
- comic agile
- web
- Software Engineering
- Book review
- django
- management
- Italy
- RFID
- hbase
- agile
- programming_book
- Java
- leadership
- Programming
Archives
- Today
- Total
python unicode, encoding 본문
http://docs.python.org/2/howto/unicode.html
e.g. UnicodeEncodeError: 'ascii' codec can't encode character u'\ua000' in position 0: ordinal not in range(128)
e.g. UnicodeDecodeError: 'ascii' codec can't decode byte 0x9f in position 3: ordinal not in range(128)
http://coreapython.hosting.paran.com/etc/Unicode%20in%20python.html
cp949 to utf-8
http://lilypad.egloos.com/651441
unicode objects
http://effbot.org/zone/unicode-objects.htm
한글 인코딩
문자열 변환
- 상황
- wiki text를 parsing해 stdout으로는 정상 출력, file redirection은 UnicodeEncodeError
- $ cat readXml.py
#-*- coding: utf-8 -*-
from xml.etree.ElementTree import iterparse
import re
import sys
fileName = 'example.xml'
xmlns = "{http://www.mediawiki.org/xml/export-0.8/}"
targetTag = 'page'
iparse = iterparse(fileName, ['start', 'end'])
for event, elem in iparse:
#print event, elem.tag
if event == 'start' and elem.tag == xmlns + targetTag:
print 'found', elem.tag
pageNode = elem
break
pages = (elem for event, elem in iparse if event == 'end' and elem.tag == xmlns + targetTag)
cnt = 0
keys = [ u'|이름=', u'|출생일=', u'|출생지=', u'|키=', u'|체중=', u'|포지션=', u'|등번호=' ]
for page in pages:
revisionElem = page.find(xmlns + 'revision')
textElem = revisionElem.find(xmlns + 'text')
if textElem.text is not None and u'축구 선수 정보' in textElem.text:
cnt += 1
print cnt, page.find(xmlns + 'title').text
for datum in textElem.text.split('\n'):
for key in keys:
if datum.startswith(key):
print datum.replace(key, ''),
#print
if page in pageNode:
pageNode.remove(page)
print 'total', cnt
$ python2.7 readXml.py
found {http://www.mediawiki.org/xml/export-0.8/}page
1 크리스티아누 호날두
크리스티아누 호날두 {{출생일과 나이|1985|2|5}} {{POR}} [[마데이라 제도]] [[푼샬]] 187cm 85kg [[공격수|윙포워드]],[[공격수|스트라이커]] 7 total 1
$ python2.7 readXml.py > data.result
Traceback (most recent call last):
File "readXml.py", line 27, in <module>
print cnt, page.find(xmlns + 'title').text
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128) - input file과 환경 변수 모두 utf-8
- $ file example.xml
example.xml: UTF-8 Unicode English text, with very long lines
$ env | grep -i lang
LANG=ko_KR.utf8 - 해결; 항상 encoding, string.encode('utf-8', 'ignore') 사용
- http://stackoverflow.com/questions/2224130/unicodeencodeerror-when-redirecting-stdout
- $ cat readXml.py
#-*- coding: utf-8 -*-
from xml.etree.ElementTree import iterparse
import re
import sys
fileName = 'example.xml'
#fileName = 'kowiki-latest-pages-meta-current.xml'
xmlns = "{http://www.mediawiki.org/xml/export-0.8/}"
targetTag = 'page'
iparse = iterparse(fileName, ['start', 'end'])
for event, elem in iparse:
#print event, elem.tag
if event == 'start' and elem.tag == xmlns + targetTag:
print 'found', elem.tag
pageNode = elem
break
pages = (elem for event, elem in iparse if event == 'end' and elem.tag == xmlns + targetTag)
cnt = 0
keys = [ u'|이름=', u'|출생일=', u'|출생지=', u'|키=', u'|체중=', u'|포지션=', u'|등번호=' ]
for page in pages:
revisionElem = page.find(xmlns + 'revision')
textElem = revisionElem.find(xmlns + 'text')
if textElem.text is not None and u'축구 선수 정보' in textElem.text:
cnt += 1
print cnt, page.find(xmlns + 'title').text.encode('utf-8', 'ignore')
for datum in textElem.text.split('\n'):
for key in keys:
if datum.startswith(key):
print datum.replace(key, '').encode('utf-8', 'ignore'),
#print
if page in pageNode:
pageNode.remove(page)
print 'total', cnt
Comments