Py3k admits, fixes, unicode bad.

| | TrackBacks (0)
Py3k is here, and I didn't even find out about it until the day after release. So it's safe to say I'm not a total python dork. The first bullet point under Text Vs. Data Instead Of Unicode Vs. 8-bit made my day:

Python 3.0 uses the concepts of text and (binary) data instead of Unicode strings and 8-bit strings. All text is Unicode; however encoded Unicode is represented as binary data. The type used to hold text is str, the type used to hold data is bytes. The biggest difference with the 2.x situation is that any attempt to mix text and data in Python 3.0 raises TypeError, whereas if you were to mix Unicode and 8-bit strings in Python 2.x, it would work if the 8-bit string happened to contain only 7-bit (ASCII) bytes, but you would get UnicodeDecodeError if it contained non-ASCII values. This value-specific behavior has caused numerous sad faces over the years. (bold text my emphasis)
Yes, I have had days of work wrecked by this very behavior. It was nice to hear them recognize it.
At work, I wrote a python 2.5 module, the sole purpose of which was to list game files' dependencies so I could write nag scripts when references were broken.

For the most part, I naively treated the xml files as ASCII text and things worked ok for awhile. I could use amara, parse everything, and move around and/or rewrite the xml.

But then I was slaughtered by a font with an umlaut in it.

The stupid umlaut actually led me down the 'what is character encoding really?' road, which I probably still misunderstand. But I learned to encode/decode, use unicode raw strings, and avoid file() / open() calls and prefer codec.open() calls all over the place. I stuffed asserts in random locations where enforcing unicodeness was a requirement. I pushed it from the sore spots all the way through every corner of my stupid code, and once I did, code and I were stronger for it.

In short, there were numerous sad faces.

Hopefully, I can figure out the new deal. And hopefully, it's less borked.

0 TrackBacks

Listed below are links to blogs that reference this entry: Py3k admits, fixes, unicode bad..

TrackBack URL for this entry: http://skull.piratehaven.org/~epu/mt/mt-tb.cgi/137

Recent Tweets