The removal of implicit conversion between bytes and strings via ascii did not i...

tsimionescu · on Jan 3, 2023

That's only true if you ever received Unicode input. There are plenty of uses of strings that never do - enums, DNS domain names, URLs, HTTP parsing, email addresses (from any sane provider) etc.

Beltalowda · on Jan 3, 2023

Strings are still strings in Python 3; if you do 'foo' == EnumValue then that will work fine in Python 2 and 3. If 'foo' is from an unknown source: yeah, you might get a bytes type in Python 3 and an error, but that's the entire point. Turns out that in practice, it can contain >0xff more often than you'd think.

Certainly today DNS domain names, URLs, and email addresses can – and do – contain >0xff input, and for some of these that was the case in 2008 as well (URLs and email addresses – IDN didn't come until a bit later).

The Python 2 situation was untenable and lead to a great many bugs; "decode errors" were something I regularly encountered in Python programs as a user. In hindsight, the migration effort for existing codebases was understated and things could have been done better with greater attention to compatibility, but the problem it addressed was very real.

tgsovlerkhgsel · on Jan 3, 2023

I've seen way too many cases (possibly resulting from 2to3 autoconversion) where the code ran without errors, you just couldn't log in because the xsrf token was "b'123456'" instead of "123456".

secondcoming · on Jan 3, 2023

> DNS domain names, URLs, and email addresses can – and do – contain >0xff input

Is that true? My understanding is that DNS never supported non-ASCII and so punycode was invented.

Beltalowda · on Jan 3, 2023

DNS-the-protocol still doesn't support non-ASCII input, but DNS-as-people-use-it does. I expanded on that in another comment I just posted, so I won't repeat it here: https://news.ycombinator.com/item?id=34230218

wink · on Jan 3, 2023

I mean, in the most basic sense you could always have non-ASCII chars in your filenames on a webserver, which could be part of a URL then.

ElectricalUnion · on Jan 3, 2023

But is also very rare that you ask directly for domain name resolution in the first place; you are usually instead using something directly or indirectly that happens to be remote, and that eventually happens to have a punycode encoded non-ascii hostname or top level domain. But there's no garantee that you (or the libraries, or the libraries that your libraries use...) are only handling the ascii punycode.

I can count on my fingers the number of places I invoked manual DNS lookups inside production code.

Certhas · on Jan 3, 2023

I mean...

https://www.w3.org/International/articles/idn-and-iri/

https://www.rfc-editor.org/rfc/rfc6531

etc...

Just because you can get away with ignoring the non english speaking world doesn't mean Python should pretend it can.

tsimionescu · on Jan 3, 2023

You can't use Unicode characters in HTTP messages - IDNs and IRIs are encoded into ASCII before being sent on the wire (using punycode for IDNs and percent encoding for IRIs).

As for RFC6531, my understanding is that virtually no email provider implements it, because of the same risks that make browsers often show IDNs as their punycode version - Unicode is extremely easy to use to confuse people maliciously or accidentally, since it contains vast amounts of duplicate characters (e.g. Latin a and Cyrillic а), even larger amounts of similar looking characters (e.g. Latin r and Cyrillic г), and even characters that are ambiguous unless you choose a locale! (the CJK problem, where the same Unicode codepoint can represent different characters based on the locale - whose locale to use when communicating between two machines being the implementer's problem).

Also, I'm not saying that Python shouldn't have had proper suport for separating byte arrays from encoded strings. I was only pointing out that there were actual legitimate use cases where a valid Python2 program was broken by the Python 3 Unicode string transition, whereas the GP was claiming the Python 2 had to have been buggy already.

Edit:reading around more, it seems that RFC6531 is getting some traction, and many providers accept sending to/receiving from internationalized emails even if they don't themselves allow you to have a Unicode email (e.g. you can't have айа@gmail.com, but you can correspond with someone having such an email at a different provider). So, email was a bad example in my list. The rest still stand.

Beltalowda · on Jan 3, 2023

No doubt some things broke "needlessly", or that things couldn't have been done better, but I don't see how it could have been avoided since there is no way to distinguish between "I know that this string will always be ASCII" vs. "this string can contain non-ASCII".

For example, what if I want to enter "ουτοπία.δπθ.gr" in an application, via direct user input, a config file, or something else? Or display that to the user from punycode? No one expects users to convert that manually to "xn--kxae4bafwg.xn--pxaix.gr", and no one will understand that when being displayed, so any generic IDN/domain name library will have to deal with non-ASCII character sets.

The same holds true for email addresses: "ουτοπία <a@example.com>" is an email address. Sure, this may get encoded as "=?UTF-8?q?…?=" in the email header (not always, "bare" UTF-8 is becoming more common) but you still want to display that, accept it as user input, etc. People sometimes to forget that the name part of an email address is widely used and that any serious email system will have to deal with it, and non-ASCII input has been common there for a long time.

In specific applications you can often "get away" by ignoring non-ASCII input because you sometimes don't need it. For example I'm working on some domain name code right now which can because everything is guaranteed to be ASCII or "punycode ASCII", so it's all good. But in generic libraries – such as those in Python's stdlib – that's much harder.

kuschku · on Jan 3, 2023

In Germany, there are two competing flight comparison engines with very similar names: fluege.de and flüge.de.

How would you send an email to the customer support of flüge.de? How would you parse that domain?

IDN is here, and it's here to stay.

jcranmer · on Jan 3, 2023

Even a decade ago, you would have needed to support Unicode if you handled any of those strings. IDN domain names existed as far back as 2003, so unless you could guarantee that everything was in A-label form already, you would need to worry about that (which affects URLs and email addresses as well). URL paths might be Unicode if no one normalized it to percent-encoding first. And HTTP headers--like email headers--could well be non-ASCII despite the standard prohibiting unencoded non-ASCII text because the real world is full of shitty implementations that don't follow standards, and the internet community generally runs on the principle that it's better to force everybody else to try to make sense of the result than tell those people to fix their code.

cozzyd · on Jan 3, 2023

Indeed, I just had to port some code that had been running happily for years at the south pole to py3 (due to EL8 upgrade and not wanting to install the py2 world). It was something talking to a HV supply over a serial port which of course only spits out bytes, but then needed to parse using string handling. It wasn't that hard to port but it took a few tries to find all the places necessary, requiring debugging over a rather slow ssh connection (that is only up when the satellite is up).