It really is a central example of the bell curve meme, isn't it? The reason we t...

ok123456 · 2025-10-05T04:27:23 1759638443

Can regular expressions parse XML: No.

Can regular expressions parse the subset of XML that I need to pull something out of a document: Maybe.

We have enough library "ergonomics" now that it's not any more difficult to use a regex vs a full XML parser now in dynlangs. Back when this wasn't the case, it really did mean the differnce between a one or two line solution, and about 300 lines of SAX boiler-pate.

mjevans · 2025-10-05T03:05:15 1759633515

It's always the edge cases that make this a pain.

The less like 'random' XML the document is the better the extraction will work. As soon as something oddball gets tossed in that drifts from the expected pattern things will break.

quotemstr · 2025-10-05T03:08:35 1759633715

Of course. But the mathematical, computer-science level truth is that you can make a regular pattern that recognizes a string in any context-free language so long as you're willing to place a bound on the length (or equivalently, the nesting depth) of that string. Everything else is a lie-to-children (https://en.wikipedia.org/wiki/Lie-to-children).

rcxdude · 2025-10-05T10:18:40 1759659520

You can, but you probably shouldn't since said regex is likely to be very hard to work with due to the amount of redundant states involved.

quotemstr · 2025-10-05T15:08:21 1759676901

Our discourse does a terrible job of distinguishing impossible things from things merely ill-advise. Intellectual honestly requires us to be up front about the difference.

Yeah, I'd almost certainly reject a code review using, say, Python's re module to extract stuff from XML, but while doing so, I would give every reason except "you can't do that".

ntcho · 2025-10-05T04:02:03 1759636923

This reminds me of cleaning a toaster with a dishwasher: https://news.ycombinator.com/item?id=41235662

zeroimpl · 2025-10-05T03:10:04 1759633804

If I’m not mistaken, even JSON couldn’t be parsed by a regex due to the recursive nature of nested objects.

But in general we aren’t trying to parse arbitrary documents, we are trying to parse a document with a somewhat-known schema. In this sense, we can parse them so long as the input matches the schema we implicitly assumed.

quotemstr · 2025-10-05T03:12:11 1759633931

> If I’m not mistaken, even JSON couldn’t be parsed by a regex due to the recursive nature of nested objects.

You can parse ANY context-free language with regex so long as you're willing to put a cap on the maximum nesting depth and length of constructs in that language. You can't parse "JSON" but you can, absolutely, parse "JSON with up to 1000 nested brackets" or "JSON shorter than 10GB". The lexical complexity is irrelevant. Mathematically, whether you have JSON, XML, sexps, or whatever is irrelevant: you can describe any bounded-nesting context-free language as a regular language and parse it with a state machine.

It is dangerous to tell the wrong people this, but it is true.

(Similarly, you can use a context-free parser to understand a context-sensitive language provided you bound that language in some way: one example is the famous C "lexer hack" that allows a simple LALR(1) parser to understand C, which, properly understood, is a context-sensitive language in the Chomsky sense.)

The best experience for the average programmer is describing their JSON declaratively in something like Zod and having their language runtime either build the appropriate state machine (or "regex") to match that schema or, if it truly is recursive, using something else to parse --- all transparently to the programmer.

LegionMammal978 · 2025-10-05T14:27:46 1759674466

What everyone forgets is that regexes as implemented in most programming languages are a strict superset of mathematical regular expressions. E.g., PCRE has "subroutine references" that can be used to match balanced brackets, and .NET has "balancing groups" that can similarly be used to do so. In general, most programming languages can recognize at least the context-free languages.

Crestwave · 2025-10-05T03:29:06 1759634946

It's impossible to parse arbitrary XML with regex. But it's perfectly reasonable to parse a subset of XML with regex, which is a very important distinction.