It’s gotten a little old for me, just because it still buoys a wave of “solve a problem with a regex, now you’ve got two problems, hehe” types, which has become just thinly veiled “you can’t make me learn new things, damn you”. Like all tools, its actual usefulness is somewhere in the vast middle ground between angelic and demonic, and while 16 years ago, when this was written, the world may have needed more reminding of damnation, today the message the world needs more is firmly: yes, regex is sometimes a great solution, learn it!
I agree that people should learn how regular expressions work. They should also learn how SQL works. People get scared of these things, then hide them behind an abstraction layer in their tools, and never really learn them.
But, more than most tools, it is important to learn what regular expressions are and are not for. They are for scanning and extracting text. They are not for parsing complex formats. If you need to actually parse complex text, you need a parser in your toolchain.
This doesn't necessarily require the hair pulling that the article indicates. Python's BeautifulSoup library does a great job of allowing you convenience and real parsing.
Also, if you write a complicated regular expression, I suggest looking for the /x modifier. You will have to do different things to get that. But it allows you to put comments inside of your regular expression. Which turns it from a cryptic code that makes your maintenance programmer scared, to something that is easy to understand. Plus if the expression is complicated enough, you might be that maintenance programmer! (Try writing a tokenizer as a regular expression. Internal comments pay off quickly!)
With regex, I won’t. I rarely include much in terms of regex in my PRs, usually small filters for text inputs for example. More complex regexes are saved for my own use to either parse out oddly formatted data, or as vim find/replace commands (or both!).
When I do use a complex regex, I document it thoroughly - not only for those unfamiliar, but also for my future self so I have a head-start when I come back to it. Usually when I get called out on it in a PR, it’s one of two things:
- “Does this _need_ to be a regex?” I’m fine to justify it, and it’s a question I ask teammates especially if it’s a sufficiently complex expression I see
- “What’s that do?” This is rarely coming from an “I don’t know regex” situation, and more from an “I’m unfamiliar with this specific part of regex” eg back references.
I think the former is 100% valid - it’s easy to use too much regex, or to use it where there are better methods that may not have been the first place one goes: need to ensure a text field always displays numbers? Use a type=number input; need to ensure a phone number is a valid NANP number? Regex, baby!
The latter is of course valid too, and I try to approach any question about why a regex was used, or what it does, with a link to a regex web interface and an explanation of my thinking. I’ve had coworkers occasionally start using more regex in daily tasks as a result, and that’s great! It can really speed up things tasks that would otherwise be crummy to do by hand or when finagling with a parser.
Bonus: some of my favorite regex adventures:
- Parsing out a heavily customizable theme’s ACF data stuffed into custom fields in a Wordpress database, only to shove them into a new database with a new and %better% ACF structure
- Taking PDF bank statements in Gmail, copying the text, and using a handful of painstakingly written find/replace vim regexes to parse the goop into a CSV format because why would banks ever provide structured data??
- Copy/pasting all of the Music League votes and winners from a like 20-person season into a text doc and converting it to a JSON format via regex that I could use to create a visualization of stats.
- Not parsing HTML (again, anyways)
It gets buried in the rant, but this part is the key:
> HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.
The first sentence is correct but the second is wrong. A regex can be used for breaking HTML into lexical tokens like start tags and end tags. Which is what the question asks about.
Fair enough. GP is right in that there's a lot of absolutism with regards to what regex can solve. I first learned recursive-descent parsing from Destroy All Software, where he used Regex for the lexing stage by trying to match the start of the buffer for each token. I'm glad I learned it that was, otherwise I probably would have gotten lost in the character-by-character lexing as a beginner and would've never considered using regex. Now I use regex in most of my parsers, to various degrees.
As for GP's "solve a problem with a regex, now you’ve got two problems, hehe", I remember for years trying to use regex and never being able to get it to work for me. I told my friends such, "I've literally never had regex help in a project, it always bogs me down for hours then I give up". I'm not sure what happened, but one day I just got it, and I've never had much issue with regex again and use it everywhere.
I had a hard time with complex regex until I started using them more in vim - the ability to see what your regex matches as you work is really helpful. Of course, this is even better now with sites like regexr and regex101
Regex101 is always open when I'm doing regexes, what a great tool. Occasionally I use sites that build the graph so you can visualize the matching behaviour.
There are even tools to generate matching text from a regex pattern. Rust's Proptest (property-based testing) library uses this pattern to generate minimal failing counterexamples from a regex pattern. The tooling around Regex can be pretty awesome.
I swapped out a "proper" parser for a regex parser for one particular thing we have at work that was too slow with the original parser. The format it is parsing is very simple, one top level tag, no nested keys, no comments, no attributes, or any other of the weird things you can do in XML. We needed to get the value of one particular tag in a potentially huge file. As far as I can tell this format has been unchanged for the past 25 years ... It took me 10 minutes to write the regex parser, and it sped up the execution by 10-100x. If the format changes unannounced tomorrow and it breaks this, we'll deal with it - until then, YAGNI
Is it really? Maybe I'm blessed with innocence, but I've never been tempted to read it as anything but a humorous commentary on formal language theory.
If you stop learning the basics, you will never know when the sycophantic AI happily lures you down a dark alley because it was the only way you discovered on your own. You’ll forever be limited to a rehashing of the bland code slop the majority of the training material contained. Like a carpenter who’s limited to drilling Torx screws.
If that’s your goal in life, don’t let me bother you.
That's not entirely fair. It's relatively easy to learn the basics of regular expressions. But it's also relatively easy, with that knowledge, to write regular expressions that
- don't work the way you want them to (miss edge cases, etc)
- fail catastrophically (ie, catastrophic backtracking, etc) which can lead to vulnerabilities
- are hard to read/maintain
I love regular expressions, but they're very easy to use poorly.
> If they don't work the way you want, you just keep refining it. This is easy if you actually test your regex in real data.
There can be edge cases in both your data and in the regular expression itself. It's not as easy as "write your code correctly and test it". Although that's true of programming in general, regular expressions tend to add an extra "layer" to it.
I don't know if you meant it to be that way, but your comment sounds a lot like "it's easy to program without bugs if you test your code". It's pretty much a given that that's not the case.
I didn’t get the “it’s easy to program without bugs” vibe at all, and OP even mentioned an edge case that took their parser down (BUG!)
Neither the human nor the AI will catch every edge case, especially if the data can be irregular. I think the point they were making is more along the lines of “when you do it yourself, you can understand and refine and fix it more easily.”
If an LLM had done my regular expressions early in my career, I’d have only !maybe! have learned just what I saw and needed to know. I’m almost certain the first time I saw (?:…) I’d have given up and leaned into the AI heavily.