This is a false dichotomy. Not every technique needs to work in all cases in order to be useful.
This seems analogous to arguing that because seat belts don't save the lives of all people involved in car crashes, and they're kind of annoying, then they shouldn't be factory-standard.
This is a case of a feature that is actively harmful for the things it tries to prevent, because it increases the risk in practice of panics "spreading" throughout a system, even after the programmer thought she had finished handling it, and because it gives a false impression what kind of guarantee you actually have.
That’s not surprising to me, but it’s not much of an argument for changing the default to be less safe. Most people want poisoning to propagate fatal errors and avoid reading corrupted data, not to recover from panics.
Edit: isn’t that an argument not to change the default? If people were recovering from poison a lot and that was painful, that’s one thing. But if people aren’t doing that, why is this a problem?
If the issue is that everyone has to write an extra unwrap, then a good step would be to make lock panic automatically in the 2027 edition, and add a lock_or_poison method for the current behavior. But I think removing poisoning altogether from the default mutex, such that it silently unlocks on panic, would be very bad. The silent-unlock behavior is terrible with async cancellations and terrible with panics.
You seem to keep making the implicit assumption that because people are using `unwrap()`, they must not care about the poisoning behavior. I really don't understand where this assumption is coming from. I explicitly want to propagate panics from contexts that hold locks to contexts that take locks. The way to write that is `lock().unwrap()`. I get that some people might write `lock().unwrap()` not because they care about propagating panics, but because they don't care either way and it's easy. But why are you assuming that that's most people?
I'm suggesting that the balance of pain to benefit is not working out enough to inflict it on everyone by default. I'm not suggesting it has no value, just not enough to be worth it.
I hear that, but it feels kind of empty because I haven't seen much discussion of that cost/benefit analysis (both of poisoning itself and of the change to the default behavior, which has its own costs and benefits).
I take it as uncontroversial that an important function of Mutexes is to ensure that invariants about data are maintained when the data is modified and that very bad things can happen when a program's data invariants are violated at runtime and the program doesn't notice. Maybe folks disagree about whether a program should always panic when invariants are violated at runtime (though there's certainly plenty of precedent in Rust itself for doing this, like with array bounds checking). Probably the bigger question mark is that panicking with a Mutex held doesn't necessarily mean an invariant is violated. But it does mean that the mechanism for ensuring the invariant has itself failed. I can see different choices about what to do here. For myself, the event itself is so rare and the impact of getting an invariant wrong so high that I absolutely do want to panic -- the false positive rate is just too small to matter.
At my last job, we only updated dependencies when there was a compelling reason. It was awful.
What would happen from time to time was that an important reason did come up, but the team was now many releases behind. Whoever was unlucky enough to sign up for the project that needed the updated dependency now had to do all those updates of the dependency, including figuring out how they affected a bunch of software that they weren't otherwise going to work on. (e.g., for one code path, I need a bugfix that was shipped three years ago, but pulling that into my component affects many other code paths.) They now had to go figure out what would break, figure out how to test it, etc. Besides being awful for them, it creates bad incentives (don't sign up for those projects; put in hacks to avoid having to do the update), and it's also just plain bad for the business because it means almost any project, however simple it seems, might wind up running into this pit.
I now think of it this way: either you're on the dependency's release train or you jump off. If you're on the train, you may as well stay pretty up to date. It doesn't need to be every release the minute it comes out, but nor should it be "I'll skip months of work and several major releases until something important comes out". So if you decline to update to a particular release, you've got to ask: am I jumping off forever, or am I just deferring work? If you think you're just deferring the decision until you know if there's a release worth updating to, you're really rolling the dice.
(edit: The above experience was in Node.js. Every change in a dynamically typed language introduces a lot of risk. I'm now on a team that uses Rust, where knowing that the program compiles and passes all tests gives us a lot of confidence in the update. So although there's a lot of noise with regular dependency updates, it's not actually that much work.)
I think it also depends on the community as well. Last time I touched Node.js and Javascript-related things, every time I tried to update something, it practically guaranteed something would explode for no reason.
While my recent legacy Java project migration from JDK 8 -> 21 & a ton of dependency upgrades has been a pretty smooth experience so far.
Yeah, along with any community's attitudes to risk and quality, there is also a varying, er, chronological component.
I'd prefer to upgrade around the time most of the nasty surprises have already been discovered by somebody else, preferably with workarounds developed.
At the same time, you don't want to be so far back that upgrading uncovers novel migration problems, or issues that nobody else cares about anymore.
Yeah, the JavaScript/Node.js ecosystem is pain. Lots of tooling (ORMs, queue/workflow frameworks, templating) is new-ish or quickly changing. I've also had minor updates cause breakages; semver is best-effort at best.
I don't like Java but sometimes I envy their ecosystem.
There is a reason most of stable companies use Java - stability. Outside of startups and SV, there are few reasons to avoid such a robust system.
Plus you can find endless stream of experienced devs for it. Which are more stable job wise than those who come & go every 6-12 months. Stability. Top management barely cares for anything else from IT.
Yes, I’ve had exactly the same experience. Once you get off the dependency train, it’s almost impossible to get back on.
I don’t think this is specific to any one language or environment, it just gets more difficult the larger your project is and the longer you go without updating dependencies.
I’ve experienced this with NPM projects, with Android projects, and with C++ (neglecting to merge upstream changes from a private fork).
It does seem likely that dynamic languages make this problem worse, but I don’t think very strict statically typed languages completely avoid it.
> I'm now on a team that uses Rust, where knowing that the program compiles and passes all tests gives us a lot of confidence in the update.
That's been my experience as well. In addition, the ecosystem largely holds to semver, which means a non-major upgrade tends to be painless, and conversely, if there's a major upgrade, you know not to put it off for too long because it'll involve some degree of migration.
It's a similar experience in Go, specially because imports are done by URL and major versions higher than v1.x are forced to change it to add a suffix `/vN` at the end.
Although this is true, any large ecosystem will have some popular packages not holding to semver properly. Also, the biggest downside is when your `>=v1` depends - indirectly usually - on a `v0` dependency which is allowed to do breaking changes.
You can choose to either live at the slightly-bleeding edge (as determined by “stable” releases, etc), or to live on the edge of end-of-life, as discussed here: <https://news.ycombinator.com/item?id=21785399>
OP wisely said for critical vulnerabilities is where the actual exposure needs to be assessed, in order to make an exception from a rule like “install the latest release of things that’s been published for X length of time.”
For instance if you use a package that provides a calendar widget and your app uses only the “western” calendar and there is a critical vulnerability that only manifests in the Islamic calendar, you have zero reason to worry about an exploit.
That's indeed reasonable, but the opposite can happen just as well: there is a vulnerability in the western calendar, but I'm on an old major.minor version that receives no security patches anymore. So now I have to upgrade that dependency, potentially triggering an avalanche of incompatibilities with other packages, leading to further upgrades and associated breakages. Oopsie.
My current employer publishes "staleness" metrics at the project level. It's imperfect because it weights all the dependencies the same, but it's better than nothing.
I wonder, are there tools to help you automate this? I.e. to assign a value to the staleness of each package instead of simple "oudated" boolean, and also a weight to each package.
I’d strongly recommend the documentary “Behind the Curve”. The close look at people in (not quite a cult) gave me a visceral appreciation for what draws people to it (it provides acceptance for people who sorely lack it) and why it can be so hard to leave (one’s identity becomes so tied up in it).
I'm not sure if this is serious or not, but to take it at face value: the value of this sort of thing in Rust is not that it prevents crashes altogether but rather that it prevents _implicit_ failures. It forces a programmer to make the explicit choice of whether to crash.
There's lots of useful code where `unwrap()` makes sense. On my team, we first try to avoid it (and there are many patterns where you can do this). But when you can't, we leave a comment explaining why it's safe.
The language semantics do not surface `unwrap` usage or make any guarantees. It should be limited to use in `unsafe` blocks.
> There's lots of useful code where `unwrap()` makes sense. On my team, we first try to avoid it (and there are many patterns where you can do this). But when you can't, we leave a comment explaining why it's safe.
I would prefer the boiler plate of a match / if-else / if let, etc. to call attention to it. If you absolutely must explicitly panic. Or better - just return an error Result.
It doesn't matter how smart your engineers are. A bad unwrap can sneak in through refactors, changing business logic, changing preconditions, new data, etc.
Restricting unwrap to unsafe blocks adds negative value to the language. It won't prevent unwrap mistakes (people who play fast and loose with it today will just switch to "foo = unsafe { bar.unwrap() };" instead). And it'll muddy the purpose of unsafe by adding in a use that has nothing to do with memory safety. It's not a good idea.
That would be a fairly significant expansion of what `unsafe` means in Rust, to put it lightly. Not to mention that I think doing so would not really accomplish anything; marking unwrap() `unsafe` would not "surface `unwrap` usage" or "make any guarantees", as it's perfectly fine for safe functions to contain `unsafe` blocks with zero indication of such in the function signature and.
> fairly significant expansion of what `unsafe` means in Rust
I want an expansion of panic free behavior. We'll never get all the way there due to allocations etc., but this is the class of error the language is intended to fix.
This turned into a null pointer, which is exactly what Rust is supposed to quench.
I'll go as far as saying I would like to statically guarantee none of my dependencies use the unwrap() methods. We should be able to design libraries that provably avoid panics to the greatest extent possible.
Sure, and I'd hardly be one to disagree that a first-party method to guarantee no panics would be nice, but marking unwrap() `unsafe` is definitely not an effective way to go about it.
> but this is the class of error the language is intended to fix.
Is it? I certainly don't see any memory safety problems here.
> This turned into a null pointer, which is exactly what Rust is supposed to quench.
There's some subtlety here - Rust is intended to eliminate UB due to null pointer dereferences. I don't think Rust was ever intended to eliminate panics. A panic may still be undesirable in some circumstances, but a panic is not the same thing as unrestricted UB.
> We should be able to design libraries that provably avoid panics to the greatest extent possible.
Yes, this would be nice indeed. But again, marking unwrap() `unsafe` is not an effective way to do so.
dtolnay's no_panic is the best we have right now IIRC, and there are some prover-style tools in an experimental stage which can accomplish something similar. I don't think either of those are polished enough for first-party adoption, though.
I would say it can go further than that: Rust enables you to construct many APIs in a way that can’t be misused. It’s not at all unique in this way, but compared with C or Go or the like, you can encode so many more constraints in types.
> So how exactly did they bring about the impossible? They put an await call inside the critical section. The part of the code base that is not allowed to be subject to arbitrary delays. Massive facepalm.
I'm not sure where you got the impression that the example code was where we found the problem. That's a minimal reproducer trying to explain the problem from first principles because most people look at that code and think "that shouldn't deadlock". It uses a Mutex because people are familiar with Mutexes and `sleep` just to control the interleaving of execution. The RFD shows the problem in other examples without Mutexes. Here's a reproducer that futurelocks even though nobody uses `await` with the lock held:
https://play.rust-lang.org/?version=stable&mode=debug&editio...
> I was honestly wondering how you could possibly cause this in any sane code base.
The actual issue is linked at the very top of the RFD. In our cases, we had a bounded mpsc channel used to send messages to an actor running in a separate task. That actor was working fine. But the channel did become briefly saturated (i.e., at capacity) at a point where someone tried to send on it via a `tokio::select!` similar to the one in the example.
Don't do ... what, exactly? The RFD answers this more precisely and provides suggestions for alternatives. But it's not very simple because the things that can cause this are all common patterns individually and it's only the confluence (which can be spread across layers of the program) that introduces this problem. In our case, it wasn't a Mutex, but an mpsc channel (that was working correctly! it just got very briefly saturated) and it was 3-4 modules lower in the stack than the code with the `tokio::select!` that induced this.
> It's more typical in my experience that the act of granting the lock to a thread is what makes it runnable, and it runs right then.
This gets at why this felt like a big deal when we ran into this. This is how it would work with threads. Tasks and futures hook into our intuitive understanding of how things work with threads. (And for tasks, that's probably still a fair mental model, as far as I know.) But futures within a task are different because of the inversion of control: tasks must poll them for them to keep running. The problem here is that the task that's responsible for polling this future has essentially forgotten about it. The analogous thing with threads would seem to be something like if the kernel forgot to enqueue some runnable thread on a run queue.
So async Rust introduces an entire novel class of subtle concurrent programming errors? Ugh, that's awful.
> The analogous thing with threads would seem to be something like if the kernel forgot to enqueue some runnable thread on a run queue.
Yes. But I've never written code in a preemptible protected mode environment like Linux userspace where it is possible to make that mistake. That's nuts to me.
From my POV this seems like a fundamental design flaw in async rust. Like, on a bare metal thing I expect to deal with stuff like this... but code running on a real OS shouldn't have to.
To keep it in perspective, though: we've been operating a pretty good size system that's heavy on async Rust for a few years now and this is the first we've seen this problem. Hitting it requires a bunch of things (programming patterns and runtime behavior) to come together. It's really unfortunate that there aren't guard rails here, but it's not like people are hitting this all over the place.
The thing is that the alternatives all have tradeoffs, too. With threaded systems, there's no distinction in code between stuff that's quick vs. stuff that can block, and that makes it easy to accidentally do time-consuming (blocking) work in contexts that don't expect it (e.g., a lock held). With channels / message passing / actors, having the receiver/actor go off and do something expensive is just as bad as doing something expensive with a lock held. There are environments that take this to the extreme where you can't even really block or do expensive things as an actor, but there the hidden problem is often queueing and backpressure (or lack thereof). There's just no free lunch.
I'd certainly think carefully in choosing between sync vs. async Rust. But we've had a lot fewer issues with both of these than I've had in my past experience working on threaded systems in C and Java and event-oriented systems in C and Node.js.
I think you misunderstand the problem. The only purpose of the sleep in this example is to control interleaving of execution to ensure the problem happens. Here's a version where the background task (the initial lock holder) only runs a bounded number of instructions with the lock held, just as you suggest:
> After that process has finished, you release the lock. Then you return to the scheduler and execute the next future. The next future cannot be blocked because the lock has already been released. It's simply impossible.
This is true with threads and with tasks that only ever poll futures sequentially. It is not true in the various cases mentioned in this RFD (notably `tokio::select!`, but also others). Intuitively: when you have one task polling on multiple futures concurrently, you're essentially adding another layer to the scheduler (kernel thread scheduler, tokio task scheduler, now some task is acting as its own future scheduler). The problem is it's surprisingly easy to (1) not realize that and (2) accidentally have that "scheduler" not poll the next runnable future and then get stuck, just like if the kernel scheduler didn't wake up a runnable thread.
This seems analogous to arguing that because seat belts don't save the lives of all people involved in car crashes, and they're kind of annoying, then they shouldn't be factory-standard.