The interesting part is that it occurred to no one to just reboot and see if tha...

kristopolous · on July 25, 2023

Calling IT is the right move here. Could have been an intruder or a remote user doing something important.

It's a different relationship. The department workstations were much closer to the refrigerator or copy machine. If it's broken you don't touch it and just call somebody.

In modern money these machines were between $7,000-$10,000 each depending on configuration.

To put it in context, pretend you work somewhere where they provide you with a half height rack and a Precision 7960 https://www.dell.com/en-us/shop/workstations-isv-certified/p...

And it starts acting up. What do you do?

(As an aside, I've always wondered how many maxed out configuration orders they get - you know, when you kick that price up to $100k - what's the threshold where they ask if they could put someone on a plane to visit you? 10 of them?)

TristanBall · on July 25, 2023

I'd guess orders for these probably skew to the higher end.

If you're putting workstations in racks it's either to share them, or due for power/cooling/noise reasons, and the fact that you've got a workload that justifies having those kind of problems probably means all your other costs will still dwarf hardware.

There's usually a large premium on whatever the current largest size dimms, ssds and the top 10% or so of cpus and video cards. So I expect they sell a lot of machines that are "50%" size ( either max physical capacity with 50% size components, or half physical capacity, with 90-100% components ), and a fair number of maxed out just because it will often be cheaper to have 1 maxed out option then 3 smaller ones, and budgets don't matter except they do.

Places who cost engineering time at $100k/hour won't blink at $100k computers if it gets the job done.

eru · on July 25, 2023

I'd probably reboot it every day (or every week) as a matter of course. Not just when there are problems.

Just so that I know that there are no unexpected surprises happening when I need to reboot it in an emergency.

pjc50 · on July 25, 2023

Somewhat against UNIX culture. There used to be proudness in having very high system uptimes. The modern security arms race has basically killed that.

somat · on July 25, 2023

I don't think that long uptimes is unix culture at all. unix was always about being small, simple and fun to use, A place where having something now is much more valuable than being correct later. A hackers OS. This is also where most of the sins of unix come from.

"We went to lunch afterward, and I remarked to Dennis that easily half the code I was writing in Multics was error recovery code. He said, "We left all that stuff out. If there's an error, we have this routine called panic(), and when it is called, the machine crashes, and you holler down the hall, 'Hey, reboot it.'"

https://multicians.org/unix.html

AnimalMuppet · on July 25, 2023

Not sure it's just security. I wonder if it's also that people don't host important services on non-redundant machines as much anymore.

aflag · on July 25, 2023

That sounds needlessly disruptive. It is a workstation after all. I restart mine as little as I'm allowed to and once a month sounds way too much already.

dmbche · on July 25, 2023

What is the negative?

kristopolous · on July 25, 2023

At least on older hardware, number of reboots was a stronger correlative with failure than hours on.

I'll readily admit this may have been apocryphal. It was a common adage when I was a child in the 80s and now that I'm actually qualified enough to suss out such a statement I've never cracked open the historical literature at archive.org on this one to actually check.

It could just be a portage from incandescent lightbulbs (where this is true) and older cars (where this is also true). The idea of non-technicals thinking magical computer dust having the same problem is understandable

erosenbe0 · on July 26, 2023

One of the highest stresses on passives and power components occurs when there is an inrush of current (di/dt) or a voltage spike (dv/dt), which can occur on power cycling or plug in. So it is not a myth that hard reboots can be stressful on older hardware, but there is a certain amount of red herring because power cycling is also when an aged or diseased component is likely to show failure due to hours of service.

Modern devices and standards are able, at a low cost, to implement ripple, transient, reverse voltage, inrush limiting and the like. So failures are more isolated.

Nowadays, with stuff like USB there is inrush limiting, reverse voltage protection, transient suppression and it costs very little to implement, so it's mostly going to be power supplies.

kristopolous · on July 26, 2023

Power bricks are certainly a huge failure point. Luckily they're cheap. It's a great place to fail if we must assign it somewhere

aflag · on July 25, 2023

You have to close all windows (and possibly tabs in your editor), restart long running jobs you have in the background, restart your SSH sessions, lose your undo tree, lose all the variables you have loaded in your interpreter or bash, among others that I have possibly forgotten.

All recoverable, but annoying. I can't imagine doing that every day. It's fine for a home computer, but for a workstation, I just want it always on. Though these days even my personal laptop is essentially always on.

eru · on July 26, 2023

For a personal machine, it's fine to leave it always on.

> All recoverable, but annoying.

For a machine that other people are supposed to rely upon, I'd rather exercise this recovery you are talking about regularly. So I know it works, when I need it.

For a production system, I rather live through it's first day of uptime 10,000 days in a row, than making new uptime records every day. In production, you don't want to do anything for the first time, if you can avoid it.

aflag · on July 26, 2023

For production it's highly dependent on the business needs. But restarting the entire estate every day seems a big enough hit to capacity at the very least that may already be prohibitive without any further consideration.

Not to count that would require every service to be prepared to be restarted daily, which could require a more complex system than you'd need otherwise.

eru · on July 27, 2023

I'd want to restart individual components often, probably not the whole system at once.

Basically, whatever eg Google is doing.

aflag · on July 27, 2023

I doubt Google restarts all their machines once a day. Obviously not all at once, otherwise they'd have massive downtime. But anyway, Google's needs are very different than just about any other company on earth (except for a handful of others like Facebook and Amazon). So, they are usually not the best example.

eru · on July 31, 2023

Yes, once a day was an example. Google uses different frequencies. However I do remember getting automated nagging messages at Google when a service hadn't been redeployed (and thus restarted) for more than two weeks.

Google as a whole might be different from other companies, yes. But smaller departments in Google aren't that different from other companies.

aflag · on Aug 6, 2023

Getting those messages is very different (and a lot more reasonable) from forceful restarting them, which was the initial suggestion.

The restarting risk is normally so small, that there are several other things that are more important than constantly restarting to test that restart works. Continuous delivery, security patching and hardware failure will likely cause enough restarts anyway.

computerfriend · on July 25, 2023

Downtime.

tmn007 · on July 25, 2023

Hard reboot on those old Sun systems usually meant a dirty filesystem and telling off by the admin as they needed an fsck

IggleSniggle · on July 25, 2023

No matter how many times I see it I always read fsck as "(for) fucks sake" and then internally correct to "file system check." I think I've got a flashbulb stressful memory floating around in there.