Re: how to make Debian less fragile (long and philosophical)

To: debian-devel@lists.debian.org
Subject: Re: how to make Debian less fragile (long and philosophical)
From: Justin Wells <jread@semiotek.com>
Date: Tue, 17 Aug 1999 00:56:17 -0400
Message-id: <[🔎] 19990817005617.B2006@semiotek.com>
In-reply-to: <[🔎] 19990816224517.I23115@justice.loyola.edu>; from Michael Stone on Mon, Aug 16, 1999 at 10:45:17PM -0400
References: <[🔎] 19990816093555.E17022@jester.vip.net.pl> <[🔎] Pine.LNX.3.96.990816064704.199A-100000@dwarf.polaris.net> <[🔎] 19990816083346.A19386@usatoday.com> <[🔎] 19990816101253.B2754@semiotek.com> <[🔎] 19990816093555.E17022@jester.vip.net.pl> <[🔎] Pine.LNX.3.96.990816064704.199A-100000@dwarf.polaris.net> <[🔎] 19990816101151.A2754@semiotek.com> <[🔎] Pine.LNX.3.96.990815160802.3856B-100000@dwarf.polaris.net> <[🔎] Pine.LNX.3.96.990815152949.1996I-100000@baer.mancill.com> <[🔎] 19990816224517.I23115@justice.loyola.edu>

On Mon, Aug 16, 1999 at 10:45:17PM -0400, Michael Stone wrote:

> Most of the stuff in /sbin only relies on a couple of libraries. If
> those couple of libraries get nailed, it's very likely that at least one
> of the static binaries you need will also get nailed.

This is misleading. The package manager relies on things like "bash" 
which is incredibly complex, and routinely updated. It also relies
on things like "libC" which is incredibly complex, and routinely 
updated. 

Like it or not, reliance on any dynamically linked library massively
increases the failure points for a program.

And no I don't think it's likely that the static binaries also got 
nailed. I think it's possible, but I've seen the opposite happen 
on numerous systems over the many years that I've administered 
many different kinds of Unix boxes. 

Possible reasons why your libraries are hosed, in order of probability:

   #1 -- a careless sysadmin rm -rf *'s the wrong thing and wipes 
         out something critical. happens more often than we like
         to think. you might wipe out the static binaries, you might
         wipe out the libraries, but the odds that you'll wipe out
         both are fairly low

   #2 -- a careless sysadmin is playing around with the libraries in
         a dangerous way; doing something they shouldn't in order to
         get something working faster. they destroy their libraries,
         and yes it's their own damn fault. but they still need to 
         fix the machine and the OS shouldn't hang them out to dry. 
         more insightfully: the OS shouldn't hang the senior syadmin
         out to dry because of the junior sysadmin's goof.

   #3 -- a careless sysadmin mis-uses the package manager in a way
         that causes it to destroy something important

   #4 -- there is (gasp) a bug in the stable release of apt-get or
         dpkg and the libraries go down. it's happened a few times 
         in unstable--sure good testing makes it less likely in 
         stable. but not impossible. or are you claiming that all
         debian software is 100% bug free?

Now yes it's true that all of the above errors could have nailed 
the static binaries too--but my own two eyes have seen the dynamic
libraries fail far more often than the entire disk gets wiped. 

The basic principle here is that dynamic linking is actually a fairly
complex process, and it's therefore easier to disrupt than the 
process of loading and executing a static binary. Much smaller types
of failures can have a much more devestating effect on dynamic 
linked programs than on static linked ones. That's in theory, in 
practice for whatever reason it happens a lot (meaning I see it
a couple of times every couple of years--that is a LOT when your
systems are important).

And here is #5, which I have separated because it requires more 
explanation than the above four reasons:

   #5 -- a hardware error occurs and it corrupts a few files. you
         don't know how extensive the problem is, but libC is 
         at least one of the file that's been hosed

Your response to this was "well then what do you trust?" and you 
sort of assume that it's OK to give up on the system at this point,
throw in the towel, insert the boot floppy, and scrub the whole
damn thing. 

Well guess what? Sometimes something really bloody important is 
running on a machine and if you can get the thing limping a long 
for another couple of hours your company and your job is safe,
and if you can't it's a big, big problem.

One of the early experiences I had with Linux that convinced me that
eventually it would be worthy of performing important services 
was when the IDE cable on my root drive fell up, and I was able to
keep the machine alive and functional for a full half hour after
that. Long enough to finish several important tasks, save a 
lot of critical data on non-affected drives, and calmly close
down my shells that were running jobs on remote machines.

By the time the eventual kernel panic hit, everything of value had 
been copied off the machine. 

So no, just because you have experienced catastrophic failure,
you don't have to throw in the towel and give up. You can fight
on and frequently win some major victories before you finally
power-down the machine and address the hardware fault.

That was a slackware box (Debian didn't exist back then). If you 
are telling me that the design of Debian should be such that I
would have been screwed in the above scenario, then I have to 
wonder if Debian is worth putting on a machine that is running 
anything really important to a company.

> > Debian has callously thrown away 30 years of hard won knowledge here, 
> > because for some reason people believe the intricate dependency manager
> > is a replacement for common sense.
> 
> No. Using dynamic libs was a decision made after weighing the advantages
> and disadvantages of static linking. It wasn't done on a whim. It wasn't
> done by a pack of fools with no common sense. You know, the more I
> reread that paragraph, the more insulting it becomes.

What is the advantage of a statically linked restore? Think about 
what restore does before composing your answer.

> Dynamic linking provides these benefits:
> 1) Saves space.
> 2) Removes the need for seperate rescue and run-time binaries.
> 3) Easier to update critical binaries in the event that a library flaw
> is discovered. (E.g., security problem or correctness issue.)

1) is not an issue since we are talking about a very small number
of programs. 2) none of the binaries involved are performance 
critical, so why bother having dynamic versions? 3) a good make
script will solve this problem.

These seem like small potatoes compared to having a reliable system.

> 
> and has this flaw:
> 4) Library damage can prevent binaries from working.
> 
> Point 2 is a little harder.
> This is a volunteer effort. It's hard enough sometimes for people to
> maintain their packages without maintaining an extra set of them. You
> could put the static versions in /sbin instead of breaking them off in a
> seperate dir, but then you waste RAM instead of hd space. (Dynamic libs
> are loaded once, static libs are loaded for each binary.) It's a
> tradeoff, just like everything else in a distribution.

At the very least, you cannot convince me that there is a RAM problem 
relating to the use of mkfs, fsck, fdisk, apt-get, dpkg, route, 
ifconfig, etc., when was the last time you ran an fdisk server
where hundreds of users ran thousands of copies of fdisk? 

Even the commonly used binaries in /bin are unlikely to cause any 
severe performance problem. Most users will have a dynamic bash as 
their shell, and just about everything commonly used is built into
it. The rest (ls, dd, cat, etc.) are used so infrequently and are
so small that you would have to be running a shell server with 
thousands of users before it would become an issue. In that case
your system would look nothing like a standard system, and you 
would have already adopted the practice of installing specialized
software all over the place--compiling a dynamic version of ls 
and putting it in /usr/bin would not be a big hardship.

>  Point 3 is more
> thought-provoking. If you statically link everything then any libc
> update means updating 30 packages by 30 different maintainers rather
> than a updating a single libc package. 

This seems like a complete non-issue, as dpkg is perfectly capable 
of insalling 30 program simultaneously. In fact it seems the truth
is the contrary:

Every time you change your C library you risk creating potentially 
dangerous security bugs in all of these static binaries. That means
you have to do much more frequent security auditing, which on a 
volunteer project is unlikely to happen. 

With simple static versions of these tools, they only have to be 
recompiled once in a blue moon--when someone notices something funny
in their particular implementation, or more rarely, in the stable
C library they've been compiled against. 

You would likely want to compile them against libc5 for size and 
stability reasons.

You've presented a lot of valid engineering considerations, but 
nothing that can't be implemented with a little thought, and in
a way more fundamentally reliable than the dynamic alternative.

> Against the pros of dynamic linking, you have a single con: a damaged
> library can prevent programs from loading. But as I said earlier: what
> are the odds that a massive failure will affect only libraries? If even
> one of your static binaries is destroyed, you're in the same place that
> you were with a broken library. (E.g., a disk problem or a bad case of
> the dumbs wiped out /dev and /lib. You've got static bins, but mknod
> also got wiped out. Bummer.)

But I might be able to nfs mount a drive on another machine that has it. 
I have many options when I have several working commands. I have no 
options when I have no working commands.

> For static bins to be useful you need a
> particular combination of disaster and luck.  Optimizing for that
> combination is like writing an algorithm for best-case performance: I
> can't say that it never helps, but it buys you nothing when you really
> need it. If a particular machine has to be highly available, it needs
> more than static binaries. 

You're talking to someone who knows how to list files when "ls" is 
broken ("echo *" is remarkably effectiver). It's been my unfortunate
task to recover numerous broken Unix systems, and I have to say that 
dynamic libraries seem to break a lot more often than you seem to 
think they do.

Admittedly, we're talking about events with very low probabilities
of happening. It's not like I see it every day, or even every year.
But in my experience, and in the experience of the many admins that
I've known over the years, dynamic libraries are just not that 
reliable a technology. They're fundamentally compex, and easily
broken, and I've seen them go down. 

When my Debian system (OK, unstable, but it's the point here) went
down in flames because it's linker broke, I thought to myself: "OK,
sigh..  now where does Debian put the statics?" and I was utterly
shocked to discover that it DOES NOT.

Worse, upon closer inspection I noticed that the flaw that brought
me down was a result of dynamic linking. 

This just confirmed my many years of previous experience, demonstrating
to me what happens when a system does not deploy the tools and 
methods I've come to expect.

> What do I think a machine needs to be reliable? Here's a couple for
> starters:
> 1) A backup root disk. It's easy, it's cheap in today's market, and it
> buys me protection from both hardware and human error. (Even static
> binaries quail in the face of rm.)
> 2) A serial console. If some disaster is pulling the rug out from under
> my server, I don't want to rely on its network interface. There's too
> much that can go wrong, and I can't boot over it.
> 3) A failover machine. Sometimes things really do break.

All three of your points assume that it is OK to reboot. Often it is 
not. Often rebooting is to be avoided at all costs. 

I know of a machine that was tracking satellite data, and rebooting 
was absolutely unacceptable. It might lose track of the satellite, 
and cost millions of dollars in re-orientation. It was only one of
many redundant machines, but the thought of losing a little bit 
of that redundancy for even a brief while gave everyone heartburn.

Secondly, even on much more ordinary machines, the cost of fixing a 
problem without a reboot is usually much more, just in terms of wages,
than the cost of a failure if you can recover by copying a few files
over in a few minutes (from another machine on the network with the
same OS and version, for example.)

Thirdly, uptime is an absolute point of pride for any respectable 
Unix administrator, and the idea that you would lose your 274 days
of uptime because you can't fix a problem without a reboot is 
something that should just stick in your craw. Sure now I've gone
into the realm of coolness, and it's practicality can be questioned
in many cases--but dammit, if you lose my 274 days of continuous 
uptime you lose my bragging rights.  

> I know that 3) can be prohibitive. But 1 & 2 don't really cost that much
> if reliability is important. (And if it's not, then why worry about
> static binaries at all?) There is no combination of static binaries that
> can give me the reliability of booting off a known good drive.

Assuming you are physically there to put the known good drive in the 
floppy slot, and that rebooting is acceptable. That strikes you out 
for several important applications to which Unix servers are commonly
deployed. A lot of people drop their servers in hosting locations, 
and attempt to removely administer them. 

If you screw up your machine you frequently have to pay a fee to have
a tape monkey stick the backup tape in your drive and re-install, or 
pay for a plane ticket to the location where your server is hosted.

>  That's
> the reasoning that really underlies the choice of dynamic libs--there is
> no benefit to pinning anyone's hopes on a false promise of reliability,
> even if there weren't some drawbacks inherent in static linking.

Obviously there are no promises. There are only probabilities, and I 
think the probabilities favour static binaries by a wide margin.

> That's exactly right. That's why you don't screw around with a
> production system unless you have a way out. And that's why you don't
> run a production system unless you have a way to compensate for
> catastrophic failure. And that has nothing to do with static binaries.

So your opinion is that if the failure is my fault, I should be hung
out to dry. What if the failure is my assistant's fault, because today
was their day to learn why "rm -rf *" and root don't mix? Should I
still be hung out to dry?

Justin

Reply to:

Follow-Ups:
- Re: how to make Debian less fragile (long and philosophical)
  - From: Michael Stone <mstone@debian.org>

References:
- Re: how to make Debian less fragile (long and philosophical)
  - From: Marek Habersack <grendel@vip.net.pl>
- Re: how to make Debian less fragile (long and philosophical)
  - From: Dale Scheetz <dwarf@polaris.net>
- Re: how to make Debian less fragile (long and philosophical)
  - From: Raul Miller <raul@usatoday.com>
- Re: how to make Debian less fragile (long and philosophical)
  - From: Justin Wells <justin@semiotek.com>
- Re: how to make Debian less fragile (long and philosophical)
  - From: Justin Wells <justin@semiotek.com>
- Re: how to make Debian less fragile (long and philosophical)
  - From: Dale Scheetz <dwarf@polaris.net>
- Re: how to make Debian less fragile (long and philosophical)
  - From: tony mancill <tmancill@mancill.com>
- Re: how to make Debian less fragile (long and philosophical)
  - From: Michael Stone <mstone@debian.org>

Prev by Date: Re: [RFD] install criticisms
Next by Date: Re: history (Was Re: Corel/Debian Linux Installer)
Previous by thread: Re: how to make Debian less fragile (long and philosophical)
Next by thread: Re: how to make Debian less fragile (long and philosophical)
Index(es):
- Date
- Thread