[OLUG] Isolating flaky hardware problems

Thu Feb 10 19:26:19 UTC 2000

A heat issue would be a good guess, but I wouldn't discount RAM. I've seen a
lot of RAM act strangely, even intermittantly fail.

"Rogers, John C NWD02" wrote:

> I vote for the CPU overheating.  We have seen similar problems here when the
> fan stops spinning or the fan is clogged with dust and barely moves.  This
> was noticeable on cases that fit tight and do not pull air off the CPU
> directly.  I have never experienced RAM work then not work after some time.
> It either worked or did not but it always gave parity errors or something
> bad.  The CPUs have always worked then after a few hours bad things started
> happening.  Usually with a winblows box the registry starts dying and then
> the machine blows up.  Shut it off and let it cool and all was well for
> awhile.  Then the problems happened again after things heated up.
>
> Just my thoughts,
> John
>
> -----Original Message-----
> From: Vincent [mailto:vraffensberger at home.com]
> Sent: Wednesday, February 09, 2000 9:12 PM
> To: olug at bstc.net
> Subject: Re: [OLUG] Isolating flaky hardware problems
>
> Dave Burchell wrote:
> >
> > I've got some hardware that may be flaky, and I need some advice on
> > narrowing down the problem.
> >
> > Long story short, how do I isolate possible CPU or RAM intermittent
> > failures?
> >
> > Here's why I'm asking.
> >
> > One of my users has an NT box.  It died one day, and I decided its SCSI
> > card might be bad because it would try to boot from the SCSI disk but
> > wouldn't make it past a certain point in the NT boot sequence (where
> > I think it was trying to initialize the SCSI devices).  I got the BSOD
> > each time.
> >
> > To test NT boxes I like to load up Linux.  Booting Linux (Debian 2.1
> > rescue floppy) went ok at first, but where it should have listed all
> > of the SCSI devices it hung.  It found the SCSI card but could not list
> > the devices on the SCSI bus.
> >
> > Replacing the SCSI card with another allowed the machine to boot the
> > rescue floppy and install Linux (on a Jaz disk because the SCSI HD was
> > (still is) full of NTFS partitions).  I built a 2.2.14 kernel with NTFS
> > support, booted it, and backed up the NTFS partitions across the LAN to
> > a tape drive using ssh.
> >
> > I should say I _tried_ to back it up, but before the backup was done
> > the machine hung with the error message
> >
> > Kernel panic: VFS: LRU block list corrupted
> >
> > I brought it up again, this time running a Perl script to figure sines
> > and thus stress the CPU some.  It hung again after a few hours, this
> > time with no error message.
> >
> > Finally last night it did the backup without crashing.
> >
> > So, what's the problem?  I suspect, based on some traffic on the Linux
> > kernel mailing list, that it could be a SIMM or a CPU going out, perhaps
> > in part due to overheating.  What's everyone think of this theory?
> >
> > Let's say we do think it may be RAM.  Can I boot Linux with a "mem=32"
> > option to limit the memory that is used by Linux?  Or would I be better
> > off removing half the SIMMs and narrowing it down that way?
> >
> > Is there any good software to run if I want to stress components of the
> > system to try to induce a failure?
> >
> > How could I do a repetitive read/write test of all the RAM on the system?
> > Could I get the memory address of a bad piece of RAM?
> >
> > It seems to me I remember doing this with a Unixware '486 years back
> > but I don't remember the utils and I don't think they were running
> > under Unixware.
> >
> > The machine in question here is an all-SCSI (disabled on-board IDE) 200
> > Mhz Pentium with 128 MB RAM and video, sound, ethernet, and modem
> > cards.
> >
> > (If I'm missing something obvious here I'm sure I'll be notified,
> > right? :-)
> >
> > --
>
> Well, that wasn't a very good attempt at making a long story short :)
>
> Given the choice, you're probably right to assume it's RAM and not the
> CPU.  If it were the cpu, I doubt it would have been able to give you an
> error message at all.
> Considering it's a 200Mhz system, I doubt all 128Mb RAM is original and
> matching.  I would take a look at it and see if the SIMM pairs in each
> bank are identical.  I mean identical too, not just speed and capacity.
> That can cause some weird problems.  Also make sure that each bank are
> the same type.  you can also get some issues if one bank is ecc or
> parity and another isn't.  Once you've eliminated those possibilities, I
> would try locating the bad bank, and then the bad SIMM by deduction.
>
> -------------------------------------------------------------------------
> Sent by OLUG Mailing list Manager, run by ezmlm.  http://olug.bstc.net/
> To unsubscribe: `echo unsubsribe | mail olug-unsubscribe at bstc.net`
>
> -------------------------------------------------------------------------
> Sent by OLUG Mailing list Manager, run by ezmlm.  http://olug.bstc.net/
> To unsubscribe: `echo unsubsribe | mail olug-unsubscribe at bstc.net`

--
God, root, what is difference? - Pitr

-------------------------------------------------------------------------
Sent by OLUG Mailing list Manager, run by ezmlm.  http://olug.bstc.net/ 
To unsubscribe: `echo unsubsribe | mail olug-unsubscribe at bstc.net`