[OLUG] Isolating flaky hardware problems

Thu Feb 10 03:11:48 UTC 2000

Dave Burchell wrote:
> 
> I've got some hardware that may be flaky, and I need some advice on
> narrowing down the problem.
> 
> Long story short, how do I isolate possible CPU or RAM intermittent
> failures?
> 
> Here's why I'm asking.
> 
> One of my users has an NT box.  It died one day, and I decided its SCSI
> card might be bad because it would try to boot from the SCSI disk but
> wouldn't make it past a certain point in the NT boot sequence (where
> I think it was trying to initialize the SCSI devices).  I got the BSOD
> each time.
> 
> To test NT boxes I like to load up Linux.  Booting Linux (Debian 2.1
> rescue floppy) went ok at first, but where it should have listed all
> of the SCSI devices it hung.  It found the SCSI card but could not list
> the devices on the SCSI bus.
> 
> Replacing the SCSI card with another allowed the machine to boot the
> rescue floppy and install Linux (on a Jaz disk because the SCSI HD was
> (still is) full of NTFS partitions).  I built a 2.2.14 kernel with NTFS
> support, booted it, and backed up the NTFS partitions across the LAN to
> a tape drive using ssh.
> 
> I should say I _tried_ to back it up, but before the backup was done
> the machine hung with the error message
> 
> Kernel panic: VFS: LRU block list corrupted
> 
> I brought it up again, this time running a Perl script to figure sines
> and thus stress the CPU some.  It hung again after a few hours, this
> time with no error message.
> 
> Finally last night it did the backup without crashing.
> 
> So, what's the problem?  I suspect, based on some traffic on the Linux
> kernel mailing list, that it could be a SIMM or a CPU going out, perhaps
> in part due to overheating.  What's everyone think of this theory?
> 
> Let's say we do think it may be RAM.  Can I boot Linux with a "mem=32"
> option to limit the memory that is used by Linux?  Or would I be better
> off removing half the SIMMs and narrowing it down that way?
> 
> Is there any good software to run if I want to stress components of the
> system to try to induce a failure?
> 
> How could I do a repetitive read/write test of all the RAM on the system?
> Could I get the memory address of a bad piece of RAM?
> 
> It seems to me I remember doing this with a Unixware '486 years back
> but I don't remember the utils and I don't think they were running
> under Unixware.
> 
> The machine in question here is an all-SCSI (disabled on-board IDE) 200
> Mhz Pentium with 128 MB RAM and video, sound, ethernet, and modem
> cards.
> 
> (If I'm missing something obvious here I'm sure I'll be notified,
> right? :-)
> 
> --

Well, that wasn't a very good attempt at making a long story short :)

Given the choice, you're probably right to assume it's RAM and not the
CPU.  If it were the cpu, I doubt it would have been able to give you an
error message at all.
Considering it's a 200Mhz system, I doubt all 128Mb RAM is original and
matching.  I would take a look at it and see if the SIMM pairs in each
bank are identical.  I mean identical too, not just speed and capacity. 
That can cause some weird problems.  Also make sure that each bank are
the same type.  you can also get some issues if one bank is ecc or
parity and another isn't.  Once you've eliminated those possibilities, I
would try locating the bad bank, and then the bad SIMM by deduction.

-------------------------------------------------------------------------
Sent by OLUG Mailing list Manager, run by ezmlm.  http://olug.bstc.net/ 
To unsubscribe: `echo unsubsribe | mail olug-unsubscribe at bstc.net`