[OLUG] Isolating flaky hardware problems

Dave Burchell burchell at inetnebr.com
Thu Feb 10 05:23:56 UTC 2000


Vincent says:

> Dave Burchell wrote:
> > 
> > I've got some hardware that may be flaky, and I need some advice on
> > narrowing down the problem.
> > 
> > Long story short, how do I isolate possible CPU or RAM intermittent
> > failures?

> Well, that wasn't a very good attempt at making a long story short :)

Doh! I meant to say that the above sentence was long-story-short, and
the long story was everything else.  (Now my _posts_ are flaky...)

> Given the choice, you're probably right to assume it's RAM and not the
> CPU.  If it were the cpu, I doubt it would have been able to give you an
> error message at all.
> Considering it's a 200Mhz system, I doubt all 128Mb RAM is original and
> matching.  I would take a look at it and see if the SIMM pairs in each
> bank are identical.  I mean identical too, not just speed and capacity. 
> That can cause some weird problems.  Also make sure that each bank are
> the same type.  you can also get some issues if one bank is ecc or
> parity and another isn't.  Once you've eliminated those possibilities, I
> would try locating the bad bank, and then the bad SIMM by deduction.

Thanks for the ideas, V.  I'm going to check the SIMMs for uniformity.
I'll also check for mismatched gold/silver contacts now that I think of
it.  If I try to locate the bad bank and SIMM by deduction, what can I
use to really hammer on the memory?  Should I just write a Perl script
that generates a huge dataset to suck up all the memory?  Should I
disable the swap?

I'd guess the machine is about 3 years old.  My user has been using NT
pretty much all this time without many complaints (that I know of; I'll
press him for more background).  If a machine _has_ been working mostly
ok with NT then does that mean it most likely was actually ok and
developed a recent problem?  Or could it be that NT just didn't fully
use the system (or stress the system in the same way Linux does) and
thus didn't uncover the problem, which was there from the start?  Is
this problem really new at all?

I would think that mismatched SIMMs would be trouble from the start,
right?  Or could mismatched SIMMs cause more trouble over time?

Another idea: I've heard that some motherboards contain components,
capacitors I think, that go bad over time due to the way they are
made.  They reach the end of their useful lives and don't perform up to
spec, introducing weird problems.  Anyone know more about this?

I'm tempted to simply reload NT, give it back to the user, and just see
what happens.  Or perhaps I should beg the boss to buy him a new
machine.  I hate these intermittent problems.

-- 
Dave Burchell                                          40.49'N, 96.41'W
Free your mind and your software will follow.              402-467-1619
http://incolor.inetnebr.com/burchell/                  burchell at acm.org     

-------------------------------------------------------------------------
Sent by OLUG Mailing list Manager, run by ezmlm.  http://olug.bstc.net/ 
To unsubscribe: `echo unsubsribe | mail olug-unsubscribe at bstc.net` 



More information about the OLUG mailing list