[OLUG] Isolating flaky hardware problems

tetherow at nol.org tetherow at nol.org
Thu Feb 10 17:37:40 UTC 2000


On  9 Feb, Dave Burchell wrote:
> Vincent says:
> 
>> Dave Burchell wrote:
>> > 
>> > I've got some hardware that may be flaky, and I need some advice on
>> > narrowing down the problem.
>> > 
>> > Long story short, how do I isolate possible CPU or RAM intermittent
>> > failures?
> 
>> Well, that wasn't a very good attempt at making a long story short :)
> 
> Doh! I meant to say that the above sentence was long-story-short, and
> the long story was everything else.  (Now my _posts_ are flaky...)
> 
>> Given the choice, you're probably right to assume it's RAM and not the
>> CPU.  If it were the cpu, I doubt it would have been able to give you an
>> error message at all.
>> Considering it's a 200Mhz system, I doubt all 128Mb RAM is original and
>> matching.  I would take a look at it and see if the SIMM pairs in each
>> bank are identical.  I mean identical too, not just speed and capacity. 
>> That can cause some weird problems.  Also make sure that each bank are
>> the same type.  you can also get some issues if one bank is ecc or
>> parity and another isn't.  Once you've eliminated those possibilities, I
>> would try locating the bad bank, and then the bad SIMM by deduction.
> 
> Thanks for the ideas, V.  I'm going to check the SIMMs for uniformity.
> I'll also check for mismatched gold/silver contacts now that I think of
> it.  If I try to locate the bad bank and SIMM by deduction, what can I
> use to really hammer on the memory?  Should I just write a Perl script
> that generates a huge dataset to suck up all the memory?  Should I
> disable the swap?

Check out the following two from Freshmeat:

memtester is a user-space utility for testing the memory subsystem in a computer 
to determine if it is faulty.  It does a reasonably good job of finding 
intermittent faults and non-deterministic faults. It has many tests to
help catch borderline memory, and generates a verbose report of faults found, 
tests run, and time taken.
                                         
Download: http://www.qcc.sk.ca/~charlesc/software/memtester/#download (3147 hits)
Homepage: http://www.qcc.sk.ca/~charlesc/software/memtester/  

Memtest-86 is very thorough, stand alone memory test for x86 and Pentium systems 
(and compatibles).

Download: http://reality.sgi.com/cbrady_denver/memtest86/memtest86-2.1.tar.gz (2268 hits)
Homepage: http://reality.sgi.com/cbrady_denver/memtest86/ (3619 hits)


> I'd guess the machine is about 3 years old.  My user has been using NT
> pretty much all this time without many complaints (that I know of; I'll
> press him for more background).  If a machine _has_ been working mostly
> ok with NT then does that mean it most likely was actually ok and
> developed a recent problem?  Or could it be that NT just didn't fully
> use the system (or stress the system in the same way Linux does) and
> thus didn't uncover the problem, which was there from the start?  Is
> this problem really new at all?

How would you know, stuff dies all the time in the MS world ;) 
 
------------------------------------------------------------------------
Sam Tetherow                           tetherow at nol.org
Director of Development
Nebrask@ Online                        http://www.nol.org/


-------------------------------------------------------------------------
Sent by OLUG Mailing list Manager, run by ezmlm.  http://olug.bstc.net/ 
To unsubscribe: `echo unsubsribe | mail olug-unsubscribe at bstc.net` 



More information about the OLUG mailing list