[olug] Fails Over, but does Not Fail Back

Jeff Hinrichs - DM&T jeffh at dundeemt.com
Wed Apr 28 10:21:56 CDT 2021


Can not tell you how often
>
> so we put in a trusted spare,
>
Is the key to successfully trouble shooting or repairing an issue. I have a
shelf where I keep key known-good drives, switches, etc.  So when a RAID
reports an issue or a switch starts flaking I know that I'm at least
starting with good equipment.

"Our job is to reign in chaos and entropy and through force of will, impose
order on our part of the multiverse."
Should be the opening line of all IT job descriptions.

0-J

On Wed, Apr 28, 2021 at 8:36 AM Joel B <joel at kansaslinuxfest.org> wrote:

> Hi Rob,
> It sounds like you are using completely different vendors than we are,
> but we have similar experiences. fun :-)
>
> Our firewall cluster loosely uses routing rules & route priorities to
> fail between our internet connections.  Sometimes when the primary
> returns, we still see traffic on the backup, but in our case it is
> because the firewall lets the already established sessions stay on the
> secondary and over time (usually an hour or so) as the sessions expire
> traffic returns to the primary. Primary -> Secondary is a hard change,
> but Secondary -> primary is more gradual.
>
> Troubleshooting MSTP issues in the moment is quite fun, fun like "take
> my phone off the hook and close my door - I know it's down" :-). We rely
> a bit on our Cacti instance to help show which links between switches
> are actually passing traffic at a given time.  In a pro-longed/stable
> event Cacti usually gives an idea of where to look to see what has
> changed.  In something major or an outage where links are continuously
> changing  Cacti isn't always useful and we do the same log searching you
> mentioned.
>
> A war story: A few years ago we had a switch flake out and start
> flapping it's links and causing STP to rolling block/unblock and
> "renegotiate" the STP root.  After a few minutes of not getting
> anywhere, I cut the network in half. half was good and services were
> restored, then I could plug into the orphaned half and the volume of
> "noise" was low enough I could see what was happening and quickly
> identified the misbehaving switch. It was older, not really supported
> anymore, and a good candidate to be replaced anyway, so we put in a
> trusted spare, pull the misbehaving one and ordered a replacement
> (installed later in a proper maintenance window). The switch PROBABLY
> would have been fine after a reboot or reflash, but the event was
> damaging & stressful enough that we no longer trusted it and it was
> retired. It's funny how outages can turn into budget item approvals :)
>
> -Joel
>
> On 4/27/2021 8:03 PM, Rob Townley wrote:
> > On Tue, Apr 27, 2021 at 8:08 AM Joel B <joel at kansaslinuxfest.org> wrote:
> >
> >> Hi Rob,
> >> I've seen this both ways and it seems to be dependent upon the
> equipment.
> >> Examples:
> >> The firewall units we have at my work run in a cluster (active-passive).
> >> They do not "fail back", but the vendor explains this as one data points
> >> used in deciding which is "active" is the uptime of the device (longer
> >> uptime weights the device more likely to be the "active" unit). I just
> >> reboot the now-active unit to restore the original order. (often this
> >> happens during a maintenance window, so it's a quick check and no
> >> problem rebooting).
> >
> > Joel,
> >
> > you magically knew firewalls and spanning tree were some sore points
> right
> > now.    Uptime of the device used in the decision process makes sense.
> > Contrasting the role of uptime between firewalls and spanning tree helps
> > elucidate the problem domain.
> >
> > Our untangle firewalls are not “equal” because one has many more licenses
> > than the other.   Have not tested fallback to the primary with recent
> > versions but i suppose VRRP does not bring licensing into its decision
> > process and needs to be told which one to favor.  Maybe the bias setting
> > does not make it down from the upper level configuration in the database
> to
> >   virtual router redundancy running in the datalink layer.  Maybe a bash
> > script could intervene here after the better link has been up and stable.
> >
> > Our Ubiquiti EdgeRouter (want to say kernel 4.11) does not fail back to
> the
> > better internet connection even when the backup is software downed,  but
> > that is most likely user error at this point.  I also tend to stay away
> > from dynamic routing for security reasons.
> >
> > Our Linux based switches are great in every way and especially in MSPT
> but
> > they are dealing with a totally different class of switches that have
> very
> > limited MSTP support.  I kinda wish SPT took into account uptime in its
> > decision  process.  The switches navigate the spaghetti with ease
> whereas i
> > get lost.  Grepping the Linux based switch log files helps tremendously
> > when trying to find what is going on.
> >
> >
> >> In our networking switches (different vendor than the firewall units) we
> >> use MSTP (Spanning-Tree). The links & switches have a priority settings
> >> set that are not dependent upon device uptime, so if a "spanning-tree
> >> event" occurs (link/switch/etc failure) when things recover they restore
> >> to the desired setup based on those priorities. No extra intervention
> >> required.
> >>
> >> So i see it happen both ways.
> >> -Joel
> >>
> >>
> >> On 4/27/2021 3:45 AM, Rob Townley wrote:
> >>> tldr; Systems that reliably fail over to redundant system, but
> absolutely
> >>> refuses to revert back to primary system.
> >>>
> >>> Looking for general guidelines on systems (primarily networking) to
> >>> troubleshoot the fail back to primary pathway.
> >>>
> >>> The failover happens reliably.   The   problem is  when the primary
> comes
> >>> back up, actually reverting back, aka “Failing Back” to the primary
> path.
> >>>
> >>> Have experienced this failure to fail back too many times across a
> >> variety
> >>> of equipment and systems.  Looking for general guidelines.  What do
> noobs
> >>> usually miss?
> >>>
> >>> Also, is it a common problem or just me?
> >>> _______________________________________________
> >>> OLUG mailing list
> >>> OLUG at olug.org
> >>> https://www.olug.org/mailman/listinfo/olug
> >> _______________________________________________
> >> OLUG mailing list
> >> OLUG at olug.org
> >> https://www.olug.org/mailman/listinfo/olug
> >>
> > _______________________________________________
> > OLUG mailing list
> > OLUG at olug.org
> > https://www.olug.org/mailman/listinfo/olug
>
> _______________________________________________
> OLUG mailing list
> OLUG at olug.org
> https://www.olug.org/mailman/listinfo/olug
>


-- 
Best,

Jeff Hinrichs
402.320.0821


More information about the OLUG mailing list