[olug] Fails Over, but does Not Fail Back

Tue Apr 27 20:03:50 CDT 2021

On Tue, Apr 27, 2021 at 8:08 AM Joel B <joel at kansaslinuxfest.org> wrote:

> Hi Rob,
> I've seen this both ways and it seems to be dependent upon the equipment.
> Examples:
> The firewall units we have at my work run in a cluster (active-passive).
> They do not "fail back", but the vendor explains this as one data points
> used in deciding which is "active" is the uptime of the device (longer
> uptime weights the device more likely to be the "active" unit). I just
> reboot the now-active unit to restore the original order. (often this
> happens during a maintenance window, so it's a quick check and no
> problem rebooting).

Joel,

you magically knew firewalls and spanning tree were some sore points right
now.    Uptime of the device used in the decision process makes sense.
Contrasting the role of uptime between firewalls and spanning tree helps
elucidate the problem domain.

Our untangle firewalls are not “equal” because one has many more licenses
than the other.   Have not tested fallback to the primary with recent
versions but i suppose VRRP does not bring licensing into its decision
process and needs to be told which one to favor.  Maybe the bias setting
does not make it down from the upper level configuration in the database to
 virtual router redundancy running in the datalink layer.  Maybe a bash
script could intervene here after the better link has been up and stable.

Our Ubiquiti EdgeRouter (want to say kernel 4.11) does not fail back to the
better internet connection even when the backup is software downed,  but
that is most likely user error at this point.  I also tend to stay away
from dynamic routing for security reasons.

Our Linux based switches are great in every way and especially in MSPT but
they are dealing with a totally different class of switches that have very
limited MSTP support.  I kinda wish SPT took into account uptime in its
decision  process.  The switches navigate the spaghetti with ease whereas i
get lost.  Grepping the Linux based switch log files helps tremendously
when trying to find what is going on.

>
> In our networking switches (different vendor than the firewall units) we
> use MSTP (Spanning-Tree). The links & switches have a priority settings
> set that are not dependent upon device uptime, so if a "spanning-tree
> event" occurs (link/switch/etc failure) when things recover they restore
> to the desired setup based on those priorities. No extra intervention
> required.
>
> So i see it happen both ways.
> -Joel
>
>
> On 4/27/2021 3:45 AM, Rob Townley wrote:
> > tldr; Systems that reliably fail over to redundant system, but absolutely
> > refuses to revert back to primary system.
> >
> > Looking for general guidelines on systems (primarily networking) to
> > troubleshoot the fail back to primary pathway.
> >
> > The failover happens reliably.   The   problem is  when the primary comes
> > back up, actually reverting back, aka “Failing Back” to the primary path.
> >
> > Have experienced this failure to fail back too many times across a
> variety
> > of equipment and systems.  Looking for general guidelines.  What do noobs
> > usually miss?
> >
> > Also, is it a common problem or just me?
> > _______________________________________________
> > OLUG mailing list
> > OLUG at olug.org
> > https://www.olug.org/mailman/listinfo/olug
>
> _______________________________________________
> OLUG mailing list
> OLUG at olug.org
> https://www.olug.org/mailman/listinfo/olug
>