[olug] Fails Over, but does Not Fail Back

Wed Apr 28 08:34:01 CDT 2021

Hi Rob,
It sounds like you are using completely different vendors than we are, 
but we have similar experiences. fun :-)

Our firewall cluster loosely uses routing rules & route priorities to 
fail between our internet connections.  Sometimes when the primary 
returns, we still see traffic on the backup, but in our case it is 
because the firewall lets the already established sessions stay on the 
secondary and over time (usually an hour or so) as the sessions expire 
traffic returns to the primary. Primary -> Secondary is a hard change, 
but Secondary -> primary is more gradual.

Troubleshooting MSTP issues in the moment is quite fun, fun like "take 
my phone off the hook and close my door - I know it's down" :-). We rely 
a bit on our Cacti instance to help show which links between switches 
are actually passing traffic at a given time.  In a pro-longed/stable 
event Cacti usually gives an idea of where to look to see what has 
changed.  In something major or an outage where links are continuously 
changing  Cacti isn't always useful and we do the same log searching you 
mentioned.

A war story: A few years ago we had a switch flake out and start 
flapping it's links and causing STP to rolling block/unblock and 
"renegotiate" the STP root.  After a few minutes of not getting 
anywhere, I cut the network in half. half was good and services were 
restored, then I could plug into the orphaned half and the volume of 
"noise" was low enough I could see what was happening and quickly 
identified the misbehaving switch. It was older, not really supported 
anymore, and a good candidate to be replaced anyway, so we put in a 
trusted spare, pull the misbehaving one and ordered a replacement 
(installed later in a proper maintenance window). The switch PROBABLY 
would have been fine after a reboot or reflash, but the event was 
damaging & stressful enough that we no longer trusted it and it was 
retired. It's funny how outages can turn into budget item approvals :)

-Joel

On 4/27/2021 8:03 PM, Rob Townley wrote:
> On Tue, Apr 27, 2021 at 8:08 AM Joel B <joel at kansaslinuxfest.org> wrote:
>
>> Hi Rob,
>> I've seen this both ways and it seems to be dependent upon the equipment.
>> Examples:
>> The firewall units we have at my work run in a cluster (active-passive).
>> They do not "fail back", but the vendor explains this as one data points
>> used in deciding which is "active" is the uptime of the device (longer
>> uptime weights the device more likely to be the "active" unit). I just
>> reboot the now-active unit to restore the original order. (often this
>> happens during a maintenance window, so it's a quick check and no
>> problem rebooting).
>
> Joel,
>
> you magically knew firewalls and spanning tree were some sore points right
> now.    Uptime of the device used in the decision process makes sense.
> Contrasting the role of uptime between firewalls and spanning tree helps
> elucidate the problem domain.
>
> Our untangle firewalls are not “equal” because one has many more licenses
> than the other.   Have not tested fallback to the primary with recent
> versions but i suppose VRRP does not bring licensing into its decision
> process and needs to be told which one to favor.  Maybe the bias setting
> does not make it down from the upper level configuration in the database to
>   virtual router redundancy running in the datalink layer.  Maybe a bash
> script could intervene here after the better link has been up and stable.
>
> Our Ubiquiti EdgeRouter (want to say kernel 4.11) does not fail back to the
> better internet connection even when the backup is software downed,  but
> that is most likely user error at this point.  I also tend to stay away
> from dynamic routing for security reasons.
>
> Our Linux based switches are great in every way and especially in MSPT but
> they are dealing with a totally different class of switches that have very
> limited MSTP support.  I kinda wish SPT took into account uptime in its
> decision  process.  The switches navigate the spaghetti with ease whereas i
> get lost.  Grepping the Linux based switch log files helps tremendously
> when trying to find what is going on.
>
>
>> In our networking switches (different vendor than the firewall units) we
>> use MSTP (Spanning-Tree). The links & switches have a priority settings
>> set that are not dependent upon device uptime, so if a "spanning-tree
>> event" occurs (link/switch/etc failure) when things recover they restore
>> to the desired setup based on those priorities. No extra intervention
>> required.
>>
>> So i see it happen both ways.
>> -Joel
>>
>>
>> On 4/27/2021 3:45 AM, Rob Townley wrote:
>>> tldr; Systems that reliably fail over to redundant system, but absolutely
>>> refuses to revert back to primary system.
>>>
>>> Looking for general guidelines on systems (primarily networking) to
>>> troubleshoot the fail back to primary pathway.
>>>
>>> The failover happens reliably.   The   problem is  when the primary comes
>>> back up, actually reverting back, aka “Failing Back” to the primary path.
>>>
>>> Have experienced this failure to fail back too many times across a
>> variety
>>> of equipment and systems.  Looking for general guidelines.  What do noobs
>>> usually miss?
>>>
>>> Also, is it a common problem or just me?
>>> _______________________________________________
>>> OLUG mailing list
>>> OLUG at olug.org
>>> https://www.olug.org/mailman/listinfo/olug
>> _______________________________________________
>> OLUG mailing list
>> OLUG at olug.org
>> https://www.olug.org/mailman/listinfo/olug
>>
> _______________________________________________
> OLUG mailing list
> OLUG at olug.org
> https://www.olug.org/mailman/listinfo/olug