And we get to another issue that I had. It happened at 11 am around the second week of term, some of my core switches started dropping packets. While I had had other issues, it had been around 3 weeks since those issues and around 5 weeks since the core had been set up. And when I had had the other issues the core switches had not really been affected.
Figure 1 – Pinging XG16 switches and seeing packet loss.
If you can remember back to one of my earlier posts about how I set up the core, I had seven XG-16’s handling all my incoming connections from buildings and servers. The EdgeRouter Infinity had been in for almost 3 weeks.
Figure 2 – The Core, with messy cabling.
However, four of the seven XG-16’s were just not happy. I connected into each of the seven switches to check for a difference in their config, they all had similar settings with the only real difference being which ports were configured for what, and no changes had been made in days. I checked the logs of each switch and nothing really stood out that would cause the issue I was seeing. One plus was that users weren’t really aware of the issue, the only impact they were getting was slow network performance. I moved my mind to how I had configured the uplinks.
Figure 3 – Core Design v1.
Each switch had 2x 10GB Ethernet CAT6A cables to connect to the switch below it, the final switch had 2 that ran back up to the first switch. This design had been designed by the MSP I had ordered my Ubiquiti gear through. And for 5 weeks it had run without issues relating to the cabling. I decided to simplify things, I unplugged one of the ethernet cables from each switch, reducing it to 1 uplink pre-switch. Though this made no difference.
Next, I decided to do a full change, maybe the circle style design was causing the issue. So I decided to use my PDC-XG16-7 switch as a test, it only had 1 building connected to it. A small building with very few users and was good for this type of testing. I connected a fibre cable from PDC-XG16-7 to PDC-XG16-1 and then removed the ethernet cable from that switch.
Within seconds PDC-XG16-7 switch stopped dropping packets and was it was fine, back to a stable connection. For the other three switches, I couldn’t connect them to PDC-XG16-1, it didn’t have enough spare fibre ports. So I decided to connect them via PDC-XG16-7, at least as a quick fix.
So using the same method I connected a new fibre then removed the ethernet in PDC-XG16-4, PDC-XG16-5 and PDC-XG16-6 repeating the process each time. It was a quick and dirty method, but in a few minutes my network was back stable and my slow network performance issues were gone.
Figure 4 – packets back to normal
By 11:20 am the network issue was resolved. But not yet fixed.
Figure 5 – Temp cable solution
As you can see in Figure 4 it was a far from a perfect solution, but it had fixed the issue. And all buildings were back to normal, though I wasn’t finished. My first thought towards a permeant solution was my powerful and under-used EdgeRouter Infinity, as you can see in Figure 5 it barely ever uses any of its actual grunt. I think the highest I have ever seen its preformance hit was 3%.
Figure 6 – EdgeRouter Infinity is bored.
NOTE: Screenshot is not from the time of the incident, but taken at a later date.
Not wanting to just jump into a solution I asked some Ubiquiti Engineers that had helped me before of my plan and they quickly advised me that it was a terrible idea. The EdgeRouter Infinity was great at routing traffic, but it would suffer as a switch. So I went back to the drawing board. I decided to re-purpose PDC-XG16-7, I moved its building uplink to another PDC-XG16-6, then I did up the following design.
Figure 7 – Cable Design Final, adding a new core switch
The new design saw PDC-XG16-7 being renamed SWCORE-XG16. It was given 2 fibre uplinks to the EdgeRouter Infinity. From there the remaining fibre ports were used to connect each of the other XG16 to the new switching core.
I ran this design past those Ubiquiti Engineers, they advised me that the solution should work and the only limitation is that the SWCORE-XG16 became a single point of failure. I do plan on addressing that down the track, but I was happy with it for now.
NOTE: As said I am aware of the signal point of failure, and that it also exists since each building is connected to 1 XG16. The plan down the line is to bring in another XG16 as a secondary switching core, just like I plan to have each building connect to 2 XG16’s to prevent an outage from a single switch going down. But that is a future plan and I haven’t had time to work on that yet. Nor has it been an issue yet, plus I have 2 XG16’s seating in the server room fully configured ready to be ‘hot-swapped’ in the event something goes wrong.
The re-patching was actually done during lunchtime that day, I was pinging all the switches from my MacBook as I undertook this work. During the whole process, I only saw 4 pings drop. And the actual swap over was pretty easy. I connected a fibre cable from SWCORE-XG16 to one of the other switches, made sure the port came live and then unplugged the old ethernet cable. I did it slowly, making sure each switch was fine for a few minutes before moving onto the second one, in total it took me about 20 minutes change all switches over.
Once done nothing changed, which was good since technically I had already resolved the problem. It is now late May as I write this post up, which is the final post around my Ubiquiti switches rollout, my switching environment hasn’t had any other issues that are Ubiquiti related. I have had a couple damaged fibre connections that needed work, though none of that was related to Ubiquiti’s gear. In all, we are happy with Ubiquiti’s switching gear. The network is stable and granted we are only 5 months in and we had some issues but everything seems to be running well now.
I have admitted before that I am not a networking guru, just a jack of all trades. If someone can let me know what was happening on the switches I would love to hear it. What puzzled me was that it had been fine for some time before the issue arose, I hadn’t made changes or firmware upgrades to bring about the issue.
I will hopefully post soon my last Ubiquiti Infrastructure Rollout post, it is around an issue with had with our wireless, and the joy of DFS Channels!
May 31, 2018 at 11:38 pm
Did you ever find out what the issue ultimately was? Was STP causing problems?
June 1, 2018 at 7:03 am
Honestly… no.
When I re-planned the network I actually found 2 benefits that I hadn’t thought about.
First it gave me a central point to do Mac address table lookups instead of going to each of my XG16 in the core to check them out. And since the network is working I haven’t gone back to seeing if I could bring the issue back. Looking after all parts of the IT infrastructure I don’t really have time to go back and work on something I have fixed, my work load is too high to do that.
July 4, 2018 at 8:04 am
At first glance looking at your initial design, STP is your main culprit. You basically created one big network loop, STP can handle it up to a point but as you have seen it will start bogging down. The drop packets are from route changes from one port to another and reconvergent times. Keep up the good blog, I enjoy reading your discoveries.
July 5, 2018 at 8:03 am
Thanks, mate.
Been pretty slack lately on posting. Will work on getting some more stuff up in the coming weeks.
Cheers.
September 22, 2019 at 9:35 pm
I am willing to bet heavily that you did not have spanning tree enabled which caused a network loop (spanning tree is designed to stop this) of learned MAC addresses which filled up your MAC table. This is why if you need multiple switches like this you should use stacking switches or a chassis switch. LACP might also help here as it is designed to bond two ports together and act as one rather than (you didn’t state one way or the other) two single links plugged into each switch which will cause a loop.
February 11, 2020 at 12:43 am
Any updates to this?
did you add another xg16? what about link agg to the other xg16s?
also, where have you gone!
February 14, 2020 at 1:47 pm
Yes I have done that.
However it is a manual process. The spare XG16 is sitting there and has 1 uplink so I can ensure it has the same updated firmware and config as the other switches. But it doesn’t auto fail over.
I am still around but haven’t had much time lately for posts. Currently thinking about a few posts (mainly what devices I use etc, and a test of a Surface Pro 7). Also happy to take suggestions for what you want me to test about.