Monitoring… eduitguy stack
From my early posts about my monitoring, several people came to me asking for more information about my monitoring solution, or stack. To answer that I should let you know how I got to use that solution. It is one thing to rattle off your current stack of software, I would rather show you how I came to that setup.
NOTE: If you just want to see what I am using now, go to the bottom.
The very first attempt I did involve VNC with windows Activity Monitor. It was that simple, and horrible. I can’t even remember the name of the application I used, but it allowed me to set up a grid that would show me nine VNC views at once. It would then rotate through VNC connections. Those VNC connections were showing their activity monitor, and that was it. I am not going to lie, it was horrible, and my SQL servers yelled in protest at having to have it always up.
My next foray into monitoring was at my next school. The budget was a big concern at this school, so I had next to no money to purchase software in the classrooms, let alone for monitoring. I first started working with Nagios Core and found that while it worked I couldn’t get any type of display screen… something that I always wanted. How had seen pictures of NASA style NOC screens and thought, I want that!
Figure 1. NASA Control Centre
I then moved across to Spiceworks (https://www.spiceworks.com/free-network-monitoring-management-software/) because it gave me something like a monitoring screen, but it still wasn’t great. This lead me to find Nagvis (http://www.nagvis.org/). It is actually a pretty neat program, you can add any image you want as your background and then assign hosts to icon’s that you put on said image. It pulls its data from Nagios, hence the name Nagvis.
Below in Figure 2 is the first attempt at a custom NOC screen that I made, a walkthrough on how to do it was found on Spiceworks forums and I built a very similar screen. I did some changes and actually used Nagvis for some time, I even made a sitemap with every admin point had an Icon on it showing if the area was up, down or acknowledged.
I don’t know what happened to the setup at that school, I can’t remember if the person who replaced me kept it. I do know that the current awesome Admin there is using a variation of my current setup with Zabbix and Grafana, so I can only assume Nagvis has gone from there.
The best way I found to get Nagios Core and Nagvis at the time was with Fully Automated Nagios (http://www.fullyautomatednagios.org/) or FAN as it was known. If you wanted Nagios and some great plugins it worked well… however, it appears to have stopped being updated as of 2013.
After I moved to another school one of my first tasks was to build a monitoring solution, they lacked one and the Director who was also new there wanted a monitoring solution. I looked into FAN again, but this time I chose to use Check_MK (https://mathias-kettner.de/checkmk.html).
Check_MK was built on Nagios Core, but with a GUI front end, so you don’t need to use text files to add hosts and services. And since it points to Nagios you can link it to Nagvis. I used it for about 2 years before I decided that I wanted something different.
It was at this point that I moved to Nagios XI. It is the only time I have ever paid for an entire monitoring solution, and at the time I thought it would give me a more stable platform. Check_MK had been giving me weird bugs, like alerts about downed switches or servers when they weren’t. To its credit, I eventually discover it was a different issue. It was hard to discover, Check_MK would alert Server 1 was down… I would ping and RDP to Server 1 within seconds of getting the alerts and it would be fine. Thirty seconds later the alert would clear.
Eventually, I discovered it was a VMWare issue… but I had moved on by then.
Nagios XI was a good product, and still is. But for me it lacked any great GUI screens. It came with some nice screens like their mention OpsScreen, Figure 3.
Nagios XI was a good product and still is. But for me, it lacked any great GUI screens. It came with some nice screens like their mention OpsScreen, Figure 3.
Figure 4. Nagios Birdseye (https://exchange.nagios.org/directory/Addons/Frontends-%28GUIs-and-CLIs%29/Web-Interfaces/Birdseye/details),
All in all, I was pretty happy with the solution. It worked thought wasn’t very flexible in terms of letting me change area’s I didn’t like. For example, 1 thing I hate is white NOC screens. Some love them but for me, I like dark designed NOC screen. A black version had come along by 2014, but by this time I had stopped paying for Nagios XI and could not get any more updates.
I first found Grafana and from the moment I saw what it could do I loved it. The dashboards it could create were amazing. However, I couldn’t find a good way to get Grafana and Nagios to talk together. So that leads me to test other solutions that work with Grafana until I found and fell in love with Zabbix.
This became my go-to monitoring stack. They are both flexible and just generally easy to use, with amazing support forums for both products. This leads me to my current monitoring stack…
NOTE: I am not claiming that anything I have done here is not something others have done before… I give it a stack name only because I have brought in different products that I don’t currently see anywhere else online. If I am wrong let me know and I will use that stack name.
My stack truly came into its own when I moved jobs again. My new Director wanted his monitoring solution completely redone, so in my new role as Senior Infrastructure Engineer I took on the task. Their monitoring worked, it just wasn’t very good. They were using EventSentry. I am sure EventSentry works well but I didn’t find anything about it that amazed me, anything that wanted me to keep using it.
In my second week on the job it was uninstalled and Check_MK was quickly installed and setup and configured… WAIT! WHAT?!
Yes, I setup Check_MK first, the reason for this is that I was still figuring out exactly what I wanted this time around and knew that I could get Check_MK up and running in a day. I never intended on keeping it, but I wanted to do more research on monitoring this time. It was the first time in a while I was able to completely focus on monitoring. I dived into a rabbit hole of monitoring products, making sure nothing had really changed in the years since I had lasted looked at them, here is just some of the applications I installed and played with.
- Nagios XI
- Check_MK (was using)
And there were more I did research on but didn’t install.
NetCrunch was something that I was really interested in, I liked the UI and the built-in Grafana option. Though as I worked on it I found myself looking back to Zabbix… the main differences between them were that in Zabbix I needed to add things myself or find templates others had created. In NetCrunch they were constantly pushing out new ways to connect to different hosts.
For example, ESX host monitoring only needed a username and password and IP address for NetCrunch. While in Zabbix I needed to use a template, then a discovery task to find the ESX hosts, and if I did it wrong I got IP Addresses as the hostname instead of their DNS name or other weird bugs.
But try as I just couldn’t let go of Zabbix. And I went back.
For me, I have found that Zabbix is the best solution I could find for one glorious reason. It is free. To me spending even a few thousand on something when I can get a similar outcome for free, it just doesn’t seem like something I was willing to do.
I bet a lot of you are screaming at your screens… or maybe saying it in your head. “But free stuff that requires more work from you means it isn’t free. It takes up your time.”
That is true, but I had the benefit that I was still learning the environment and setting up monitoring on a system you have just come to is a great way to learn all the ins and outs. Since Zabbix has discovery tasks, I would point it at a subnet and find everything there. Then discover what everything did.
And now we come to what I love about my monitoring system. Grafana.
Just google Grafana Figure 4, Grafana gives you an amazing view of the different pages people have made.
Figure 5. Grafana Dashboards (https://tinyurl.com/y8b4m7a5)
And with the recent release of Grafana 4.0 the dashboards have gotten even better with fully adjustable layouts. (http://docs.grafana.org/guides/whats-new-in-v5/)
For me, I use Grafana in 2 fashions. In my office, I have 2x 32″ LCD Screens mounted above my whiteboard. Figure 6. These screens are setup for two unique functions. The right screen runs a Grafana playlist (see Figure 7). Every 60 seconds it goes to a new one of the many screens that are there. Figure 6. The left screen shows me a static page that refreshes the items within it every 60 seconds.
Figure 6. NOC Screens in my office.
Note: These photos were taken on a day when building maintenance was being conducted, hence the amount of ‘issues’ that are listed. My BDC was down.
Figure 7. Example of creating Grafana Playlists
These screens are constantly running when I am at work and make sure I am across any potential issues before they can happen.
These screens run on 2x Raspberry Pi 3’s that are connected via Ethernet and have VNC enabled for when I need to work on them. Each Raspberry is running Raspbian, and is I use Firefox in Kiosk mode so that none of the address or side bars can be seen.
I run 3 more of these screens around the ICT Department for different people. The Director, Admin Manager and Helpdesk each have a screen showing different content. While my screens are focused on infrastructure information theirs are focused on our Helpdesk system.
While 90% of my network is focused on Grafana and Zabbix, the remaining 10% is handled by InfluxDB and PRTG. Both of these can be linked to Grafana as a data source, which allows me to add their data to my NOC screens.
These last ones are there to pull certain data from my Ubiquiti systems, at least until Ubiquiti have their own SNMP integration, which I hope is coming. You can read about how I got Ubiquiti data at the following post:
And now I get to most likely one of the most important parts of any monitoring system. The Alerts.
I have looked at a range of alerts, and despite my lack of love for their phones, I will admit that Android gives you some of the best monitoring notification and overview. However, I am someone who prefers to work on Apple devices… I will write another post about why that is. My main point is that Zabbix lacks any real mobile interface.
Before I got to the solution I currently use with Zabbix, I had tried the two most direct solutions. The first was emails, however, I found that it was hard to see the difference between an alert email and normal emails. I found email alerts were hidden in plain sight as the saying goes.
The second one that I reviewed was pushed notification programs like Pushover and Pushbullet. Both of these work and I found that you can instantly see the difference between emails and notifications from your monitoring system.
Then I ran into a problem, despite it doing what I needed… and despite my system being setup correctly you still get alerts that you don’t actually have to action. A power brown out that causes an alert but instantly fixes itself. A server restarting due to an update, or another senior member of my team restarting a server. With these I don’t need to action them.
With push notification alerts like Pushover or Pushbullet I have to bring up Zabbix and see what has come back up and so on. I found it frustrating that I had to do this checking each morning. So I went looking for a better solution, and along came two options. VictorOps, OpsGenie and even PagerDuty. All of these options had the ability to be linked to Zabbix via scripting that would give me 2 great steps.
- Trigger and alert and send me a notification.
- If alert goes away (problem fixed) it sends a Resolved alert to me.
I tried all three and chose to go with VictorOps, to me its price was best and gave the best option. All I have to do now is looked at my phone each morning and see if I have any VictorOps alerts. If I do I then open their app, it will then tell me if I have any ongoing issues. And to my thankfulness it also shows those alerts on my watch when I am awake, letting me know everything that is happening to my network before the users know it is happening.
And this comes to the one part of my monitoring that I am paying for. Alerting. To date I have yet to found a good method of alerts in the free space, if someone has a better way please let me know. But, as I currently see it, paying $9 a month so that I can have peace of mind each morning that everything is fine is something I am more than happy to pay.
If you have any further questions please feel free to ask and I will do my best to help. Or, if you have a better way to get my monitoring done please let me know… I never consider a project finished. Just in its current stage.