LDAP-Configured Nagios
From Nagios Wiki
Contents |
Situation
In complex paranoid-secure situations, we end up having failover/redundant Nagios servers submitting data to clustered databases for history. Often, firewalls inside the datacenter block us from testing all services from a single Nagios server (pair). Because we have a number of Sysadmins, who are overworked and sometimes sub-flawless at communicating precisely, we find that the config doesn't always match the existing hosts, nor the DNS and DHCP (static DHCP: IPs never change, but giving them out has side-benefits). In six months of actively recording rootcauses of errors in LDAP, we had two human-error issues: one of our own hacks locking the Primary (yeah, never run anything on the primary but the Primary Replication Manager itself) and we had an openldap-2.2.13 being throttled by a glassfish. So our situation is:
- Many Sysadmins (and hiring so many more)
- Configs spread across multiple services
- New servers would get IP, DNS config, services, but partial/no Nagios
- Nagios configs were often incorrect (Arunas, you'll someday cooperate)
Referring to the diagram above, we have an (green) LDAP Primary in a datacenter, a number of (green) LDAP secondaries (typically a pair in each datacenter), and a (blue) pair of Nagios at each location looking internally at the services (black) provided by the datacenter. We also want to use some of the Nagios in one Datacenter to monitor the externally-accessible services of another datacenter going forward.
Mission
We want to make it easiest to configure a new server in DNS, DHCP (if used), and in Nagios
- same time
- same data values
- single LDAP object atomically updated (for ACID-properties due to inter-related/dependent config)
We want a pair of nagios servers easily configured identically
We want to federate our config so that losing a Nagios server is quick to replace
Other Options
We looked at SVN/CVS our config, but that still leaves us with SysAdmins forgetting to configure Nagios. Plus, when there are people for whom "human error" reaches levels rivaled by the criminally-stupid, relying on consistent procedure is no longer possible.
Besides, I would prefer to simply make the easier way the right way, so that marching to a different drum takes more effort than asking and communicating. If the data's there, the config is there, even anarchists get lazy and cooperate at times.
In order to do the SVN of the config, I had a variant of the commit-hook checking and rsync out the config files, but even when human errors don't enter the situation, you still have ignorance and a huge security hole rsynching your monitoring configs around. Plus, the configs aren't exactly the same, but there's no way to "select the config got this host" (without getting complex).
When Nagios has a "host" entry that doesn't match a DNS entry, then resolving an error becomes (often) a case of finding the errant machine first, confirming the IP address, checking ownership for change-control, then finally connecting to see the error.
We looked at trying to configure from MySQL, and either generating the config by a cheesy script, or make a converter but MySQL is more difficult to cluster than LDAP, it has fewer options (we use both syncrepl and slurpd-pushes based on security)
Current Solution
The current schema allows me to express a Nagios host using the the DNS config from nis.schema (ipHost, ipAddress, name), and services similarly. There are some issues I'm trying to work out, and I can detail for others' input if that's the best way forward. This current work is based on a 4-day obsession started 2008-04-12, so will show some gaps. As well, I'd like to reduce the repetition of the xdata/xoddefault.c and base/config.c files if that's acceptible.
Let me paste a graphviz to show the datamodel:
http://www.chickenandporn.com/r/nagios-ldap-dm1000.png
That's a bit complex, graphviz reders an OK job, and I've added some colouring to ease the tracing of references. In short, this is a conversion of an xdata file read in by a "cfg_file=filename" directive into an LDAP zone instead. The regular files are read in, but the "ldap_server" directives trigger the LDAP client:
... cfg_file=hostgroups.cfg resource_file=resource.cfg ldap_server=ldap://ldap1.west.example.com/dc=example,dc=com ldap_server=ldap://ldap2.west.example.com:3890/dc=example,dc=com status_file=status.dat ...
When the main config files are finished, the ldap servers are tried in order for the first that gives a response (this assumes that the servers all hold identical data, and are replicas, so the first hit should work rather than hitting all and aggregating the results). This allows the base/config.c to read in the LDAP-based common data. The LDAP object searched for is a "monitorGroup" where "monitorFQDN" matches a "gethostname()":
dn: cn=Private,ou=West,dc=example,dc=com cn: Private objectClass: monitorGroup monitorFQDN: nagios1.west.example.com. monitorFQDN: nagios2.west.example.com. nagios-user: nagios nagios-group: nagios check-external-commands: FALSE
Don't worry about the "cn=Private": that's just a textstring here, there's no special meaning. In our implementation, we have public and private interfaces (ie a server on 10.1.1.7 might service requests outside the firewall at 89.146.11.7 via NetScreens). We check both public and private services for failures, which also confirms the Netscreens' config. A remote datacenter checks its own private services, and another datacenter's public services. Because these two *.west.example.com serverw ould also check the other DC's public side, we have a matching:
dn: cn=Public,ou=East,dc=example,dc=com cn: Public objectClass: monitorGroup monitorFQDN: nagios1.west.example.com. monitorFQDN: nagios2.west.example.com.
Host objects look like the following. Note the shared artifacts with the other schemas, to allow bind-sdb and ldap-dhcp to read their configs from the same objects:
dn: cn=dns1,ou=West,dc=example,dc=com cn: dns1 nagios-use: basic-server objectClass: monitoredHost objectClass: dNSZone objectClass: ipHost objectClass: top monitorGroup: cn=Private,ou=West,dc=example,dc=com ipHostNumber: 192.168.12.100 macAddress: 00:30:48:23:c1:fe zoneName: west.example.com relativeDomainName: dns1
Note that the macAddress and ipNumber will need a bit of (to be done) magic to convert to a "dhcpAddress: 192.168.12.100" and "dhcpHWAddress: ethernet 00:30:48:23:c1:fe", but placing it in a single LDAP record makes it so that the SysAdmin editing it has very few possibilities to make human-errors. Our SAs are usually working under-the-gun, pressured for results by project managers, and juggling a bazillion things at once towards a solution, so (based on combat stress) I can see the smartest dude making the silliest errors. I hope to reduce that chance.
Services are similarly-registered:
dn: cn=ping-dns1,ou=West,dc=example,dc=com cn: ping-dns1 host-name: dns1 nagios-use: local-service objectClass: monitoredService objectClass: top description: PING check-command: check-host-alive monitorMemberOf: cn=Public,ou=West,dc=example,dc=com
This is also copied directly from my AutoTest self-test; the "monitorMemberOf" is a synonym of "monitorGroup" attribute. The "host-name" matches the host above, and registers the service as we'd expect.
Another benefit is that when we bring-online another datacenter, we can "slapcat" the current database, copy out the Nagios stuff, edit it with a text-file, and "ldapadd" the whole chunk at once. LDAP's self-defense schema-checking will resist Objects that violate schema, and re can apply referential-integrity-checking overlays and uniqueness-checking overlays to avoid making errors during this process.
Next Steps
I'm no longer with this company; I wish them well, there are a few very smart people there who balance the occasional Speedwobble in the datacenters. This company was doing a lot of cool number-crunching, and would have benefitted too from Apache config in the LDAP and similar. I still think along the same lines, and would love to see this implemented.
Currently, it handles Services, Hosts, and Monitoring Nagios Servers, but no hostgroups or servicegroups.
wildcards work in really funky ways.
Benoit -- who looked at this before -- has offered some test-data based on his work in GoSA
Sebastien at ZenSolucion has shown interest, I may have to build some Ubuntu. Sebastien's got the largest set of schema interoperating that I've seen.
I'm looking or a contact to consider this work for merge. It has been produced in off-work time, so bears no IP encumbrance, and submission back to Nagios is the best way to share and improve. If instead I need to keep a running patch, well, I'll produce the RPMs :)
My work tends to be on http://tech.chickenandporn.com/tags/nagios (reloaded 2009-05-25)





