FreeNAS collects metrics on itself using collectd. This is a nice program which does nothing but gather metrics, and gather them well. FreeNAS gathers basic metrics on itself - cpu, disk performance, disk space, network interfaces, memory, processes, swap, uptime, and ZFS stats, and logs it to RRD databases which can be accessed via the Reporting tab. However, as nice as that is, I much prefer the Graphite TSDB (time-series database) for storing and displaying metrics.
Previously, I was editing the collectd.conf directly, but since the collectd.conf is dynamically generated, and I'd have to add the same block of code each time that happened, I decided to move my additions to the collectd.conf into files stored on my zpool. I then just use the include directive added to the end of the native collectd.conf to call out those files. So, at this point, all I add to the native collectd.conf is this line:
Include "/mnt/sto/config/collectd/*.conf"
This makes my edits really easy, and allows me to create a script to check for it and fix it if necessary - more on that later.
In the /mnt/sto/config/collectd/ directory, I have several files - graphite.conf, hostname.conf, ntpd.conf, and ping.conf.
The graphite.conf loads and defines the write_graphite plugin:
LoadPlugin write_graphite
<Plugin "write_graphite">
<Node "graphite">
Host "graphite.example.net"
Port "2003"
Protocol "tcp"
LogSendErrors true
Prefix "servers."
Postfix ""
StoreRates true
AlwaysAppendDS false
EscapeCharacter "_"
</Node>
</Plugin>
It's worth mentioning that some of the other TSDBs out there accept Graphite's native plain-text format, so this could be used with them just as well. Or, if you had another collectd host, you could use collectd's "network" plugin to send to those.
The hostname.conf redefines the hostname. The native collectd.conf uses "localhost", and that does no good when logging to a graphite server which is receiving metrics from many hosts, so I force it to the hostname of my FreeNAS system:
Hostname "nas"
In order for this to not break the Reporting tab in FreeNAS (not that I use that anymore with the metrics in Graphite) I first need to move the local RRD databases to my zpool by chcking "Reporting Database" under the "System Dataset" in the "System tab:
I then go to the RRD directory, move "localhost" to "nas", and then symlink nas to localhost:
lrwxr-xr-x 1 root wheel 3 May 19 2015 localhost -> nas
drwxr-xr-x 83 root wheel 83 Dec 20 10:23 nas
This way, redefining the hostname in collectd causes the RRD data to be written to the "nas" directory, but when the GUI looks for the "localhost" directory, it still finds what it's looking for and displays the metrics properly.
The ntpd.conf enables ntpd logging, which I use to monitor the time offsets on my FreeNAS box on my Icinga2 monitoring host:
LoadPlugin ntpd
<Plugin "ntpd">
Host "localhost"
Port 123
ReverseLookups false
</Plugin>
Finally, ping.conf calls the Exec plugin to echo a value of "1" all the time:
LoadPlugin "exec"
<Plugin "ntpd">
Exec "nobody:nobody" "/bin/echo" "PUTVAL nas/collectd/ping N:1"
</Plugin "ntpd">
I use this on my Icinga2 server to check the health of the collectd data, and have a dependency on this check for all the other Graphite-based checks. This way, if collectd breaks, I get alerted on collectd being broken - the actual problem. This prevents a flurry of alerts on all the things I'm checking from Graphite, which makes deciphering the actual problem more difficult.
So, I define the Graphite writer, I change the hostname so the metrics show up on the Graphite host with the proper servers.nas.* path, and I add two more groups of metrics to the default configuration. These configuration files are stored on my zpool, so even if my FreeNAS boot drive craps out (which actually happened last week) and I have to reload the OS from scratch, I don't lose these files.
Since I'm only adding one line to the bottom of the collectd.conf file, it becomes very easy to check for my additions, and if necessary, add them. I have a short script which I run via cron: (the "Tasks" tab in the FreeNAS GUI)
#!/bin/bash
# Set the file path and the line I want to add
conf=/etc/local/collectd.conf
inc='Include "/mnt/sto/config/collectd/*.conf"'
# Fail if I'm not running as root
if (( EUID ))
then
echo "ERROR: Must be run as root. Exiting." >&2
exit 1
fi
# Check to see if the line is in the config file
if grep -q Include $conf
then
: All good, exit quietly.
else
: Missing the include line! Add it!
echo "$inc" >> $conf
service collectd restart
logger -p user.warn -t "collectd" \
"Added Include line to collectd.conf and restarted."
echo "Added include to collectd.conf" | \
mail -s "Collectd fixed on NAS" mymyselfandi@example.com
fi
If I reboot my FreeNAS system, the collectd.conf gets reverted. Not a huge problem if I can wait no more than 30 minutes for my cron job to run, but in 9.3, I can do even better. I can call the script at boot time as a postinit script from the Init/Shutdown Scripts section of "Tasks":
This way, when I boot the system, it runs the check script, which sees the missing Include line, adds it automatically, and restarts collectd so it resumes logging to my Graphite server.
This setup has proven to be wonderfully reliable, and unless/until native Graphite support is added to FreeNAS, should keep on working.
4 comments:
Awesome write-up, Chris! I'll submit this for inclusion in the next FreeNAS Newsletter :)
An elegant solution, thank you.
I thought you might like to know that this approach (albeit with adaptions to the new "calendar" system) works with FreeNAS 10 Beta, too.
Thanks, darac, glad it helped!
In case someone is trying to install it and get an error when starting the vm. Try with the latest virtual box. At first I was unable to install using 5.1.20, worked fine using 5.1.24
Post a Comment