Show Menu
Cheatography

Guidelines for writing checks for Check_MK Cheat Sheet by

Naming

The check types should be named short and unique. They must consist only of lower case charac­ters, digits and unders­cores and begin with a lower case character.
Checks where one item of the check represents one thing (e.g. fan, power supply), should be named in singular, e.g.
casa_fan
,
if
,
oracle­_ta­ble­space
. Checks where each item checks a quantity, e.g. number of logins, should be named in plural (e.g.
user_l­ogins
,
printe­r_pages
). Note: due to historic miscon­ducts many existing check types are named contrarily to this rule. That does not mean that new checks should be named incons­ist­ently as well!
Vendor specific checks must be prefixed with a vendor specific unique abbrev­iation (which you think of). Example:
fsc_
for Fujitsu Siemens Computers.
Product specific checks must be prefixed with a product abbrev­iation, for example
steelh­ead­_status
for a Steelhead appliance of Riverbed.
SNMP based checks: if the check makes use of a standa­rdized MIB which is or might be implem­ented by more than one vendor, then the check should not be named after the vendor but after the MIB. An example are the hr_* checks.
Service descri­ptions of different check types fundam­entally doing the same must be identical (e.g.
if
/
if64
/
ifoper­status
). Reason: this makes rules in
main.mk
simpler for the user!

Config­uration variables

Config­uration variables for
main.mk
should be named after the check if they are only used by this check. This does not hold for variables, that are used by several checks (e.g.
filesy­ste­m_d­efa­ult­_levels
is used by
df
,
hr_fs
,
df_netapp
, ...)
The variable that is used for the check's default parameters and entered in the inventory function must be named CHECKTYP
_defau­lt_­levels
(if not used by more than one check, see above). Example: check foo_bar has the config­uration variable
foo_ba­r_d­efa­ult­_levels
.
If a check does not use check parame­ters, the inventory function must return None as parameter and the check function must name the parameter argument
_no_params
.
The name of the inventory and check function must be prefixed with the name of the check type, for example
invent­ory­_h3­c_l­ans­wit­ch_cpu
for the
check h3c_la­nswitch
.

Plugin output

Each check returns one line of text - the plugin output (or sometimes called check output). In order to unify things the output must be formated according to the following rules:
when returning measur­ement values, place exactly one space between the value an the unit (e.g. 17.3 V). Only exception: Put no space before a percent sign. (correct e.g. 89.4%).
When returning measur­ement values, name the names of the quantities in upper case, then add the value separated by a colon. Examples: Voltage:
24.5 V
, Phase:
negative
, Flux-C­apa­citor:
operat­ional
Do not directly use return codes or cryptic return strings internal to the device. Instead, try to translate them to human readable messages. Example: Instead of
routeM­oni­torFail
use
route monitor has failed

Perfor­mance data

Format of Perfor­mance data
Always send int or float data as perfor­mance data. Do not attach a unit. Write temp instead of "­%0.2­fC­" % temp!
If you need to omit fields in the middle of the data list (e.g. warn or crit), add a None instead, for example [("u­sag­e", usage, None, None, 0, size)]
If you need to omit fields at the end, simply omit them. Do not add trailing Nones.
Naming of perfor­mance data variables: Names consist of only lowercase letters and unders­cores (rare). Also trailing digits are allowed (e.g. phase3).
Naming of perfor­mance data variables: The name of the variable should be named correctly after the thing, not after the unit. Example: use current instead of ampere. Use size instead of bytes.
Always use the canonical unit: send Bytes, not KB, MB or GB. Send Celsius, not Fahren­heit. Send Bits/sec, not MBits/sec. It is the task of the graphing tool to do a useful scaling.
Perfor­mance data flag
Only set "­has­_pe­rfd­ata­" to True in check_info if the check really produces perfor­mance data output.
PNP Graph definition
Each check returning perfor­mance data must have a dedicated PNP graph definition in pnp-te­mpl­ates. If the check has warning and critical levels, the graph must display these levels as yellow and red lines.
PNP graphs should always use the consol­idation function MAX (there are some rare exceptions where only MIN makes sense).
However: the Average value which is printed in the labelling of the graph must use the consol­idation function AVERAGE. Using MAX would compute the average of the maximum values - which is totally useless.
RRA definition
Each check returning perfor­mance data must also have an RRA definition specif­iying which of MAX, MIN and AVERAGE is needed to display the graph in its current (and maybe future) forms. These defini­tions are in pnp-rr­aconf. Use a symlink here.
Perf-O­-Meter
Each check returning perfor­mance data should have a Perf-O­-Meter. For checks which are part of Check_MK the Perf-O­-Meter must be defined in web/pl­ugi­ns/­per­fom­ete­r/c­hec­k_m­k.py. For third-­party checks it should be defined in a separate file in web/pl­ugi­ns/­per­fom­eter.
SNMP based checks
Only use numeric OIDs in your checks. Name-based OIDs rely on MIB files and the check won't work when the MIB files are not in place. Always have your OIDs start with a root, for example: .1.3.6.1.4.1

Simple memory checks

Many devices report memory usage in a simple way: used and total memory in absolute terms, or, equiva­lently, used and free memory in absolute terms.
To ensure uniform behaviour, all these checks should use the check_­memory function defined in memory.in­clude.
The check group should be memory­_si­mple. Note that this requires that the check has an item. For devices with no modules, (i.e. only one memory value) the item should be the empty string.
The service descri­ption should be "­Mem­ory­" or "­Memory %s" for checks with nonempty items.
 

Check Layout

All checks must follow the same layout specified below:
fileheader with GPL notice
name and email address of the author - if check was contri­buted
example output as sent by the agent
default settings of config­uration variables
helper functions and variables, if any are needed
the inventory function
the check function
the
check_info
declar­ation

Coding Style: Add an author

If the check is contri­buted by a third party (i.e., not by the developers of Check_MK), the name and email address of the contri­butor should be added as a comment, right after the header.

Coding style: Readab­ility, looks and indents.

Avoid long lines. Ideally, your lines shouldn't exceed 100 chars.
Use four spaces to indent your code. Don't use tab chars! And if you really can't live without tabs, set the tab width to 8 spaces.

Coding style: File Header

For checks which are supposed to be part of the official Check_MK project the file header with the copyright inform­ation must be present. This will be automa­tically created if you call 'make headers' in the main source directory.

Coding style: Example agent output

Including example output of the agent is very helpful for unders­tanding how the check parser works.
TCP-Agent based checks must include an output example of the agent. If the agent output can have different formats or output styles, then put an example for each kind of style the check supports (e.g.: the output of multipath -l has changed its layout between SLES 10 and SLES 11).
For SNMP based checks, at least include examples if the kind of output is remarkable in some respect.

Coding style: Use of lambda functions

When it comes to
parse_­fun­ction
,
invent­ory­_fu­nction
and
check_­fun­ction
, the usage of
lambda
functions is only allowed in order to reuse existing functions while providing some additional argument. Example:
"­inv­ent­ory­_fu­nct­ion­" : invent­ory­_fo­oba­r_g­ene­ric­(info, "­tem­per­atu­re")
It is not allowed to implement the function itself as
lambda
expres­sion. Example:
# This is bad, ugly and unreadable code!!
`'chec­k_f­unc­tion' : lambda _no_item, _no_pa­rams, info: `
(0, "­Memory used: %s" % get_by­tes­_hu­man­_re­ada­ble­(in­t(i­nfo­[0]­[0]))),

Manpages

Each check must have a check man page. This should be:
complete
precise
terse
helpful!
Inform­ation that must be contained in the check descri­ption:
What does the check exactly do?
A definition under which circum­stances the check status will change to WARN/CRIT?
Which devices are supported by the check?
Does the check require some config­uration of the agent or some separate agent plugin? (example: the logwatch check requires the agent plugin mk_log­watch to be installed)

Service Descri­ptions

Checks doing the same should always have the same (consi­stent) service descri­ption. Examples:
CPU utiliz­ation services must be named CPU utiliz­ation.
Temper­ature services must begin with the word Temper­ature.
Services for main RAM usage should be named Memory used.
Services for fans should be named Fan or Fan %s.
Services for power supplies should be named Power Supply or Power Supply %s.
Service descri­ptions should be capita­lized like English titles, e.g. "­Source of Output­"

Forbidden Things

Never use a global import statement in a check file
Do not use datetime for date/time parsing. Use time. It can do all you need, really !!!
Do not use any other modules, except: sys, os, time, socket
If you need regular expres­sions, use the function regex(). Do not use re directly.
Neither the check function nor the inventory function may use the print command, or otherwise output any data to stdout or stderr, or commun­icate with the outside world in any other way. An rare exception to this are checks which need a dedicated data storage (such as logwatch: it keeps unread log messages in files).
Never fetch SNMP data that is not actually used in the check or inventory function.

Temper­ature checks

The item name should reflect the kind of temper­ature being monitored. Please refer to the following table to make sure that the same kinds of temper­atures get the same item.
Ambient: Built-in sensor measuring ambient air temper­ature
External: An external, freely placeable sensor connected to the device
System: System mainboard temper­ature
CPU: CPU temper­ature
To ensure that all temper­ature checks work in the same way, use the check_­tem­per­ature function in temper­atu­re.i­nc­lude.
The check group should be temper­ature.
check_­tem­per­ature can handle device levels and status in various ways config­urable in the temper­ature WATO rule. Do not pass both device status and device levels to check_­tem­per­ature - if a device provides levels, pass those and not the status.
Some devices can output temper­ature in various units, and specify which unit it is. In those cases, pass the temper­ature in the unit the device states, along with the unit as the dev_unit parameter to check_­tem­per­ature.
Some devices have a very large number of similar temper­ature sensors, where one item per sensor would be unreas­onable. (Dozens of ambient temper­ature sensors in a small device do not really provide more inform­ation than a single one.) In those cases, use the check_­tem­per­atu­re_list function defined in temper­atu­re.i­nc­lude. Use the temper­ature check group just as you would for regular temper­ature checks.
 

Setting default values for config­uration variables

Default values for check parameters (e.g.
switch­_cp­u_d­efa­ult­_levels
) must be chosen in a way that they make sense for everybody, not just for your special case. If case you are unsure, rather choose too loose than too tight levels. This helps avoid false alarms.
If you set default values, add a short comment about how you came to choose said values. If it is merely a rough estimate, document that it is, if you got them from a very specific source, document where you got them.

Reuse of config­uration variables

If the same config­uration variable is used in multiple checks, it must be set to a default value in all checks and the values must be identical!

Error handling

Your check should assume that the agent is always producing valid data. It should not try to handle cases when the agent output is broken. Reason: broken agent output is already handled by Check_MK via Python except­ions. Interc­epting these exceptions in your check code makes debugging of broken outputs much more difficult.
Do not handle cases in the agent output for which you have no indication that they can actually happen.

int() vs. saveint() and float

vs.
savefl­oat()
int()
will throw an exception if the argument is not a valid number string (or if it is empty). Check_MK will catch the exception and make the check result "­UNK­NOW­N" with an approp­riate error message.
saveint()
, however, will assume 0 if the argument cannot be converted to a valid integer.
Use
saveint()
in all cases when you know or suspect that your device may supply invalid data, but the check should work with the rest of the data and produce useful results. Disadv­antage: you may never find out that the device has supplied invalid data, because the check wont tell you !
Use
int()
in all other cases, e.g. if you want to be notified with an exception if the check has received invalid data from your device. In most cases this is what you want !

Interp­ret­ation of levels

Many checks have parameters defining warning and critical levels which are compared to an actual value. Please observe the following important rules and conven­tions if you are writting such checks.
Warning and critical levels should always be checked with >= and <=. Example: a check monitors the length of a mail queue. The critical upper level is at 100. This means that if the length is exactly 100, the check should already be critical. There might be a few exceptions to this where this wouldn't make sense.
If there are both upper and lower levels, the labelling should be: Warning at or above ___, Critical at or above ___, Warning at or below ___ and Critical at or below ___.
If there are both upper and lower levels, the labelling should be: Warning at or above ___, Critical at or above ___, Warning at or below ___ and Critical at or below ___.

return versus yield

A check function producing several subresults (e.g. current usage and growth) must use the
yield
function for returning these results. On the other hand, check generating exactly one result must use
return
.

check_­inf­o[...] keys

Do not add keys here which are not used. The only mandatory keys are "
servic­e_d­esc­ription
" and "
check_­fun­ction
". Add "
has_pe­rfdata
" and other keys with a boolean value only if its value is
True
.

Various

Here are some frequent errors and further mixed guidel­ines:
If your check is accomp­anied by an agent plugin, you should observe the following rules:
Put it into share/­che­ck_­mk/­agents for UNIX like systems and make it executable (mode 755).
Put it into share/­che­ck_­mk/­age­nts­/wi­ndows for Windows.
Do not add a file extension like .sh or .py.
For shell scripts, add #!/bin/sh in the first line. Use #!/bin­/bash only if the BASH is really required.
Add the standard Check_MK file header with the GPL notice.
Make sure that the plugin does not do any harm even if installed on a system where the check in question is not relevant or does not work.
Make sure that the check manpage tells the user that the plugin is needed and which additional software needs to be installed in order to make it work.
The plugin must not output a section header if the tool or technology to be monitored does not exist on the system.
If the plugin needs a config­uration file, expect it in $MK_CO­NFDIR and give it the same name as the plugin, but with the extension .cfg, and with any mk_ prefix removed.
A check which does not get the inform­ation which is needed decide whether or not the check is OK, must simply return None. This can be the case when a check with an item can not found the data matching this item in the agent output or SNMP data. Another possible situation is when the data provided by the agent or SNMP is completely empty.
When a check returns None, Check_MK will produce an UNKNOWN state with a state output which tells the user that this thing could not be found.
The state markers (!) and (!!) must only be used in checks which can go warning or critical for several different reasons, like sub-ch­ecks.
Your check must also work with Nagios as Core. If you use functions or variables from *.include files then you must declare them in check_info in the key "­inc­lud­es" and you must then test our check with Nagios as the core.
 

Comments

No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.