Packaging

We ended with pyinstaller. It works fine and create a full standalone EXE file, which is just perfect.

The only trick is that it scans ".py" files, and was not handling our probes implementations, which are not explicit included but dynamically loaded. Butcher mode on, we added a registry class which just reference all probes (and do nothing else). Doing so, pyinstaller is happy.

Checked!

Installer

Ok, i don't like Windows installers.

We tested several tools, it took two days (long...) and we selected Wix Toolset.

We discover at the end that the MSI file need to be signed with a code signing certificate (my god, why so much pain?) to avoid some warnings during installation. You have to buy the certificate, to buy it to have to provided officials documents. Well, this is ongoing and will come later in low priority.

Almost checked!

WMI and Win32

Before going ahead with our code base, we still have two blockers.

We need WMI access for the metrics and (may be) Win32 apis access.

The Win32 apis should be handled by Pypiwin32.

A python package WMI seems to offer WMI access. We need to ensure it works fine and we have the required metrics for our operating relative probes.

The WMI package was working fine, but fetching full WMI objects was sometimes slow (we ended with almost only WQL queries). We will go async (like usual in these cases) with some tricks for some fast refreshes. Good.

Then we checked what WMI counters have in stock for our metrics. Pretty straight forward with some minor issues:

We adapted a bit the low level probes to support Linux and/or Windows execution pipeline, moved UDP domain socket to UDP standard socket, integrated all that in unittests, and everything was suipergreen in a day.

The next day we set the Knock daemon NT Service with unittest.

The event log :(

Then we lost two days. The winner was the Event log.

Instead of storing plain text buffers, they had (who??) the big idea to handle logs with formatted string buffers (kind of templates), with the magic goal to internationalize the logs (just go English men) and to reduce the event log sizes (i will suicide and will be back after).

I already encountered that, but i forgot (may be too messy for my poor brain).

To ease all this mess, these formatted string buffers have to be in a DLL (lol).

We ended with win32evtlog.pyd and event id 1 (which is an undocumented "%s", gg guys).

We did some minor updates in our low level logger to handle log files rotation over Windows (which do not have logrotate) using TimedRotatingFileHandler.

Finalization

We finished with some tricks about file paths (which are pretty dirty but avoid storing stuff inside the Windows registry).

The Daemon service was up, packaging and MSI tested.

We integrated this is our build system, tested MSI again.

Checked!

Windows probes

With solid foundations, Windows probes implementation was easy.

As expected, we remapped Windows metrics to existing ones without (almost) issues.

95% of the stuff is WMI based, except socket states which are based on Win32 api calls.

Then we encountered some shifting in probes scheduling on some Windows boxes. Not sure what the real root cause is at this stage (gevent, Windows, both...). We rewritten a bit this part and boosted a bit the scheduling intervals for windows to avoid gaps in graphs.

Delayed delay

Last issue was NT Service slow start under OS heavy pressure at boot (slow disks, huge packs of service starting up, standalone EXE).

Step one was to delay service start. Not enough.

We moved the start signal a bit sooner and engage restart failure action with 2 minutes delay using the Wix Toolset upon service install.

This should be acceptable; moreover we have no easy way to configure the delayed start delay and the (stupid) 30 seconds timeout handled by the Windows service manager (it's in registry, requires reboot, system wide blablabla)

At the end, Knock Daemon Service used a bit CPU more ressources than Linux version - due to WMI fetches - but this remains low, memory footprint remains at less than 40 MB and network and IO usages are similar.

Final thoughts

We got fast Windows support with unified code base, unified metrics, unified alerts in a couple of weeks.

We reused our development environments and build systems.

We did not modify a single code line in our monitoring backend to handle Windows.