Add check for devices with Status: Offline, that have row/rack assigned

Add check for devices with Status not in (Offline, Planned) but with no row/rack assigned (Update: there are 43 devices that are Status: Inventory but clearly in storage; not sure how we want to handle this right now)

Make asset tag and task checking needs a bit smarter (e.g. catch cases like ticket: T2). Should probably check for WMF\d{4,} and T\d{5,} or something.

Change log_info and log lines to log_success if they are infact successes so we can have some nice green lines even if they are summaries.

Add a PuppetDB sub-report that checks the device type vs. the productname fact. Same for manufacturer but that has extra gotchas (Dell's is self-reported "Dell Inc."; HP was renamed to "HPE" at some point).

Filter out ADMIN_down Ganeti VMs from the PuppetDB report.

The plan is to enable an alert for these reports once we get them to only error on actual errors.

OK, so, after the efforts in the past few days, we're in a much better shape! The PuppetDB report seems to be (almost?) entirely indicative of real issues and is actionable now - I will involve DC Ops to start fixing the cases that are known to be real errors, and we'll see if there are any false positives (I know of at least one, that is tough to handle!).

The Coherence report in its current state is not super useful. Issues to fix:

Blacklist sites esams and knams for now; these are known to be wildly inconsistent (or should I say incoherent? :) and they are unfortunately not actionable right now. We can re-enable once we do a big cleanup project there, hopefully in a few months.

Blacklist certain roles, which are known to miss data and we accept that's OK. I can think of at least: Cable management, Storage bin, Optical device. Possibly others.

We should also probably exclude Status: Offline. If the asset is to offline and unracked, we probably don't care at that point.

We should probably do a s/^wmf/WMF/ in asset tags, and s/^t/T/ in tasks for now, until that report becomes actionable and DC Ops starts monitoring this report.

Asset tag and task checking needs to be a bit smarter. There is a device with task set to "T2" that doesn't alert right now. Should probably check for WMF\d{4,} and T\d{6} or something.

(from another report) "only unconnected console ports are present" - what does that mean?

We needs a new method, to check for devices with Status: Offline, that have row/rack assigned. I'm sure there are plenty of those now.

Finally, food for thought: with these coherence issues fixed up, we can be reasonably sure that we have data that looks to be formatted properly, but we can't be sure whether it's bogus (e.g. an asset tag, Phabricator task or purchase date typo). Ideas on how to detect these:

Assets with the same procurement task, should probably have the same purchase date. I wonder if that's the case in the wild though; I know that in some cases we had multiple shipments for the same order and may have entered different invoicing dates. Needs to be checked before it gets into the report.

<insert crazy idea about checking whether the Phabricator task is marked with #procurement>

OK, so, after the efforts in the past few days, we're in a much better shape! The PuppetDB report seems to be (almost?) entirely indicative of real issues and is actionable now - I will involve DC Ops to start fixing the cases that are known to be real errors, and we'll see if there are any false positives (I know of at least one, that is tough to handle!).
The Coherence report in its current state is not super useful. Issues to fix:

Blacklist sites esams and knams for now; these are known to be wildly inconsistent (or should I say incoherent? :) and they are unfortunately not actionable right now. We can re-enable once we do a big cleanup project there, hopefully in a few months.

Blacklist certain roles, which are known to miss data and we accept that's OK. I can think of at least: Cable management, Storage bin, Optical device. Possibly others.

ack

We should also probably exclude Status: Offline. If the asset is to offline and unracked, we probably don't care at that point.

ack

We should probably do a s/^wmf/WMF/ in asset tags, and s/^t/T/ in tasks for now, until that report becomes actionable and DC Ops starts monitoring this report.

Reasonable.

Asset tag and task checking needs to be a bit smarter. There is a device with task set to "T2" that doesn't alert right now. Should probably check for WMF\d{4,} and T\d{6} or something.

(from another report) "only unconnected console ports are present" - what does that mean?

The situation where no consoles are defined is different from where there are console ports but no connections on them. To be honest this is unnecessary complexity, since both situations are equivalent.

We needs a new method, to check for devices with Status: Offline, that have row/rack assigned. I'm sure there are plenty of those now.

Reasonable, straight forward.

Finally, food for thought: with these coherence issues fixed up, we can be reasonably sure that we have data that looks to be formatted properly, but we can't be sure whether it's bogus (e.g. an asset tag, Phabricator task or purchase date typo). Ideas on how to detect these:

Assets with the same procurement task, should probably have the same purchase date. I wonder if that's the case in the wild though; I know that in some cases we had multiple shipments for the same order and may have entered different invoicing dates. Needs to be checked before it gets into the report.

Interesting, I think about that one.

<insert crazy idea about checking whether the Phabricator task is marked with #procurement>

For the Coherence report, I did a big audit and all devices (offline and online) should have purchase date + ticket + asset tag now. Exceptions are esams/knams, plus real errors that should be fixed. So:

the by-role blacklist (Cable management/Storage bin/Optical device) should not apply in those three fields, or in other words, apply only to the serial number, since these assets do not have any.

As a direct consequence of the above, the previously mentioned blacklisting of Status: Offline should not be applied anymore.

For the PuppetDB report:

I wonder if we should exclude VMs that are ADMIN_down from the Ganeti<->Netbox sync (not just the report). The PuppetDB report has only 2 VMs outstanding right now across all checks (yay!), and one of them is d-i-test which is, by design. I'm on the fence myself.

We should add another check that checks the device type vs. facter's productname. It should match in all cases :) We should probably also do the same for Netbox's manufacturer vs. PuppetDB's manufacturer fact, although note that a) Dell is self-reported by facter as "Dell Inc.", so we'd need to mangle that, b) HP was renamed to HPE at some point in their products, which is not represented by Netbox.

I wonder if we should exclude VMs that are ADMIN_down from the Ganeti<->Netbox sync (not just the report). The PuppetDB report has only 2 VMs outstanding right now across all checks (yay!), and one of them is d-i-test which is, by design. I'm on the fence myself.

Excluding from sync (as a missing machine) would prevent them from showing up in the report, it is true. Perhaps using that to set the machine status instead would be a better way, so the machine would be present just with a status that we could exclude.

We should add another check that checks the device type vs. facter's productname. It should match in all cases :) We should probably also do the same for Netbox's manufacturer vs. PuppetDB's manufacturer fact, although note that a) Dell is self-reported by facter as "Dell Inc.", so we'd need to mangle that, b) HP was renamed to HPE at some point in their products, which is not represented by Netbox.

After a little bit of digging, this is an additional field to ship from PuppetDB microapi? Facter doesn't provide an additional API does it?

I wonder if we should exclude VMs that are ADMIN_down from the Ganeti<->Netbox sync (not just the report). The PuppetDB report has only 2 VMs outstanding right now across all checks (yay!), and one of them is d-i-test which is, by design. I'm on the fence myself.

Excluding from sync (as a missing machine) would prevent them from showing up in the report, it is true. Perhaps using that to set the machine status instead would be a better way, so the machine would be present just with a status that we could exclude.

Unfortunately there is no Status field for VMs. We could always add it as a custom field, with our own options for it, to be maintained separately... I'm legitimately unsure whether that's something we should do or not.

We should add another check that checks the device type vs. facter's productname. It should match in all cases :) We should probably also do the same for Netbox's manufacturer vs. PuppetDB's manufacturer fact, although note that a) Dell is self-reported by facter as "Dell Inc.", so we'd need to mangle that, b) HP was renamed to HPE at some point in their products, which is not represented by Netbox.

After a little bit of digging, this is an additional field to ship from PuppetDB microapi? Facter doesn't provide an additional API does it?

The oldhardware report isn't super useful right now. It does what it was designed to do, but due to various factors (refresh rate varies per type of device; sometimes we go over 5y; sometimes the replacement is already there and it's about to be executed etc.) it's not super useful as a tool for DC Ops to be alerted on and triage. We typically do this audit once a year during budgeting anyway, but at that point we're looking one year ahead. We also track this regularly as part of the budget execution, so having this as a report isn't of particular help to me as budget owner, and I think neither to DC Ops.

(Oh, and the "null purchase date" portion of it is now redundant -- the Coherence check already alerts us of this.)

One thing that /could/ be useful for this report though, would be to report on hardware > 4 or 4.5 years that is Status: Planned. Basically: once a spare device is sufficiently old that it would not make sense to provision anything new on it, to alert us so that we can transition it to Status: Inventory and eventually Offline. Does that make sense?

We should add another check that checks the device type vs. facter's productname. It should match in all cases :) We should probably also do the same for Netbox's manufacturer vs. PuppetDB's manufacturer fact, although note that a) Dell is self-reported by facter as "Dell Inc.", so we'd need to mangle that, b) HP was renamed to HPE at some point in their products, which is not represented by Netbox.

Given how unique models are across vendors, I'm leaning towards not adding a manufacturer check -- more trouble than it's worth.

As for model mismatches: cloudvirt1026-cloudvirt1030 are good examples of an error there (they're R640, but documented as R630).

Just a note, admin_down does not seem to indicate anything particular about the machines that is useful to denote in Netbox as far as I can tell? It seems to reflect the *desired* state. To clarify is there any situation where it would not match the op_state within a short period of time? AFAICT it is used to tell ganeti to down or up the machine but I may be incorrect here. I have implemented mirroring the op_state but if we truly do need an extra field for admin_state that'd be useful to know.

Just a note, admin_down does not seem to indicate anything particular about the machines that is useful to denote in Netbox as far as I can tell? It seems to reflect the *desired* state. To clarify is there any situation where it would not match the op_state within a short period of time? AFAICT it is used to tell ganeti to down or up the machine but I may be incorrect here. I have implemented mirroring the op_state but if we truly do need an extra field for admin_state that'd be useful to know.

You are correct that admin_state is the desired state, while oper_state is the operating one; basically, if your desired state is "this VM should be up" and for some reason (e.g. the host had an unscheduled reboot, or QEMU crashed etc.) the VM is down, then admin != oper, and a cronjob that runs the ganeti "watcher" will execute and fix things up (= start the VM). Until that happens, gnt-instance list will list the VM with a status of "ERROR_down", meaning "it's down, but it shouldn't be".

For the purposes of Netbox, I think what we need is the admin state, not the operating one, i.e. what we've configured Ganeti to do, rather than what has actually happened due to an error. The equivalent for a physical host is that we expect a Status: Offline host to be powered down, and a Status: Active host to be powered up, but we don't really track whether we've shut down or powercycled a host manually.

Just a note, admin_down does not seem to indicate anything particular about the machines that is useful to denote in Netbox as far as I can tell? It seems to reflect the *desired* state. To clarify is there any situation where it would not match the op_state within a short period of time? AFAICT it is used to tell ganeti to down or up the machine but I may be incorrect here. I have implemented mirroring the op_state but if we truly do need an extra field for admin_state that'd be useful to know.

You are correct that admin_state is the desired state, while oper_state is the operating one; basically, if your desired state is "this VM should be up" and for some reason (e.g. the host had an unscheduled reboot, or QEMU crashed etc.) the VM is down, then admin != oper, and a cronjob that runs the ganeti "watcher" will execute and fix things up (= start the VM). Until that happens, gnt-instance list will list the VM with a status of "ERROR_down", meaning "it's down, but it shouldn't be".
For the purposes of Netbox, I think what we need is the admin state, not the operating one, i.e. what we've configured Ganeti to do, rather than what has actually happened due to an error. The equivalent for a physical host is that we expect a Status: Offline host to be powered down, and a Status: Active host to be powered up, but we don't really track whether we've shut down or powercycled a host manually.
Does that make sense and do you agree?