March 22, 2016

Agenda:

IMPORTANT NOTE: As we have weekly Doctor meetings fixed at 06:00 PacificTime, today's meeting will start at 14:00 (CEST) instead of 15:00 becausePacific went Daylight Saving Time already last week. Daylight Saving Timebegins on March 27 in Europe. The change to 14:00 only applies to today'smeeting!

Action Ryota to open and cleanup JIRA tickets according to the result of the discussion at the hackfest

Maintenance changes to Nova (Tomi)

Propose a spec as normal (Tomi: can present the main idea in this meeting).

Had some chat with John Garbutt (Nova plt) and he is also very interested about the topic and looking to see spec and it would be natural continue for get-valid-server-state work. Also proposed I could discuss this with API team.

Initial proposal would be to have start and end time visible. API to set that and visibility to VM owner also. Currently only disable/enable to stop schduling, but this would be for the real maintanance period.

Any actions for VMs would be harder and out of scope. Also talked a bit about this with Sean Dague (core). Mostly opinion has been to have this external, but gaps surely needs to implement in Nova if use case needs it. Some discussion exists about auro recovery, but perhaps addressed later.

AoB

OpenStack Summit

Reminder: multiple Doctor members got presentations accepted -- thank you all! This is no news but good to share once again :)

Carlos: originally proposed by NEC; integrated with OpenStack; NEC found some "bugs" and "gaps" (e.g. delay is significantly more than 1s); meet them in Vancouver; it is a candidate, but no other platform seems to be integrated in OpenStack

Gerald: meeting with Fujitsu on Monasca two weeks ago

Carlos: pluggable architecture, could support Nagios or Zabbix

Gerald: in Monasca there is currently no requirement to do reporting within 1s

Last meeting with REL

has Tommy plan to meet with ETSI NFV REL? if time allows

ETSI NFV IFA

IFA documents not yet open to public

April 28, 2015

Joint meeting with ETSI NFV REL team. Agenda:

Identify Purpose of the call

Collaboration kick-off

NFV REL:

Project Overview

NFV upgrade

Active monitoring and failure detection

OPNFV Doctor:

Project Overview

Use cases

Collaboration methodology discussion

Wrap-up

{group3}Minutes:

Purpose

Ryota: know each other; see how to work together; further technology discussion needed at later stage

Markus Schoeller (NEC): no IPR declarations today, today only exchange of public information

policies how to work together w.r.t IPR etc should be defined for later work

Gurpreet: high-level of Doctor project; fault-detection and management; what are use cases of Doctor?

NFV REL introduction (Markus Schoeller)

Project overview: see ETSI/NFVREL(14)000200)

dedicated reliability project

Ryota: target size / number of applications?

Tommy: which work items focus on VIM part? indirectly addressed in monitoring and failure detection. scalabilty per se has some impact on VIM

Tommy: this means "monitoring and failure detection" would be the main crossing point with Doctor? so far yes, but in next meeting new WIs may be created

NFV software upgrade mechanism (Stefan Arntzen - Huawei)

different to traditional upgrades: "old traffic" can still go to "old software version", whereas new traffic/connections can go to the new s/w version in parallel (this is enabled by virtualization); no hard switchover needed; old system/version is still running and it can be switched back in case of issues with the new version

assumption is that this can be done stateless (otherwise it would be more complex)

Active monitoring for NFV (Gurpreet)

Alistair Scott: interested in passive monitoring; where as attachment points for passive monitoring? REL has not looked in passive monitoring for NFV

April 7, 2015

Revisit the scope of the NIC to makes sure that we can collect VF stats.

TBD

Can the NIC report VF/PF stats capabilities? Investigate: Maryam

I’ve been looking into this for Intel® 82599 10 GbE Controller, and this might be possible through a level of indirection by checking what VFs are enabled. It’s not exactly what’s being asked, but if you know what knew a VF was enabled then you’d know what stats are also available.

BTW: Stats can then be retrieved then per VF for Niantic:

VF Good Packets Received Count

VF Good Packets Transmitted Count

VF Good Octets Received Count Low

VF Good Octets Received Count High

VF Good Octets Transmitted Count

VF Good Octets Transmitted Count

VF Multicast Packets Received Count

But then error stats are still shared.

Open Maryam is looking into is if we knew the Queues that were assigned to a VF could we use Queue Packets Received Drop Count (QPRDC) to retrieve the dropped packets for a VF?

Maryam in the process of writing a DPDK app that runs as a secondary process on the host and is capable of reading the stats, which can then be parsed by a script.

Illegal Byte Error Count: Counts the number of receive packetswith illegal bytes errors (such as there is an illegal symbol in the packet).

Error Byte Count: Counts the number of receive packetswith error bytes (such as there is an error symbol in the packet).

Rx Missed Packets Count

MAC Local Fault Count : Number of faults in the local MAC.

MAC Local Fault Count: Number of faults in the remote MAC.

Receive Length Error Count: Number of packets with receive length errors. A length error occurs if an incoming packet length field in the MACheader doesn't match the packet length.

Receive Undersize Count:. This register counts the number of received frames that are shorter than minimum size (64bytes from <Destination Address> through <CRC>, inclusively), andhad a valid CRC.

Receive Fragment Count: Number of receive fragment errors (frame shorter than 64 bytes from <Destination Address> through <CRC>,inclusively) that have bad CRC (this is slightly different from the ReceiveUndersize Count register)

Receive Oversize Count: This register counts the number of received frames that are longer than maximum size as defined by MAXFRS.MFS (from <Destination Address> through <CRC>,inclusively) and have valid CRC.

Receive Jabber Count: Number of receive jabber errors. This register counts the number of received packets that are greater than maximum size and have bad CRC (this is slightly different from the Receive OversizeCount register). The packets length is counted from <Destination Address>through <CRC>, inclusively.

Management Packets Dropped Count: Number of management packets dropped.This register counts the total number of packets received that pass the management filters and then are dropped because the management receive FIFO is full. Management packets include any packet directed to the manageability console (such as RMCP and ARP packets).

MAC Short Packet Discard Count: Number of MAC short packet discard packets received.

people from HA project really work on HA today, have a lot of knowledge on it

Identify Overlap:

NB I/F

Doctor is also requiring fast reaction. objective with HA is similar.

HA has more use cases and may send more information on the northbound I/F. VNFM should be informed about changes.

Doctor objective is to design a NB I/F.

Does HA already have flows available?

HA is focusing on application level. Reaction should be as fast as possible. Including the VNFM may slow down the progress.

In Doctor we will follow the path through VNFM.

In ETSI we have lifecycle mgmt, where VNFM is responsible for the lifecylce

There are certain information the VNFM doesn't know about. In Doctor we call it "consumer".

Proposal to do use case analysis for HA. Which use cases may require the VNFM to be involved? "Doctors" will have a look at HA use cases.

How is the entity to resolve race conditions? Some entity in the "north".

What about a shared fault collection/detection entity instead of collecting the same information 3 times?

Predictor could also notify immediate failures to Doctor.

Security issues are not addressed in Doctor. Currently assuming a single operator, where policies ensure who can talk to who.

In Doctor we do not look at application faults, only NFVI faults.

Huawei: we use Heat to do HA. if one VM died and Heat will find Scaling Group less than 2, it will start a new VM. This may take more than 60s, we need to find something faster for HA. Heat doesn't find error in the applications.

Failure detection time is an issue across all projects.

Which metrics of fastpath would Doctor be interested in? need to check in detail. Action Item to send metrics to Doctor.

Hypervisor may detect failure of VM and take action.

Other failures: VM is using heartbeat. it will e.g. reboot after not receiving a heartbeat for 7s.

Doctor: if VIM takes action on its own it may conflict the ACT-SBY configuration at the consumer side. this is why the consumer should be involved.

Which project would address ping-pong issue that may arise?

We need subscription mechanism including filter (which alarms to be notified about). Mapping VM-PM-VNFM can be recorded during the instantiation.

Relationship between Doctor and Copper:

policy defines e.g. when VIM can expose its interface

When to inform a fault, whom to inform etc is all a kind of policy.

Copper has both pro-active and reactive deployment of policies. In reactive case, there may be conflict when both Copper and Doctor receive the policies.

Wrapup:

Overlap in fault management

FastPath: monitor traffic metrics; Doctor will need some of the metrics in the VIM. plan to do regular meetings.

HA: large project with wider scope than Doctor, different use cases. direct flow (to be faster). task to check each others NB I/F in order not to block each other.

Evacuation will move the network also regardless it being OpenDaylight or Neutron.

We are trying to tackle step-by-step, first focusing on Nova.

Ceilometer approch seems to be good rather than using metadata on Nova

What is the relation to Nova metadata? Ceilometer is therrible for FM. It uses polling, and suits for PM. It would be extra step causing delay. It makes a lot of network traffic. Database consumes a lot of memory.

Should we kick poweroff the host by Doctor?

One needs to fence host by powering off by shutdown trough OS (or Nova) or if reachable only with IPMI, then trough that. In some case host can be rebooted as recovery, but in most cases it is faulty and needs to be moved to disabled aggrigate or mark for maintenance. If one do not reach host at all, the evacuation trough Nova will anyhow isolate host as everything will be moved to other host (network, disk).

General agreement that there should be some level of aggregation, butneed to figure out what events need to be aggregated.

Some suggested that VNFs should be notified only if the faults are urgent.

Notifying data center operations folks about hardware faults is something that seems to be out of scope for this project. Tomi: I think they need the information and there should not be a duplicate mechanism to detect faults to be able to make HW maintenance operations. Surely they will not need the notification that we would send to VNFM, but the actual alarm information we are gathering to make those notifications. Anyhow I agree that this is not in our scope and tools like Zabbix that we could use here can easily be configured then for this also in case HW owner is interested.

Why should warnings be sent to VNFs (such as cpu temp rising but notcritical yet)? VNFs might want to take action such as setup/sync hot standbyand this could take some time.

Are there open souce projects already to detect hypervisor or host OS faults?

OpenStackNova devs said it should be kept simple, providers need to monitor processes ontheir own.

But there appears to be some open source tools(SNMP polling or SNMP agents on host). Need to pull things together.

Next call will be on January 12th.

Dec 15, 2014

Agenda:

Minutes:

wiki/doc structure

Agreed to have three sections

UseCase (High-level description)

Requirement (Detail description, GAP Analysis)

Implementation(includes monitor tools and alternatives)

Faults table

will create table that explain stories for each fault

columns would be physical fault, how to detect, effected virtual resource and actions to recover

in three categories Compute, Network and Storage, will start on Compute first

also try to keep separate table/categories for critical and warning

TODO(Palani): provide fault table example

TODO(Gerald): create first version of fault table after getting table example

framework

how we handle combination of faults and feature H/W faults that is still open question

suggestion to have fault management "framework" that should be configurable to define faults by developers or operators

Gap analysis

We should have list of items so that we can avoid duplicated work

TODO(Ryota): Post first item to show example how we describe that could be template for GAP analysis

Monitoring

We should check monitoring tools as well: Nagios, Ganglia, Zabbix

Check TODOs from the last meeting

seems almost all items have done or started (but we could not check 'fault management scenario based on ETSI NFV Architecture' although there is silde on wiki)

Next meetings

Dec 22, 2014

Jan 12, 2015 # skip Jan 5th

Dec 8, 2014

Agenda:

How we shape requirements

Day of the week and time of weekly meeting

Tools: etherpad, ML, IRC?

Project schedule, visualiztion of deliverables

Minutes:

How we shape requirements

Use case study first

Gap Analysis should be included existing monitoring tools like Nagios etc.

How we format fault message and VNFD elements for alarms?

Fault detection should be designed within a common/standard manner

Those could be implement in existing monitoring tools separated from OpenStack

What is "common" monitoring tools, there are different tools and configurations

Focus on H/W faults

Do we really need that kind of notification mechanism? Can we use error from API polling, just get error detected by application or auto-healing by VIM?

Real vEPC needs to know fault that cannot be found by application like abnormal temperature.

VIM should not run auto-healing for some VNF.

There are two cases/sequences defined in ESTI NFV MANO that fault notification are send from VIM to VNFM and to Orchestrator.

Alarming mechanism is good to reduce the number of request from user who pooling virtual resource status.

We shall categorize requirements and create new table on wiki page. (layer?)

-> A general view of the participants is to have the 'HW monitoring module' outside of OpenStack