The median release people were on in the room was Kilo. Some were
upgrading to Liberty, many had older than Kilo clouds. Remembering
these are the larger ops environments that are engaged enough with the
community to send people to the Ops Meetup.

Performance Bottlenecks

scheduling issues with Ironic - (this is a bug we got through during
the week after the session)

live snapshots actually end up performance issue for people

The workarounds config group was not well known, and everyone in the
room wished we advertised that a bit more. The solution for snapshot
performance is in there.

There were also general questions about what scale cells should be
considered at.

ACTION: we should make sure workarounds are advertised better
ACTION: we should have some document about "when cells"?

Networking

A number of folks in the room were still on Nova Net, and were a bit
nervous about it going away. As they are Kilo / Liberty it's still a
few upgrades before they get there, but that nervousness and concern
was definitely there.

Policy

How are you customizing policy? People were largely making policy
changes to protect their users that didn't really understand cloud
semantics. Turning off features that they thought would confuse them
(like pause). The large number of VM states is confusing, and not
clearly useful for end users, and they would like simplification.

Ideally policy could be set on a project by project admin, because
they would like to delegate that responsibility down.

No one was using the user_id based custom policy (yay!).

There was desire that flavors could be RBAC locked down, which was
actually being done via policy hacks right now. Providers want to
expose some flavors (especially those with aggregate affinity) to only
some projects.

People were excited about the policy in code effort, only concern was
that the defacto documentation of what you could change wouldn't be in
the sample config.

ACTION: ensure there is policy config reference now that the sample
file is empty
ACTION: flavor RBAC is a thing most of the room wanted, is there a
taker on spec / implementation?

Upgrade

Everyone waits to do any optional thing until they absolutely have
to.

The Cells API db caught a bunch of people off guard because it was in
Kilo (with release note) optional. Status quo in Liberty, with no
release note about it existing, then forced in Mitaka. When an
optional component is out there make sure it continues to be talked
about in releases even when it's status did not change, or people
forget.

People were on Kilo, so EC2 out of tree didn't really have any
data. About 25% of folks users have some existing AWS tooling, that
it's good to be able to just let them use to onboard them.

The current DB online data upgrade model feels very opaque to
ops. They didn't realize the current model Nova was using, and didn't
feel like it was documented anywhere.

ACTION: document the DB data lifecycle better for operators
ACTION: make sure we are cautious in rewarning people about changes
they have to make (like Cells API db)

API

API upgrade seemed fine for folks. The only question was the new
policy names, which was taking folks a bit of time to adjust to.

No one in the room was using custom API extensions (or at least
admitted to it when I asked).

Tracking Feedback

We talked a bit about tracking feedback. The silence on the ops list
mostly comes from people not using a particular feature, so they don't
really have an opinion.

Most ops do not have time to look at our specs. That is an unlikely
place to get feedback.

Additional Questions

There was an ask about VM HA. I stated that was beyond scope for Nova,
plus Nova's view of the world is non authoritative enough you didn't
want it to do that anyway. I told folks that the NFV efforts were
working on this kind of thing beyond Nova, and people should team up
there.

There was an ask on status of Cinder Multi Attach. We gave them a bit
of status on where things were at.

ACTION: Cinder Multi Attach should maybe be a priority effort in the
next cycle.

Most people are a couple of releases back (Kilo / Liberty or even
older). The only team CDing in the room was RAX, they are now 2 to 3
months behind master.

Everyone agrees upgrades are getting better with every release.

Most are taking change windows and downtime for upgrades.

Why are upgrades taking so long?

About half way through this session I threw this powder keg into the
room, it generated a lot of feedback.

People are holding a lot of out of tree patches, causing
latency. These are for:

bug fixes that get fixed on old versions of OpenStack, so upstream
won't take them (chicken / egg problem of being on an old release)

custom identity driver for keystone

some new feature that customer wants, not taken upstream

lack of time to invest working patches upstream

(Thinking out loud, I do wonder if there is a way we could close this gap)

Defcore is actually forcing upgrades, because people loose their
OpenStack trademark if they don't stay in the Defcore supported window.

Other Sessions
==============

There were a ton of other sessions there. Interesting things that I
remember from them.

In the session on deploying OpenStack in containers, there was a split
between idempotent (docker) vs. system (lxc/lxd) containers. Both were
getting used in different ways, and there was debate between the camps
as to which was most effective.

Ceilometer deployments are highly coupled to Heat, and seem to be only
used when users want Heat auto scaling.

There are noticable failures in our CLI / API paths when using UTF8
names for projects / resources. Noticed by many .eu folks. Would be
good to increase testing in areas like this.

The median release people were on in the room was Kilo. Some were
upgrading to Liberty, many had older than Kilo clouds. Remembering
these are the larger ops environments that are engaged enough with the
community to send people to the Ops Meetup.

Performance Bottlenecks
-----------------------

* scheduling issues with Ironic - (this is a bug we got through during
the week after the session)
* live snapshots actually end up performance issue for people

The workarounds config group was not well known, and everyone in the
room wished we advertised that a bit more. The solution for snapshot
performance is in there.

There were also general questions about what scale cells should be
considered at.

ACTION: we should make sure workarounds are advertised better
ACTION: we should have some document about "when cells"?

Networking
----------

A number of folks in the room were still on Nova Net, and were a bit
nervous about it going away. As they are Kilo / Liberty it's still a
few upgrades before they get there, but that nervousness and concern
was definitely there.

Policy
------

How are you customizing policy? People were largely making policy
changes to protect their users that didn't really understand cloud
semantics. Turning off features that they thought would confuse them
(like pause). The large number of VM states is confusing, and not
clearly useful for end users, and they would like simplification.

Ideally policy could be set on a project by project admin, because
they would like to delegate that responsibility down.

No one was using the user_id based custom policy (yay!).

There was desire that flavors could be RBAC locked down, which was
actually being done via policy hacks right now. Providers want to
expose some flavors (especially those with aggregate affinity) to only
some projects.

People were excited about the policy in code effort, only concern was
that the defacto documentation of what you could change wouldn't be in
the sample config.

ACTION: ensure there is policy config reference now that the sample
file is empty
ACTION: flavor RBAC is a thing most of the room wanted, is there a
taker on spec / implementation?

Upgrade
-------

Everyone waits to do any optional thing until they absolutely have
to.

The Cells API db caught a bunch of people off guard because it was in
Kilo (with release note) optional. Status quo in Liberty, with no
release note about it existing, then forced in Mitaka. When an
optional component is out there make sure it continues to be talked
about in releases even when it's status did not change, or people
forget.

People were on Kilo, so EC2 out of tree didn't really have any
data. About 25% of folks users have some existing AWS tooling, that
it's good to be able to just let them use to onboard them.

The current DB online data upgrade model feels *very opaque* to
ops. They didn't realize the current model Nova was using, and didn't
feel like it was documented anywhere.

ACTION: document the DB data lifecycle better for operators
ACTION: make sure we are cautious in rewarning people about changes
they have to make (like Cells API db)

API
---

API upgrade seemed fine for folks. The only question was the new
policy names, which was taking folks a bit of time to adjust to.

No one in the room was using custom API extensions (or at least
admitted to it when I asked).

Tracking Feedback
-----------------

We talked a bit about tracking feedback. The silence on the ops list
mostly comes from people not using a particular feature, so they don't
really have an opinion.

Most ops do not have time to look at our specs. That is an unlikely
place to get feedback.

Additional Questions
--------------------

There was an ask about VM HA. I stated that was beyond scope for Nova,
plus Nova's view of the world is non authoritative enough you didn't
want it to do that anyway. I told folks that the NFV efforts were
working on this kind of thing beyond Nova, and people should team up
there.

There was an ask on status of Cinder Multi Attach. We gave them a bit
of status on where things were at.

ACTION: Cinder Multi Attach should maybe be a priority effort in the
next cycle.

Most people are a couple of releases back (Kilo / Liberty or even
older). The only team CDing in the room was RAX, they are now 2 to 3
months behind master.

Everyone agrees upgrades are getting better with every release.

Most are taking change windows and downtime for upgrades.

Why are upgrades taking so long?
--------------------------------

About half way through this session I threw this powder keg into the
room, it generated a lot of feedback.

People are holding a lot of out of tree patches, causing
latency. These are for:

* bug fixes that get fixed on old versions of OpenStack, so upstream
won't take them (chicken / egg problem of being on an old release)
* custom identity driver for keystone
* some new feature that customer wants, not taken upstream
* lack of time to invest working patches upstream

(Thinking out loud, I do wonder if there is a way we could close this gap)

Defcore is actually forcing upgrades, because people loose their
OpenStack trademark if they don't stay in the Defcore supported window.

Other Sessions
==============

There were a ton of other sessions there. Interesting things that I
remember from them.

In the session on deploying OpenStack in containers, there was a split
between idempotent (docker) vs. system (lxc/lxd) containers. Both were
getting used in different ways, and there was debate between the camps
as to which was most effective.

Ceilometer deployments are highly coupled to Heat, and seem to be only
used when users want Heat auto scaling.

There are noticable failures in our CLI / API paths when using UTF8
names for projects / resources. Noticed by many .eu folks. Would be
good to increase testing in areas like this.

The median release people were on in the room was Kilo. Some were
upgrading to Liberty, many had older than Kilo clouds. Remembering
these are the larger ops environments that are engaged enough with the
community to send people to the Ops Meetup.

Performance Bottlenecks

scheduling issues with Ironic - (this is a bug we got through during
the week after the session)

live snapshots actually end up performance issue for people

The workarounds config group was not well known, and everyone in the
room wished we advertised that a bit more. The solution for snapshot
performance is in there.

There were also general questions about what scale cells should be
considered at.

ACTION: we should make sure workarounds are advertised better
ACTION: we should have some document about "when cells"?

Networking

A number of folks in the room were still on Nova Net, and were a bit
nervous about it going away. As they are Kilo / Liberty it's still a
few upgrades before they get there, but that nervousness and concern
was definitely there.

Policy

How are you customizing policy? People were largely making policy
changes to protect their users that didn't really understand cloud
semantics. Turning off features that they thought would confuse them
(like pause). The large number of VM states is confusing, and not
clearly useful for end users, and they would like simplification.

Ideally policy could be set on a project by project admin, because
they would like to delegate that responsibility down.

No one was using the user_id based custom policy (yay!).

There was desire that flavors could be RBAC locked down, which was
actually being done via policy hacks right now. Providers want to
expose some flavors (especially those with aggregate affinity) to only
some projects.

People were excited about the policy in code effort, only concern was
that the defacto documentation of what you could change wouldn't be in
the sample config.

ACTION: ensure there is policy config reference now that the sample
file is empty
ACTION: flavor RBAC is a thing most of the room wanted, is there a
taker on spec / implementation?

Upgrade

Everyone waits to do any optional thing until they absolutely have
to.

The Cells API db caught a bunch of people off guard because it was in
Kilo (with release note) optional. Status quo in Liberty, with no
release note about it existing, then forced in Mitaka. When an
optional component is out there make sure it continues to be talked
about in releases even when it's status did not change, or people
forget.

People were on Kilo, so EC2 out of tree didn't really have any
data. About 25% of folks users have some existing AWS tooling, that
it's good to be able to just let them use to onboard them.

The current DB online data upgrade model feels very opaque to
ops. They didn't realize the current model Nova was using, and didn't
feel like it was documented anywhere.

ACTION: document the DB data lifecycle better for operators
ACTION: make sure we are cautious in rewarning people about changes
they have to make (like Cells API db)

API

API upgrade seemed fine for folks. The only question was the new
policy names, which was taking folks a bit of time to adjust to.

No one in the room was using custom API extensions (or at least
admitted to it when I asked).

Tracking Feedback

We talked a bit about tracking feedback. The silence on the ops list
mostly comes from people not using a particular feature, so they don't
really have an opinion.

Most ops do not have time to look at our specs. That is an unlikely
place to get feedback.

Additional Questions

There was an ask about VM HA. I stated that was beyond scope for Nova,
plus Nova's view of the world is non authoritative enough you didn't
want it to do that anyway. I told folks that the NFV efforts were
working on this kind of thing beyond Nova, and people should team up
there.

There was an ask on status of Cinder Multi Attach. We gave them a bit
of status on where things were at.

ACTION: Cinder Multi Attach should maybe be a priority effort in the
next cycle.

Most people are a couple of releases back (Kilo / Liberty or even
older). The only team CDing in the room was RAX, they are now 2 to 3
months behind master.

Everyone agrees upgrades are getting better with every release.

Most are taking change windows and downtime for upgrades.

Why are upgrades taking so long?

About half way through this session I threw this powder keg into the
room, it generated a lot of feedback.

People are holding a lot of out of tree patches, causing
latency. These are for:

bug fixes that get fixed on old versions of OpenStack, so upstream
won't take them (chicken / egg problem of being on an old release)

custom identity driver for keystone

some new feature that customer wants, not taken upstream

lack of time to invest working patches upstream

(Thinking out loud, I do wonder if there is a way we could close this gap)

Defcore is actually forcing upgrades, because people loose their
OpenStack trademark if they don't stay in the Defcore supported window.

Other Sessions
==============

There were a ton of other sessions there. Interesting things that I
remember from them.

In the session on deploying OpenStack in containers, there was a split
between idempotent (docker) vs. system (lxc/lxd) containers. Both were
getting used in different ways, and there was debate between the camps
as to which was most effective.

Ceilometer deployments are highly coupled to Heat, and seem to be only
used when users want Heat auto scaling.

There are noticable failures in our CLI / API paths when using UTF8
names for projects / resources. Noticed by many .eu folks. Would be
good to increase testing in areas like this.

On 16-09-20 09:20 AM, Sean Dague wrote:
> This is a bit delayed due to the release rush, finally getting back to
> writing up my experiences at the Ops Meetup.
>
> Nova Feedback Session
> =====================
>
> We had a double session for Feedback for Nova from Operators, raw
> etherpad here - https://etherpad.openstack.org/p/NYC-ops-Nova.
>
> The median release people were on in the room was Kilo. Some were
> upgrading to Liberty, many had older than Kilo clouds. Remembering
> these are the larger ops environments that are engaged enough with the
> community to send people to the Ops Meetup.
>
>
> Performance Bottlenecks
> -----------------------
>
> * scheduling issues with Ironic - (this is a bug we got through during
> the week after the session)
> * live snapshots actually end up performance issue for people
>
> The workarounds config group was not well known, and everyone in the
> room wished we advertised that a bit more. The solution for snapshot
> performance is in there.
>
> There were also general questions about what scale cells should be
> considered at.
>
> ACTION: we should make sure workarounds are advertised better
> ACTION: we should have some document about "when cells"?
>
> Networking
> ----------
>
> A number of folks in the room were still on Nova Net, and were a bit
> nervous about it going away. As they are Kilo / Liberty it's still a
> few upgrades before they get there, but that nervousness and concern
> was definitely there.
>
> Policy
> ------
>
> How are you customizing policy? People were largely making policy
> changes to protect their users that didn't really understand cloud
> semantics. Turning off features that they thought would confuse them
> (like pause). The large number of VM states is confusing, and not
> clearly useful for end users, and they would like simplification.
>
> Ideally policy could be set on a project by project admin, because
> they would like to delegate that responsibility down.
>
> No one was using the user_id based custom policy (yay!).
>
> There was desire that flavors could be RBAC locked down, which was
> actually being done via policy hacks right now. Providers want to
> expose some flavors (especially those with aggregate affinity) to only
> some projects.
>
> People were excited about the policy in code effort, only concern was
> that the defacto documentation of what you could change wouldn't be in
> the sample config.
>
> ACTION: ensure there is policy config reference now that the sample
> file is empty
> ACTION: flavor RBAC is a thing most of the room wanted, is there a
> taker on spec / implementation?
>
> Upgrade
> -------
>
> Everyone waits to do any optional thing until they absolutely have
> to.
>
> The Cells API db caught a bunch of people off guard because it was in
> Kilo (with release note) optional. Status quo in Liberty, with no
> release note about it existing, then forced in Mitaka. When an
> optional component is out there make sure it continues to be talked
> about in releases even when it's status did not change, or people
> forget.
>
> People were on Kilo, so EC2 out of tree didn't really have any
> data. About 25% of folks users have some existing AWS tooling, that
> it's good to be able to just let them use to onboard them.
>
> The current DB online data upgrade model feels *very opaque* to
> ops. They didn't realize the current model Nova was using, and didn't
> feel like it was documented anywhere.
>
> ACTION: document the DB data lifecycle better for operators
> ACTION: make sure we are cautious in rewarning people about changes
> they have to make (like Cells API db)
>
> API
> ---
>
> API upgrade seemed fine for folks. The only question was the new
> policy names, which was taking folks a bit of time to adjust to.
>
> No one in the room was using custom API extensions (or at least
> admitted to it when I asked).
>
> Tracking Feedback
> -----------------
>
> We talked a bit about tracking feedback. The silence on the ops list
> mostly comes from people not using a particular feature, so they don't
> really have an opinion.
>
> Most ops do not have time to look at our specs. That is an unlikely
> place to get feedback.
>
> Additional Questions
> --------------------
>
> There was an ask about VM HA. I stated that was beyond scope for Nova,
> plus Nova's view of the world is non authoritative enough you didn't
> want it to do that anyway. I told folks that the NFV efforts were
> working on this kind of thing beyond Nova, and people should team up
> there.
>
> There was an ask on status of Cinder Multi Attach. We gave them a bit
> of status on where things were at.
>
> ACTION: Cinder Multi Attach should maybe be a priority effort in the
> next cycle.
>
>
> Upgrade Pain Points
> ===================
>
> Raw etherpad -
> https://etherpad.openstack.org/p/NYC-ops-Upgrades-Pain-points
>
> Most people are a couple of releases back (Kilo / Liberty or even
> older). The only team CDing in the room was RAX, they are now 2 to 3
> months behind master.
>
> Everyone agrees upgrades are getting better with every release.
>
> Most are taking change windows and downtime for upgrades.
>
> Why are upgrades taking so long?
> --------------------------------
>
> About half way through this session I threw this powder keg into the
> room, it generated a lot of feedback.
>
> People are holding a lot of out of tree patches, causing
> latency. These are for:
>
> * bug fixes that get fixed on old versions of OpenStack, so upstream
> won't take them (chicken / egg problem of being on an old release)
> * custom identity driver for keystone
> * some new feature that customer wants, not taken upstream
> * lack of time to invest working patches upstream
>
> (Thinking out loud, I do wonder if there is a way we could close this gap)
>
> Defcore is actually forcing upgrades, because people loose their
> OpenStack trademark if they don't stay in the Defcore supported window.
>
> Other Sessions
> ==============
>
> There were a ton of other sessions there. Interesting things that I
> remember from them.
>
> In the session on deploying OpenStack in containers, there was a split
> between idempotent (docker) vs. system (lxc/lxd) containers. Both were
> getting used in different ways, and there was debate between the camps
> as to which was most effective.
>
> Ceilometer deployments are highly coupled to Heat, and seem to be only
> used when users want Heat auto scaling.
>
> There are noticable failures in our CLI / API paths when using UTF8
> names for projects / resources. Noticed by many .eu folks. Would be
> good to increase testing in areas like this.
>
>
> The full list of all etherpads are here for anyone else looking to
> dive in and learn more - https://etherpad.openstack.org/p/NYC-ops-meetup
>
> -Sean
>

Thanks Sean, these are great notes and very consumable. I really
appreciate you taking the time to convey this information so well.

I'm sorry i couldn't attend myself, but your summary really helps to
communicate the highlights as you saw them.

scheduling issues with Ironic - (this is a bug we got through during
the week after the session)

live snapshots actually end up performance issue for people

The workarounds config group was not well known, and everyone in the
room wished we advertised that a bit more. The solution for snapshot
performance is in there.

There were also general questions about what scale cells should be
considered at.

ACTION: we should make sure workarounds are advertised better
ACTION: we should have some document about "when cells"?

This is a difficult question to answer because "it depends." It's akin
to asking "how many nova-api/nova-conductor processes should I run?"
Well, what hardware is being used, how much traffic do you get, is it
bursty or sustained, are instances created and left alone or are they
torn down regularly, do you prune your database, what version of rabbit
are you using, etc...

I would expect the best answer(s) to this question are going to come
from the operators themselves. What I've seen with cellsv1 is that
someone will decide for themselves that they should put no more than X
computes in a cell and that information filters out to other operators.
That provides a starting point for a new deployment to tune from.

Policy

How are you customizing policy? People were largely making policy
changes to protect their users that didn't really understand cloud
semantics. Turning off features that they thought would confuse them
(like pause). The large number of VM states is confusing, and not
clearly useful for end users, and they would like simplification.

Ideally policy could be set on a project by project admin, because
they would like to delegate that responsibility down.

No one was using the user_id based custom policy (yay!).

There was desire that flavors could be RBAC locked down, which was
actually being done via policy hacks right now. Providers want to
expose some flavors (especially those with aggregate affinity) to only
some projects.

People were excited about the policy in code effort, only concern was
that the defacto documentation of what you could change wouldn't be in
the sample config.

ACTION: ensure there is policy config reference now that the sample
file is empty

We have the "genpolicy" tox target which mimics the "genconfig" target.
It's similar to the old sample except guaranteed to be up to date, and
can include comments. Is that sufficient?

ACTION: flavor RBAC is a thing most of the room wanted, is there a
taker on spec / implementation?

On Tue, Sep 20, 2016, at 09:20 AM, Sean Dague wrote:
> <snip>
>
> Performance Bottlenecks
> -----------------------
>
> * scheduling issues with Ironic - (this is a bug we got through during
> the week after the session)
> * live snapshots actually end up performance issue for people
>
> The workarounds config group was not well known, and everyone in the
> room wished we advertised that a bit more. The solution for snapshot
> performance is in there.
>
> There were also general questions about what scale cells should be
> considered at.
>
> ACTION: we should make sure workarounds are advertised better
> ACTION: we should have some document about "when cells"?

This is a difficult question to answer because "it depends." It's akin
to asking "how many nova-api/nova-conductor processes should I run?"
Well, what hardware is being used, how much traffic do you get, is it
bursty or sustained, are instances created and left alone or are they
torn down regularly, do you prune your database, what version of rabbit
are you using, etc...

I would expect the best answer(s) to this question are going to come
from the operators themselves. What I've seen with cellsv1 is that
someone will decide for themselves that they should put no more than X
computes in a cell and that information filters out to other operators.
That provides a starting point for a new deployment to tune from.

> <snip>
>
> Policy
> ------
>
> How are you customizing policy? People were largely making policy
> changes to protect their users that didn't really understand cloud
> semantics. Turning off features that they thought would confuse them
> (like pause). The large number of VM states is confusing, and not
> clearly useful for end users, and they would like simplification.
>
> Ideally policy could be set on a project by project admin, because
> they would like to delegate that responsibility down.
>
> No one was using the user_id based custom policy (yay!).
>
> There was desire that flavors could be RBAC locked down, which was
> actually being done via policy hacks right now. Providers want to
> expose some flavors (especially those with aggregate affinity) to only
> some projects.
>
> People were excited about the policy in code effort, only concern was
> that the defacto documentation of what you could change wouldn't be in
> the sample config.
>
> ACTION: ensure there is policy config reference now that the sample
> file is empty

We have the "genpolicy" tox target which mimics the "genconfig" target.
It's similar to the old sample except guaranteed to be up to date, and
can include comments. Is that sufficient?

> ACTION: flavor RBAC is a thing most of the room wanted, is there a
> taker on spec / implementation?

The median release people were on in the room was Kilo. Some were
upgrading to Liberty, many had older than Kilo clouds. Remembering
these are the larger ops environments that are engaged enough with the
community to send people to the Ops Meetup.

Performance Bottlenecks

scheduling issues with Ironic - (this is a bug we got through during
the week after the session)

live snapshots actually end up performance issue for people

The workarounds config group was not well known, and everyone in the
room wished we advertised that a bit more. The solution for snapshot
performance is in there

There were also general questions about what scale cells should be
considered at.

ACTION: we should make sure workarounds are advertised better

Workarounds ought to be something that admins are rarely, if
ever, having to deal with.

If the lack of live snapshot is such a major performance problem
for ops, this tends to suggest that our default behaviour is wrong,
rather than a need to publicise that operators should set this
workaround.

eg, instead of optimizing for the case of a broken live snapshot
support by default, we should optimize for the case of working
live snapshot by default. The broken live snapshot stuff was so
rare that no one has ever reproduced it outside of the gate
AFAIK.

IOW, rather than hardcoding disablelivesnapshot=True in nova,
we should just set it in the gate CI configs, and leave it set
to False in Nova, so operators get good performance out of the
box.

Also it has been a while since we added the workaround, and IIRC,
we've got newer Ubuntu available on at least some of the gate
hosts now, so we have the ability to test to see if it still
hits newer Ubuntu.

On Tue, Sep 20, 2016 at 09:20:15AM -0400, Sean Dague wrote:
> This is a bit delayed due to the release rush, finally getting back to
> writing up my experiences at the Ops Meetup.
>
> Nova Feedback Session
> =====================
>
> We had a double session for Feedback for Nova from Operators, raw
> etherpad here - https://etherpad.openstack.org/p/NYC-ops-Nova.
>
> The median release people were on in the room was Kilo. Some were
> upgrading to Liberty, many had older than Kilo clouds. Remembering
> these are the larger ops environments that are engaged enough with the
> community to send people to the Ops Meetup.
>
>
> Performance Bottlenecks
> -----------------------
>
> * scheduling issues with Ironic - (this is a bug we got through during
> the week after the session)
> * live snapshots actually end up performance issue for people
>
> The workarounds config group was not well known, and everyone in the
> room wished we advertised that a bit more. The solution for snapshot
> performance is in there
>
> There were also general questions about what scale cells should be
> considered at.
>
> ACTION: we should make sure workarounds are advertised better

Workarounds ought to be something that admins are rarely, if
ever, having to deal with.

If the lack of live snapshot is such a major performance problem
for ops, this tends to suggest that our default behaviour is wrong,
rather than a need to publicise that operators should set this
workaround.

eg, instead of optimizing for the case of a broken live snapshot
support by default, we should optimize for the case of working
live snapshot by default. The broken live snapshot stuff was so
rare that no one has ever reproduced it outside of the gate
AFAIK.

IOW, rather than hardcoding disable_live_snapshot=True in nova,
we should just set it in the gate CI configs, and leave it set
to False in Nova, so operators get good performance out of the
box.

Also it has been a while since we added the workaround, and IIRC,
we've got newer Ubuntu available on at least some of the gate
hosts now, so we have the ability to test to see if it still
hits newer Ubuntu.

scheduling issues with Ironic - (this is a bug we got through during
the week after the session)

live snapshots actually end up performance issue for people

The workarounds config group was not well known, and everyone in the
room wished we advertised that a bit more. The solution for snapshot
performance is in there.

There were also general questions about what scale cells should be
considered at.

ACTION: we should make sure workarounds are advertised better
ACTION: we should have some document about "when cells"?

This is a difficult question to answer because "it depends." It's akin
to asking "how many nova-api/nova-conductor processes should I run?"
Well, what hardware is being used, how much traffic do you get, is it
bursty or sustained, are instances created and left alone or are they
torn down regularly, do you prune your database, what version of rabbit
are you using, etc...

I would expect the best answer(s) to this question are going to come
from the operators themselves. What I've seen with cellsv1 is that
someone will decide for themselves that they should put no more than X
computes in a cell and that information filters out to other operators.
That provides a starting point for a new deployment to tune from.

I don't think we need "don't go larger than N nodes" kind of advice. But
we should probably know what kinds of things we expect to be hot spots.
Like mysql load, possibly indicated by system load or high level of db
conflicts. Or rabbit mq load. Or something along those lines.

Basically the things to look out for that indicate your are approaching
a scale point where cells is going to help. That also helps in defining
what kind of scaling issues cells won't help on, which need to be
addressed in other ways (such as optimizations).

On 09/20/2016 10:22 AM, Andrew Laski wrote:
> Excellent writeup, thanks. Some comments inline.
>
>
> On Tue, Sep 20, 2016, at 09:20 AM, Sean Dague wrote:
>> <snip>
>>
>> Performance Bottlenecks
>> -----------------------
>>
>> * scheduling issues with Ironic - (this is a bug we got through during
>> the week after the session)
>> * live snapshots actually end up performance issue for people
>>
>> The workarounds config group was not well known, and everyone in the
>> room wished we advertised that a bit more. The solution for snapshot
>> performance is in there.
>>
>> There were also general questions about what scale cells should be
>> considered at.
>>
>> ACTION: we should make sure workarounds are advertised better
>> ACTION: we should have some document about "when cells"?
>
> This is a difficult question to answer because "it depends." It's akin
> to asking "how many nova-api/nova-conductor processes should I run?"
> Well, what hardware is being used, how much traffic do you get, is it
> bursty or sustained, are instances created and left alone or are they
> torn down regularly, do you prune your database, what version of rabbit
> are you using, etc...
>
> I would expect the best answer(s) to this question are going to come
> from the operators themselves. What I've seen with cellsv1 is that
> someone will decide for themselves that they should put no more than X
> computes in a cell and that information filters out to other operators.
> That provides a starting point for a new deployment to tune from.

I don't think we need "don't go larger than N nodes" kind of advice. But
we should probably know what kinds of things we expect to be hot spots.
Like mysql load, possibly indicated by system load or high level of db
conflicts. Or rabbit mq load. Or something along those lines.

Basically the things to look out for that indicate your are approaching
a scale point where cells is going to help. That also helps in defining
what kind of scaling issues cells won't help on, which need to be
addressed in other ways (such as optimizations).

The median release people were on in the room was Kilo. Some were
upgrading to Liberty, many had older than Kilo clouds. Remembering
these are the larger ops environments that are engaged enough with the
community to send people to the Ops Meetup.

Performance Bottlenecks

scheduling issues with Ironic - (this is a bug we got through during
the week after the session)

live snapshots actually end up performance issue for people

The workarounds config group was not well known, and everyone in the
room wished we advertised that a bit more. The solution for snapshot
performance is in there

There were also general questions about what scale cells should be
considered at.

ACTION: we should make sure workarounds are advertised better

Workarounds ought to be something that admins are rarely, if
ever, having to deal with.

If the lack of live snapshot is such a major performance problem
for ops, this tends to suggest that our default behaviour is wrong,
rather than a need to publicise that operators should set this
workaround.

eg, instead of optimizing for the case of a broken live snapshot
support by default, we should optimize for the case of working
live snapshot by default. The broken live snapshot stuff was so
rare that no one has ever reproduced it outside of the gate
AFAIK.

IOW, rather than hardcoding disablelivesnapshot=True in nova,
we should just set it in the gate CI configs, and leave it set
to False in Nova, so operators get good performance out of the
box.

Also it has been a while since we added the workaround, and IIRC,
we've got newer Ubuntu available on at least some of the gate
hosts now, so we have the ability to test to see if it still
hits newer Ubuntu.

Here is my reconstruction of the snapshot issue from what I can remember
of the conversation.

Nova defaults to live snapshots. This uses the libvirt facility which
dumps both memory and disk. And then we throw away the memory. For large
memory guests (especially volume backed ones that might have a fast path
for the disk) this leads to a lot of overhead for no gain. The
workaround got them past it.

Maybe there is another bug we should be addressing here, but it was an
issue out there people were seeing on the performance side.

On 09/20/2016 10:38 AM, Daniel P. Berrange wrote:
> On Tue, Sep 20, 2016 at 09:20:15AM -0400, Sean Dague wrote:
>> This is a bit delayed due to the release rush, finally getting back to
>> writing up my experiences at the Ops Meetup.
>>
>> Nova Feedback Session
>> =====================
>>
>> We had a double session for Feedback for Nova from Operators, raw
>> etherpad here - https://etherpad.openstack.org/p/NYC-ops-Nova.
>>
>> The median release people were on in the room was Kilo. Some were
>> upgrading to Liberty, many had older than Kilo clouds. Remembering
>> these are the larger ops environments that are engaged enough with the
>> community to send people to the Ops Meetup.
>>
>>
>> Performance Bottlenecks
>> -----------------------
>>
>> * scheduling issues with Ironic - (this is a bug we got through during
>> the week after the session)
>> * live snapshots actually end up performance issue for people
>>
>> The workarounds config group was not well known, and everyone in the
>> room wished we advertised that a bit more. The solution for snapshot
>> performance is in there
>>
>> There were also general questions about what scale cells should be
>> considered at.
>>
>> ACTION: we should make sure workarounds are advertised better
>
> Workarounds ought to be something that admins are rarely, if
> ever, having to deal with.
>
> If the lack of live snapshot is such a major performance problem
> for ops, this tends to suggest that our default behaviour is wrong,
> rather than a need to publicise that operators should set this
> workaround.
>
> eg, instead of optimizing for the case of a broken live snapshot
> support by default, we should optimize for the case of working
> live snapshot by default. The broken live snapshot stuff was so
> rare that no one has ever reproduced it outside of the gate
> AFAIK.
>
> IOW, rather than hardcoding disable_live_snapshot=True in nova,
> we should just set it in the gate CI configs, and leave it set
> to False in Nova, so operators get good performance out of the
> box.
>
> Also it has been a while since we added the workaround, and IIRC,
> we've got newer Ubuntu available on at least some of the gate
> hosts now, so we have the ability to test to see if it still
> hits newer Ubuntu.

Here is my reconstruction of the snapshot issue from what I can remember
of the conversation.

Nova defaults to live snapshots. This uses the libvirt facility which
dumps both memory and disk. And then we throw away the memory. For large
memory guests (especially volume backed ones that might have a fast path
for the disk) this leads to a lot of overhead for no gain. The
workaround got them past it.

Maybe there is another bug we should be addressing here, but it was an
issue out there people were seeing on the performance side.

On 20 Sep 2016, at 16:38, Sean Dague sean@dague.net wrote:
...
There were also general questions about what scale cells should be
considered at.

ACTION: we should make sure workarounds are advertised better
ACTION: we should have some document about "when cells"?

This is a difficult question to answer because "it depends." It's akin
to asking "how many nova-api/nova-conductor processes should I run?"
Well, what hardware is being used, how much traffic do you get, is it
bursty or sustained, are instances created and left alone or are they
torn down regularly, do you prune your database, what version of rabbit
are you using, etc...

I would expect the best answer(s) to this question are going to come
from the operators themselves. What I've seen with cellsv1 is that
someone will decide for themselves that they should put no more than X
computes in a cell and that information filters out to other operators.
That provides a starting point for a new deployment to tune from.

I don't think we need "don't go larger than N nodes" kind of advice. But
we should probably know what kinds of things we expect to be hot spots.
Like mysql load, possibly indicated by system load or high level of db
conflicts. Or rabbit mq load. Or something along those lines.

Basically the things to look out for that indicate your are approaching
a scale point where cells is going to help. That also helps in defining
what kind of scaling issues cells won't help on, which need to be
addressed in other ways (such as optimizations).

-Sean

We had an ‘interesting' experience splitting a cell which I would not recommend for others.

We started off letting our cells grow to about 1000 hypervisors but following discussions in the
large deployment team, ended up aiming for 200 or so per cell. This also allowed us to make the
hardware homogeneous in a cell.

We then split the original 1000 hypervisor cell into smaller ones which was hard work to plan.

Thus, I think people who think they may need cells are better adding new cells than letting their first one
grow until they are forced to do cells at a later stage and then do a split.

...
There were also general questions about what scale cells should be
considered at.

ACTION: we should make sure workarounds are advertised better
ACTION: we should have some document about "when cells"?

This is a difficult question to answer because "it depends." It's akin
to asking "how many nova-api/nova-conductor processes should I run?"
Well, what hardware is being used, how much traffic do you get, is it
bursty or sustained, are instances created and left alone or are they
torn down regularly, do you prune your database, what version of rabbit
are you using, etc...

I would expect the best answer(s) to this question are going to come
from the operators themselves. What I've seen with cellsv1 is that
someone will decide for themselves that they should put no more than X
computes in a cell and that information filters out to other operators.
That provides a starting point for a new deployment to tune from.

I don't think we need "don't go larger than N nodes" kind of advice. But
we should probably know what kinds of things we expect to be hot spots.
Like mysql load, possibly indicated by system load or high level of db
conflicts. Or rabbit mq load. Or something along those lines.

Basically the things to look out for that indicate your are approaching
a scale point where cells is going to help. That also helps in defining
what kind of scaling issues cells won't help on, which need to be
addressed in other ways (such as optimizations).

-Sean

We had an ‘interesting' experience splitting a cell which I would not recommend for others.

We started off letting our cells grow to about 1000 hypervisors but following discussions in the

large deployment team, ended up aiming for 200 or so per cell. This also allowed us to make the

hardware homogeneous in a cell.

We then split the original 1000 hypervisor cell into smaller ones which was hard work to plan.

Thus, I think people who think they may need cells are better adding new cells than letting their first one

grow until they are forced to do cells at a later stage and then do a split.

The median release people were on in the room was Kilo. Some were
upgrading to Liberty, many had older than Kilo clouds. Remembering
these are the larger ops environments that are engaged enough with the
community to send people to the Ops Meetup.

Performance Bottlenecks

scheduling issues with Ironic - (this is a bug we got through during
the week after the session)

live snapshots actually end up performance issue for people

The workarounds config group was not well known, and everyone in the
room wished we advertised that a bit more. The solution for snapshot
performance is in there

There were also general questions about what scale cells should be
considered at.

ACTION: we should make sure workarounds are advertised better

Workarounds ought to be something that admins are rarely, if
ever, having to deal with.

If the lack of live snapshot is such a major performance problem
for ops, this tends to suggest that our default behaviour is wrong,
rather than a need to publicise that operators should set this
workaround.

eg, instead of optimizing for the case of a broken live snapshot
support by default, we should optimize for the case of working
live snapshot by default. The broken live snapshot stuff was so
rare that no one has ever reproduced it outside of the gate
AFAIK.

IOW, rather than hardcoding disablelivesnapshot=True in nova,
we should just set it in the gate CI configs, and leave it set
to False in Nova, so operators get good performance out of the
box.

Also it has been a while since we added the workaround, and IIRC,
we've got newer Ubuntu available on at least some of the gate
hosts now, so we have the ability to test to see if it still
hits newer Ubuntu.

Here is my reconstruction of the snapshot issue from what I can remember
of the conversation.

Nova defaults to live snapshots. This uses the libvirt facility which
dumps both memory and disk. And then we throw away the memory. For large
memory guests (especially volume backed ones that might have a fast path
for the disk) this leads to a lot of overhead for no gain. The
workaround got them past it.

I think you've got it backwards there.

Nova defaults to not using live snapshots:

cfg.BoolOpt(
'disable_libvirt_livesnapshot',
default=True,
help="""

Disable live snapshots when using the libvirt driver.
...""")

When live snapshot is disabled like this, the snapshot code is unable
to guarantee a consistent disk state. So the libvirt nova driver will
stop the guest by doing a managed save (this saves all memory to
disk), then does the disk snapshot, then restores the managed saved
(which loads all memory from disk).

This is terrible for multiple reasons

the guest workload stops running while snapshot is taken

we churn disk I/O saving & loading VM memory

you can't do it at all if host PCI devices are attached to
the VM

Enabling live snapshots by default fixes all these problems, at the
risk of hitting the live snapshot bug we saw in the gate CI but never
anywhere else.

On Tue, Sep 20, 2016 at 11:01:23AM -0400, Sean Dague wrote:
> On 09/20/2016 10:38 AM, Daniel P. Berrange wrote:
> > On Tue, Sep 20, 2016 at 09:20:15AM -0400, Sean Dague wrote:
> >> This is a bit delayed due to the release rush, finally getting back to
> >> writing up my experiences at the Ops Meetup.
> >>
> >> Nova Feedback Session
> >> =====================
> >>
> >> We had a double session for Feedback for Nova from Operators, raw
> >> etherpad here - https://etherpad.openstack.org/p/NYC-ops-Nova.
> >>
> >> The median release people were on in the room was Kilo. Some were
> >> upgrading to Liberty, many had older than Kilo clouds. Remembering
> >> these are the larger ops environments that are engaged enough with the
> >> community to send people to the Ops Meetup.
> >>
> >>
> >> Performance Bottlenecks
> >> -----------------------
> >>
> >> * scheduling issues with Ironic - (this is a bug we got through during
> >> the week after the session)
> >> * live snapshots actually end up performance issue for people
> >>
> >> The workarounds config group was not well known, and everyone in the
> >> room wished we advertised that a bit more. The solution for snapshot
> >> performance is in there
> >>
> >> There were also general questions about what scale cells should be
> >> considered at.
> >>
> >> ACTION: we should make sure workarounds are advertised better
> >
> > Workarounds ought to be something that admins are rarely, if
> > ever, having to deal with.
> >
> > If the lack of live snapshot is such a major performance problem
> > for ops, this tends to suggest that our default behaviour is wrong,
> > rather than a need to publicise that operators should set this
> > workaround.
> >
> > eg, instead of optimizing for the case of a broken live snapshot
> > support by default, we should optimize for the case of working
> > live snapshot by default. The broken live snapshot stuff was so
> > rare that no one has ever reproduced it outside of the gate
> > AFAIK.
> >
> > IOW, rather than hardcoding disable_live_snapshot=True in nova,
> > we should just set it in the gate CI configs, and leave it set
> > to False in Nova, so operators get good performance out of the
> > box.
> >
> > Also it has been a while since we added the workaround, and IIRC,
> > we've got newer Ubuntu available on at least some of the gate
> > hosts now, so we have the ability to test to see if it still
> > hits newer Ubuntu.
>
> Here is my reconstruction of the snapshot issue from what I can remember
> of the conversation.
>
> Nova defaults to live snapshots. This uses the libvirt facility which
> dumps both memory and disk. And then we throw away the memory. For large
> memory guests (especially volume backed ones that might have a fast path
> for the disk) this leads to a lot of overhead for no gain. The
> workaround got them past it.

I think you've got it backwards there.

Nova defaults to *not* using live snapshots:

cfg.BoolOpt(
'disable_libvirt_livesnapshot',
default=True,
help="""
Disable live snapshots when using the libvirt driver.
...""")

When live snapshot is disabled like this, the snapshot code is unable
to guarantee a consistent disk state. So the libvirt nova driver will
stop the guest by doing a managed save (this saves all memory to
disk), then does the disk snapshot, then restores the managed saved
(which loads all memory from disk).

This is terrible for multiple reasons

1. the guest workload stops running while snapshot is taken
2. we churn disk I/O saving & loading VM memory
3. you can't do it at all if host PCI devices are attached to
the VM

Enabling live snapshots by default fixes all these problems, at the
risk of hitting the live snapshot bug we saw in the gate CI but never
anywhere else.

The median release people were on in the room was Kilo. Some were
upgrading to Liberty, many had older than Kilo clouds. Remembering
these are the larger ops environments that are engaged enough with the
community to send people to the Ops Meetup.

Performance Bottlenecks

scheduling issues with Ironic - (this is a bug we got through during
the week after the session)

live snapshots actually end up performance issue for people

The workarounds config group was not well known, and everyone in the
room wished we advertised that a bit more. The solution for snapshot
performance is in there

There were also general questions about what scale cells should be
considered at.

ACTION: we should make sure workarounds are advertised better

Workarounds ought to be something that admins are rarely, if
ever, having to deal with.

If the lack of live snapshot is such a major performance problem
for ops, this tends to suggest that our default behaviour is wrong,
rather than a need to publicise that operators should set this
workaround.

eg, instead of optimizing for the case of a broken live snapshot
support by default, we should optimize for the case of working
live snapshot by default. The broken live snapshot stuff was so
rare that no one has ever reproduced it outside of the gate
AFAIK.

IOW, rather than hardcoding disablelivesnapshot=True in nova,
we should just set it in the gate CI configs, and leave it set
to False in Nova, so operators get good performance out of the
box.

Also it has been a while since we added the workaround, and IIRC,
we've got newer Ubuntu available on at least some of the gate
hosts now, so we have the ability to test to see if it still
hits newer Ubuntu.

Here is my reconstruction of the snapshot issue from what I can remember
of the conversation.

Nova defaults to live snapshots. This uses the libvirt facility which
dumps both memory and disk. And then we throw away the memory. For large
memory guests (especially volume backed ones that might have a fast path
for the disk) this leads to a lot of overhead for no gain. The
workaround got them past it.

I think you've got it backwards there.

Nova defaults to not using live snapshots:

cfg.BoolOpt(
'disable_libvirt_livesnapshot',
default=True,
help="""

Disable live snapshots when using the libvirt driver.
...""")

When live snapshot is disabled like this, the snapshot code is unable
to guarantee a consistent disk state. So the libvirt nova driver will
stop the guest by doing a managed save (this saves all memory to
disk), then does the disk snapshot, then restores the managed saved
(which loads all memory from disk).

This is terrible for multiple reasons

the guest workload stops running while snapshot is taken

we churn disk I/O saving & loading VM memory

you can't do it at all if host PCI devices are attached to
the VM

Enabling live snapshots by default fixes all these problems, at the
risk of hitting the live snapshot bug we saw in the gate CI but never
anywhere else.

On 09/20/2016 11:20 AM, Daniel P. Berrange wrote:
> On Tue, Sep 20, 2016 at 11:01:23AM -0400, Sean Dague wrote:
>> On 09/20/2016 10:38 AM, Daniel P. Berrange wrote:
>>> On Tue, Sep 20, 2016 at 09:20:15AM -0400, Sean Dague wrote:
>>>> This is a bit delayed due to the release rush, finally getting back to
>>>> writing up my experiences at the Ops Meetup.
>>>>
>>>> Nova Feedback Session
>>>> =====================
>>>>
>>>> We had a double session for Feedback for Nova from Operators, raw
>>>> etherpad here - https://etherpad.openstack.org/p/NYC-ops-Nova.
>>>>
>>>> The median release people were on in the room was Kilo. Some were
>>>> upgrading to Liberty, many had older than Kilo clouds. Remembering
>>>> these are the larger ops environments that are engaged enough with the
>>>> community to send people to the Ops Meetup.
>>>>
>>>>
>>>> Performance Bottlenecks
>>>> -----------------------
>>>>
>>>> * scheduling issues with Ironic - (this is a bug we got through during
>>>> the week after the session)
>>>> * live snapshots actually end up performance issue for people
>>>>
>>>> The workarounds config group was not well known, and everyone in the
>>>> room wished we advertised that a bit more. The solution for snapshot
>>>> performance is in there
>>>>
>>>> There were also general questions about what scale cells should be
>>>> considered at.
>>>>
>>>> ACTION: we should make sure workarounds are advertised better
>>>
>>> Workarounds ought to be something that admins are rarely, if
>>> ever, having to deal with.
>>>
>>> If the lack of live snapshot is such a major performance problem
>>> for ops, this tends to suggest that our default behaviour is wrong,
>>> rather than a need to publicise that operators should set this
>>> workaround.
>>>
>>> eg, instead of optimizing for the case of a broken live snapshot
>>> support by default, we should optimize for the case of working
>>> live snapshot by default. The broken live snapshot stuff was so
>>> rare that no one has ever reproduced it outside of the gate
>>> AFAIK.
>>>
>>> IOW, rather than hardcoding disable_live_snapshot=True in nova,
>>> we should just set it in the gate CI configs, and leave it set
>>> to False in Nova, so operators get good performance out of the
>>> box.
>>>
>>> Also it has been a while since we added the workaround, and IIRC,
>>> we've got newer Ubuntu available on at least some of the gate
>>> hosts now, so we have the ability to test to see if it still
>>> hits newer Ubuntu.
>>
>> Here is my reconstruction of the snapshot issue from what I can remember
>> of the conversation.
>>
>> Nova defaults to live snapshots. This uses the libvirt facility which
>> dumps both memory and disk. And then we throw away the memory. For large
>> memory guests (especially volume backed ones that might have a fast path
>> for the disk) this leads to a lot of overhead for no gain. The
>> workaround got them past it.
>
> I think you've got it backwards there.
>
> Nova defaults to *not* using live snapshots:
>
> cfg.BoolOpt(
> 'disable_libvirt_livesnapshot',
> default=True,
> help="""
> Disable live snapshots when using the libvirt driver.
> ...""")
>
>
> When live snapshot is disabled like this, the snapshot code is unable
> to guarantee a consistent disk state. So the libvirt nova driver will
> stop the guest by doing a managed save (this saves all memory to
> disk), then does the disk snapshot, then restores the managed saved
> (which loads all memory from disk).
>
> This is terrible for multiple reasons
>
> 1. the guest workload stops running while snapshot is taken
> 2. we churn disk I/O saving & loading VM memory
> 3. you can't do it at all if host PCI devices are attached to
> the VM
>
> Enabling live snapshots by default fixes all these problems, at the
> risk of hitting the live snapshot bug we saw in the gate CI but never
> anywhere else.

Here is my reconstruction of the snapshot issue from what I can remember
of the conversation.

Nova defaults to live snapshots. This uses the libvirt facility which
dumps both memory and disk. And then we throw away the memory. For large
memory guests (especially volume backed ones that might have a fast path
for the disk) this leads to a lot of overhead for no gain. The
workaround got them past it.

I think you've got it backwards there.

Nova defaults to not using live snapshots:

cfg.BoolOpt(
'disable_libvirt_livesnapshot',
default=True,
help="""

Disable live snapshots when using the libvirt driver.
...""")

When live snapshot is disabled like this, the snapshot code is unable
to guarantee a consistent disk state. So the libvirt nova driver will
stop the guest by doing a managed save (this saves all memory to
disk), then does the disk snapshot, then restores the managed saved
(which loads all memory from disk).

This is terrible for multiple reasons

the guest workload stops running while snapshot is taken

we churn disk I/O saving & loading VM memory

you can't do it at all if host PCI devices are attached to
the VM

Enabling live snapshots by default fixes all these problems, at the
risk of hitting the live snapshot bug we saw in the gate CI but never
anywhere else.

> > Here is my reconstruction of the snapshot issue from what I can remember
> > of the conversation.
> >
> > Nova defaults to live snapshots. This uses the libvirt facility which
> > dumps both memory and disk. And then we throw away the memory. For large
> > memory guests (especially volume backed ones that might have a fast path
> > for the disk) this leads to a lot of overhead for no gain. The
> > workaround got them past it.
>
> I think you've got it backwards there.
>
> Nova defaults to *not* using live snapshots:
>
> cfg.BoolOpt(
> 'disable_libvirt_livesnapshot',
> default=True,
> help="""
> Disable live snapshots when using the libvirt driver.
> ...""")
>
>
> When live snapshot is disabled like this, the snapshot code is unable
> to guarantee a consistent disk state. So the libvirt nova driver will
> stop the guest by doing a managed save (this saves all memory to
> disk), then does the disk snapshot, then restores the managed saved
> (which loads all memory from disk).
>
> This is terrible for multiple reasons
>
> 1. the guest workload stops running while snapshot is taken
> 2. we churn disk I/O saving & loading VM memory
> 3. you can't do it at all if host PCI devices are attached to
> the VM
>
> Enabling live snapshots by default fixes all these problems, at the
> risk of hitting the live snapshot bug we saw in the gate CI but never
> anywhere else.