I have a couple of sizing questions to the users and developers. Hope, youdon't mind answering those.

What is the guideline for the maximum reasonable size of a DataTree that asingle ZK server can manage? If ZK server writes out a snapshot of about1GB in size, is it pushed beyond the limits or is it still manageable? Ifso, where is the critical threshold when ZK is really being abused?

Similarly, how can I estimate the propagation delay of a change across anensemble of three ZK servers?Thank you,/Sergey

1. Depends on how much RAM your machine has. Snapshot is should be lessthan the available RAM since everything is loaded into memory.2. Depends on what is the availability guarantee that the client needs.If there is leader election, every machine need to reload the data fromdisk. So the quorum will be down for at least the same as snapshot loadingtime. The session timeout on the client side should be at least longerthan expected downtime during leader election.

-- Thawan Kooburat

On 7/15/13 8:46 PM, "Sergey Maslyakov" <[EMAIL PROTECTED]> wrote:

>I have a couple of sizing questions to the users and developers. Hope, you>don't mind answering those.>>What is the guideline for the maximum reasonable size of a DataTree that a>single ZK server can manage? If ZK server writes out a snapshot of about>1GB in size, is it pushed beyond the limits or is it still manageable? If>so, where is the critical threshold when ZK is really being abused?>>Similarly, how can I estimate the propagation delay of a change across an>ensemble of three ZK servers?>>>Thank you,>/Sergey

> Max snapshot size:>> Here is my take on these issue, others feel free to add or correct.>> 1. Depends on how much RAM your machine has. Snapshot is should be less> than the available RAM since everything is loaded into memory.> 2. Depends on what is the availability guarantee that the client needs.> If there is leader election, every machine need to reload the data from> disk. So the quorum will be down for at least the same as snapshot loading> time. The session timeout on the client side should be at least longer> than expected downtime during leader election.>> --> Thawan Kooburat>>>>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <[EMAIL PROTECTED]> wrote:>> >I have a couple of sizing questions to the users and developers. Hope, you> >don't mind answering those.> >> >What is the guideline for the maximum reasonable size of a DataTree that a> >single ZK server can manage? If ZK server writes out a snapshot of about> >1GB in size, is it pushed beyond the limits or is it still manageable? If> >so, where is the critical threshold when ZK is really being abused?> >> >Similarly, how can I estimate the propagation delay of a change across an> >ensemble of three ZK servers?> >> >> >Thank you,> >/Sergey>>

And another extension on top of Kishore's question: do the reelectionshappen if the previously elected leader remains in the cluster? In otherwords, what events can trigger re-election and the corresponding temporarydegradation of the service provided by Zookeeper?Thank you,/SergeyOn Tue, Jul 16, 2013 at 2:21 AM, kishore g <[EMAIL PROTECTED]> wrote:

> Regarding #2. Is that really true that during leader election every machine> reloads snapshot data from disk? Any reason why this is needed unless it> really needs to truncate or undo conflicting transactions already applied?>>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:>> > Max snapshot size:> >> > Here is my take on these issue, others feel free to add or correct.> >> > 1. Depends on how much RAM your machine has. Snapshot is should be less> > than the available RAM since everything is loaded into memory.> > 2. Depends on what is the availability guarantee that the client needs.> > If there is leader election, every machine need to reload the data from> > disk. So the quorum will be down for at least the same as snapshot> loading> > time. The session timeout on the client side should be at least longer> > than expected downtime during leader election.> >> > --> > Thawan Kooburat> >> >> >> >> >> > On 7/15/13 8:46 PM, "Sergey Maslyakov" <[EMAIL PROTECTED]> wrote:> >> > >I have a couple of sizing questions to the users and developers. Hope,> you> > >don't mind answering those.> > >> > >What is the guideline for the maximum reasonable size of a DataTree> that a> > >single ZK server can manage? If ZK server writes out a snapshot of about> > >1GB in size, is it pushed beyond the limits or is it still manageable?> If> > >so, where is the critical threshold when ZK is really being abused?> > >> > >Similarly, how can I estimate the propagation delay of a change across> an> > >ensemble of three ZK servers?> > >> > >> > >Thank you,> > >/Sergey> >> >>

The disk state should be the authoritative state of a server, so if Iremember correctly, we load the database as a way of validating the diskstate. I don't claim that this is strictly necessary, but if we are tochange it, then I would need to think this through.

About leader election, if a leader loses support from a quorum of followers,then it will drop leadership. Any event that causes a follower to stopreceiving messages from the leader or the follower to disconnect from theleader will make it stop supporting the current leader.

And another extension on top of Kishore's question: do the reelectionshappen if the previously elected leader remains in the cluster? In otherwords, what events can trigger re-election and the corresponding temporarydegradation of the service provided by Zookeeper?Thank you,/SergeyOn Tue, Jul 16, 2013 at 2:21 AM, kishore g <[EMAIL PROTECTED]> wrote:

> Regarding #2. Is that really true that during leader election every > machine reloads snapshot data from disk? Any reason why this is needed > unless it really needs to truncate or undo conflicting transactionsalready applied?>>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:>> > Max snapshot size:> >> > Here is my take on these issue, others feel free to add or correct.> >> > 1. Depends on how much RAM your machine has. Snapshot is should be > > less than the available RAM since everything is loaded into memory.> > 2. Depends on what is the availability guarantee that the client needs.> > If there is leader election, every machine need to reload the data > > from disk. So the quorum will be down for at least the same as > > snapshot> loading> > time. The session timeout on the client side should be at least > > longer than expected downtime during leader election.> >> > --> > Thawan Kooburat> >> >> >> >> >> > On 7/15/13 8:46 PM, "Sergey Maslyakov" <[EMAIL PROTECTED]> wrote:> >> > >I have a couple of sizing questions to the users and developers. > > >Hope,> you> > >don't mind answering those.> > >> > >What is the guideline for the maximum reasonable size of a DataTree> that a> > >single ZK server can manage? If ZK server writes out a snapshot of > > >about 1GB in size, is it pushed beyond the limits or is it stillmanageable?> If> > >so, where is the critical threshold when ZK is really being abused?> > >> > >Similarly, how can I estimate the propagation delay of a change > > >across> an> > >ensemble of three ZK servers?> > >> > >> > >Thank you,> > >/Sergey> >> >>

> The disk state should be the authoritative state of a server, so if I> remember correctly, we load the database as a way of validating the disk> state. I don't claim that this is strictly necessary, but if we are to> change it, then I would need to think this through.>> About leader election, if a leader loses support from a quorum of> followers,> then it will drop leadership. Any event that causes a follower to stop> receiving messages from the leader or the follower to disconnect from the> leader will make it stop supporting the current leader.>> -Flavio>> -----Original Message-----> From: Sergey Maslyakov [mailto:[EMAIL PROTECTED]]> Sent: 16 July 2013 16:16> To: [EMAIL PROTECTED]> Subject: Re: Maximum size of a snapshot>> And another extension on top of Kishore's question: do the reelections> happen if the previously elected leader remains in the cluster? In other> words, what events can trigger re-election and the corresponding temporary> degradation of the service provided by Zookeeper?>>> Thank you,> /Sergey>>> On Tue, Jul 16, 2013 at 2:21 AM, kishore g <[EMAIL PROTECTED]> wrote:>> > Regarding #2. Is that really true that during leader election every> > machine reloads snapshot data from disk? Any reason why this is needed> > unless it really needs to truncate or undo conflicting transactions> already applied?> >> >> > On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:> >> > > Max snapshot size:> > >> > > Here is my take on these issue, others feel free to add or correct.> > >> > > 1. Depends on how much RAM your machine has. Snapshot is should be> > > less than the available RAM since everything is loaded into memory.> > > 2. Depends on what is the availability guarantee that the client needs.> > > If there is leader election, every machine need to reload the data> > > from disk. So the quorum will be down for at least the same as> > > snapshot> > loading> > > time. The session timeout on the client side should be at least> > > longer than expected downtime during leader election.> > >> > > --> > > Thawan Kooburat> > >> > >> > >> > >> > >> > > On 7/15/13 8:46 PM, "Sergey Maslyakov" <[EMAIL PROTECTED]> wrote:> > >> > > >I have a couple of sizing questions to the users and developers.> > > >Hope,> > you> > > >don't mind answering those.> > > >> > > >What is the guideline for the maximum reasonable size of a DataTree> > that a> > > >single ZK server can manage? If ZK server writes out a snapshot of> > > >about 1GB in size, is it pushed beyond the limits or is it still> manageable?> > If> > > >so, where is the critical threshold when ZK is really being abused?> > > >> > > >Similarly, how can I estimate the propagation delay of a change> > > >across> > an> > > >ensemble of three ZK servers?> > > >> > > >> > > >Thank you,> > > >/Sergey> > >> > >> >>>

The synchronization phase is part of the protocol and we use it to guarantee that we expose a consistent view of the state. During the synchronization phase, servers do not accept requests.

Which behavior are you proposing we change, Kishore?

-Flavio

On Jul 16, 2013, at 7:04 PM, kishore g <[EMAIL PROTECTED]> wrote:

> Thanks for clarification Flavio. Does this mean during the leader election,> both reads and writes are not supported?. Do we start a separate> thread/jira of changing this behavior?.> > thanks,> Kishore G> > > On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira <[EMAIL PROTECTED]>wrote:> >> The disk state should be the authoritative state of a server, so if I>> remember correctly, we load the database as a way of validating the disk>> state. I don't claim that this is strictly necessary, but if we are to>> change it, then I would need to think this through.>> >> About leader election, if a leader loses support from a quorum of>> followers,>> then it will drop leadership. Any event that causes a follower to stop>> receiving messages from the leader or the follower to disconnect from the>> leader will make it stop supporting the current leader.>> >> -Flavio>> >> -----Original Message----->> From: Sergey Maslyakov [mailto:[EMAIL PROTECTED]]>> Sent: 16 July 2013 16:16>> To: [EMAIL PROTECTED]>> Subject: Re: Maximum size of a snapshot>> >> And another extension on top of Kishore's question: do the reelections>> happen if the previously elected leader remains in the cluster? In other>> words, what events can trigger re-election and the corresponding temporary>> degradation of the service provided by Zookeeper?>> >> >> Thank you,>> /Sergey>> >> >> On Tue, Jul 16, 2013 at 2:21 AM, kishore g <[EMAIL PROTECTED]> wrote:>> >>> Regarding #2. Is that really true that during leader election every>>> machine reloads snapshot data from disk? Any reason why this is needed>>> unless it really needs to truncate or undo conflicting transactions>> already applied?>>> >>> >>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:>>> >>>> Max snapshot size:>>>> >>>> Here is my take on these issue, others feel free to add or correct.>>>> >>>> 1. Depends on how much RAM your machine has. Snapshot is should be>>>> less than the available RAM since everything is loaded into memory.>>>> 2. Depends on what is the availability guarantee that the client needs.>>>> If there is leader election, every machine need to reload the data>>>> from disk. So the quorum will be down for at least the same as>>>> snapshot>>> loading>>>> time. The session timeout on the client side should be at least>>>> longer than expected downtime during leader election.>>>> >>>> -->>>> Thawan Kooburat>>>> >>>> >>>> >>>> >>>> >>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <[EMAIL PROTECTED]> wrote:>>>> >>>>> I have a couple of sizing questions to the users and developers.>>>>> Hope,>>> you>>>>> don't mind answering those.>>>>> >>>>> What is the guideline for the maximum reasonable size of a DataTree>>> that a>>>>> single ZK server can manage? If ZK server writes out a snapshot of>>>>> about 1GB in size, is it pushed beyond the limits or is it still>> manageable?>>> If>>>>> so, where is the critical threshold when ZK is really being abused?>>>>> >>>>> Similarly, how can I estimate the propagation delay of a change>>>>> across>>> an>>>>> ensemble of three ZK servers?>>>>> >>>>> >>>>> Thank you,>>>>> /Sergey>>>> >>>> >>> >> >>

All servers in the quorum reading the snapshot from disk as part of thesynchronization phase. From Thawan's email it looks like when ever there isa leader election, all zk servers read the snapshot from disk. I am notsure why all servers should reload the snapshot from disk as this increasesunavailability time.On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira <[EMAIL PROTECTED]>wrote:

> The synchronization phase is part of the protocol and we use it to> guarantee that we expose a consistent view of the state. During the> synchronization phase, servers do not accept requests.>> Which behavior are you proposing we change, Kishore?>> -Flavio>> On Jul 16, 2013, at 7:04 PM, kishore g <[EMAIL PROTECTED]> wrote:>> > Thanks for clarification Flavio. Does this mean during the leader> election,> > both reads and writes are not supported?. Do we start a separate> > thread/jira of changing this behavior?.> >> > thanks,> > Kishore G> >> >> > On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira <[EMAIL PROTECTED]> >wrote:> >> >> The disk state should be the authoritative state of a server, so if I> >> remember correctly, we load the database as a way of validating the disk> >> state. I don't claim that this is strictly necessary, but if we are to> >> change it, then I would need to think this through.> >>> >> About leader election, if a leader loses support from a quorum of> >> followers,> >> then it will drop leadership. Any event that causes a follower to stop> >> receiving messages from the leader or the follower to disconnect from> the> >> leader will make it stop supporting the current leader.> >>> >> -Flavio> >>> >> -----Original Message-----> >> From: Sergey Maslyakov [mailto:[EMAIL PROTECTED]]> >> Sent: 16 July 2013 16:16> >> To: [EMAIL PROTECTED]> >> Subject: Re: Maximum size of a snapshot> >>> >> And another extension on top of Kishore's question: do the reelections> >> happen if the previously elected leader remains in the cluster? In other> >> words, what events can trigger re-election and the corresponding> temporary> >> degradation of the service provided by Zookeeper?> >>> >>> >> Thank you,> >> /Sergey> >>> >>> >> On Tue, Jul 16, 2013 at 2:21 AM, kishore g <[EMAIL PROTECTED]> wrote:> >>> >>> Regarding #2. Is that really true that during leader election every> >>> machine reloads snapshot data from disk? Any reason why this is needed> >>> unless it really needs to truncate or undo conflicting transactions> >> already applied?> >>>> >>>> >>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <[EMAIL PROTECTED]>> wrote:> >>>> >>>> Max snapshot size:> >>>>> >>>> Here is my take on these issue, others feel free to add or correct.> >>>>> >>>> 1. Depends on how much RAM your machine has. Snapshot is should be> >>>> less than the available RAM since everything is loaded into memory.> >>>> 2. Depends on what is the availability guarantee that the client> needs.> >>>> If there is leader election, every machine need to reload the data> >>>> from disk. So the quorum will be down for at least the same as> >>>> snapshot> >>> loading> >>>> time. The session timeout on the client side should be at least> >>>> longer than expected downtime during leader election.> >>>>> >>>> --> >>>> Thawan Kooburat> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <[EMAIL PROTECTED]> wrote:> >>>>> >>>>> I have a couple of sizing questions to the users and developers.> >>>>> Hope,> >>> you> >>>>> don't mind answering those.> >>>>>> >>>>> What is the guideline for the maximum reasonable size of a DataTree> >>> that a> >>>>> single ZK server can manage? If ZK server writes out a snapshot of> >>>>> about 1GB in size, is it pushed beyond the limits or is it still> >> manageable?> >>> If> >>>>> so, where is the critical threshold when ZK is really being abused?> >>>>>> >>>>> Similarly, how can I estimate the propagation delay of a change

>All servers in the quorum reading the snapshot from disk as part of the>synchronization phase. From Thawan's email it looks like when ever there>is>a leader election, all zk servers read the snapshot from disk. I am not>sure why all servers should reload the snapshot from disk as this>increases>unavailability time.>>>On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira><[EMAIL PROTECTED]>wrote:>>> The synchronization phase is part of the protocol and we use it to>> guarantee that we expose a consistent view of the state. During the>> synchronization phase, servers do not accept requests.>>>> Which behavior are you proposing we change, Kishore?>>>> -Flavio>>>> On Jul 16, 2013, at 7:04 PM, kishore g <[EMAIL PROTECTED]> wrote:>>>> > Thanks for clarification Flavio. Does this mean during the leader>> election,>> > both reads and writes are not supported?. Do we start a separate>> > thread/jira of changing this behavior?.>> >>> > thanks,>> > Kishore G>> >>> >>> > On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira>><[EMAIL PROTECTED]>> >wrote:>> >>> >> The disk state should be the authoritative state of a server, so if I>> >> remember correctly, we load the database as a way of validating the>>disk>> >> state. I don't claim that this is strictly necessary, but if we are>>to>> >> change it, then I would need to think this through.>> >>>> >> About leader election, if a leader loses support from a quorum of>> >> followers,>> >> then it will drop leadership. Any event that causes a follower to>>stop>> >> receiving messages from the leader or the follower to disconnect from>> the>> >> leader will make it stop supporting the current leader.>> >>>> >> -Flavio>> >>>> >> -----Original Message----->> >> From: Sergey Maslyakov [mailto:[EMAIL PROTECTED]]>> >> Sent: 16 July 2013 16:16>> >> To: [EMAIL PROTECTED]>> >> Subject: Re: Maximum size of a snapshot>> >>>> >> And another extension on top of Kishore's question: do the>>reelections>> >> happen if the previously elected leader remains in the cluster? In>>other>> >> words, what events can trigger re-election and the corresponding>> temporary>> >> degradation of the service provided by Zookeeper?>> >>>> >>>> >> Thank you,>> >> /Sergey>> >>>> >>>> >> On Tue, Jul 16, 2013 at 2:21 AM, kishore g <[EMAIL PROTECTED]>>>wrote:>> >>>> >>> Regarding #2. Is that really true that during leader election every>> >>> machine reloads snapshot data from disk? Any reason why this is>>needed>> >>> unless it really needs to truncate or undo conflicting transactions>> >> already applied?>> >>>>> >>>>> >>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <[EMAIL PROTECTED]>>> wrote:>> >>>>> >>>> Max snapshot size:>> >>>>>> >>>> Here is my take on these issue, others feel free to add or>>correct.>> >>>>>> >>>> 1. Depends on how much RAM your machine has. Snapshot is should be>> >>>> less than the available RAM since everything is loaded into memory.>> >>>> 2. Depends on what is the availability guarantee that the client>> needs.>> >>>> If there is leader election, every machine need to reload the data>> >>>> from disk. So the quorum will be down for at least the same as>> >>>> snapshot>> >>> loading>> >>>> time. The session timeout on the client side should be at least>> >>>> longer than expected downtime during leader election.>> >>>>>> >>>> -->> >>>> Thawan Kooburat>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <[EMAIL PROTECTED]> wrote:>> >>>>>> >>>>> I have a couple of sizing questions to the users and developers.>> >>>>> Hope,>> >>> you>> >>>>> don't mind answering those.>> >>>>>>> >>>>> What is the guideline for the maximum reasonable size of a>>DataTree>> >>> that a>> >>>>> single ZK server can manage? If ZK server writes out a snapshot of

Thanks Thawan. Another question to follow up, so lets say client c1 isconnected to leader and leader fails. Now c1 is trying to connect toanother zk server but all servers are busy loading snapshot and can take aminute or two. According to Flavio zk servers dont accept any request whilesynchronization, but most clients dont keep that high connection timeout.So does this mean clients will timeout on connection?. Is my understandingcorrect or zk servers will accept connection requests but reject read/writerequests.

Client will get session expire event only when a server explicitly tellsthe client. So any established sessions will remain in a disconnectedstate during the period

So my comment about the need for longer session timeout might beincorrect. While the quorum is down during leader election, session won'texpire during this period. When the quorum comes back, the client have toreconnect within session timeout in order to resume the session. However,client won't be able to issue any read/write request or create a newsession while the quorum is down.

However, some application may need a stronger consistency guarantee. Theywill have a special logic to abort the client if it was disconnected foran extended period. This is because the client won't be able to tell ifthe quorum is down or there is a network partition between the client andthe quorum. -- Thawan Kooburat

On 7/16/13 6:46 PM, "kishore g" <[EMAIL PROTECTED]> wrote:

>Thanks Thawan. Another question to follow up, so lets say client c1 is>connected to leader and leader fails. Now c1 is trying to connect to>another zk server but all servers are busy loading snapshot and can take a>minute or two. According to Flavio zk servers dont accept any request>while>synchronization, but most clients dont keep that high connection timeout.>So does this mean clients will timeout on connection?. Is my understanding>correct or zk servers will accept connection requests but reject>read/write>requests.>>thanks,>Kishore G>>>On Tue, Jul 16, 2013 at 3:45 PM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:>>> There is a plan to work on this optimization ZOOKEEPER-1674.>>>>>> -->> Thawan Kooburat>>>>>>>>>>>> On 7/16/13 1:37 PM, "kishore g" <[EMAIL PROTECTED]> wrote:>>>> >All servers in the quorum reading the snapshot from disk as part of the>> >synchronization phase. From Thawan's email it looks like when ever>>there>> >is>> >a leader election, all zk servers read the snapshot from disk. I am not>> >sure why all servers should reload the snapshot from disk as this>> >increases>> >unavailability time.>> >>> >>> >On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira>> ><[EMAIL PROTECTED]>wrote:>> >>> >> The synchronization phase is part of the protocol and we use it to>> >> guarantee that we expose a consistent view of the state. During the>> >> synchronization phase, servers do not accept requests.>> >>>> >> Which behavior are you proposing we change, Kishore?>> >>>> >> -Flavio>> >>>> >> On Jul 16, 2013, at 7:04 PM, kishore g <[EMAIL PROTECTED]> wrote:>> >>>> >> > Thanks for clarification Flavio. Does this mean during the leader>> >> election,>> >> > both reads and writes are not supported?. Do we start a separate>> >> > thread/jira of changing this behavior?.>> >> >>> >> > thanks,>> >> > Kishore G>> >> >>> >> >>> >> > On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira>> >><[EMAIL PROTECTED]>> >> >wrote:>> >> >>> >> >> The disk state should be the authoritative state of a server, so>>if I>> >> >> remember correctly, we load the database as a way of validating>>the>> >>disk>> >> >> state. I don't claim that this is strictly necessary, but if we>>are>> >>to>> >> >> change it, then I would need to think this through.>> >> >>>> >> >> About leader election, if a leader loses support from a quorum of>> >> >> followers,>> >> >> then it will drop leadership. Any event that causes a follower to>> >>stop>> >> >> receiving messages from the leader or the follower to disconnect>>from>> >> the>> >> >> leader will make it stop supporting the current leader.>> >> >>>> >> >> -Flavio>> >> >>>> >> >> -----Original Message----->> >> >> From: Sergey Maslyakov [mailto:[EMAIL PROTECTED]]>> >> >> Sent: 16 July 2013 16:16>> >> >> To: [EMAIL PROTECTED]>> >> >> Subject: Re: Maximum size of a snapshot>> >> >>>> >> >> And another extension on top of Kishore's question: do the>> >>reelections>> >> >> happen if the previously elected leader remains in the cluster? In

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext