Bitcoin can't copy every innovation, especially if that innovation turns all the existing mining farms into useless door stops. The vested interests in Bitcoin could never agree to copy radical block chain overhaul. Also that new design would already have the first mover advantage with millions more users than Bitcoin. Bitcoin can't scale so it won't have those millions of users to start with.

Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function,and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound.

Quote

I also expect that given Cuckoo Cycle is parallelizable

No need to expect. This is clearly stated in the white paper.

Quote

necessary to make Cuckoo Cycle's proving time reasonable for instant transactions

The white paper also explains that Cuckoo Cycle is not suitable for short block intervals,and works best for intervals even longer than bitcoin's. Instant transactions are best handled with off-chain payment channels,reserving the main chain for settlement.

Quote

tromp thinks that Cuckoo Cycle has very low computation relative to memory latency. But this depends on only one 32-bit bucket being read in each memory page/row. With massive number of h/w threads, can get them all synced up (or sorted) so that we read 1024 of 32-bit buckets in one memory latency cycle

Cuckoo Cycle randomly accesses 2-bit counters, not 32-bit buckets.In one round of edge trimming, Cuckoo Cycle typically updates 2^29 countersspread over 2^8 memory banks. Each memory bank thus holds 2^21 counters,only 2^14 of which typically fit in a single row. This is where the latency comes in.To avoid latency, you could increase the number of memory banks by a factor of 2^7,but this similarly increases die area, hardware costs, and energy usage.

Alternatively, the algorithm parameter PART_BITS allows for a reduction in the numberof counters in use at the same time, which is what your proposal essentially amounts to.Setting this to 21 will require only 2^8 counters,one per memory bank. But now your hash computations have increased by a factor 2^21,over 2 million.

Both of these approaches are rather cost prohibitive. You might as well avoid DRAM latencyby using SRAM instead. But that is also two orders of magnitude more expensive, while theperformance increase is closer to one order of magnitude.

Cuckoo Cycle may not be ASIC-proof, but it certainly seems to resist...

tromp thinks that Cuckoo Cycle has very low computation relative to memory latency. But this depends on only one 32-bit bucket being read in each memory page/row. With massive number of h/w threads, can get them all synced up (or sorted) so that we read 1024 of 32-bit buckets in one memory latency cycle

Cuckoo Cycle randomly accesses 2-bit counters, not 32-bit buckets.

Are you referring to the latest version of your code, because I am referring to the code that was in the Appendix of your December 31. 2014February 1, 2014 white paper which references int buckets.

In one round of edge trimming, Cuckoo Cycle typically updates 2^29 countersspread over 2^8 memory banks. Each memory bank thus holds 2^21 counters,only 2^14 of which typically fit in a single row. This is where the latency comes in.To avoid latency, you could increase the number of memory banks by a factor of 2^7,but this similarly increases die area, hardware costs, and energy usage.

Or increase the number of h/w threads to 2^15 (2^8 multiplied by 2^21 bits for 2-bit counters divided by 2^14 bits per row) which have some efficient h/w mechanism for waiting to be synchronized on a memory page/row.

Alternatively, the algorithm parameter PART_BITS allows for a reduction in the numberof counters in use at the same time, which is what your proposal essentially amounts to.Setting this to 21 will require only 2^8 counters,one per memory bank. But now your hash computations have increased by a factor 2^21,over 2 million.

No that is not equivalent to increasing the number of h/w threads and syncing them to pause until 2^13 of them are queued up to read a shared memory page/row.

Both of these approaches are rather cost prohibitive. You might as well avoid DRAM latencyby using SRAM instead. But that is also two orders of magnitude more expensive, while theperformance increase is closer to one order of magnitude.

Btw, electrical power cost of SRAM is only one order-of-magnitude increase:

Cuckoo Cycle may not be ASIC-proof, but it certainly seems to resist...

So far, I disagree. I appreciate your discussion, because I want to make sure my analysis is correct. I also have to strike a balance between full publishing and secrecy, because unlike you, I have objectives around releasing a Bitcoin killer altcoin.

In one round of edge trimming, Cuckoo Cycle typically updates 2^29 countersspread over 2^8 memory banks. Each memory bank thus holds 2^21 counters,only 2^14 of which typically fit in a single row. This is where the latency comes in.To avoid latency, you could increase the number of memory banks by a factor of 2^7,but this similarly increases die area, hardware costs, and energy usage.

Or increase the number of h/w threads to 2^15 (2^8 multiplied by 2^21 bits for 2-bit counters divided by 2^14 bits per row) which have some efficient h/w mechanism for waiting to be synchronized on a memory page/row.

I don't see why you multiply with the number of rows that Cuckoo uses per memory bank.Your 2^15 threads will each hash some nonce and then most will want to access a unique row.Then what? There's little to be synchronized.

Alternatively, the algorithm parameter PART_BITS allows for a reduction in the numberof counters in use at the same time, which is what your proposal essentially amounts to.Setting this to 21 will require only 2^8 counters,one per memory bank. But now your hash computations have increased by a factor 2^21,over 2 million.

No that is not equivalent to increasing the number of h/w threads and syncing them to pause until 2^13 of them are queued up to read a shared memory page/row.

In my code a thread is something that processes a sequence of nonces, hashing each one, andthen atomically updating a corresponding counter, without any coordination.

Are you proposing a thread for each nonce? Half a billion threads?I really don't understand what you are proposing and how this syncing is supposed to be implemented.

Quote

Btw, electrical power cost of SRAM is only one order-of-magnitude increase:

Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function,and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound.

Please provide me a copy of that discussion, because I believe I have already shown it is potentially unsound in some logic I have written down and not yet revealed.

I spoke to him in person at BITCOIN'15 where I presented Cuckoo Cycle.

Did he consider (especially parallelized) multi-collision attacks and their impact on the potential to find an algorithm that can find cycles faster? Can you paraphrase what he said or the gist of his reasoning?

The basic issue is that can we guarantee that patterns will not leak through Siphash because the key is controllable by the attacker? By not using a general cryptographic hash function, there is not the setup overhead of randomizing the impact (or lack thereof) of a key, but I think this is very risky. If not necessary to take this risk, then why do it. I understand you think you need to minimize computation for ASIC resistance but I don't think that is the wise place to remove computation from the assurances on the Cuckoo Cycle PoW.

For example, look at the potential cryptanalysis break found just recently on break 256-bit random numbers by reducing the search space by eliminating certain patterns (although this is probably a hoax):

I am intentionally referring to the earlier version of your algorithm, to avoid all the source code noise implementing multi-threading. Also apparently you added some TMTO tradeoffs in the latter version of your code which I presume adds more complexity to the code.

Once we establish what we are referring to, then I can better respond to your other points. I will await first your reply.

Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function,and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound.

Please provide me a copy of that discussion, because I believe I have already shown it is potentially unsound in some logic I have written down and not yet revealed.

I spoke to him in person at BITCOIN'15 where I presented Cuckoo Cycle.

Did he consider (especially parallelized) multi-collision attacks and their impact on the potential to find an algorithm that can find cycles faster? Can you paraphrase what he said or the gist of his reasoning?

Attacks on hash functions are mostly about doing tons of computation (but less than try all inputs)to try invert its outputs. Siphash may not stand up to that. But to think that a weakness in inversionresistance can make finding cycles of length 42 in a graph defined by millions or billions of siphashoutputs easier is just... utterly inconceivable. That's the gist of it.

Quote

I think you are referring to something entirely unrelated to what I was referring to. I see int here in the code I am referring to:

Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function,and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound.

Please provide me a copy of that discussion, because I believe I have already shown it is potentially unsound in some logic I have written down and not yet revealed.

I spoke to him in person at BITCOIN'15 where I presented Cuckoo Cycle.

Did he consider (especially parallelized) multi-collision attacks and their impact on the potential to find an algorithm that can find cycles faster? Can you paraphrase what he said or the gist of his reasoning?

Attacks on hash functions are mostly about doing tons of computation (but less than try all inputs)to try invert its outputs. Siphash may not stand up to that. But to think that a weakness in inversionresistance can make finding cycles of length 42 in a graph defined by millions or billions of siphashoutputs easier is just... utterly inconceivable. That's the gist of it.

Please distinguish preimage attacks from multi-collision attacks. If we can find certain characteristics in the hash function which are not perfectly random and are repeatable relationship with certain changes of bits in the key from hash-to-hash, then we can perhaps impact the cycling in the Cuckoo. Cycle forming or the way we optimize detecting them, may be impacting by certain patterning away from a perfect Random Oracle.

Think of a differential cryptanalysis where we look for the way a hash changes over differentials from one hash to the next. And also think of the parallelized rho attack. The weakness in Cuckcoo w.r.t. to hashing it is isn't doing one hash, but millions of them and so the dynamic randomness of correlation over hash differentials also has to be considered.

I am not comfortable with cheating margins of security on the hash function. Sorry. I think Zooko @ Zcash was correct to be concerned even though he didn't express what sort of attack might concern him. And remember he worked on the Blake2s cryptographic hash function but I understand he isn't a cryptographer per se (and neither am I). Crypto-currency is also about being able to trust the cryptography is strong and sound. I don't see the advantage of cheating on the hash function used in the Cuckoo Cycle, especially given I don't think you attain ASIC resistance any way, which was the entire motivation for replacing SHA256 with Siphash in order to lower the computation to memory consumption ratio.

You are apparently thinking of inversion and not about what matters. I doubt Dan really thought deeply about what he was commenting on off-the-cuff.

And Btw, you write about in our past discussions and also in your white paper (or blog) about 1/3 computation and 2/3 memory accesses, but this is execution time, not energy cost. The power cost of loading a row buffer on DRAM is very cheap because the circuits are much more dense and the page is localized. The cost of computation on a CPU is very costly because the circuits are spread out all over the chip and not as dense as 1 transistor per bit of DRAM, thus the parasitic capacitance and resistance of the interconnects is an order-of-magnitude or more greater. I think it is a losing strategy to try to say 1 hash per 1 memory access (meaning you expect 1 memory access to load a new memory row) is going to only result in a 1/3 speedup for the ASIC. And besides below I will explain how I think the assumption of 1 hash per memory row buffer load can I think be avoided by the ASIC running many h/w threads.

I was avoiding studying the edge trimming code and was focusing on the "basic algorithm".

Any way, if the edge trimming is done after running enumerating all possible nonces, then it is not a heuristic and is an optimization to reduce memory consumption by eliminating all the edges which can't be in a cycle and then I presume you re-enumerate the nonces on the "basic algorithm" but use a hash table to represent a sparse array so you don't have to store all the buckets from the more naive version of the basic algorithm.

So it is reducing memory consumption and I presume eliminating the wasted code paths for walking leaf edges in the final basic algorithm.

Btw, you may not remember that AnonyMint or TPTB_need_war wrote to you a long time ago (and I think perhaps before Dave Andersen did) that you could compress and count with bit flags. So you should credit Shelby too, lol (but Dave wrote a much more detailed exposition). Do I need to go find that post on BCT where AnonyMint or TPTB_need_war told you that? I remember reading it again this week when I was doing my deeper research on Cuckoo Cycle.

In any case, it seems it doesn't change my theory about the lack of ASIC resistance. Rather it just reduces the size of the data being randomly read/write from 32-bits to 2-bits for the majority of running time.

Also you do note that the edge trimming is applicable for the case where many/most of the edges will leaf nodes, which is the case where M/N < 1/2. You may think that is acceptable, because your intended usage is apparently to run Cuckoo at that load. I note it for reasons I will not reveal now.

Okay so now we are on the same page, so I can proceed to reply to what you had written before...

In one round of edge trimming, Cuckoo Cycle typically updates 2^29 countersspread over 2^8 memory banks. Each memory bank thus holds 2^21 counters,only 2^14 of which typically fit in a single row. This is where the latency comes in.To avoid latency, you could increase the number of memory banks by a factor of 2^7,but this similarly increases die area, hardware costs, and energy usage.

Or increase the number of h/w threads to 2^15 (2^8 multiplied by 2^21 bits for 2-bit counters divided by 2^14 bits per row) which have some efficient h/w mechanism for waiting to be synchronized on a memory page/row.

I don't see why you multiply with the number of rows that Cuckoo uses per memory bank.Your 2^15 threads will each hash some nonce and then most will want to access a unique row.Then what? There's little to be synchronized.

The point is that if you have enough threads running, the likelihood is there are 2^12 threads wanting to read from the same row, thus your algorithm is no longer bounded on memory latency rather on memory bandwidth and moreover that there isn't 1 hash per memory row buffer load but instead 2^12 hashes, and thus actually your algorithm is computation bound and will be 1000s times more power efficient on the ASIC.

I suspect if the GPU is running at memory bandwidth, then it may be getting multiple hashes per row buffer load and thus more power efficient than the CPU with only 8 threads/cores. CPU h/w threads may not be coalesced to sync on row buffer loads either (and I am not sure if they are on the GPU).

Alternatively, the algorithm parameter PART_BITS allows for a reduction in the numberof counters in use at the same time, which is what your proposal essentially amounts to.Setting this to 21 will require only 2^8 counters,one per memory bank. But now your hash computations have increased by a factor 2^21,over 2 million.

No that is not equivalent to increasing the number of h/w threads and syncing them to pause until 2^13 of them are queued up to read a shared memory page/row.

In my code a thread is something that processes a sequence of nonces, hashing each one, andthen atomically updating a corresponding counter, without any coordination.

Are you proposing a thread for each nonce? Half a billion threads?I really don't understand what you are proposing and how this syncing is supposed to be implemented.

It doesn't matter that the thread proceeds to do something else after reading and updating the 2-bit count corresponding to a nonce. It only matters that enough h/w threads are running to enable syncing many reads on the same row buffer load without stalling all the threads and gridlock.

The ASIC will need some custom timing and syncing logic. I presume stalled threads (waiting for the other threads to coalesce on each row buffer load) don't cost much in terms of power consumption when implemented with custom circuitry on the ASIC.

I think you really didn't understand what Bill Cox was trying to explain.

Cuckoo Cycle doesn't rely on cryptographic security of its underlying hash function,and Dan Bernstein himself attested to its use in Cuckoo Cycle being perfectly sound.

Please provide me a copy of that discussion, because I believe I have already shown it is potentially unsound in some logic I have written down and not yet revealed.

I spoke to him in person at BITCOIN'15 where I presented Cuckoo Cycle.

Did he consider (especially parallelized) multi-collision attacks and their impact on the potential to find an algorithm that can find cycles faster? Can you paraphrase what he said or the gist of his reasoning?

Attacks on hash functions are mostly about doing tons of computation (but less than try all inputs)to try invert its outputs. Siphash may not stand up to that. But to think that a weakness in inversionresistance can make finding cycles of length 42 in a graph defined by millions or billions of siphashoutputs easier is just... utterly inconceivable. That's the gist of it.

Please distinguish preimage attacks from multi-collision attacks. If we can find certain characteristics in the hash function which are not perfectly random and are repeatable relationship with certain changes of bits in the key from hash-to-hash, then we can perhaps impact the cycling in the Cuckoo. Cycle forming or the way we optimize detecting them, may be impacting by certain patterning away from a perfect Random Oracle.

Think of a differential cryptanalysis where we look for the way a hash changes over differentials from one hash to the next. And also think of the parallelized rho attack. The weakness in Cuckcoo w.r.t. to hashing it is isn't doing one hash, but millions of them and so the dynamic randomness of correlation over hash differentials also has to be considered.

I am not comfortable with cheating margins of security on the hash function. Sorry. I think Zooko @ Zcash was correct to be concerned even though he didn't express what sort of attack might concern him. And remember he worked on the Blake2s cryptographic hash function but I understand he isn't a cryptographer per se (and neither am I). Crypto-currency is also about being able to trust the cryptography is strong and sound. I don't see the advantage of cheating on the hash function used in the Cuckoo Cycle, especially given I don't think you attain ASIC resistance any way, which was the entire motivation for replacing SHA256 with Siphash in order to lower the computation to memory consumption ratio.

You are apparently thinking of inversion and not about what matters. I doubt Dan really thought deeply about what he was commenting on off-the-cuff.

And Btw, you write about in our past discussions and also in your white paper (or blog) about 1/3 computation and 2/3 memory accesses, but this is execution time, not energy cost. The power cost of loading a row buffer on DRAM is very cheap because the circuits are much more dense and the page is localized. The cost of computation on a CPU is very costly because the circuits are spread out all over the chip and not as dense as 1 transistor per bit of DRAM. I think it is a losing strategy to try to say 1 hash per 1 memory access (meaning you expect 1 memory access to load a new memory row) is going to only result in a 1/3 speedup for the ASIC. And besides below I will explain how I think the assumption of 1 hash per memory row buffer load can I think be avoided by the ASIC running many h/w threads.

I was avoiding studying the edge trimming code and was focusing on the "basic algorithm".

Any way, if the edge trimming is done after running enumerating all possible nonces, then it is not a heuristic and is an optimization to reduce memory consumption by eliminating all the edges which can't be in a cycle and then I presume you re-enumerate the nonces on the "basic algorithm" but use a hash table to represent a sparse array so you don't have to store all the buckets from the more naive version of the basic algorithm.

So it is reducing memory consumption and I presume eliminating the wasted code paths for walking leaf edges in the final basic algorithm.

Btw, you may not remember that AnonyMint or TPTB_need_war wrote to you a long time ago (and I think perhaps before Dave Andersen did) that you could compress and count with bit flags. So you should credit Shelby too, lol (but Dave wrote a much more detailed exposition). Do I need to go find that post on BCT where AnonyMint or TPTB_need_war told you that? I remember reading it again this week when I was doing my deeper research on Cuckoo Cycle.

In any case, it seems it doesn't change my theory about the lack of ASIC resistance. Rather it just reduces the size of the data being randomly read/write from 32-bits to 2-bits for the majority of running time.

Also you do note that the edge trimming is applicable for the case where many/most of the edges will leaf nodes, which is the case where M/N < 1/2. You may think that is acceptable, because your intended usage is apparently to run Cuckoo at that load. I note it for reasons I will not reveal now.

Okay so now we are on the same page, so I can proceed to reply to what you had written before...

In one round of edge trimming, Cuckoo Cycle typically updates 2^29 countersspread over 2^8 memory banks. Each memory bank thus holds 2^21 counters,only 2^14 of which typically fit in a single row. This is where the latency comes in.To avoid latency, you could increase the number of memory banks by a factor of 2^7,but this similarly increases die area, hardware costs, and energy usage.

Or increase the number of h/w threads to 2^15 (2^8 multiplied by 2^21 bits for 2-bit counters divided by 2^14 bits per row) which have some efficient h/w mechanism for waiting to be synchronized on a memory page/row.

The point is that if you have enough threads running, the likelihood is there are 2^12 threads wanting to read from the same row, thus your algorithm is no longer bounded on memory latency rather on memory bandwidth and moreover that there isn't 1 hash per memory row buffer load but instead 2^12 hashes, and thus actually your algorithm is computation bound and will be 1000s times more power efficient on the ASIC.

I don't see why you multiply with the number of rows that Cuckoo uses per memory bank.Your 2^15 threads will each hash some nonce and then most will want to access a unique row.Then what? There's little to be synchronized.

Alternatively, the algorithm parameter PART_BITS allows for a reduction in the numberof counters in use at the same time, which is what your proposal essentially amounts to.Setting this to 21 will require only 2^8 counters,one per memory bank. But now your hash computations have increased by a factor 2^21,over 2 million.

No that is not equivalent to increasing the number of h/w threads and syncing them to pause until 2^13 of them are queued up to read a shared memory page/row.

In my code a thread is something that processes a sequence of nonces, hashing each one, andthen atomically updating a corresponding counter, without any coordination.

Are you proposing a thread for each nonce? Half a billion threads?I really don't understand what you are proposing and how this syncing is supposed to be implemented.

It doesn't matter that the thread proceeds to do something else after read a nonce. It only matters that enough h/w threads are running to enable syncing many reads on the same row buffer load without stalling all the threads and gridlock.

The ASIC will need some custom timing and syncing logic.

I think you really didn't understand what Bill Cox was trying to explain.

Attacks on hash functions are mostly about doing tons of computation (but less than try all inputs)to try invert its outputs. Siphash may not stand up to that. But to think that a weakness in inversionresistance can make finding cycles of length 42 in a graph defined by millions or billions of siphashoutputs easier is just... utterly inconceivable. That's the gist of it.

Please distinguish preimage attacks from multi-collision attacks. If we can find certain characteristics in the hash function which are not perfectly random and are repeatable relationship with certain changes of bits in the key from hash-to-hash, then we can perhaps impact the cycling in the Cuckoo. Cycle forming or the way we optimize detecting them, may be impacting by certain patterning away from a perfect Random Oracle.

My comment applies similarly to multi-collision attacks. I believe detecting deviationsfrom perfect randomness to be not only way too expensive but more importantly to die out exponentiallyfast as the cycle grows.

Quote

I am not comfortable with cheating margins of security on the hash function. Sorry.

I know you and Zooko disagree with me and Dan on this.You feel such attacks are conceivable. I feel such attacks are quite inconceivable.Clearly we're not going to convince each other. So let's just agree to disagree.People are more than welcome to run Cuckoo Cycle with blake2b replacing siphashif they feel that trade off is worthwhile. It's just not something I recommend.

Quote

Any way, if the edge trimming is done after running enumerating all possible nonces, then it is not a heuristic and is an optimization to reduce memory consumption by eliminating all the edges which can't be in a cycle and then I presume you re-enumerate the nonces on the "basic algorithm" but use a hash table to represent a sparse array so you don't have to store all the buckets from the more naive version of the basic algorithm.

The basic algorithm doesn't use buckets. Just a cuckoo hash table.In the edge trimming case, only on the order of 1% of edges remains, andthat table is a little more complicated to translate the sparseness into memory savings.But still no buckets.

Quote

Btw, you may not remember that AnonyMint or TPTB_need_war wrote to you a long time ago (and I think perhaps before Dave Andersen did) that you could compress and count with bit flags. So you should credit Shelby too, lol (but Dave wrote a much more detailed exposition). Do I need to go find that post on BCT where AnonyMint or TPTB_need_war told you that? I remember reading it again this week when I was doing my deeper research on Cuckoo Cycle.

Sure, a URL would help.

Quote

Also you do note that the edge trimming is applicable for the case where many/most of the edges will leaf nodes, which is the case where M/N < 1/2. You may think that is acceptable, because your intended usage is apparently to run Cuckoo at that load. I note it for reasons I will not reveal now.

A much earlier version used edge fractions lower than 1/2 as a difficulty control, butI abandoned it since there a relationship between edge fraction and cycle probability is stronglynon-linear, making dynamic difficulty control practically impossible.

Quote

Okay so now we are on the same page, so I can proceed to reply to what you had written before...

Gonna visit the local chess club now. Will read and respond to that later...

Attacks on hash functions are mostly about doing tons of computation (but less than try all inputs)to try invert its outputs. Siphash may not stand up to that. But to think that a weakness in inversionresistance can make finding cycles of length 42 in a graph defined by millions or billions of siphashoutputs easier is just... utterly inconceivable. That's the gist of it.

Please distinguish preimage attacks from multi-collision attacks. If we can find certain characteristics in the hash function which are not perfectly random and are repeatable relationship with certain changes of bits in the key from hash-to-hash, then we can perhaps impact the cycling in the Cuckoo. Cycle forming or the way we optimize detecting them, may be impacting by certain patterning away from a perfect Random Oracle.

My comment applies similarly to multi-collision attacks. I believe detecting deviations from perfect randomness to be not only way too expensive but more importantly to die out exponentially fast as the cycle grows.

Except if the cycle is growing in a patterned way and that is part of the algorithm of the attack, i.e. deviates from the algorithm you use but is still able to provide the cycles within the maximum nonce count (and does so with some advantage in performance obviously else there is no point to it).

I think it is naive for you to assume that a random walk becomes more convoluted as cycle length grows when the entire risk is about whether the uniform randomness could be subverted.

I have a very creative imagination, which is why I am often able to come up with things that others don't think of.

There are so many permutations of possibilities that you sweep away with a generalized notion of random walking which may not hold when the random oracle assumption is subverted.

It may be that there is some differential analysis or distinguisher (even just bias in even one of the bits between relative outputs to relative inputs between multiple hashes) that can be precomputed and amortized over all the hashes, because it is a table of differentials of the way the hash behaves over huge number of invocations where the spread between the enumeration of the space is smaller due to the large number of hashes computed. There are many different angles to analyse because we are not just considering one hash, but millions of them for one Cuckoo Cycle solution.

Of course I respect Dan Berstein as a much smarter cryptographer than I am, but unless I am reading that he has written, I am hesitant to be convinced he has considered all of what I am getting at just from your second hand summary of what he said verbally at a conference.

I am not comfortable with cheating margins of security on the hash function. Sorry.

I know you and Zooko disagree with me and Dan on this. You feel such attacks are conceivable. I feel such attacks are quite inconceivable. Clearly we're not going to convince each other. So let's just agree to disagree.People are more than welcome to run Cuckoo Cycle with blake2b replacing siphash if they feel that trade off is worthwhile. It's just not something I recommend.

I am just arguing why not be safer than sorry when so much is at stake? Why would you recommend adding unnecessary risk even where you think you are omniscient and there is no risk?

Any way, if the edge trimming is done after running enumerating all possible nonces, then it is not a heuristic and is an optimization to reduce memory consumption by eliminating all the edges which can't be in a cycle and then I presume you re-enumerate the nonces on the "basic algorithm" but use a hash table to represent a sparse array so you don't have to store all the buckets from the more naive version of the basic algorithm.

The basic algorithm doesn't use buckets. Just a cuckoo hash table. In the edge trimming case, only on the order of 1% of edges remains, and that table is a little more complicated to translate the sparseness into memory savings. But still no buckets.

I don't see what you wrote that is different than what I wrote. You focus too literally on the use of the work 'buckets'. What I am saying is that the edge trimming eliminates most of the nodes from consideration and then you have a hash table representing a sparse array for the remainder of the nodes which aren't the pruned/trimmed leaf edges. Those remaining nodes were "buckets" in a regular array in the former untrimmed basic algorithm.

Btw, you may not remember that AnonyMint or TPTB_need_war wrote to you a long time ago (and I think perhaps before Dave Andersen did) that you could compress and count with bit flags. So you should credit Shelby too, lol (but Dave wrote a much more detailed exposition). Do I need to go find that post on BCT where AnonyMint or TPTB_need_war told you that? I remember reading it again this week when I was doing my deeper research on Cuckoo Cycle.

Note that was quoting a document that AnonyMint wrote I believe in late 2013 or early 2014. But he didn't publish all of that document until January 2016. Note he had published excerpts of that document on BCT when Monero was first released and AnonyMint was having an argument with the guy who optimized Monero's hash function. So we could go correlate that if we wanted to prove it. Any way, I am fine with David getting the credit, as he developed the concept more than AnonyMint did.

Also you do note that the edge trimming is applicable for the case where many/most of the edges will leaf nodes, which is the case where M/N < 1/2. You may think that is acceptable, because your intended usage is apparently to run Cuckoo at that load. I note it for reasons I will not reveal now.

A much earlier version used edge fractions lower than 1/2 as a difficulty control, but I abandoned it since there a relationship between edge fraction and cycle probability is strongly non-linear, making dynamic difficulty control practically impossible.

Yeah I can see that dynamic difficulty control gets messed up if deviate too far from 1/2. But difficulty can also be controlled externally by hashing the result with SHA256, which you also noted in your paper.

The point is that if you have enough threads running, the likelihood is there are 2^12 threads wanting to read from the same row, thus your algorithm is no longer bounded on memory latency rather on memory bandwidth and moreover that there isn't 1 hash per memory row buffer load but instead 2^12 hashes, and thus actually your algorithm is computation bound and will be 1000s times more power efficient on the ASIC.

I stated earlier

Quote

To avoid latency, you could increase the number of memory banks by a factor of 2^7,but this similarly increases die area, hardware costs, and energy usage.

Your above proposal of simultaneously running millions of hardware threads and needing to sync themincurs much worse increases in die area (to the point of not fitting on any single die),hardware costs, and energy usage.

The point is that if you have enough threads running, the likelihood is there are 2^12 threads wanting to read from the same row, thus your algorithm is no longer bounded on memory latency rather on memory bandwidth and moreover that there isn't 1 hash per memory row buffer load but instead 2^12 hashes, and thus actually your algorithm is computation bound and will be 1000s times more power efficient on the ASIC.

I stated earlier

Quote

To avoid latency, you could increase the number of memory banks by a factor of 2^7,but this similarly increases die area, hardware costs, and energy usage.

Your above proposal of simultaneously running millions of hardware threads and needing to sync themincurs much worse increases in die area (to the point of not fitting on any single die),hardware costs, and energy usage.

My calculation was 2^15, that isn't a million.

Did you really design these threads already. I bet they can use a minimal number of transistors each by reusing some components for threads that are stalled.

Multiple CPUs can access the same memory. Perhaps the same could be done if multiple dies are needed.

Afaik, the hardware cost is rather insignificant relative to the relative power consumption efficiency cost when it comes to mining economics for ASIC farms.

I am just arguing why not be safer than sorry when so much is at stake? Why would you recommend adding unnecessary risk even where you think you are omniscient and there is no risk?

I only need good statistical properties of the hash function, not cryptographic security.In fact siphash-2-4 is already overkill for my purposes, with siphash-4-8 beingclose to cryptographically secure.Your attack is about as conceivable to me as P being equal to NP,in which case there is no cryptographically secure hash function anyway.

Quote

What I am saying is that the edge trimming eliminates most of the nodes from consideration and then you have a hash table representing a sparse array for the remainder of the nodes which aren't the pruned/trimmed leaf edges.

Yes, totally correct. But in my view the term "bucket" denotes a fixed size unordered container,and my current algorithms use no such thing.

My algorithms break down in that case, since the expected number of cycles becomes non-constant,causing the basic algorithm to overlook an increasing fraction of them,and edge-trimming fails altogether to remove a large fraction of edges.

I am just arguing why not be safer than sorry when so much is at stake? Why would you recommend adding unnecessary risk even where you think you are omniscient and there is no risk?

I only need good statistical properties of the hash function, not cryptographic security.

Pseudorandomness ("good statistical properties") is tied into the cryptographic security:

Quote from: Blake2 paper

BLAKE2 aims to provide the highest security level, be it in terms of classical notions as (second) preimage or collision resistance, or of theoretical notions as pseudorandomness (a.k.a. indistinguishability) or indifferentiability.

Relevant terms are 'indistinguishability' and 'indifferentiability'.

Siphash is sufficient for the use case it was designed for because the attacker doesn't know the input to the hash function. It is not the case that Siphash has been cryptanalysed for the case of your application of Siphash in Cuckoo Cycle wherein the attacker knows both the input and output of the hash function.

Note that every submission that Dan Berstein made for cryptographic hash, has always been rejected having breakage. He appears to not excel in that one area. His ARX rounds (from ChaCha symmetric encryption) have been redeployed in Blake2 by those who know how to make it cryptographically secure for the different application of a hash function. I admire Dan a lot. He knows a hell of a lot more than me about everything involving higher maths and cryptography.

In fact siphash-2-4 is already overkill for my purposes, with siphash-4-8 being close to cryptographically secure. Your attack is about as conceivable to me as P being equal to NP, in which case there is no cryptographically secure hash function anyway.

What I am saying is that the edge trimming eliminates most of the nodes from consideration and then you have a hash table representing a sparse array for the remainder of the nodes which aren't the pruned/trimmed leaf edges.

Yes, totally correct. But in my view the term "bucket" denotes a fixed size unordered container, and my current algorithms use no such thing.

Agreed 'bucket' would not be an ordered array element. That is a more precise use of the term 'bucket'. We programmers who started hacking since the 1970s and 1980s may be a little bit more loose (employing the general English definition) with our terminology (using 'bucket' to refer to an array element as an array bucket) as compared to someone who came up predominantly in an academic setting and especially someone employing technical English not as their native language wherein they would likely be more strict about definitions. I stand corrected on the more precise terminology.

My algorithms break down in that case, since the expected number of cycles becomes non-constant, causing the basic algorithm to overlook an increasing fraction of them, and edge-trimming fails altogether to remove a large fraction of edges.

Exactly. In a few months, you'll know what I was referring to when it is published.

You need millions to find 2^12 of them wanting to access a single row.

With 2^15 threads you expect 1 access per row.

Correct on 2^15 but not on the conclusion. We both agree the number of memory banks is irrelevant for this concern. Memory banks only determine how many row buffers are active simultaneously, which impacts maximum speed and whether we are memory bandwidth or memory latency bound on speed of execution.

So for the power consumption issue, if we have 2^29 counters (2-bits each) and 2^14 counters per memory page/row, then 2^16 (65536) h/w threads means we can stall threads (perhaps not permanently so as to account for variance outliers) and be roughly assured statistically of roughly 2 accesses per row. That already doubles the hash computation per memory page row.

Thus we don't need millions of h/w threads to convert the algorithm to computation bound on power consumption, and thus making it not ASIC resistant.

At 2^18 (262144) h/w threads we have 8 times more computation per memory page row than the CPU (and the computation will be orders-of-magnitude more power efficient on the ASIC).

A CUDA gpu can apparently do 671 million threads, although this are probably not synced on memory row buffers as we would need here, although the statistical spread might be sufficient without syncing. I think if you had the Kill-A-Watt meter what you would have observed is that by increasing threads, the speed remained topped out at memory bandwidth, but the power consumption of the GPU decreases as threads are increased beyond a certain threshold (assuming GPU threads are very efficient, but that might not be the case).

I don't think it is safe to assume that ASICs can't be designed to support millions of very efficient threads for this very customized computation and again sharing the computation transistors amongst only the threads that aren't stalled, thus not needing millions of instances of the compute units. The GPU may already be doing this, although maybe not since it is probably optimized for performance and not power consumption where there is a choice, although once it has hit memory bandwidth bound one might hope the GPU would be optimizing for minimizing power consumption by maximizing row buffer coalescing, but this is perhaps too costly to implement in the general case of GPU (but I am arguing not in the case of an ASIC with a specific custom computation circuit). Apparently GPU memory coalescing is not powerful enough to do what we require.

Also remember I need for instant transactions to have a very fast proving time, which thus means at 2^20 counters it could be trivially parallelized losing ASIC resistance with only thousands of threads.

Actually how ironic that the 2-bit counters instead of the original basic algorithm, makes the ASIC resistance worse, because more counters fit in a row. Remember I wrote this about "bit array" in my 2013 rough draft paper on memory hard proof-of-work.