The usual method (per RFC 1951) is to use a 3-byte hash to get started, then iterate through a chain of earlier 3-byte matches. However this can be slow if the chains become long. ZLib has heuristics (which I don't fully understand!) to shorten the search, RFC 1951 suggests "To avoid a worst-case situation, very long hash chains are arbitrarily truncated at a certain length, determined by a run-time parameter."

That all seems a little messy, so I decided to see if a "re-hashing" approach would work : suppose the string ABC is already recorded, and we encounter a new ABC, the new position of ABC will in future be used in preference to the old (position) for coding ABC (as it's "closer") so we re-hash the old position as a 4-byte hash instead. If there is already a matching 4-byte entry, then the lesser position becomes a 5-byte hash, etc.

Example: given input string ABCABCDABC, the hash dictionary updates
as follows ( each section delimited by ';' corresponds to the input position advancing by one byte ):

ZLib (on default settings) seems to be faster ( although I have not tried to fully optimise my code yet), but at the expense of less compression ( since truncating the search means it doesn't always find the longest LZ77 match ).

The logic is that since the limit on a match is MaxMatch (=258), there is no point hashing a string longer than that. This means it is now possible, albeit rather slow, to compress long repeat sequences ( which are quite common when compressing PDF images ).

I'm not very satisfied with this solution yet though, I think the algorithm could be modified to process long repeats efficiently and elegantly, but I haven't figured out how. Even with this fix, long repeats cause the processing to go much slower than normal - the outer loop of Rehash is executed nearly 258 times for each byte of input processed, when processing a long repeat.

Edit: I think skipping to the last MaxMatch bytes of a repeat sequence might work, as earlier potential matches will never be used ( again due to the MaxMatch limit ). I have not yet tried to code this.

Further Edit: After more investigation, my conclusion is that the re-hashing is not an effective approach. The cost appears to exceed any benefit, at least in the context of RFC 1951.

What seems to be more interesting is trying different block sizes, which seems to produce improved compression for relatively mimimal cost ( for example, for each block, starting with a small block size, and testing whether doubling the block size results in smaller output ).

@Svek 2019-01-11 11:43:38

I apologize in advance that this answer does not contribute much to the actual functions of your code; However, I firmly believe that writing code that is easy for others to read/understand is an important element of feedback from any code review.

(Aside) I admit it is a bit of a discipline thing (think of setting a table for dinner... The fork and knife shouldn't be upside down with the handles faced away from the diner -- it's not functionally "wrong" but you'll get a weird face from people when they find out you were responsible -- it doesn't conform to conventions).

Note: Don't do this with your code "as-is", since your Hash object is currently a private one -- you will have an accessibility problem. I just added this as an example of a property declaration since I noticed you had none.

Validating arguments.
This is something that you may or may not require in every method as sometimes this can get a little out of hand. However, for the case of a constructor and key (usually every publicly exposed) methods, I will state that you MUST do it.

Hmmmm... smelly code.

I am not familiar with RFC 1951, so I hesitate to dive into the efficiencies or accuracy of your code. But asides from some of the coding convention related issues I noted earlier, there are a few sections that look like bad / smelly code (again, I cannot be certain about the results due to my lack of knowledge with RFC 1951).

I'm sure others will tend to agree that this doesn't sit right. It's quite unusual to see a while(true) at all. To be blunt, it feels like a lazy design attempt, but the reason it stands out to me is because it is inside a method which name suggests a singular task / operation (Rehash). If the method name was something like RehashForeverAndEver then I would probably have glossed over this and just assumed the infinite loop was intended.

Use of public fields.
I'm looking at this private nested class Entry, that you have...

I generally hesitate when I see the use of public fields. Due to the -- let's call it -- stigma behind the use of them. It throws off some heavy code smell. I would carefully reconsider the design of this class.

I was about to note areas of possible optimizations with how you are handling the byte[] operations, but I'm going to hold off, as I think that this post is plenty long now. I encourage you to break down your code into small concise methods and re-post just the methods that you feel need code review (it would very likely yield you a more focused review and profitable feedback).

@Heslacher 2019-01-11 11:57:36

@Svek 2019-01-11 12:07:30

@Heslacher -- You're right. I am so used to seeing private fields being prefixed with an underscore in C# that I incorrectly made it appear like it was part of the official naming guideline. I hope these comments will serve as a footnote to my post.

@Rick Davin 2019-01-11 13:44:38

I would suggest Hash still be readonly with public Entry[] Hash { get; }

@AnotherGuy 2019-01-11 19:46:41

I would advice against omitting curly brackets in conditional statements. I guess it's personal preferences in the end, but I have seen subtle bugs being introduced (fortunately caught in code review) due to a developer adding line before e.g. an exception. As said, it's not a rule, but it does help keep consistency.

@Svek 2019-01-11 19:54:12

@RickDavin - Technically, given his current code the Entry type is a private class nested within, so creating a property would not work. I would also argue that exposing the Entry class (take a look at it) has a bit of code smell and I wouldn't advise putting that out there in the wild. --- I did attempt to note this in my answer, but maybe I should have just not put anything about property declarations in my answer at all?