P.S. My reason for posting is in case there's a less insane way of removing images that someone seeing this knows about easier than the above method. If so, please inform me; I'm rather ignorant about ebooks.

You are aware that ePubs (and most, if not all other eBook formats) are actually compressed already? So there is a strong probability that compressing them again will actually cause the files to grow in size?

Most efficient format, as in the one that takes up the least amount of space? Listen, don't worry about space. 500 GB/1 TB drives are not that expensive and e-books don't take up that much space anyway. Maybe if you have several thousands of them... but even then it's still manageable (probably under 10 GB).

And why would you want to remove images? That just isn't right... For ePubs, I guess you could delete the content of the Images folder, or maybe the folder itself. But will it still be a valid ePub then? I don't know. The images were put there for a reason. Removing them is like making soup without any oil or salt. Have you ever tried it? It tastes awful.

I've run across a few books that were excessively bloated to 3-5x the normal epub file size because of uncompressed images. They included images for every book by an author (which I really don't care to see when I'm reading), and multiple cover images. There's sometimes a background image for certain pages... I don't need the faded publishing house name on the publishing page for instance, but if it's a small enough file size, I'll leave it. Once I delete all those types of crap images, I compress the images that are left which usually reduces them to up to 12% of their original size (without quality loss, it can be done very easily), and I'm back to an epub that's under 500KB as it should be.

I never delete images essential to the text, dividers between sections, publishing logos, etc that are really part of the book. But I see no reason to use up space on my reader with images of 30 other books.

I just open those types of books in Sigil and delete the images from there, drag the ones that stay into my graphic program, compress and save, and delete the pages that have all the images on them.

strong probability that compressing them again will actually cause the files to grow in size

Negative. All compression algorithms have a "if compression expands then store" feature.

Yes, that some formats are already compressed was the whole point. I was just curious how efficient they are at storing text. To know the answer I *had* to remove the images, because otherwise I was also measuring a separate variable: image compression, which can be adjusted all over the place and wasn't what I wanted to measure. BTW the command I gave was wrong. PM me if you want to do this test yourself. (One needs */*.jpg etc also.)

I completely agree it is morally wrong to remove images. In my case I wanted to measure something. I should probably be burned at a stake for doing that. But, OTOH, I once read about an sf author who was very upset about the picture the publisher had drawn for his book. It was against the meme of the book. You know how they try to sex up stuff these days. The author lost out. Also, my reader creates a library some mystery place on the hdd, and was going super slow. And HDD was out of disk space. My hdd is always low on space. I blamed a recipe book with like 100 hd images. Can clog up your computer & your reader.

Anyway, my reader doesn't have an SD slot (don't really need it or want it, I only load about 100 books at a time anyway since I prefer to keep all my books in one location in my Calibre library on my PC). But even so that doesn't mean I want epubs over 2000KB when they're usually around 300-400kb. I never delete images that are part of the storyline or any illustrations, even fancy section separators. I will compress them though, there's no reason they need to be 300kb in size when it will look the same if compressed properly down to 60kb. Yes, you can do that even for color images for tablets and see no difference.

What I will always continue to remove is advertising for every other book the author ever wrote, sometimes several authors if from the same publishing house, and then insist on including a cover image for each. Then there's the cover page on page 1... few pages later there's another cover image, same image, just a slightly different size so they can't even use the same one, more senseless bulk. Then further on at the back where there's often "About the Author" blurbs, etc. they include yet another cover image... yes, same image but again a different size. That's just senseless bloat to the book and I'll continue to prune those out. If they have 5 different sizes for the same image for section separators, I'll eliminate 4 of them and use one in all locations. Easy to do with Sigil.

Btw, this thread was never deleted. If I remember I think your originally posted it (incorrectly) in General Discussion and a mod nicely moved over to the Workshop where it belonged. Perhaps you just didn't find it after the move. But I don't think I've ever seen a thread deleted here.

Hehe. It's almost gotten to the point where if you rip a page out of a book, you get a knock on the door from the publisher. And 1000 years ago books were somewhat illegal, so it could be worse.

I respectfully disagree. My stats page showed I had zero posts instead of one, like I had. Google had indexed the original page. So my post was in a deleted mode for some time. So, IMO, either you are wrong or the software running this board is not working correctly, and the # of posts you have showing is inaccurate. I'm sure. If you search another post of mine you might get more info, but this discussion is off limits according to rule 12, so we will have to leave it at that I guess.

I am reasonably sure that if your post had been actually deleted you would have received a private message about why the deletion had occurred.

Quote:

Originally Posted by jasontaylor7

I respectfully disagree. My stats page showed I had zero posts instead of one, like I had. Google had indexed the original page. So my post was in a deleted mode for some time.

In thinking over what you experienced, I may have an explanation. I am not privy to exactly how the board mechanics work, but I have observed the following; when a first time poster creates a post with a html link the board software will relatively quickly (I don't know how often this process runs... once an hour ... once every 5 minutes?) place that post in moderation (this is a spam prevention feature). Once the post has been placed awaiting moderation the post will not be available to the public for viewing until a Board level moderator reviews the post and takes it out of moderation.

Approximately 17 hours after you posted. For all or most of the 17 hours your post may have been in an auto moderation state awaiting moderator action. This would explain everything including why no one responded to your post for so long.

This "auto moderation" state is only for folks with very limited posts (less then 10 posts is my SWAG) with a html link in the post.

I see (but can't act on) posts in a moderation state as a forum level moderator in the calibre forum, but most folks would never see these posts until cleared by a board level moderator or deleted as spam.

I am reasonably sure that if your post had been actually deleted you would have received a private message about why the deletion had occurred.

The fact that the post was cached by google implied my post was deleted to all but the mods. I've posted elsewhere. The software used here isn't that unique. The anti-spam feature in board software you describe automatically prevents any initial posts with hyperlinks from ever being displayed until they are reviewed. Mine was up. So I disagree your theory that software automatically placed my post into a holding pattern it after it was up for some time. In fact, I find it very disturbing that you would posit that a mod didn't put it there on purpose. Also, while in the "limbo" state you describe, from my perspective, the post was deleted, since no pm was made to me. That the person who deleted it *might* have intended for it to have been reposted later would be better, but I see no testimony here from any such mod, and it doesn't change the perception to all except the mods, which is most of the community here. Also, the notion that it was merely moved but not deleted and later restored violates the way the word "moved" is commonly used in the computer industry (in which a copy is first made in step 1, and then in step 2 the original is deleted, causing 2 versions to temporarily exist, not zero, as was the case for me.) Lastly, your apparent impressive boundless desire and persistent need to make it seem like this board is like god's gift to mankind or that my post was never effectively temporarily deleted is as at least suspicious as your need to, e.g., to largely deny the existence deleted posts (something this board's rules clearly state do exist), or to deny the various issues and imperfections of the free program, calibre. I mean, I didn't invent the verb, "Calibre-ized." The software is good, like this board, but, as you yourself admitted, it has issues, and like other things, including the efficiency of the pdf format, is not perfect.

Jason...I usually hate jumping in the middle of an argument...but it seems you are painting people with a pretty broad brush. DoctorOhh only replied once and posited a rational explanation for what "might" have happened. You seem to jump all over him and then start attacking others that are only trying to help.

So I disagree your theory that software automatically placed my post into a holding pattern it after it was up for some time.

As I stated I don't know the mechanics of how the system works, but I did see another first time poster yesterday with a link that was placed in moderation and there was no number for the amount of posts beside their name. So to this person their number of posts is 0 until the post is moved out of moderation.

Since the zero posts from being in moderation is still a viable reason you didn't see your post I would guess that Ripplinger is correct.

Quote:

Originally Posted by Ripplinger

Btw, this thread was never deleted. If I remember I think your originally posted it (incorrectly) in General Discussion and a mod nicely moved over to the Workshop where it belonged. Perhaps you just didn't find it after the move. But I don't think I've ever seen a thread deleted here.

If you had placed this initial post in the General Discussion forum it may indeed have been placed in moderation until the moderators had a chance to confer and it was decided what to do with the link and where to put the post.

Quote:

Originally Posted by jasontaylor7

The software is good, like this board, but, as you yourself admitted, it has issues, and like other things, including the efficiency of the pdf format, is not perfect.

Absolutely correct. PDF conversions have so many problems that there is a permanent post in the conversion forum warning users of that exact thing.

Negative. All compression algorithms have a "if compression expands then store" feature.

First this 100% not true. TCR (your best compression) has no such feature.

You are not taking into account that many "compression" formats have header or other framing information they add to the file. Zip (while not technically a compression algorithm) and TCR are such formats. Every time you compress it creates a zip header and list of file entries. Zip a zip file 100 times and at a certain point you will start seeing the file size increase.

FYI. I wrote the TCR compression implementation used by calibre.

Another issue I see with your test is the formats your using. Lets take HTMLZ, PMLZ, TXT and TCR. HTMLZ and PMLZ both contain formatting information while TXT and TCR are text only (no formatting). So in your test you're not taking into account formatting. So your test is really, "Most efficient ereader format for storing only text without formatting."

I would argue that formatting is part of the book and losing formatting (I would't argue images) is detrimental. For example, removing new lines so you have a stream of characters on a single line will produce a smaller file than with your test. However, is a single line of text acceptable?

Some formats do lend themselves to compression more so than others. A binary file like RB and MOBI is going to be harder to compress compared to a TXT file. a TXT file (especially a written work like a book) is going to have a lot of repetition.

That said, I'm not saying figuring out which compression is best for ebooks isn't bad. I'm just saying your testing methodology needs some work.

When I'm referring to the loss of formatting I'm looking at the starting file size. A file with more information will typically be larger than a file with less information. Think of comparing a blu-ray to a vhs tape. The content may be the same but there is a huge difference in quality and amount of information. Less data will compress better than more data. So the comparison between the given formats is not a good apples to apples comparison in this regard.

This test really should look at the overall compression ratio. That is the percentage shrunk form the original size. Any other comparison isn't really valid.

My binary format comment has a few facets. you also need to keep in mind that some formats are already compressed. This can lead to reduced compression when compressed again vs if the data itself was uncompressed. Compression (typically) looks for repeated patterns. Compressing once will remove many patterns making subsequent compressions less performant until it cannot find any patterns and will not be able to reduce the size any more.

Which leads to the issue of binary formats and testings only with the gzip (gz) format. This only one compression format. It works great for text and is an all around good compression format. However, there are other compression formats that work better than gzip for binary data. There are other compression formats that work better in general but that's beside the point. You're only looking at one compression format and while one ebook format due to the nature of that ebook format compresses very well with gzip you can hardly say that ebook format has the best compression. Another compression format that works better with binary data could compress some of the other formats better than gzip can. I don't mean by producing a smaller files but by producing a better compression ratio for the given files.

Finally there is a difference between a compression format and a compression algorithm. gzip is a compression format not a compression algorithm. gzip uses the deflate compression algorithm. Which just so happens to be one of the (the main and the one required by the epub standard) algorithms used by the zip format. Which leads to the fact that a gzip and zip compressed file even using the same algorithm will end up with different sizes because they have different header/structural components. To truely compare you need to take this into account. But this becomes complicated when you look at formats like TCR that are both a compression format, ebook format, and algorithm.

So really all that's been shown in this test is the smallest file format which is known to be a format that compresses will with deflate ends up giving the smallest compressed file size. Larger files, with more data, in a format that does not have as good compression with deflate yield a larger file size.

From what I have seen, ePub is generally smaller (unless it has embedded fonts or better quality images) then Mobi. ePub converted to KF8 (AZW3) is always smaller. So overall, ePub is the format of choice if you want a smaller eBook.