Description

Looking at lib/util.c, I see that virtually all SI prefixes are plain wrong. Additionally, man page talks about SI units, there aren't any. Only SI prefixes are incorrectly used. Bases arent 1024 and 1000, they are 2 and 10.

Could you please clarify what user facing changes this patch introduces, what user facing bugs it fixes? Also, how come that the "%s byte" and "%s bytes" strings disappear?

What will happen to the "Size" column, which is 7 characters by default? Is it the plan to use pedantic Ki, Mi whatever prefixes there, "wasting" 2 out of the 7 characters, leaving only 5 for the actual number (and hence reduce by a factor of 10 the boundaries before switching to a larger unit)?

Being sloppy rather than pedantic by not using the proper "Ki", "Mi" etc. is not nice, but perhaps justifiable by limited screen real estate. Not being consistent within mc is IMO much worse.

I'm not against modifying mc and making it more technically correct, I'd just like to make sure we're going to do it consistently, and understand and accept the downsides of the changes if any.

Could you please clarify what user facing changes this patch introduces, what user facing bugs it fixes? Also, how come that the "%s byte" and "%s bytes" strings disappear?

Currently, this patch fixes only the total size calculation for files and directories as well as the directory popup when it is being recursively traversed for size calculation (.e.g, /usr which is usally large).

"%s byte" and so forth has been removed because it is redundant now, properly scaled units are done in size_trunc() already, thus this msg id is not necessary. Additionally, a few translations even had duplicate words in it (German and others). (See the applied diff)

What will happen to the "Size" column, which is 7 characters by default? Is it the plan to use pedantic Ki, Mi whatever prefixes there, "wasting" 2 out of the 7 characters, leaving only 5 for the actual number (and hence reduce by a factor of 10 the boundaries before switching to a larger unit)?

Currently, nothing will happen to "size", "bsize" and alike. This is going to be addressed in a separate ticket. I did not want to introduce a huge change with a single ticket which cannot be easily reviewed. My plan is to fix size calculation which can properly scale between 0 and 9999 <scaled unit>, consuming at most 8 chars.

Being sloppy rather than pedantic by not using the proper "Ki", "Mi" etc. is not nice, but perhaps justifiable by limited screen real estate. Not being consistent within mc is IMO much worse.

I am aware of the limited size but accoring to the explanation above, size and bsize will only increase by 1 char if scaling will work as expected. Consisitently is key for me. As I already layed out. This is a separte ticket.

I'm not against modifying mc and making it more technically correct, I'd just like to make sure we're going to do it consistently, and understand and accept the downsides of the changes if any.

I fully agree and do not have an opposite opinion.

The first reason why I did not patch everything is because I do not waste time for a contribution where people might completely ignore or refuse to merge it. This has happened too often. As long as devs are at least willing to review and comment, I am happy to provide patches.

If you are uncertain, you can branch off master as 'size-display' (umbrella ticket necessary?), apply all of my patches there, see the result and merge it as a whole as soon as work has been completed.

Here is another patch for missing bits in Andrey's fixups. I have obstained to translate Bulgarian and Mongolian because I don't know wether they retain the Latin script or have Cyrillic terms. Other Cyrillic-based langs have been fixed, as well as octet (o) users.

Let's take a note about a UI change here: what used to be "12,345 bytes in 67 files" on the UI will from now on be "12,345 B in 67 files". Correct?

Especially if we save a few characters this way, could we write out the exact number if it fits? E.g. instead of "12,345,678 KiB" why not say e.g. "12,641,974,273 B"?

[individual files] Currently, nothing will happen to "size", "bsize" and alike. This is going to be addressed in a separate ticket. I did not want to introduce a huge change with a single ticket which cannot be easily reviewed. My plan is to fix size calculation which can properly scale between 0 and 9999 <scaled unit>, consuming at most 8 chars.

Since you said 9999 is the upper limit, I assume you're planning to properly continue it with a space and then the entire unit. E.g. what used to be "1234567" until now will from now on be "1234 kB" or "1205 KiB" (depending on the preferences), am I following you correctly?

How about small files, do you plan to add the " B" suffix there?

Until now, the first suffix appeared at the file size of around 10 megs (that is: "9999999" was displayed as-is, the next value became "9766K"). Except for this first change, all subsequent changes occurred around multiples of 1000 (or 1024), e.g. roughly at 1 gigabyte the display format changed from "K" to "M", and presumably roughly at 1 terabyte it changes from "M" to "G".

If you allow numbers up to 9999, you modify these boundaries to be roughly at 10 megs, 10 gigs etc. E.g. 9 megs would be something like "9000 KiB" whereas 11 megs would be "11 MiB". This I believe makes it much harder to subconsciously "feel" the file size magnitude by looking at it.

I'm wondering if maybe the right approach would be to allow numbers only up to 999 or 1023, showing a bit of decimal fraction digits. This way the suffix would directly give you the feeling about the file size's magnitude. E.g. instead of showing "1234567" or "1205 KiB", we could show "1.17 MiB". What do you think? (I'm not sure this would be the best, just wondering...)

I am aware of the limited size but accoring to the explanation above, size and bsize will only increase by 1 char if scaling will work as expected. Consisitently is key for me. As I already layed out. This is a separte ticket.

Increase by 1 char, resulting in the filename shrinking from 16 to 15 chars at the default terminal width of 80 columns; and you still lose two digits of information. In turn, a lot of screen real estate would be wasted for those spaces as well as "i" and "B" letters.

Yup, overall, if I understand your proposal correctly, we'll lose 3 character columns per panel. One for an aestethical space, one for a pedantic "i", and one for that "B" carrying no information whatsoever. I'm not happy, even though I'm not sure what would be a better solution. At least I'd like to have some discussion/brainstorming about it.

For starter, how about leaving everything as-is now, except for fixing the upper/lowercase k/m/g letters (if they're broken at all), and adding an "i" to the header if IEC is selected? This one also obviously has downsides (around the translation of that "Size(i)" label) but let's see if this idea can be continued...

By the way, note how much my proposal in ticket #3087 got frowned upon...

Let's take a note about a UI change here: what used to be "12,345 bytes in 67 files" on the UI will from now on be "12,345 B in 67 files". Correct?

Correct. I do also plan to create a ticket which drops the comma two simple reasons: it is not locale aware and completely confuses people which do not use en_* locales like I do here in German or Russian (which I am a native).

Especially if we save a few characters this way, could we write out the exact number if it fits? E.g. instead of "12,345,678 KiB" why not say e.g. "12,641,974,273 B"?

Of course we could, the people do not want to see huge numbers, since humans are bad in reading a lot of digits. Its like reading an IBAN with several consecutive zeroes.

[individual files] Currently, nothing will happen to "size", "bsize" and alike. This is going to be addressed in a separate ticket. I did not want to introduce a huge change with a single ticket which cannot be easily reviewed. My plan is to fix size calculation which can properly scale between 0 and 9999 <scaled unit>, consuming at most 8 chars.

Since you said 9999 is the upper limit, I assume you're planning to properly continue it with a space and then the entire unit. E.g. what used to be "1234567" until now will from now on be "1234 kB" or "1205 KiB" (depending on the preferences), am I following you correctly?

Almost, yes. In the first place, I do not want to fiddle with the scaling logic as in size_trunc_len(). This is too much change in one patch. The logic mostly relies on the len input limited by the buffer which produces for the same size different outputs in different spots, like size column or info panel. This confuses me too.

Until now, the first suffix appeared at the file size of around 10 megs (that is: "9999999" was displayed as-is, the next value became "9766K"). Except for this first change, all subsequent changes occurred around multiples of 1000 (or 1024), e.g. roughly at 1 gigabyte the display format changed from "K" to "M", and presumably roughly at 1 terabyte it changes from "M" to "G".

If you allow numbers up to 9999, you modify these boundaries to be roughly at 10 megs, 10 gigs etc. E.g. 9 megs would be something like "9000 KiB" whereas 11 megs would be "11 MiB". This I believe makes it much harder to subconsciously "feel" the file size magnitude by looking at it.

I'm wondering if maybe the right approach would be to allow numbers only up to 999 or 1023, showing a bit of decimal fraction digits. This way the suffix would directly give you the feeling about the file size's magnitude. E.g. instead of showing "1234567" or "1205 KiB", we could show "1.17 MiB". What do you think? (I'm not sure this would be the best, just wondering...)

Personally, I completely agree with you. I do dislike the scaling logic which is implemented right now because it scales way too late and shows huge numbers like 120345 kB. This is impossible to read. But as I have said before, I want to fix the units first and then take care of the scaling logic. Though, I won't change it without your consent.

I have applied a similar approach a couple of months ago to a formatter written for Apache Maven, displaying file sizes and download speed. See nested class ​FileSizeFormat along with ​test cases.

The basic idea is to have one decimal position (with seperator from LC_NUMERIC) for sizes between 0 and 10 and whole numbers above 10. Automatically scale between 1 and 1000. It is space efficient, consize and easy to read. You never exceed 8 chars.

I am aware of the limited size but accoring to the explanation above, size and bsize will only increase by 1 char if scaling will work as expected. Consisitently is key for me. As I already layed out. This is a separte ticket.

Increase by 1 char, resulting in the filename shrinking from 16 to 15 chars at the default terminal width of 80 columns; and you still lose two digits of information. In turn, a lot of screen real estate would be wasted for those spaces as well as "i" and "B" letters.

That is correct. A sacrifice for technical correctness. Consider that proper GUIs/operating systems do use only SI prefixes for storage sizes, Gnome, OS X, etc. Microsoft is too stupid for that. Ultimately, people do not really understand why 1 GB is not 1 000 000 000 B.

Yup, overall, if I understand your proposal correctly, we'll lose 3 character columns per panel. One for an aestethical space, one for a pedantic "i", and one for that "B" carrying no information whatsoever. I'm not happy, even though I'm not sure what would be a better solution. At least I'd like to have some discussion/brainstorming about it.

In the first place, you will lose 3 chars but add proper, unambigious information. There is no 'aestethical space' because this is always mandatory between number and unit. "B" IS the unit. k, M, G aren't units. In fact, K is Kelvin. This is even more confusing.

You would lose at most 1 char, if you would apply a better scaling logic which I have proposed a few paragraphs above.

I am perfectly fine with a discussion where we can retain space efficiency and proper readability.

For starter, how about leaving everything as-is now, except for fixing the upper/lowercase k/m/g letters (if they're broken at all), and adding an "i" to the header if IEC is selected? This one also obviously has downsides (around the translation of that "Size(i)" label) but let's see if this idea can be continued...

From my point of view, size_trunc() and size_trunc_len() can be collapsed into one method with less logic and code. Especially size_trunc_len() performs operations for two different concerns which it shouldn't. During my review, I found several other spots where no/fixed scale is applied and only base 2 is used, regardless of the settings.

By the way, note how much my proposal in ticket #3087 got frowned upon...

I read the entire conversation and it was quite daunting. I follow your point and if I check the shortcuts with my German keyboard, I have absolute no idea what M-h means and how I can even enter some of them because the key combination is impossible without a n-rollover keyboard.

I do also plan to create a ticket which drops the comma two simple reasons: it is not locale aware and completely confuses people which do not use en_* locales like I do here in German or Russian (which I am a native).

Of course we could, the people do not want to see huge numbers, since humans are bad in reading a lot of digits. Its like reading an IBAN with several consecutive zeroes.

These two seem to contradict each other. I agree that reading a large unstructured number is very hard. So why make it even harder by dropping the separator? I guess it should be fixed instead: make the thousands-separator locale-dependent.

An unstructured number has IMO two advantages over the one with thousands separators: easy copy-pasting (let's say to "bc"), as well as iTerm 3 beta and gnome-terminal 3.20 showing useful information about the number on right click (gnome-terminal ticket with screenshot: ​https://bugzilla.gnome.org/show_bug.cgi?id=741728). Despite these two, I'd rather go for mc structuring these numbers properly though.

Another idea to explore might be to use different color/boldness/underlying for every 3 characters, e.g. "1234567890 bytes" or some other visual differentiation.

Or, similarly to the gnome-terminal feature, mc could always show the entire number as well as a Ki / Mi / Gi converted value, e.g. "1234567890 (~1.14 GiB)". (Is there enough room for this? Probably not always.)

[short format]

That is correct. A sacrifice for technical correctness. Consider that proper GUIs/operating systems do use only SI prefixes for storage sizes, Gnome, OS X, etc. Microsoft is too stupid for that. Ultimately, people do not really understand why 1 GB is not 1 000 000 000 B.

I have a strong guts feeling that changing the current format to having those " B", " KiB" etc. suffixes and in turn much shorter room for the actual number will be something that many people will strongly dislike, and would prefer to revert to the old display format (which is technically less correct, but shows more information on the available limited space).

At the very least I strongly recommend that whatever you'll be doing with the Size field should have a config option (or a new field type in User-defined Listing mode) to revert to the old behavior. I'd also be happy to hear others' opinion here :)

An unstructured number has IMO two advantages over the one with thousands separators: easy copy-pasting (let's say to "bc"), as well as iTerm 3 beta and gnome-terminal 3.20 showing useful information about the number on right click (gnome-terminal ticket with screenshot: ​https://bugzilla.gnome.org/show_bug.cgi?id=741728). Despite these two, I'd rather go for mc structuring these numbers properly though.

I agree with copy and paste. This is indeed an issue. Regarding your Gnome issue, GNU ls supports 'ls --si' and you are done. Your screenshot has prefixes only and misses the unit.

Another idea to explore might be to use different color/boldness/underlying for every 3 characters, e.g. "1234567890 bytes" or some other visual differentiation.

Or, similarly to the gnome-terminal feature, mc could always show the entire number as well as a Ki / Mi / Gi converted value, e.g. "1234567890 (~1.14 GiB)". (Is there enough room for this? Probably not always.)

I would show the full byte size as well as the scaled one in the info/detail panel only.

[short format]

That is correct. A sacrifice for technical correctness. Consider that proper GUIs/operating systems do use only SI prefixes for storage sizes, Gnome, OS X, etc. Microsoft is too stupid for that. Ultimately, people do not really understand why 1 GB is not 1 000 000 000 B.

I have a strong guts feeling that changing the current format to having those " B", " KiB" etc. suffixes and in turn much shorter room for the actual number will be something that many people will strongly dislike, and would prefer to revert to the old display format (which is technically less correct, but shows more information on the available limited space).

At the very least I strongly recommend that whatever you'll be doing with the Size field should have a config option (or a new field type in User-defined Listing mode) to revert to the old behavior. I'd also be happy to hear others' opinion here :)

An option is always possible and I am not rejecting it. We should though by default use the proper format and have a fallback option.

I have attached a patch (scaling.patch) for scaling for size_trunc(). Apply locally and look how it works. It makes magnitudes instantly visible. This is actually what I would ultimately propose. I haven't yet touched size_trunc_len() but will if you want to see how it might look like at the end.

Regarding your Gnome issue, GNU ls supports 'ls --si' and you are done. Your screenshot has prefixes only and misses the unit.

That screenshot is just one example, the point was to make it work with any command, not just ls. Unit is missing because gnome-terminal cannot possibly know what the unit of the number is, it only recognizes the number itself (it would be incorrect to assume bytes).

Note, however, how 'ls --si' also happily omits the space and the letters "i" and "B" to save space, even though (unlike mc) it has as much space as it wants to (it can wrap a file's entry into the next terminal row, no problem). I guess I'd rather see mc taking this direction too, rather than being fully pedantic.

Egmont, I have created two more patches (0001-size_trunc-autoscale-size-with-proper-prefixes.patch and 0002-size_trunc_len-drastically-simplify-function-by-appl.patch) which autoscale the size up to gigabytes/gibibytes (TB and TiB can be added later). All static buffer which are too small have been increased. You only need at most 9 chars including NULL. Note that for non-Latin languges more than 9 bytes are necessary (UTF-8 multibyte). This has been added to the small buffers as well as for the size and bsize columns. I cannot tell about the big ones whether they might overflow or not. From my testing, it does not. This is before the patch:

0001-size_trunc-autoscale-size-with-proper-prefixes.patch​: I think we shouldn't lose the high precision in display_total_marked_size(). This is single place where such precision is shown unlike other places where scaling is applied.

0001-size_trunc-autoscale-size-with-proper-prefixes.patch​: I think we shouldn't lose the high precision in display_total_marked_size(). This is single place where such precision is shown unlike other places where scaling is applied.

Interesting point, though I do not understand why. The accumulated byte value can get really large with 9, 10 or more places, barely readable. I would expect to see the raw value *and* the scaled value in the info panel: 102 MB (102124456999 B, 199461830 blocks).

Can you explain why the precision is necessary in the total marked size?

I second andrew_b's opinion here. There might be use cases we don't even think of. Doing arithmetics with that number outside of mc, using as a basic "checksum" to check whether the size is what's expected, etc... There's room to show that information, so let's do it. The info panel is IMO cumbersome to get to.

I second andrew_b's opinion here. There might be use cases we don't even think of. Doing arithmetics with that number outside of mc, using as a basic "checksum" to check whether the size is what's expected, etc... There's room to show that information, so let's do it. The info panel is IMO cumbersome to get to.

Whenever it becomes too long (wouldn't fit the available width) we could fallback to a more concise notation, e.g. "123456789012 B (107.3 GiB) / 1234 files" (omitting the "in" would probably save more in other languages), or even simply "123456789012 B (107.3 GiB) / 1234", or stop showing either the exact overall size or the approximation.

How does it look to you?

A bit of my worry is that with IEC it looks okay, with SI it's unnecessarily repeating "1234567 B (1.23 MB)". Not sure what to do here. Maybe here we could omit the apprimated value if we have thousands separators and just simply show "1,234,567 B in 1234 files" (wait, this is what we do currently).

Yes, this is just swapped. The question is which people like to see first? Most UIs display scaled unit first and then the raw value. If the raw value is getting too large this won't be helpful.

Whenever it becomes too long (wouldn't fit the available width) we could fallback to a more concise notation, e.g. "123456789012 B (107.3 GiB) / 1234 files" (omitting the "in" would probably save more in other languages), or even simply "123456789012 B (107.3 GiB) / 1234", or stop showing either the exact overall size or the approximation.

How does it look to you?

I am not really fond of this. This gets too complex, and unreadable by leaving more and more information out. Additionally, you have to constantly calculate whether your localized strings fits into the panel and reapply and print again. I would leave it fully written because it will always behave the same way. there are probably 30 chars available, I consider that terabytes will fit it.

A bit of my worry is that with IEC it looks okay, with SI it's unnecessarily repeating "1234567 B (1.23 MB)". Not sure what to do here. Maybe here we could omit the apprimated value if we have thousands separators and just simply show "1,234,567 B in 1234 files" (wait, this is what we do currently).

I wouldn't neither worry about repetition because the UI is consistent after all. People see exact information. Leave as-is with bytes only yields to huge unreadble numbers, not copiable as you have written earlier.

We can do the following: provide a new option "[] Scale marked total size". If selected, people see scaled size, if not they see the raw byte value as with %ju. Easy and understandable. If you want to make it even better, provide a selection option: "Marked total size style: [raw, scaled, both]" which maps to an enum. Default would be deselected or raw.

This should satisfy all needs. If you prefer to see raw values, you get them. I prefer to see scaled ones.

Yes, this is just swapped. The question is which people like to see first? Most UIs display scaled unit first and then the raw value. If the raw value is getting too large this won't be helpful.

Could you please provide references (e.g. screenshots) to that "most UIs"?

I believe that parentheses (in this case) "explain" the aforementioned value, in which sense it makes sense to convert from bytes to MB, GB, but does not make sense the other way around, as there's no way to derive the exact number from the MB, GB approximation.

But I don't have references to defend this, so if you prove that your suggestion is a common UI approach then I'm fine.

We can do the following: provide a new option "[] Scale marked total size". If selected, people see scaled size, if not they see the raw byte value as with %ju. Easy and understandable. If you want to make it even better, provide a selection option: "Marked total size style: [raw, scaled, both]" which maps to an enum. Default would be deselected or raw.

This should satisfy all needs. If you prefer to see raw values, you get them. I prefer to see scaled ones.

You assume that one's need is to either always see the exact value, or to always see the approximated one. At least for me this is not the case.

I often wish to see the exact value, and often the approximated one that suggests the magnitude. And I'd prefer to "switch" just by looking at the screen (that is, seeing both there) rather than by going to whichever menu entry or remembering a long hotkey sequence to toggle.

With the current behavior I get both (albeit not a copy-pasteable number). With the new behavior I'd only get one at a time. I'd either need to toggle mc's setting, or use iterm's/gnome-terminal's right-click menu that shows the detailed information. The latter one is probably more convenient out of these two, but not available to most of our users. A definite advantage is copy-pasteability.

I guess I could easily get used to the new behavior, but I'm heavily in doubt whether our users will see this as an improvement.

Yes, this is just swapped. The question is which people like to see first? Most UIs display scaled unit first and then the raw value. If the raw value is getting too large this won't be helpful.

Could you please provide references (e.g. screenshots) to that "most UIs"?

I believe that parentheses (in this case) "explain" the aforementioned value, in which sense it makes sense to convert from bytes to MB, GB, but does not make sense the other way around, as there's no way to derive the exact number from the MB, GB approximation.

Your reasoning is correct here. You can't derive but people won't do it anyway as long as you provide the raw value.

But I don't have references to defend this, so if you prove that your suggestion is a common UI approach then I'm fine.

I did some digging with Google and my VMs (this applies to file details, I assume it applies to marked files in a detailed view too):

We can do the following: provide a new option "[] Scale marked total size". If selected, people see scaled size, if not they see the raw byte value as with %ju. Easy and understandable. If you want to make it even better, provide a selection option: "Marked total size style: [raw, scaled, both]" which maps to an enum. Default would be deselected or raw.

This should satisfy all needs. If you prefer to see raw values, you get them. I prefer to see scaled ones.

You assume that one's need is to either always see the exact value, or to always see the approximated one. At least for me this is not the case.

I often wish to see the exact value, and often the approximated one that suggests the magnitude. And I'd prefer to "switch" just by looking at the screen (that is, seeing both there) rather than by going to whichever menu entry or remembering a long hotkey sequence to toggle.

With the current behavior I get both (albeit not a copy-pasteable number). With the new behavior I'd only get one at a time. I'd either need to toggle mc's setting, or use iterm's/gnome-terminal's right-click menu that shows the detailed information. The latter one is probably more convenient out of these two, but not available to most of our users. A definite advantage is copy-pasteability.

I have offered three options actually. You can have both in one view. It is simply not true that you can have both currently, you get a partially scaled value, neither fully nor raw. I do not consider it usable/readable at all. Looking at my Mediatomb library, I see tens of millions of kilobytes. Not really helpful.

I guess I could easily get used to the new behavior, but I'm heavily in doubt whether our users will see this as an improvement.

Great. Anything you dislike leaving people the choice to have three options with having seeing both as default?

I have offered three options actually. You can have both in one view. It is simply not true that you can have both currently, you get a partially scaled value, neither fully nor raw. I do not consider it usable/readable at all. Looking at my Mediatomb library, I see tens of millions of kilobytes. Not really helpful.

Up to the value of "999,999,999" you get it spelled out exactly, and the grouping of 3 lets you immediately see that this is approx. 999 megabytes. Continuing this pattern for even larger numbers would IMO provide a quite good solution (despite its known downsides e.g. unability to copy-paste), it would show the exact number as well as the magnitude at the same time.

What I personally never really cared about is the difference between IEC and SI - exactly what you're trying to pedantically address in this bug. Is it 1 GiB or 1 GB or 1.07 GB? It hardly ever makes a difference in real life.

I'm sorry but I'm totally lost: could you please provide examples of how the different modes you'd offer would look like for different magnitudes?

I have offered three options actually. You can have both in one view. It is simply not true that you can have both currently, you get a partially scaled value, neither fully nor raw. I do not consider it usable/readable at all. Looking at my Mediatomb library, I see tens of millions of kilobytes. Not really helpful.

Up to the value of "999,999,999" you get it spelled out exactly, and the grouping of 3 lets you immediately see that this is approx. 999 megabytes. Continuing this pattern for even larger numbers would IMO provide a quite good solution (despite its known downsides e.g. unability to copy-paste), it would show the exact number as well as the magnitude at the same time.

Exactly, why do I need to scale myself if the machine can do that for me? I am not really interested wether is 100 kB larger or not.

What I personally never really cared about is the difference between IEC and SI - exactly what you're trying to pedantically address in this bug. Is it 1 GiB or 1 GB or 1.07 GB? It hardly ever makes a difference in real life.

The difference is how the values are displayed and the unit does not really show what base is applied. With a proper patch, you immediately see the base 2 or 10. Not exactly the difference itself.

I'm sorry but I'm totally lost: could you please provide examples of how the different modes you'd offer would look like for different magnitudes?

I like it pretty much, my only concern is that the "both" mode can easily overflow (that is, not fit in the given room) at 80 columns, let alone in other languages where "in" and "files" are longer words. This is something I'd like to see getting addressed.

Here are some ideas, I'm curious about your preference:

Stop displaying either the raw or the scaled number if the entire string doesn't fit.

Let it overflow to the other panel (whichever has the focus should "win"). (Ugly and not sure how easily implementable, but useful.)

Shorten by omitting the words "in" and "files", just use e.g. a "/" as separator, and/or omitting spaces before the units.

Continue two lines below in the left (to the left of the disk usage information).

(Any other idea?)

The latter one suggests we also have to figure out what to display if the mini status is disabled for the panel. (This is something that just occurred to me, I've never used mc in this mode.) Currently the number of files is omitted.

I like it pretty much, my only concern is that the "both" mode can easily overflow (that is, not fit in the given room) at 80 columns, let alone in other languages where "in" and "files" are longer words. This is something I'd like to see getting addressed.

I have thought of this too but there is no silver bullet. We have at most 38 chars in default mode/size because we have two panes for the marked total size.

Here are some ideas, I'm curious about your preference:

Stop displaying either the raw or the scaled number if the entire string doesn't fit.

This would mean to constantly check and recalculate. Adding additional complexity. Not really favorable.

Let it overflow to the other panel (whichever has the focus should "win"). (Ugly and not sure how easily implementable, but useful.)

Have you actually tried this? In my opinion, this won't actually happen. I have tried this and resized my PuTTY window to the smallest possible size. mc is smart enough to detect that. It has automatically shortened the string in the center with a tilde (~). No overflow has occured.

Shorten by omitting the words "in" and "files", just use e.g. a "/" as separator, and/or omitting spaces before the units.

I wouldn't do this for two reasons: (1) changes wouldn't look harmonic anymore and people might consider it as a bug when words start to disappear. Moreover, if you mark large directories/files in the first place, you won't see the word "files" at all and will ask yourself: What does this output mean? "100 GB / 1000".

Continue two lines below in the left (to the left of the disk usage information).

Is this actually possible? My knowledge of ncurses in non-existent.

(Any other idea?)

The latter one suggests we also have to figure out what to display if the mini status is disabled for the panel. (This is something that just occurred to me, I've never used mc in this mode.) Currently the number of files is omitted.

From my findings, an overflow won't happen but a shorting like "├─ 5,3 MB ~ateien ─┤".

I have patched locally to "%s (%ju B) in %d file", "%s (%ju B) in %d files" and I see this in a small SSH sesssion:

If you really want to make it look nice (just as in your screenshots as well as the current implementation) the string would be surrounded by a space, then a horizontal line, then the vertical one. That would mean 34 chars only in a terminal of 80 columns.

This would mean to constantly check and recalculate. Adding additional complexity. Not really favorable.

In my opinion, source code should generally try to be as simple or as complex as required by the desired feature. That is, the desired feature should drive things, and not internal simplicity. There are of course places where you sacrifice a feature for the sake of clean and maintainable code as well as developer resources, or find a reasonable compromise. I don't think an if-else branch (what we'd need here) is such a case. That is, I'd be totally fine with a code that goes like "format this way, check its length, if it's too long then format that way".

I wouldn't do this for two reasons: (1) changes wouldn't look harmonic anymore and people might consider it as a bug when words start to disappear. Moreover, if you mark large directories/files in the first place, you won't see the word "files" at all and will ask yourself: What does this output mean? "100 GB / 1000".

I understand your worries here.

Continue two lines below in the left (to the left of the disk usage information).

Is this actually possible? My knowledge of ncurses in non-existent.

I really doubt ncurses can do this. I meant to do it "manually", that is, construct two separate strings and make them show up independently from each other at two different parts of the UI.

From my findings, an overflow won't happen but a shorting like "├─ 5,3 MB ~ateien ─┤".

Currently the string is also shortened. The presence of the "~" char clearly informs the user that some data has been stripped and that they need a wider window to fit it all. In this aspect it's okay I guess.

On the other hand, in your examples it happened to strip the last couple of digits of the size (which is kinda usable), but it might actually strip some digits from the middle, or the closing parantheses resulting in an ugly unbalanced pair, or whatever else.

As a compromise between us, how about going for this right now, and opening another bugreport where we ponder about possible better approaches? E.g. I'd prefer the entire exact size disappearing rather than omitting some random digits from a giant number and replacing by tilde.

... or, when the string gets too long then the scaled and raw numbers could remain in the given row, and the "in 12345 files" could jump two lines below (to the line of disk usage summary), aligned to the left.

If you really want to make it look nice (just as in your screenshots as well as the current implementation) the string would be surrounded by a space, then a horizontal line, then the vertical one. That would mean 34 chars only in a terminal of 80 columns.

This would mean to constantly check and recalculate. Adding additional complexity. Not really favorable.

In my opinion, source code should generally try to be as simple or as complex as required by the desired feature. That is, the desired feature should drive things, and not internal simplicity. There are of course places where you sacrifice a feature for the sake of clean and maintainable code as well as developer resources, or find a reasonable compromise. I don't think an if-else branch (what we'd need here) is such a case. That is, I'd be totally fine with a code that goes like "format this way, check its length, if it's too long then format that way".

Agreed, I have to perform some if-else-magic for the three depicted options.

From my findings, an overflow won't happen but a shorting like "├─ 5,3 MB ~ateien ─┤".

Currently the string is also shortened. The presence of the "~" char clearly informs the user that some data has been stripped and that they need a wider window to fit it all. In this aspect it's okay I guess.

On the other hand, in your examples it happened to strip the last couple of digits of the size (which is kinda usable), but it might actually strip some digits from the middle, or the closing parantheses resulting in an ugly unbalanced pair, or whatever else.

Correct, though one has to say that I hardly believe that people will use a terminal smaller than 80x24 in general. Probably even larger.

As a compromise between us, how about going for this right now, and opening another bugreport where we ponder about possible better approaches? E.g. I'd prefer the entire exact size disappearing rather than omitting some random digits from a giant number and replacing by tilde.

Agreed here too. Rather omitting information than have broken one. Ticket is pending.

... or, when the string gets too long then the scaled and raw numbers could remain in the given row, and the "in 12345 files" could jump two lines below (to the line of disk usage summary), aligned to the left.

I am not really sure this is a wise change. The mount point size function has to receive this information though, completely unrelated (bad design). Additionally, the "in x files" would look completely dangling.

I'd like to summarize our current discussion with Egmont to avoid misunderstandings:

The pending changes for size_trunc() and size_trunc_len(), which are used throughout the code, will be applied to the feature branch. (necessary for further changes)

size_trunc_sep() will be deprecated and not used anymore (don't know how to mark a function as deprecated in your documentation style)

display_total_marked_size() will receive an update in such that it will apply the three modes depicted in comment 29. A new panel option "Total marked size style: [raw, scaled, both]" will make this possible. (need to investigate how to do this)

Outstanding merges from comment:37, they are necessary to complete the change codewise. The current patch applied by you to the branch is one of several.

As I wrote in comment:10, patches for *.po are superfluous. All translation updates are made by native language translation teams via www.transifex.com.

I am aware of that but if you take a closer look at commit ​bfcf1509295704e5a4ea260032f199027cb18604 you see that it removes redundant words not present in the English po file. You can drop the second one if you want to wait for the translation team.

The replacement of size_trunc_sep() by size_trunc() with autoscaling is out-of-topic here. Probably, #3674 is better place for that.

size_trunc_sep() does mere add grouping, it does not scale anything. I do now understand what you are after now, you want the stuff completely seperated. This is fine. Morever, #3674 isn't the right place since it already requries the scaler implementation. I will spawn an new ticket for that.

Though, it is still not complete, free/total size is still wrong and all other places where size_trunc_len() is used. This needs an additional commit.

There are no gettext ids in the po files for them. If they will use the key by default, I will readd them. Though, I highly doubt that anything above PB will be displayed as volume size in the near future. Do you want me to readd them? No issue.

Regarding the suffixes, ~100 PB libraries are not uncommon these days (my personal experience), so, I'm afraid, EBs are not that far away. I'm not sure what's the advantage of leaving them and the remaining prefixes out. Are there any? If not, maybe we could keep them? Otherwise, no objections.

Though, there are still po errors/lefteovers which need to be addressed.

Hmmm, I'm completely confused. What errors/leftovers are you talking about? Anything that is not addressed in new tickets and prevents this branch from being merged?

P.S. I can delete the attachments from this ticket if this request still stands.

Regarding the suffixes, ~100 PB libraries are not uncommon these days (my personal experience), so, I'm afraid, EBs are not that far away. I'm not sure what's the advantage of leaving them and the remaining prefixes out. Are there any? If not, maybe we could keep them? Otherwise, no objections.

We can, of course, reintroduce them. That would require some po changes as well too. If you really see a need for them, I will attach another patch, though I cannot test such numbers reasonably.

Though, there are still po errors/lefteovers which need to be addressed.

Hmmm, I'm completely confused. What errors/leftovers are you talking about? Anything that is not addressed in new tickets and prevents this branch from being merged?

This patch is missing, though I have written it several times. Andrey refused to accept it:

Thanks for merging. I will check this on several operating systems next week.

It's clean that "4096" is 4096 bytes. Now it's displayed as "4096 B". I'd personally prefer do not display the "B" suffix, but I'm not insisting.

To be honest, it is not clean or clear what is implied. I think there is no need to imply anything but simply state it, avoiding any interpretation. Additionally, on a subsequent ticket, I will try to address that making it autoscale.

Strange thing, I tried current master on RHEL 6 and FreeBSD 9.3-STABLE and had no segfault on neither small nor large files. Anyway, thank you for testing. No idea why I have missed that static buffer. Patch attached.

The unfortunately no way to tell the caller that the buffer is too small.

You could (or rather: you should :)) add an assert() call. This way if someone invokes the method with a too small buffer, we at least get a kinda informative error message rather than a potential data corruption or segfault somewhere later on (depending on other conditions such as architecture, OS, compiler flags etc.).

I have properly documented by it has to be at least seven bytes long.

I'm sorry but I cannot find this being documented (I'm looking above the declaration and definition of size_trunc_len() in lib/util.[hc], I guess if anywhere then it should be documented here).

The unfortunately no way to tell the caller that the buffer is too small.

You could (or rather: you should :)) add an assert() call. This way if someone invokes the method with a too small buffer, we at least get a kinda informative error message rather than a potential data corruption or segfault somewhere later on (depending on other conditions such as architecture, OS, compiler flags etc.).

Stupid me, yes. It give the size of the pointer only. Since we do not know whether the buffer is statically or dynamically, is there a better way to check the size? I am not a day-to-day C hacker. I think one has to rely on length and the precondition described in the docs.

The size of array is the 'len + 1', as described in function description.

assert() itself is not a segfault protection. I think that size_trunc_len() mustn't segfault with small buffer. If buffer is too small, it can be filled with some non-numerical value (say, "????", or "----", or something else) in order to show that requested size cannot be printed correctly.

Stupid me, yes. It give the size of the pointer only. Since we do not know whether the buffer is statically or dynamically, is there a better way to check the size?

Nope, at least not with bare C strings (char[] vs. char*). You could go with higher level solutions such as GString, in which case you wouldn't need to pass the length separately. Not sure how much mc uses this or not.

As long as the pointer and the length are passed separately, it's reasonable to do the assertion on the `length' parameter only.

Why? The entire program aborts on a failed assertion, doesn't it? (If assertions are enabled, of course.) Assertions are meant to catch situations that are clearly internal programming errors, such as calling a method with arguments that do not comply with the method's documentation.

I think that size_trunc_len() mustn't segfault with small buffer.

It's not a black and white story. You can choose to go with a method documentation that requires a buffer of at least 7 bytes and then an assertion is okay. Or you can choose to allow a smaller buffer and fill up with whatever placeholder characters (but still mention in the documentation that the caller shouldn't expect a "usable" result in that case). In this particular case I'm fine with either approach.

Unexpected abort is better then segfault but isn't a nice thing itself. When you have open and not saved file(s), no matter for you what happened, assert or segfault. (Yes we have many asserts in mc code base, but it is another story.)

I think that size_trunc_len() mustn't segfault with small buffer.

It's not a black and white story. You can choose to go with a method documentation that requires a buffer of at least 7 bytes and then an assertion is okay.

panel.c: string_file_size() is called with a len parameter that's less than 4, which in turn is increased by 3 (why? that step is totally unclear to me) and then passed to size_trunc_len().

4? Is t supposed to be 8+3. 3 is added here intentionally because of multibyte chars. Otherwise they won't fit. I cannot add 3 upfront because the column would be wider than necessary. This is a typical bytesize vs. charsize problem in C.

You add 3 because... why 3? Sounds like you're expecting every character of "MiB" and friends to be transcribed to let's say Russian. How about CJK (3 bytes per character), Vietnamese (they use a lot of combining character as far as I know) etc. scripts? How do you know that they don't need more?

The code, without a comment right there about this "+ 3" IMO cannot be understood.

The two most important concepts needed in mc are the length in bytes when encoded in UTF-8 (usually called 'len' or 'length') and the width it occupies on the screen aka. the number of cells (usually called 'width'). They are quite loosely coupled: a Unicode codepoint can take anywhere from 1 to 4 bytes in UTF-8, and can occupy anywhere from 0 to 2 character cells.

In the spirit of the aforementioned article, if you add 3 because of the UTF-8 encoding as well as varying display width of characters (and if this is indeed the right way of doing the maths which I really doubt) then the variable should be called 'width' before incrementing, and 'len' afterwards.

Now, I think I would probably get rid of passing 'len' (the buffer length in bytes) across these methods. It usually makes sense if the data to be stored there can be arbitrarily large (or quite large) and you don't always want to entirely store it. This is not the case here. The "tiny" bufsize (64) is a safe upper limit, a file size suffixed by units will always fit in all locales (or if not because of non-latin digits for some locale, we can just increase the tiny bufsize). I think it's okay to say in all these methods that the caller is expected to give a large enough buffer that can store the output, and make the code simpler by not communicating the actual buffer size. There's absolutely no point in trying to save a couple of bytes by allocating smaller buffers.

As for formatting for a given available width on the screen, there are two possible approaches I can imagine, and I'd need to check the code to see which one it does.

One possibility is that the method that formats the number doesn't care about the width, and it's the caller that somehow truncates it afterwards if it turns out to be too wide. I guess this approach is okay, although not as flexible as the other.

The other possibility is that the method that formats the number knows the available width. As such, in this case this method needs to know the 'width' as a parameter, rather than 'len', which is something that being incremented by 3 because of UTF-8 just when calling this method is obviously totally wrong.

I hope this comment helps coming up with a cleaner design and implementation. Thanks guys for working on this! :)

Nope, sorry, this is not a "Linux feature", this is a bug in the code.

Rule of thumb: a segfault (especially if reproducible by multiple developers) has a perhaps about 99.99% chance of being a bug in the code (or in a library underneath that it calls), and about 0.01% of being some OS/compiler/kernel/etc. issue.

static char buffer[BUF_TINY];

You allocate a buffer of 64 bytes, but it's irrelevant since there's a separate 'len' parameter on which the assertion happens, and this 'len' parameter contains a much smaller number.

You add 3 because... why 3? Sounds like you're expecting every character of "MiB" and friends to be transcribed to let's say Russian. How about CJK (3 bytes per character), Vietnamese (they use a lot of combining character as far as I know) etc. scripts? How do you know that they don't need more?

I don't. It was a base assumption that two bytes per char would suffice, but this is far from complete.

The code, without a comment right there about this "+ 3" IMO cannot be understood.

The two most important concepts needed in mc are the length in bytes when encoded in UTF-8 (usually called 'len' or 'length') and the width it occupies on the screen aka. the number of cells (usually called 'width'). They are quite loosely coupled: a Unicode codepoint can take anywhere from 1 to 4 bytes in UTF-8, and can occupy anywhere from 0 to 2 character cells.

Exactly! This was my primary problem with the columns. panel_fields[] in panel.c passes byte length which made my addition necessary. Throughout the code only length semantic is used where widths would be necessary. That would require to change several other spots of the code when some multibyte locale is used.

In the spirit of the aforementioned article, if you add 3 because of the UTF-8 encoding as well as varying display width of characters (and if this is indeed the right way of doing the maths which I really doubt) then the variable should be called 'width' before incrementing, and 'len' afterwards.

Don't forget that this 'width' directly influences column sizes and if you pass lengh=width*4 to accomondate all possible sizes, it will directly influence the number of digits displayed for the input.

Now, I think I would probably get rid of passing 'len' (the buffer length in bytes) across these methods. It usually makes sense if the data to be stored there can be arbitrarily large (or quite large) and you don't always want to entirely store it. This is not the case here. The "tiny" bufsize (64) is a safe upper limit, a file size suffixed by units will always fit in all locales (or if not because of non-latin digits for some locale, we can just increase the tiny bufsize). I think it's okay to say in all these methods that the caller is expected to give a large enough buffer that can store the output, and make the code simpler by not communicating the actual buffer size. There's absolutely no point in trying to save a couple of bytes by allocating smaller buffers.

In some spots, the buffer is not only used for the size, but for other strings as well. Size cannot be indefinitely large.

As for formatting for a given available width on the screen, there are two possible approaches I can imagine, and I'd need to check the code to see which one it does.

One possibility is that the method that formats the number doesn't care about the width, and it's the caller that somehow truncates it afterwards if it turns out to be too wide. I guess this approach is okay, although not as flexible as the other.

The other possibility is that the method that formats the number knows the available width. As such, in this case this method needs to know the 'width' as a parameter, rather than 'len', which is something that being incremented by 3 because of UTF-8 just when calling this method is obviously totally wrong.

I hope this comment helps coming up with a cleaner design and implementation. Thanks guys for working on this! :)

It is likely that panel_field_t needs to extended to accomondate all possible Unicode points in the manner you had in mind.

Anyway, thank you very much for the valueable input. I am open to any improvement/new approach in my code.

You allocate a buffer of 64 bytes, but it's irrelevant since there's a separate 'len' parameter on which the assertion happens, and this 'len' parameter contains a much smaller number.

Just checked the code. Without knowing a lot of the internal mechanics of mc, I assumed that the panel_field_t.min_size is really a hard min size and never less and this is what I used throughout. It obviously isn't. Unless someone know why, I would accept Andrew's offer from comment:91.

Guys, I'm really not happy to say this, but I'm getting seriously concerned. Two segfaults, the author apparently not being experienced enough in C string handling, crazily mixing the semantics, try-and-error approach (randomly changing the code until it doesn't seem to crash, rather than properly designing and implementing it), blaming it on Linux... this just doesn't look good to me at all.

Having a version control log is obviously a must-have in software development but it shouldn't be the primary source of documentation. The code itself (e.g. in the tarball) should be understandable, or wherever it gets trickier, it should contain at least a pointer to the explanation.

Exactly! This was my primary problem with the columns. panel_fields[] in panel.c passes byte length which made my addition necessary. Throughout the code only length semantic is used where widths would be necessary. That would require to change several other spots of the code when some multibyte locale is used.

mc is legacy code (22 years old, with UTF-8 support being pretty new compared to that). I don't know who implemented UTF-8 support (truly sorry for that, great kudos!), it was a really tough work, and done greatly. It's sure not picture perfect and sure doesn't look like as if mc's been implemented with UTF-8 straight from the beginning. I vaguely recall that I myself have filed a ticket (probably still open) about clarifying the terminology and renaming some variables (e.g. 'len' vs 'width'). That being said, in the existing code it's still quite easy to figure out the meaning wherever a variable is incorrectly named, and I really don't think there are codes along "len += 3" to convert from length to width. Also, new code should preferably be written with the proper terminology.

Don't forget that this 'width' directly influences column sizes and if you pass lengh=width*4 to accomondate all possible sizes, it will directly influence the number of digits displayed for the input. [...snip...]

At this point I totally don't understand what you say (and probably you don't understand either what I said in my previous comment). And it feels to me that you try to justify the behavior by bringing up the current (buggy) implementation as how the behavior should be.

I have no clue what you mean by "this 'width' directly influences column sizes". In mc the width of a formatted string doesn't influence anything. It's the other way around: the width needs to obey the available column count.

I'm also absolutely lost at "length=width*4" which is just as bad as the "+3" was.

In your thinking, the caller goes like: "I've allocated a buffer of 64 bytes. I need to have the size formatted in (let's say) 8 character cells and placed here. I believe that the UTF-8 overhead will be no more than 3 bytes, even despite that noone will revise my code when the formatter method changes, or when a translation is updated. So let's pass 8+3 = 11 as the length/width (whatever) to the formatter method, even if the current locale is English". Terribly wrong! And for that matter, passing 4*8 = 32 or anything similar would be just as wrong too.

Do you want to tell the formatter method how many bytes it can safely write without risking a segfault? If so, tell it 64 (and call that parameter 'length').

Do you want to tell it how many visual cells are available for the formatted string? If so, tell it 8 (and call that parameter 'width').

There's absolutely no way that could justify passing it 11 or 32 or anything different.

So, again, there are two absolutely totally utterly independent stories:

How does the formatter method know how much storage space is available? Some possible answers:

Require that the caller allocates an obviously large enough buffer, maybe 64 bytes or so, and don't care about this any more.

Require that the caller tells us the available storage space, and if and only if we would write beyond that, the formatter method bails out with an error and then the caller allocates a larger buffer and tries again. Very complex to implement for hardly any benefit. Mostly used in system calls (see e.g. readline(2)) because the kernel cannot allocate new memory for the user. I firmly vote against this approach in mc.

The formatter method dynamically allocates a large enough new string.

Some higher level object (e.g. GStrin) is used.

And, totally independently from this:

How does the formatter know the required width?

It does not. The caller checks the width and if it's too wide, it truncates. The truncation probably automatically happens with the (~) symbol in the middle. It's consistent across the UI, although not quite flexible and the data shown there is not as informative as it could be.

It's passed on as an argument. The formatting method is allowed to take this into account in various ways, e.g. omitting some decimals or omitting the space in front of the unit if width is too small.

Independently from each other, pick one possible answer for both questions, and then implement the formatting method according to this interface. Let 'len' always mean the length in bytes, and 'width' always be the available visual width, and never ever ever have a variable that's a bit of this and a bit of that.

It is likely that panel_field_t needs to extended to accomondate all possible Unicode points in the manner you had in mind.

I've no clue about panel_field_t, however, apparently (see the mcview segfault earlier on) the formatter is used at other places as well, not just in the panel.

After a failed attempt, I became firm on the opinion that the formatter shouldn't make any assumptions on where/how the formatted string would be used, what'll be the minimum width etc.

Exactly! This was my primary problem with the columns. panel_fields[] in panel.c passes byte length which made my addition necessary. Throughout the code only length semantic is used where widths would be necessary. That would require to change several other spots of the code when some multibyte locale is used.

Don't forget that this 'width' directly influences column sizes and if you pass lengh=width*4 to accomondate all possible sizes, it will directly influence the number of digits displayed for the input. [...snip...]

At this point I totally don't understand what you say (and probably you don't understand either what I said in my previous comment). And it feels to me that you try to justify the behavior by bringing up the current (buggy) implementation as how the behavior should be.

I have no clue what you mean by "this 'width' directly influences column sizes". In mc the width of a formatted string doesn't influence anything. It's the other way around: the width needs to obey the available column count.

I'm also absolutely lost at "length=width*4" which is just as bad as the "+3" was.

In your thinking, the caller goes like: "I've allocated a buffer of 64 bytes. I need to have the size formatted in (let's say) 8 character cells and placed here. I believe that the UTF-8 overhead will be no more than 3 bytes, even despite that noone will revise my code when the formatter method changes, or when a translation is updated. So let's pass 8+3 = 11 as the length/width (whatever) to the formatter method, even if the current locale is English". Terribly wrong! And for that matter, passing 4*8 = 32 or anything similar would be just as wrong too.

Do you want to tell the formatter method how many bytes it can safely write without risking a segfault? If so, tell it 64 (and call that parameter 'length').

Do you want to tell it how many visual cells are available for the formatted string? If so, tell it 8 (and call that parameter 'width').

The whole point about length=width*4 was the semantic mismatch between byte and character orientation I was trying to tell you about. If you want to fill n width in a column, you are talking about characters but length in the context of mc operates on bytes.

In the very special case of size_trunc_len() and alike is that the passed len is given in bytes but the upper-level in panel_fields[] contains length in bytes but expects it to be columns. More than that, the scale logic in again byte/single byte char oriented and limited by length. If you pass a buffer with 64 byte and say length=64, it will never scale numbers down to KiB, MiB, etc. because there is enough space (given by length) to accomodate that. You'd always need calculate wether your Unicode points fit in the given length based on the UTF-8 encoding scheme.
It may be a limitation in mc existed ever since, as you mention it is 22 years old, but now revealed simply by my changes. The caller does not know upfront how many bytes the characters will consume. That's the crux -- at least for me.

The whole point about length=width*4 was the semantic mismatch between byte and character orientation I was trying to tell you about. If you want to fill n width in a column, you are talking about characters but length in the context of mc operates on bytes.

Even as an upper limit this would be incorrect because of combining characters. It might take even more than 'width*4' UTF-8 bytes to produce a string of width 'width'.

In the very special case of size_trunc_len() and alike is that the passed len is given in bytes

What's the semantics of len here? Is it the number of available bytes in the output buffer, or is it the number of desired bytes to fill it up with? You're mixing the two! When thinking in bytes, the number of available bytes is 64 in the current code, whereas the number of desired (or maximum) bytes to which to format the string does not make sense. Why would you want to ask the formatter "hey format me this number, I give you a quite large buffer, but make sure not to write more than 11 bytes"? No sense whatsoever. It's the desired UI look you care about, in some locales it'll take 8 bytes, in some it'll take 11, in some it'll take even more. The formatter will know this, why should the caller care? If you specify a limit in bytes, it should be because of technical limitation (the buffer being finite long).

Whereas, on the other hand, you might want to ask the same questions for width. Here, the concept of available width in the buffer does not make any sense, whereas the desired width, to which to format the string (which is 8 in my recent example), is an important one.

You need to absolutely cleary separate all these conceps in your mind. Currently you're thiking about a washed-up mess of these. As long as you don't manage to separate these concepts, you won't be able to come up with proper code.

My fridge is able to store maybe 40-60 cans of beer. Dunno. Depends on the sizes of the cans. But still, I can ask my friend to bring only 6 cans. Why would I have to lie that my fridge is only able to store 6 cans?

but the upper-level in panel_fields[] contains length in bytes but expects it to be columns. More than that, the scale logic in again byte/single byte char oriented and limited by length.

Again, you're bringing up the current buggy code as the rationale for the code being buggy.

If you pass a buffer with 64 byte and say length=64, it will never scale numbers down to KiB, MiB, etc. because there is enough space (given by length) to accomodate that.

It will (if implemented properly) because the semantics of the length parameter is not the desired length of the output, but the technical maximum allowed.

The caller does not know upfront how many bytes the characters will be consume. That's the crux -- at least for me.

This is something anyone ever doing UTF-8 coding needs to overcome. In this case it's damn simple (if going for the "a" answer for the 1st question in my previous comment): just pass it an obviously big enough buffer and don't care about it anymore.

The whole point about length=width*4 was the semantic mismatch between byte and character orientation I was trying to tell you about. If you want to fill n width in a column, you are talking about characters but length in the context of mc operates on bytes.

Even as an upper limit this would be incorrect because of combining characters. It might take even more than 'width*4' UTF-8 bytes to produce a string of width 'width'.

In the very special case of size_trunc_len() and alike is that the passed len is given in bytes

What's the semantics of len here? Is it the number of available bytes in the output buffer, or is it the number of desired bytes to fill it up with? You're mixing the two! When thinking in bytes, the number of available bytes is 64 in the current code, whereas the number of desired (or maximum) bytes to which to format the string does not make sense. Why would you want to ask the formatter "hey format me this number, I give use a quite large buffer, but make sure not to write more than 11 bytes"? No sense whatsoever. It's the desired UI look you care about, in some locales it'll take 8 bytes, in some it'll take 11, in some it'll take even more. The formatter will know this, why should the caller care? If you specify a limit in bytes, it should be because of technical limitation (the buffer being finite long).

This is the desired amount of bytes, upper bound. I am not mixing it. I was aware of this issue upfront and this existed way before I started to change anything. The reason for limit the output length is that the column is bound to 8 chars, everything is truncated with a tidle. Not my contribution. That's the current state.

You need to absolutely cleary separate all these conceps in your mind. Currently you're thiking about a washed-up mess of these. As long as you don't manage to separate these concepts, you won't be able to come up with proper code.

This is a general problem in the entire codebase I cannot solve.

My fridge is able to store maybe 40-60 cans of beer. Dunno. Depends on the sizes of the cans. But still, I can ask my friend to bring only 6 cans. Why would I have to lie that my fridge is only able to store 6 cans?

Without knowning the volume of your fridge and the cans, this question is impossible to answer. Therefore, I prefer wine, standard 750 mL by volume.

but the upper-level in panel_fields[] contains length in bytes but expects it to be columns. More than that, the scale logic in again byte/single byte char oriented and limited by length.

Again, you're bringing up the current buggy code as the rationale for the code being buggy.

If you pass a buffer with 64 byte and say length=64, it will never scale numbers down to KiB, MiB, etc. because there is enough space (given by length) to accomodate that.

It will (if implemented properly) because the semantics of the length parameter is not the desired length of the output, but the technical maximum allowed.

The caller does not know upfront how many bytes the characters will be consume. That's the crux -- at least for me.

This is something anyone ever doing UTF-8 coding needs to overcome. In this case it's damn simple (if going for the "a" answer for the 1st question in my previous comment): just pass it an obviously big enough buffer and don't care about it anymore.

With my little knowledge of the codebase, I am not in the position to properly evaluate this option.

I have just explained to you why this concept does not make any sense whatsoever. (And by the way, I have also just clarified that I used 'len' with the other semantics, the one that does make sense.)

The reason for limit the output length is that the column is bound to 8 chars, everything is truncated with a tidle.

And as such the output should be limited to a certain number of character width, which has not much to do with the length in bytes.

Not my contribution. That's the current state.

I'm afraid you're far from understanding how mc works. The problem of displaying data in a certain width is solved throughout mc without segfaults and such.

I feel like this discussion is going nowhere, and as much as I'd love to spend much time on mc (including reviewing the patches here, fixing the code etc.), I hardly have any. As such, the top priority for me remains speaking up when I'm afraid things are going wrong... and that's why unfortunately I'm here.

Seeing the history of this ticket, the problems encountered on the way, the two segfaults, your uncertainity in C string handling, the antipatterns and unclear semantics of variables, the terrible "+3" hack for UTF-8, you yourself being in serious doubts why the code fails, and so on, I believe that this patch is noticably below the average quality of mc. Even worse than that: you seem to be stubborn that you're doing things okay, you're not listening to me how it should be done, and blame the rest of the codebase.

If any of the developers here are willing to give this patchset a really thorough review/cleanup and heavy stress-testing (including (fake) translations to Russian, some CJK, and a language with lots of combining chars), let it be. Otherwise my vote for goes for reverting it entirely. In its current form it doesn't bring the quality that I could be happy about, could be rest calm without worrying about a third segfault being discovered by someone shortly after the release of the next tarball... :(

I'm sorry but I don't have time (nor motivation) to continue the discussion we had in the last couple of comments.

I have just explained to you why this concept does not make any sense whatsoever. (And by the way, I have also just clarified that I used 'len' with the other semantics, the one that does make sense.)

Either I do not understand you or you don't understand me. You are either implying that I have introduced this concept or have abused it. The former is not true, the latter arguable. I have simply reused what was given w/o reinventing the wheel.

The reason for limit the output length is that the column is bound to 8 chars, everything is truncated with a tidle.

And as such the output should be limited to a certain number of character width, which has not much to do with the length in bytes.

Why don't you show a simple patch where you depict your idea as you imply that I don't get it?

Not my contribution. That's the current state.

I'm afraid you're far from understanding how mc works. The problem of displaying data in a certain width is solved throughtout mc without segfaults and such.

This is maybe true, but I have to start somewhere...

Seeing the history of this ticket, the problems encountered on the way, the two segfaults, your uncertainity in C string handling, the antipatterns and unclear semantics of variables, the terrible "+3" hack for UTF-8, you yourself being in serious doubts why the code fails, and so on, I believe that this patch is noticably below the average quality of mc. Even worse than that: you seem to be stubborn that you're doing things okay, you're not listening to me how it should be done, and blame the rest of the codebase.

If I would be stubborn, I wouldn't take any discussion trying to sort things out and understand what you are really after. I never said that the things I do are best, I even admitted that I am *not* a day-to-day C hacker several times. Again, it would be very benificial showing a snippet of code to portray your idea.

You had plenty of time to evaluate my patches and reject them, so that I can learn and improve.

Why don't you show a simple patch where you depict your idea as you imply that I don't get it?

If you do not understand my written description how I'd go with the code, I'm sorry but I don't trust you that you'd be able to properly finish a started proof of concept patch. And I don't have time (let alone motivation since I didn't agree with the goal of this patch from the beginning) to come up with a complete polished and well-tested patch.

You had plenty of time to evaluate my patches and reject them, so that I can learn and improve.

I did not have free time to work on this. I trusted that it went in the right way; it turned out just recently that it didn't and it's not even close to that.

I understand it sucks from your point of view that someone comes along and criticizes your work without contributing a single line of code to it. But let's be honest: I'm here because it crashes, and after realizing that, I found a no-go in the code and then found other no-gos during our discussions. I have time to speak up that something's horribly wrong here, whereas I don't have time to fix it, sorry.

andrew_b, mooffie, egmont: I suggest the following - I will make an rc this weekend, without this branch, and we will see if we merge it after the release. The skins are now committed, quite some stuff has accumulated in the NEWS file and last release was in fall. I think we should really make one now. What do you think?

andrew_b, mooffie, egmont: I suggest the following - I will make an rc this weekend, without this branch, and we will see if we merge it after the release. The skins are now committed, quite some stuff has accumulated in the NEWS file and last release was in fall. I think we should really make one now. What do you think?

Please go ahead with your RC and don't wait for me! Other annoying issues at work held me off doing a verification of the branch.

I agree. This one's a controversial change, and hasn't quite been tested in git master. Let's not commit it last minute before a release, let's stay safe instead (and commit soon after a release, if it's believed to be production ready).