Author
Topic: Table File Standard Discussion (Read 21741 times)

I've moved from working on paper to actually writing code. Table switching did create a few headaches, but I'm using recursion as my stack now, so it's all good (well, except for post-processing of overlapping strings with bytes interpreted by different tables, which is still a mess, but hopefully no game was insane enough to do that; same goes for pointers into the middle of multibyte tokens). I can locate, translate, and output text now, and it is fun . However, there are a couple of things in the table standard that I'd like to get a sanity check on.

Raw Hex InsertsGiven the following table:

00=<$01>01==02=this is a '<$01>' string03=this is a '04=' string05=<06=>07=$08=\10=011=1

the hex sequence "00 01 02" will be dumped as the text "<$01>=this is a '<$01>' string" but must be inserted as "01 01 03 01 04" as per 2.2.1. This seems wrong, but the problem could be resolved by replacing the entries for 00 and 02 with e.g.

00=<$$01>02=this is a '<$$01>' string

so perhaps text sequences containing <$[0-9A-Fa-f][0-9A-Fa-f]> should be forbidden. Also, it might be more appropriate to include the section on inserting raw hex in 2.2.4 instead of in 2.2.1. Also also, it might be worth mentioning that hex characters are equally valid in upper or lower case (e.g. "AB" == "ab" =="Ab" == "aB").

Control Codes in End Tokens Requiring Hex RepresentationGiven the following table:

00=text01=more text/FF=<end>\n\n//

the hex sequence "00 FF" will be dumped as the text "text<end>

//". When inserting, control codes are ignored for end tokens requiring hex representation, so any of "text<end>//", text<end>//", "text<end>

//", etc. will be mapped to "00 FF", but "text<end>" will be mapped to "00" since "<end>" is ignored as per 2.2.2.

//", which is probably not what was intended (maybe you could try to interpolate the pointer output into the end token's newlines, but that sounds like an extremely bad idea). Output commenting should probably be controlled at the utility level rather than the table level.

When inserting, should control codes be ignored for all tokens, or just end tokens requiring hex representation?

Uniqueness of End Token Names

Quote

Note: End Tokens, regardless of type, must be uniquely named.

The standard makes no definition of what constitutes a "name". Given that duplicate hex sequences are forbidden by 2.2.5, I assume name refers to the text sequence. Presumably an error should be generated when encountering a duplicate text sequence and at least one of the tokens involved is an end token... maybe? Is this dependent on the type of token? How about dumping vs. inserting? Given the following table:

what errors (if any) should be generated when dumping? when inserting?

While on the topic of uniqueness, it might be worth including a note in the standard that restricts the definition of a "unique" entry to the current logical table. Otherwise an error (duplicate hex sequence) should be generated by the following table:

@table100=foo

@table200=bar

Conversely, TableIDString in 2.6 should be considered unique across all tables (including across multiple table files) provided to the utility.

Linked EntriesAttempting to insert a linked entry which lacks its corresponding <$XX> in the insert script should generate an error under 4.2, right?

Under the "wishing for the moon" category:It would be nice if we could define linked entries in such a way that we could obtain output with both pre- and post-fix, like "A(Color:<$A4><$34>)a". Theoretically, there's no reason a table file has to be restricted to dealing with in-game script. I can imagine, for instance, somebody writing a table like:

...A9=LDA #,1AA=TAXAC=LDY $,2...

and wanting "B1" to map to "LDA ($<$XX>),Y". You could do that with e.g. "B1=LDA ($,1,),Y", but since we lack a general escape character, you couldn't determine which commas were field delimiters and which were data . It might also be nice to be able to specify how raw bytes were output in case you don't want the <$XX> format.

Table SwitchingNumberOfTableMatches in 2.6 refers to tokens (each of variable length) rather than bytes, yes? How many tokens do the bytes of a linked entry count as?

Given the following table:

@table100=1!01=table2,2

@table200=A01=B0100=C$02=<Color>,2

I think the process for translating "01 02 25 25 01 00 00" starting with table1 would be this:

Various errata2.2.1: "sequenes"2.2.1: "hex byte insertion takes precedent" should read "hex byte insertion takes precedence"2.3: "Control Codes" is a somewhat ambiguous term, since it can refer to the only defined table entry format control code ("\n") or to game-specific control codes as referenced in 2.53.1: "in and automated fashion" should read "in an automated fashion"3.4, Example 1: should "E060" read "E030"?4.2: "paramter"

I've moved from working on paper to actually writing code. Table switching did create a few headaches

I handled table switching with an array of tables, index to active table, and a stack of table indexes to handle the table jumps and returns. It was only a few lines of code. It was more difficult to handle notification of the several conditions that would required the table to fall back or jump to another.

Quote

Raw Hex Inserts the hex sequence "00 01 02" will be dumped as the text "<$01>=this is a '<$01>' string" but must be inserted as "01 01 03 01 04" as per 2.2.1.

Yes, that would be the way it would currently operate. The possibility of normal text sequences containing the "<$XX>" pattern was deemed to be very low and thus the notation that raw hex insertion takes precedence was chosen. It would make sense to disallow the pattern of usage in normal text sequences altogether. I will run it by a few others and see if any other thoughts come from it.

Quote

Control Codes in End Tokens Requiring Hex Representation//", which is probably not what was intended (maybe you could try to interpolate the pointer output into the end token's newlines, but that sounds like an extremely bad idea). Output commenting should probably be controlled at the utility level rather than the table level.

It is controlled at the utility level. The table file has no knowledge of what commenting characters are. The example simply illustrates a possibility of being able to output like that, if you desire, for situations it's appropriate for. Even then, you still would not be able to insert without the insertion utility being aware of your commenting characters, which the table file certainly does not do. You would obviously not choose to do this at all with Atlas output like you showed in your example. I will think about making that a bit more clear, or change the example to not have anything to do with comments to eliminate the confusion. I just thought that would be a common and useful application for many simple cases.

Quote

When inserting, should control codes be ignored for all tokens, or just end tokens requiring hex representation?

They should be ignored for all tokens. Linebreak control exist only for dumping output/readability only. It serves no use for insertion. Cases have been made for eliminating it entirely and adding line breaks via regex or search and replace later to eliminate control codes from the table file entirely.

Quote

Uniqueness of End Token NamesThe standard makes no definition of what constitutes a "name". Given that duplicate hex sequences are forbidden by 2.2.5, I assume name refers to the text sequence. Presumably an error should be generated when encountering a duplicate text sequence and at least one of the tokens involved is an end token... maybe? Is this dependent on the type of token? How about dumping vs. inserting?

Here, we're talking specifically about end tokens. The end token's name (or more precisely text sequence) should be unique regardless of end token type. End Tokens are the only token type that must have unique text sequences.

what errors (if any) should be generated when dumping? when inserting?

The only error would be with the duplication of end token names. Everything else follows the rules in 2.2.6. You do make a good point with the same end token text sequences differing only by '\n'. At present that would result in a duplicate for insertion purposes, but would pass for dumping. That feeds into the case for elimination of control sequences altogether, but then there's no way to have any line breaks anywhere other than what your dumper or other third party app may be able to do. It is a possibility considered, but the majority of people who would use the applications based on this format are going to want to have some line breaks without having to jump through hoops or additional programs/steps.

Quote

While on the topic of uniqueness, it might be worth including a note in the standard that restricts the definition of a "unique" entry to the current logical table. Conversely, TableIDString in 2.6 should be considered unique across all tables (including across multiple table files) provided to the utility.

The standard provides only for a single table per file. There can be only one Table ID line that will uniquely identify that table.

Quote

Linked EntriesAttempting to insert a linked entry which lacks its corresponding <$XX> in the insert script should generate an error under 4.2, right?

Not sure what you mean here. Linked entries aren't really inserted. For insertion, they pretty much become normal entries for all practical purposes. The parameter bytes are all raw hex at that point.

Quote

Under the "wishing for the moon" category:It would be nice if we could define linked entries in such a way that we could obtain output with both pre- and post-fix, like "A(Color:<$A4><$34>)a". Theoretically, there's no reason a table file has to be restricted to dealing with in-game script. I can imagine, for instance, somebody writing a table like: It might also be nice to be able to specify how raw bytes were output in case you don't want the <$XX> format.

That would make it much more difficult to insert. At present, a linked entry is only used for dumping. After that, you just have a normal entry and some raw hex bytes. Insertion can then by done by token stream. Having a prefix and postfix and keeping the notion of a linked entry would be more complex. Usage outside of ROM Hacking is beyond the scope of the document. Lastly, we explicitly defined the raw hex format and do not allow for other formats for simplicity and unification of upcoming utilities based on this standard. It was a real black eye for incompatibility in the past. If you don't define raw hex format, it leads to many more overlapping and ambiguous possibilities being counter productive to what we're doing.

Quote

Table SwitchingNumberOfTableMatches in 2.6 refers to tokens (each of variable length) rather than bytes, yes? How many tokens do the bytes of a linked entry count as?

Your example looks correct. We use the term 'matches' for a reason. There are no characters or bytes involved. Thus a linked entry hit counts as a single match as you have.

Quote

Various errata2.2.1: "sequenes"2.2.1: "hex byte insertion takes precedent" should read "hex byte insertion takes precedence"2.3: "Control Codes" is a somewhat ambiguous term, since it can refer to the only defined table entry format control code ("\n") or to game-specific control codes as referenced in 2.53.1: "in and automated fashion" should read "in an automated fashion"3.4, Example 1: should "E060" read "E030"?4.2: "paramter"

It would make sense to disallow the pattern of usage in normal text sequences altogether.

As added support for this, I'll point out that the raw hex precedence rule makes affected table entries entirely useless for insertion. This behaviour also feels counter intuitive at first glance, unlike the case of duplicate text sequences.

The current placement strongly suggests that the "ignore newlines when inserting" rule applies only to end tokens requiring hex representation, so 2.3 should probably say so instead. Better yet, I think, would be to strike the rule altogether. If somebody went to the trouble of putting a newline in their table in the first place, they must have had a reason to do so. Assuming that reason was for formatting, the odds are very good those newlines will still be exactly where they should be when it comes time to insert, in which case ignoring them or not leads to the same result.

In any case, the decision whether to ignore newlines or not should be left to the utility rather than imposed by the table standard.

The only error would be with the duplication of end token names. Everything else follows the rules in 2.2.6.

So the only errors are

/00=<end>,2/<end>,2

and

/01=<end>,2\n/<end>,2\n

for dumping and

/00=<end>,2/01=<end>,2\n/<end>,2/<end>,2\n

for inserting. Ok, let's clean those up and add a (very) little variety:

/00=<end>,2$02=<end>,2!03=<end>,204=<end>,205=<end>,2\n06= blah

Following 2.2.6, then, when an inserter reads " blah<end>,2 blah" it should insert "06 05 06".

What's the reasoning behind disallowing duplicate text sequences for end tokens only? The "last occurring shortest hex sequence" rule from 2.2.6 seems like it should also apply here, with artificial end tokens counting as hex sequences of length 0. If somebody decides to give their artificial end tokens the same text as some other end token and then things break, how is that different from giving any two game-specific control codes (which are likely represented as normal entries) the same text? Is it just that with end tokens we're guaranteed to be dealing with game-specific control codes?

I'm not even sure this is something the table standard should try to enforce - like commenting, artificial end tokens would appear to be something more properly dealt with at the utility level than at the table level.

Eh? Oh. Hmm. Somehow I was under the impression that the standard supported multiple logical tables within the same physical file. At the very least, it isn't explicitly forbidden, and I see no difficulty in supporting it (having already done so myself) if we impose two more constraints:1) a table's ID line must precede any entries for that table2) if a file contains multiple tables, every table in that file after the first table must have an ID line(in fact I require an ID for every table used, but I don't feel that condition should be imposed as part of the standard)

For insertion, they pretty much become normal entries for all practical purposes.

So if instead of "A<Color><$A4><$34>a" we only have "A<Color><$A4>a", you'd want to insert "A0 E0 A4 21"? How does this interact with table switching? Continuing my table switching example, what does the insertion process look like when instead of "<Color><$25><$25>C1", we're trying to insert "<Color><$25>C1" starting with table1?

As added support for this, I'll point out that the raw hex precedence rule makes affected table entries entirely useless for insertion. This behaviour also feels counter intuitive at first glance, unlike the case of duplicate text sequences.

Counter intuitive as opposed to what? You need to have the ability for raw hex handling. You either need to have a proper escape mechanism, which we (those that discussed it) agreed early on to omit for simplicity, or you need to define a rule to cover the overlap case. The rule can only really lean one way or the other, raw hex or table entry. As you exemplified, the rule isn't even needed if you simply disallow using the raw hex pattern in your table entry. So, I'm not sure what other behavior would be more intuitive here.

Quote

The current placement strongly suggests that the "ignore newlines when inserting" rule applies only to end tokens requiring hex representation, so 2.3 should probably say so instead. Better yet, I think, would be to strike the rule altogether. If somebody went to the trouble of putting a newline in their table in the first place, they must have had a reason to do so. Assuming that reason was for formatting, the odds are very good those newlines will still be exactly where they should be when it comes time to insert, in which case ignoring them or not leads to the same result. In any case, the decision whether to ignore newlines or not should be left to the utility rather than imposed by the table standard. I think rather that it feeds into the case for eliminating the "ignore newlines when inserting" rule :p A table should be valid period, not valid for dumping or valid for inserting.

Processing newlines for insertion as part of your tokens is a bad idea. No thank you to that. No insertion utilities currently do that to my knowledge, and for good reason. I made a private utility once that did do it and that was bad enough to make work perfectly for my private controlled use. The newlines are there only to aid in the output spacing of the dump. You're guaranteed to change it. One extra line break, carriage return entered silently by your word processor, inserted rogue space in your game script, translator hit enter one too many times, etc. breaks it right away. Let me tell you from direct experience, you will not be able to maintain XXK of game script and have all your line breaks be exactly as they were in the dump without many hard to find/fix mistakes!

In my opinion, the alternative if the newlines cause such a problem, is simply to eliminate them altogether. The negative to that is it's going to make the dump utility be responsible for all line breaks. To get it as flexible as it is in the table standard, allowing for newline to be used in any table entry, would be a bit tough and burdensome. I'd imagine most utilities would go no further than line breaks after game line breaks, and line breaks after end tokens. That's why I hang on to it, even though good case is made for it's elimination. It's also inheriting what Cartographer, Romjuice, and Atlas already do.

What's the reasoning behind disallowing duplicate text sequences for end tokens only? The "last occurring shortest hex sequence" rule from 2.2.6 seems like it should also apply here, with artificial end tokens counting as hex sequences of length 0. If somebody decides to give their artificial end tokens the same text as some other end token and then things break, how is that different from giving any two game-specific control codes (which are likely represented as normal entries) the same text? Is it just that with end tokens we're guaranteed to be dealing with game-specific control codes?

I didn't find any backing for a reason when I looked back at past conversation. It may have been a first come solution to the ambiguity. I agree it could be allowed and follow 2.2.6 rule. I'm not seeing the reason for exception right now.

Quote

I'm not even sure this is something the table standard should try to enforce - like commenting, artificial end tokens would appear to be something more properly dealt with at the utility level than at the table level.

This is another situation where it is and it isn't. It should be handled on the utility level. It has to be for dumping because it takes logic to define when an artificial control code is to be output. However, after it's dumped, it's no longer artificial. It becomes a real end token and it should be defined in your table then to be used for insertion as normal.

Quote

Eh? Oh. Hmm. Somehow I was under the impression that the standard supported multiple logical tables within the same physical file. At the very least, it isn't explicitly forbidden, and I see no difficulty in supporting it (having already done so myself) if we impose two more constraints:

I don't really want to do this. It takes away some of the simplicity of table file parsing and loading. It is a table file, singular, after all. It also increases the complexity of defining the starting table in the utility. Instead of simply picking the starting file with no processing required, you need to process all table files and generate a list of tables by ID. Just having table switching to begin with probably puts this out of reach for many to implement (even you indicated difficulty). Many in our community are amateur programmers at best (not that we don't also have pros) and the more burdensome it is, the less it will be used. It adds a layer of complication I'm not too interested in. Even if it is a good idea, at the end of the day, I can't advocate a standard I don't want to program a utility for, so this is probably something I won't go with.

Quote

So if instead of "A<Color><$A4><$34>a" we only have "A<Color><$A4>a", you'd want to insert "A0 E0 A4 21"? How does this interact with table switching? Continuing my table switching example, what does the insertion process look like when instead of "<Color><$25><$25>C1", we're trying to insert "<Color><$25>C1" starting with table1?

Yes, that's correct (insert A0 E0 A4 21). For table switching, when dumping, a linked entry, it would count as a single match. On the insertion side, you don't really have this nested table switching setup anymore. You're either going to have the table switching raw hex bytes there or you're going to have a command. When last I spoke to Klarth (author of Atlas), there were still some implementation details to work out, but the idea would be the dumping utility would output a table command the inserter would use. This is a utility issue. The scope of the table is only to properly decode the character with the right table. How to indicate to the inserter what table to use is a different story. In my own utility I will probably just end up outputting the table switch bytes in the dump and/or provide an option to omit.

Ideally, if I dump a bunch of hex to text and then reinsert that text unaltered, my basic expectation is to get the original hex again. In the case of duplicate text sequences or when compression is involved, I would expect to get some optimally compressed variant of the text. Giving precedence to raw hex bytes breaks that expectation. Except in the case of linked entries, raw hex bytes represent a failure of the hex-to-text translation process. I understand why that failure needs to be addressed with higher priority. I just don't like it . Disallowing situations in which it can arise seems like the simplest solution, and besides, who uses <$XX> in their table entries anyway?

The only thing differentiating "/text" from the disallowed "=text" is the end token flag, and the end token flag is primarily a utility hint rather than hex <-> text translation information. As such, it just feels like including artificial end tokens in the table standard wanders a little further than necessary into defining the content of script files and utility interfaces, leaning towards the Atlas/Cartographer style status quo in the interests of backwards compatibility. I'm not saying any of this is necessarily bad, but it does discourage growth in other directions. After all, there are other ways for a utility to keep track of where strings end without using artificial end tokens.

Fair enough. Table switching complicates hex <-> text translation, but doesn't have much impact on table file parsing. I agree it gets a bit silly if the user provides you with large quantities of unused table data (like if somebody stored all the tables for every game they'd ever worked on in the same file [wait, that sounds kind of awesome ]), but assuming the utility only receives tables it needs, you still have to go through all the same parsing/selection steps anyway, so the overhead is pretty low. As an added convenience, I'm defaulting table ID to the table's file name, so there's very little work required for the end user, and single table files with no ID still work. In any case, being burdensome to implement is not the same as being burdensome to use, and this seems like something people might use. I kind of want to keep it, but at the same time, people going off and doing their own thing is what led to all the incompatibilities this standard tries to correct, and I definitely agree in principle to having inter-operable tools. Thoughts?

Right, no safeguarding the user from their own mistakes, just like the policy on newlines. Oh wait (no, I'm not seriously suggesting that ignoring newlines for insertion is a panacea; there are still lots of ways to get into trouble). I guess I don't much care either way, but I think maybe I'll print a warning, just in case.

Table switching does appear to make insertion more complex. I don't think I can just tell perl to do it for me anymore (or maybe I just need to ask nicer) . Outputting table switch tokens/inserter commands might work well enough if all you want to do is re-insert the original text, but if that text has been modified, I think you might need to keep the nested setup in order to deal with table switch tokens that come with a non-zero token count, under the assumption that the game is going to expect to find that many tokens in the new table. That's why I'm concerned about linked tokens - it sounds as though a linked token with its parameter bytes counts as one match for dumping, but as multiple matches for inserting?

I handled table switching with an array of tables, index to active table, and a stack of table indexes to handle the table jumps and returns. It was only a few lines of code. It was more difficult to handle notification of the several conditions that would required the table to fall back or jump to another.

I've got a table object parsing tokens and then calling the parsing method on another table when it reads a switch token. End conditions were also interesting - I've got whichever comes first of "read X bytes", "read Y tokens", and "read an end token", with "unable to read a token" for sub tables only, with each condition propagating to parent table(s) as appropriate. I am unconvinced this is sufficient in general, but have no real-world example to support that belief. It would be nice if we had a repository of known unsupported formats and things in general that still require a custom dumper/inserter. I'm trying to be as flexible as I can, but specific goals are nice sometimes too!

As a point of possible interest, I found it more useful (especially for post-processing) to return a list of tokens rather than the translated text.

New business:

I disagree with the "longest prefix" insertion rule from 2.2.4. With a slightly altered example:

12=Five13=SixSeven00=FiveSix

the "longest prefix" rule makes it impossible to insert the string "FiveSixSeven", despite that being valid output from a dumper. A less greedy algorithm would be able to insert "12 13" instead, which seems superior to inserting "00" but losing the "Seven".

It's not clear from the wording (2.6), but presumably NumberOfTableMatches here refers to the match count in the old table rather than the new table, i.e. "!hex=tableID,1" is not a no-op.

Quote

NumberOfTableMatches is the non-negative number of table matches to match before falling back to the previous table.

It's not clear from the wording (2.6), but when counting matches, I think this should be restricted to matches in the new table itself, rather than including matches from any other table(s) the new table might subsequently switch to. Here's an example to illustrate:

The only thing differentiating "/text" from the disallowed "=text" is the end token flag, and the end token flag is primarily a utility hint rather than hex <-> text translation information. As such, it just feels like including artificial end tokens in the table standard wanders a little further than necessary into defining the content of script files and utility interfaces, leaning towards the Atlas/Cartographer style status quo in the interests of backwards compatibility. I'm not saying any of this is necessarily bad, but it does discourage growth in other directions. After all, there are other ways for a utility to keep track of where strings end without using artificial end tokens.

Absolutely, but that's exactly what we're going for. We're not trying to making something radically new. We're refining and defining that status quo. The second paragraph of the overview sums it up. The end result should be able to be adapted into Atlas and Cartographer without all that much modification. The idea here is to get everybody on the same page for the first time ever, and hold us over. Baby steps. Otherwise, I can guarantee you it will go nowhere and nobody else will ever use it. Just look at the whole patching format fiasco and why many people still use IPS. I can't guarantee this will fare any better, but what I can guarantee is, if at the very least, it is supported in Atlas, Cartographer, and TextAngel, it will be used by 85% of everybody out there and provide a dumping and inserting standard for the near future. This will hold until we eventually move on to XML or whatever the next evolution will be where radical new and improved ideas can run crazy and new tools developed. In the ROM Hacking community, such a move can often take a decade or more. We need something in between.

Quote

Fair enough. Table switching complicates hex <-> text translation, but doesn't have much impact on table file parsing. I agree it gets a bit silly if the user provides you with large quantities of unused table data (like if somebody stored all the tables for every game they'd ever worked on in the same file [wait, that sounds kind of awesome ]), but assuming the utility only receives tables it needs, you still have to go through all the same parsing/selection steps anyway, so the overhead is pretty low. As an added convenience, I'm defaulting table ID to the table's file name, so there's very little work required for the end user, and single table files with no ID still work. In any case, being burdensome to implement is not the same as being burdensome to use, and this seems like something people might use. I kind of want to keep it, but at the same time, people going off and doing their own thing is what led to all the incompatibilities this standard tries to correct, and I definitely agree in principle to having inter-operable tools. Thoughts?

It makes the difference between having to process the tables just to pick a starting table, and not having to process at all until operation time. You're requiring I process and parse all the tables just to be able to provide the options necessary to the user to select the starting table. I don't need do do any of that now. You just pick the file. Table parsing and processing is only needed when the operation commences. That's my objection. It's a fine idea, it just requires utility changes I don't want to make.

It's every bit as much as much about implementation as it is about end user use. That's especially true for me. I'm setting forth a standard. I'm the only one developing a dumper that will use it (Cartographer will hopefully pick it up when/if Klarth updates TableLib). If I were to not develop TextAngel because I find the standard to be too much of a pain in the ass, then why am I involved in pushing this standard to begin with? And if then there is no dumper that uses the standard, the standard really has no point in existing. You see where I'm going with this? It's got to be something I'm comfortable and motivated to develop a utility for or it's pointless for me to invest any time on. I'm trying to make something for everybody, but since I'm the only one doing development of a dumper that will use it, I have to be a little selfishly biased in the standard in order for it to ever see the light of day.

Quote

Right, no safeguarding the user from their own mistakes, just like the policy on newlines. Oh wait (no, I'm not seriously suggesting that ignoring newlines for insertion is a panacea; there are still lots of ways to get into trouble). I guess I don't much care either way, but I think maybe I'll print a warning, just in case.

There's no reason to prevent the user from inserting more or less raw hex data. I wouldn't call that a user mistake. I've done it intentionally several times. As for newlines, they are nothing but whitespace in your script file for insertion. They have no effect on mistake or insertion. Look, I used to think the same thing. I told you I had done a project with newlines being processed as part of the tokens for insertion. It's certainly possible, but totally undesirable. It's just prone to too many problems. When the script was ready for insertion, many files were broken. Remember it passes through hands of several people. As mentioned, different text processors silently made changes (0x0d and 0x0a mangling), extra line breaks put in by mistake by human, extra spaces nobody sees by copy paste or human error. All sorts of things in real world practice. I will never actually insert newlines again. Those are probably similar reasons Atlas doesn't do it either. I don't know of any utility that does come to think of it.

Quote

Table switching does appear to make insertion more complex. I don't think I can just tell perl to do it for me anymore (or maybe I just need to ask nicer) . Outputting table switch tokens/inserter commands might work well enough if all you want to do is re-insert the original text, but if that text has been modified, I think you might need to keep the nested setup in order to deal with table switch tokens that come with a non-zero token count, under the assumption that the game is going to expect to find that many tokens in the new table. That's why I'm concerned about linked tokens - it sounds as though a linked token with its parameter bytes counts as one match for dumping, but as multiple matches for inserting?

Klarth previously said "As far as the switch after X characters, I haven't seen (nor can formulate) a case where it would've helped with an English translation. I'd probably have the user create a table entry specifically for switching the table."

I think there's going to be a small number of situations where on a theoretical level, there may be a concern or issue, but real world dictates the case never occurs. We've got to appease Klarth because Atlas support holds major weight in any success this standard may have. Atlas IS probably the closest thing to a standard we have currently.

Quote

New business:

I disagree with the "longest prefix" insertion rule from 2.2.4. With a slightly altered example:

12=Five13=SixSeven00=FiveSix

the "longest prefix" rule makes it impossible to insert the string "FiveSixSeven", despite that being valid output from a dumper. A less greedy algorithm would be able to insert "12 13" instead, which seems superior to inserting "00" but losing the "Seven".

You do? I'd love to hear your algorithm, especially if we happen to add a few more table values to the mix:

02=F03=i04=v05=e06=Fi07=ve08=ive

Now, what are you going insert and how are you going to determine it?

Quote

It's not clear from the wording (2.6), but presumably NumberOfTableMatches here refers to the match count in the old table rather than the new table, i.e. "!hex=tableID,1" is not a no-op.

Correct.

Quote

It's not clear from the wording (2.6), but when counting matches, I think this should be restricted to matches in the new table itself, rather than including matches from any other table(s) the new table might subsequently switch to. Here's an example to illustrate:

I disagree with the "longest prefix" insertion rule from 2.2.4. With a slightly altered example:

12=Five13=SixSeven00=FiveSix

the "longest prefix" rule makes it impossible to insert the string "FiveSixSeven", despite that being valid output from a dumper. A less greedy algorithm would be able to insert "12 13" instead, which seems superior to inserting "00" but losing the "Seven".

You do? I'd love to hear your algorithm, especially if we happen to add a few more table values to the mix:

02=F03=i04=v05=e06=Fi07=ve08=ive

Now, what are you going insert and how are you going to determine it?

The rest I'll have to think about, but this I can answer off the top of my head:

In the absence of table switching, I would model my insertion algorithm (which is currently hypothetical, unlike my dumping algorithm) on perl's regular expression engine. Ignoring all of its optimizations, the engine is generally longest prefix, but adds the concept of backtracking - if it's not possible to complete tokenizing the string based on your current tokenization, go back a little bit and try a different tokenization. So tokenizing "FiveSixSeven" would start by finding 4 possible initial tokens ("00=FiveSix", "02=F", "06=Fi", and "12=Five"). It would then tentatively accept the longest of those, "FiveSix", and try to continue matching the remainder of the string ("Seven"). Since that can't be done, it would backtrack, discarding the previous token ("FiveSix") and trying again with the next longest, "Five". The remainder of the string ("SixSeven") is able to be matched ("13=SixSeven"), so hurray, we're done. If it hadn't worked out so nicely, "Fi" and then "F" would have been tried in turn, until eventually it becomes obvious that the string just can't be matched. In that case, 2.2.2 kicks in and it would start all over again with "iveSixSeven". It's not as fast as a straight up "longest prefix", but it is more accurate. For our purposes you might want to modify the token preference to be based on hex length rather than text length, but in order to guarantee an optimal encoding, you'll probably have to run every possible tokenization, remember the ones that worked, and then compute each one's hex length.

You've illustrated my point. Now take a poll and see how many people can a.) understand how to implement that b.) have the ability to implement that and c.) have the desire to implement that. You've just increased complexity by 10x for an otherwise trivial task. Has anybody ever actually written an inserter that behaves like this? Of all the ones I've seen source code for, none have. From all the scripts I've ever inserted with my own utilities, and all the scripts that have ever been inserted with Atlas, it has never been an issue.

If you were writing this standard, I'm pretty sure I'd want no part of it. You like to take the mountain climbing approach for the molehill.

Anybody else, don't be afraid to comment here. If I'm off my rocker, I'd like to know, but I can't see anybody really doing this as evidenced by our past history. I think it'll be out of reach for most and necessary/undesirable for those it isn't out of reach for.

Sure, as long as your table contains separate entries for all the individual characters of your string, longest prefix will work. I ran this example through Atlas, and it does indeed insert three times as many bytes as required - "FiveSix" "S" "e" "v" "e" "n" instead of "Five" "SixSeven". My point here is that the longest prefix insertion algorithm is provably non-optimal, and I do object to the standard imposing a non-optimal algorithm on any utility author wishing to implement the standard, regardless of whether any such author is ready, willing, or able to step forward with a better implementation.

Speaking of which... you said you were the only one developing for this standard. Is it worth mentioning at this point that I believe I've already written a 100% standard compliant dumper? (I've yet to write/run a full test suite, and I'm assuming the standard will be updated based on recent discussion, so no promises.) In addition to almost everything Cartographer can do (it's difficult to determine the full range of Cartographer's capabilities without source code), I also support multi-table files (my own happy misinterpretation of the standard, subject to the additional constraints listed earlier), discontinuous pointer bytes (as suggested by Geiger in the previous thread), arbitrary pointer tree structures (your own pointer table to pointer table to pointer table to string example), optional overlapping string output, and optional string fragment output (my own feature requests from a couple of years ago).

Aye, perhaps I've seen enough molehills turn in to mountains that I've gotten used to exploring unlikely possibilities first. There's nothing wrong with being prepared .

I'd say we're in agreement on 99% of the material. Most of the issues I've raised have been more about presentation ("X would make more sense here instead of there", "Y could be stated more clearly", etc.) than content, but where content is at issue, yes, we do appear to have some philosophical differences. You argue for strictness in places I would prefer freedom, I argue for strictness in places you would prefer freedom. It's almost like we're writing programs in two different styles .

Anybody else, don't be afraid to comment here. If I'm off my rocker, I'd like to know

Comments from others would definitely be appreciated. I for one don't think you're off your rocker - your position is based on years of hard-won experience, and I respect it even if I don't always agree with it. I do think the table file standard oversteps its bounds in a few places - if all it does is codify and enforce the behaviour of the currently popular utilities, I don't see it providing much room for improvement. Cartographer and Atlas are extremely useful, but it's not hard to imagine utilities that would be even more useful than either or both of them, and I don't think the table file standard should hinder the emergence of such a utility in any way not directly related to table files.

Sure, as long as your table contains separate entries for all the individual characters of your string, longest prefix will work. I ran this example through Atlas, and it does indeed insert three times as many bytes as required - "FiveSix" "S" "e" "v" "e" "n" instead of "Five" "SixSeven". My point here is that the longest prefix insertion algorithm is provably non-optimal, and I do object to the standard imposing a non-optimal algorithm on any utility author wishing to implement the standard, regardless of whether any such author is ready, willing, or able to step forward with a better implementation.

Ran this one by Klarth. "Bad token selection can occur sometimes, but I'd estimate it very rare for it to be detrimental...unless it's a "gotcha table". The optimal algorithm is simply out of reach for most, and non desirable for the rest of us. Just because it may be more optimal doesn't mean it's desirable or the best choice for the job.

Quote

Speaking of which... you said you were the only one developing for this standard. Is it worth mentioning at this point that I believe I've already written a 100% standard compliant dumper?

It's only worth mentioning if a.) It will end up being standard compliant, b.) it's released to the public, and c.) It's not in Perl. I'm joking about that last part as Perl syntax makes me ill. However I do want to raise the point that hardly any windows users (the statistical majority of the end users) have Perl installed.

Quote

I'd say we're in agreement on 99% of the material. Most of the issues I've raised have been more about presentation ("X would make more sense here instead of there", "Y could be stated more clearly", etc.) than content, but where content is at issue, yes, we do appear to have some philosophical differences. You argue for strictness in places I would prefer freedom, I argue for strictness in places you would prefer freedom. It's almost like we're writing programs in two different styles .

Too much freedom for utilities caused the problem to begin with. It's the freedom of utilities that led to not being able to use the same table between various utilities. Not only do we want to be able to use the same table amongst all compliant utilities, we also want to ideally be able to be able to dump and insert interchangeably. My utility and your utility shouldn't give something different (as far as the basic text goes) when dumping or inserting the same script. And whether you insert with my utility or Atlas, you'll get the same hex inserted. They will differ in features, formatting, pointers, and abilities beyond the basic text/hex.

With that said, I'm not saying that can't necessarily still be accomplished with some of your proposed changes, but it is the reason why the standard appears to overstep boundaries in those areas. To steal the words of a colleague, "While the aims is unifying table files, what is being unified is the textual representation of certain dumping and insertion processes and their matching behavior between textual representation and hexadecimal representation of the game script. As such, these definitions are naturally a part of the spec."

Quote

Comments from others would definitely be appreciated. I for one don't think you're off your rocker - your position is based on years of hard-won experience, and I respect it even if I don't always agree with it.

Thanks. I appreciate that. I hope I do not come off as condescending as a result. I do not discredit any of your ideas, I merely argue my position on them. I would like to try and incorporate some more of what's been discussed here. I am in process of making several of the changes mentioned and running some of the hot button items by some of the other guys. It seemed they were a bit scared off by the walls of text here. I will likely get someone to stop by and comment yet.

I will summarize later with changes I made and remaining open business to decide on. I think we're going to try to wrap this whole thing up. It would be nice to reach something you also can agree with, but it looks like there will be a few items of business such as optimal algorithm item above that I (with backing from others) may be steadfast on.

The longest prefix algorithm can fail to be optimal in a variety of ways. It can fail by being too greedy, when choosing a long text sequence at one point forces it to choose many short text sequences at a later point. This can only happen when your table compresses at least 3 characters into one token (i.e. games that use at most DTE are immune to this weakness). It can also fail by assuming that all tokens have equal hex length, an assumption which the table standard explicitly invalidates. What would happen in the "FiveSix" example of 2.2.4 if the "FiveSix" entry were 3 bytes long instead of 1?

Sometimes all a project really wants is just 4 more bytes in order to fit its masterpiece script in without having to make ASM changes. In cases like those an optimal insertion algorithm might make a significant difference. I say "an" optimal algorithm, since there are many different algorithms that produce optimally encoded output. Just to be clear, I'm not arguing in favour of any particular insertion algorithm. I'm arguing for the freedom for utility authors to choose their own. The longest prefix + backtracking algorithm I rattled off earlier produces optimal output, but its runtime and memory requirements tend to grow exponentially with the input length, making it prohibitively expensive in practice. After getting bored waiting for a single medium-length string to encode, I ended up abandoning the longest prefix idea altogether and created a different optimal insertion algorithm that runs in roughly linear time and memory instead. It took a couple hours to get everything working right, but the end result is only about 50 lines of code and it chews through a 200KB script in under a second.

While on the topic of insertion algorithms... how are you handling table switching? I've been thinking more about this, and in the general case it presents an even larger can of worms than I was anticipating. I have yet to come up with a reliable method that doesn't involve breaking out old CS textbooks .

Oh, I am wounded . Maybe it's different with .NET, but grubbing through the Win32 API always left me feeling dirty, so I can understand your reaction to perl. I also agree that Windows and perl are not a frequent combination. In theory, installing perl on Windows is analogous to installing the .NET framework, so if people are willing to do one, they might be willing to do the other. I do see threads every now and then where people ask about utilities for Linux or Mac, and many flavours of those OSes come with perl already installed. It is entirely possible that my utilities may end up being about as popular as my translations . I do plan on releasing it publicly (this includes source code, of course), and will at least note any intentional deviations from the standard, assuming I make any. Dumping will likely be 100% compliant, but insertion will likely not be, since I feel guaranteeing encoding optimality is superior to just hoping for it.

Not usually, no. I was tempted to suggest spending a few weeks working on approximation algorithms for NP-complete problems before cracking wise about mountains and molehills, but it's all good fun . One of the nice things about this place (due in large part to your own influence, I think) is that people with divergent viewpoints can have discussions like these in a reasonably mature manner. I think it's been mutually beneficial even if the television audience did fall asleep .

First of all, this post will be pretty long and I would like to apologize to abw if this seems like I jump on his posts only. I originally discussed the table file standard with NC back on his own board and we pretty much figured out a way to do it. I'm also quite late to the party, which is why I cover almost every second point you make here.

While developing the standard, we opted for more simple text checking algorithms, so we decided to give the user the power to do this at the cost of possible mix-ups. NC tried to tone complexity down as much as possible. Even regular expressions are considered a hindrance here, which I will elaborate on further down below. However, I would support and have already offered to design regular expressions for identifying entry types, so we might as well disallow it.

Also, it might be more appropriate to include the section on inserting raw hex in 2.2.4 instead of in 2.2.1.

It seemed logical to include it there, because that way, the dumping and insertion process for hexadecimal literals is completely defined, instead of breaking these two up. 2.2.4 doesn't deal with literals at all right now.

I admit having these commenting operations and dumped and to-be-inserted text in the same file always irked me and I personally handle it differently, i.e. no line breaks and no content mixing.I felt like this was an atlas-specific hack and easily remedied by the user taking action after dumping. However, I also felt that done properly, one could easily address this with regular expression grouping and get user-customizable behavior. However, regular expressions are considered "one step too far" right now.

When inserting, should control codes be ignored for all tokens, or just end tokens requiring hex representation?

The only control codes that are currently implemented ― and again in a fashion that makes for the inability to implement literal "\n" instead of control codes '\n' for simplicity ― are line breaks and are as per 2.3 to be ignored by insertion tools:

These codes are used by dumpers only and will be ignored by inserters.An additional burden here is the different line end control codes used in different OS. Basically, we might have 0x0D 0x0A, or 0x0A, or 0x0D. This also favors completely ignoring line ends, because it cannot be assured that some text editing program doesn't silently convert from dumping standard to OS standard so the insertion tool would not find the proper bytes in the text file.On the other hand, "OS-independent" ReadLine functions do exist and will worst case, read two lines instead of one for 0x0D 0x0A. Therefore, by ignoring the number of line breaks and empty lines, we actually gain a little bit of stability here.

This should currently match [^,]*, i.e. any string that does not contain a comma. I would be willing to settle for [0-9A-Za-z]* in the light of not wanting to deal with Unicode confusables or different canonical decompositions of accented letters etc.

As for uniqueness of labels, I have to admit I was silently going for uniqueness in each type, but this might have to be discussed again.

Eh? Oh. Hmm. Somehow I was under the impression that the standard supported multiple logical tables within the same physical file. At the very least, it isn't explicitly forbidden, and I see no difficulty in supporting it (having already done so myself) if we impose two more constraints:1) a table's ID line must precede any entries for that table2) if a file contains multiple tables, every table in that file after the first table must have an ID line(in fact I require an ID for every table used, but I don't feel that condition should be imposed as part of the standard)

I think this did not occur to anybody, simple because one file per table and one table per file is the way it has always been. I feel we should leave it that way and be specific about it.

Under the "wishing for the moon" category:It would be nice if we could define linked entries in such a way that we could obtain output with both pre- and post-fix[...]

[W]e lack a general escape character, you couldn't determine which commas were field delimiters and which were data :(. It might also be nice to be able to specify how raw bytes were output in case you don't want the <$XX> format.

So if instead of "A<Color><$A4><$34>a" we only have "A<Color><$A4>a", you'd want to insert "A0 E0 A4 21"? How does this interact with table switching? Continuing my table switching example, what does the insertion process look like when instead of "<Color><$25><$25>C1", we're trying to insert "<Color><$25>C1" starting with table1?

I couldn't find your Color example. Nevertheless, parsing of this would become a search for a normal entry, because a linked entry cannot be found. If a normal entry exists, the whole thing is obviously not ambiguous and insertion will progress as one would expect. On the other hand, when no normal entry can be found for the label that was inadvertently misplaced, the insertion tool will not know what to do and should produce an error.

Ideally, if I dump a bunch of hex to text and then reinsert that text unaltered, my basic expectation is to get the original hex again.

This expectation is a false premise, really. You open an image file saved with one program and save a copy in another program. You will likely find that the compression changed, or the picture was silently up-converted from 16bit rgb to 32bit rgb etc. What matters in the end is that the final result is equivalent. Now in the case of text, this means the following:

You dump text and insert it back. It should display as the same thing when the game is running. When it does not ― barring edge cases we explicitly forbade, such as <$xx> ― it just means your table file matches two entries to the same text which is really not the same text. A common example would be having two different text engines that happen to share parts of their encoding (or were even built from one encoding!). In such cases, 0x00 might be displayed as "string" in engine A and entry 0x80 might be displayed as "test" in engine B. However, as long as the rendering of the other code fails in one engine, the strings must not be considered the same to begin with, because they are not. In such cases, dumping text per engine might be an option or marking entries with custom tags etc.

I disagree with the "longest prefix" insertion rule from 2.2.4. With a slightly altered example:

12=Five13=SixSeven00=FiveSix

the "longest prefix" rule makes it impossible to insert the string "FiveSixSeven", despite that being valid output from a dumper. A less greedy algorithm would be able to insert "12 13" instead, which seems superior to inserting "00" but losing the "Seven".

You do? I'd love to hear your algorithm, especially if we happen to add a few more table values to the mix:

02=F03=i04=v05=e06=Fi07=ve08=ive

Now, what are you going insert and how are you going to determine it? ;)

That is a basic depth-first tree search, so it's not overly complicated for implementation. However, as we all know, complexity is O(b^n) where n is the maximum number of levels and b is the number of average branches. Another way to think of this is a transducer, which will naturally only finish on valid paths. The only criterion is that we need to find a shortest path, not enumerate all shortest paths.Basically, an A* search with cost in bytes and a basic heuristic counting letters per byte will do. This could also be expanded for target language and occurrence inside the script to accomodate for the simple fact that lots of table entries doesn't mean all of them get used with the same probability. However, basic A* with cost in bytes and even heuristic of zero will work out the shortest path directly. Since the heuristic must not overestimate the byte count and already one table entry with 1 byte = more than 1 letter will basically mean that letter per bytes < 1, we deal with a (0,1) bytes per letter range of possible values here, really, so this means even an ideal heuristic for the source file will have little impact on finding the right way. The only nitpick here is, that possibly normal entries like 3 bytes = 1 letter exist and therefore letter per byte would be > 1 on average (when entry probability is not calculated). However, since a heuristic giving 0.1 bytes per letter will still be permissible, because it doesn't overestimate, the (first) shortest path will still be found.

Cost is 1.2 at the start with h(x) = 0.1 * x where x is letters left in buffer. Simple mean for bytes per letter would be 11/30 (30 letters per 11 bytes), so we're somewhat far away from even close to optimal and we will see some jumping towards the end because of this.

We expand the first node (red), find that "F" and "Fi" can be used. "F"'s path cost d(x) = 1 byte and we estimate another h(x) = 1.1 bytes left to go => 2.1 total. "Fi" only costs f(x) = 2.0 per this rationale [10 letters to go at 0.1 plus 1 byte already used].

(Assume we haven't stopped, black), expand any other node and find that it's cost (because d(x) >= 3 at this point) is greater than 3.0.

This goes on until all 2.x nodes and one 3.0 node (depending on implementation) have been expanded once.

Expand "ixSeven" (green) and notice it's our goal.

Before you ask, NC, this can be programmed with a simple stack that pushes node IDs back with reordering or a list that is sorted after every insertion of node(s).

I hope to have demonstrated that this is neither a laughable nor impossible claim or problem. However, having an admissible heuristic here is key and a simple (unweighed) mean will most likely not do because of outliers and one would need to use median or mode or some analysis of input first. Worst case is h(x) = 0 for all x, so you get depth-first search or h(x) = x * [lengths of longest hex sequence in table] which yields breadth-first search.

In the absence of table switching, I would model my insertion algorithm [...] on perl's regular expression engine. [...] the engine is generally longest prefix, but adds the concept of backtracking - if it's not possible to complete tokenizing the string based on your current tokenization, go back a little bit and try a different tokenization.

This is basically A* over tokens with just text length like all regex engines do for greedy star. However, I'm currently not aware that how this would work with the added impetus of having multiple bytes for some tokens thus changing the cost from same cost for all tokenizations to cost per tokenized token.Having said that, a way to implement this via general purpose regex engines would probably be more accessible in more programming languages.

You've just increased complexity by 10x for an otherwise trivial task.

Neither tokenization not optimal search are trivial tasks. Indeed, defining the search problem itself mathematically is not a trivial task. That's why so much brain power (and money) went into things like SQL queries and the like.

I'm not sure what you're thinking of as an inserter in this case, but pretty much every compression that does not use sliding window techniques will have to use a backtracking algorithm to be optimal alongside a good heuristic. So yes, people have written inserters that insert input data into an output file while contemplating size and combination of mappings. It might just have been for binary data or the like.

Ran this one by Klarth. "Bad token selection can occur sometimes, but I'd estimate it very rare for it to be detrimental...unless it's a "gotcha table". The optimal algorithm is simply out of reach for most, and non desirable for the rest of us. Just because it may be more optimal doesn't mean it's desirable or the best choice for the job.

Not sure where the quote ends, so I assume it's all Klarth'. If so, I'd like to see him defend the POV that an optimal algorithm is non-desirable, because I cannot think of a single argument except burden of implementation, which he ruled out. Once a provably optimal algorithm is used, why is it not desirable? Speed-wise we can do brute force of some 100 kB in a few seconds, so speed doesn't seem to be the issue here, does it?

After getting bored waiting for a single medium-length string to encode, I ended up abandoning the longest prefix idea altogether and created a different optimal insertion algorithm that runs in roughly linear time and memory instead. It took a couple hours to get everything working right, but the end result is only about 50 lines of code and it chews through a 200KB script in under a second.

While on the topic of insertion algorithms... how are you handling table switching? I've been thinking more about this, and in the general case it presents an even larger can of worms than I was anticipating.

Since I'm the guy that introduced this, I can only say that ― yes ― it does pose a problem. However, my naïve solution was to use ordered data sets and do a longest-prefix over all tables. You can still do this with A*, where path cost is adapted to table switching as well, i.e. the path to an entry with switched table from another table is the cost in bytes in the new table plus the cost in bytes from the switching code in the old table. This of course, needs just some bookkeeping to know for each explored node in which table it belongs and how many matches the table can have before switching back.

Since I thought about it some time now, I would actually like to introduce the notion of leveled compliance. Basically, one could incorporate into the standard more than one notion of insertion whereas tools can then specify which version(s) they comply to and by. This way, tool x can comply to a) longest prefix, b) the proposed A* method (or maybe just a depth first etc.) and c) it's own method. IMHO, this would also preclude utility authors inventing their own insertion technique but not indicating it anywhere. Table files themselves would be required to stay the way they are for compatibility to other tools once the final draft is done. Versioning will likely happen anyway once some <glaring oversight that really should have been caught and everybody feels miserable about once the standard is out> has been identified and rectified in a subsequent version of the standard.

When it comes to longest match vs hex-saving, there are a few things to consider. The speed decrease for hex-saving is negligible. The space saved is equally negligible. The implementation time (and ease) is strongly in favor of longest match. The ease of understanding to the end user is probably in favor of longest match.

You can formulate scenarios where erroneous or non-optimal insertions could happen with both algorithms. This is where best practices/experience can come in. Such as representing a no-default-name renameable character's name with [Hero] rather than just plain Hero. So in case a townsperson talks about a "Legendary Hero", the game won't print the character's name.

On the space savings part. The FiveSix example is rooted in substrings and how to insert them. There isn't an optimal substring analysis tool that I'm aware of geared towards romhacking. ScriptCrunch computes a fresh longest substring analysis for each entry (which is also pretty slow thanks to my poor implementation). So the savings will be pretty negligible in a large script unless you guys write both "optimal" inserter and "optimal" substring analysis tools. If you implement those, you might bump a greedy 30% substring compression up to 31%-33%. Maybe.

Longest match has been a pretty solid algorithm over the years and it'll take a bigger argument than the small potential space saving (think reducing real world erroneous insertion) for me to implement it instead of features that can help insert scripts that were previously impossible. But I'd be fine with the standard saying that either algorithm is acceptable.

On a more ideological side: it's real world vs text book. If you're writing your thesis, you can spend days designing and implementing an insertion algorithm if you want. In the real world, at some point it becomes time to move on and implement real features so a script can actually get inserted. Having that kind of practicality is central to Atlas: wider variety of scripts inserted, lowering the learning curve, and ways to limit Atlas command clutter. As Patton said, "A good plan violently executed now is better than a perfect plan executed next week."

First of all, this post will be pretty long and I would like to apologize to abw if this seems like I jump on his posts only. I originally discussed the table file standard with NC back on his own board and we pretty much figured out a way to do it. I'm also quite late to the party, which is why I cover almost every second point you make here.

No problem. This thread must be in the running for the prestigious "Most Words Per Post" award anyway, so why stop now? Anyway, you end up agreeing with me on most of the points I care about .

@NightcrawlerAs an aside, I didn't receive any email notifications about Tauwasser's posts. Does this have something to do with the thread title changing? For reference, I did get email when this topic was split off from Program Design and Data Structure (TextAngel), and I also received notification of Klarth's post under the original thread title.

except for post-processing [for table switching] of overlapping strings with bytes interpreted by different tables, which is still a mess

Can you elaborate on this one?

Certainly. The desired post-processing I referred to is described in more detail here. As for the bit about different tables: given a byte range x, it is conceivable that a game could read x once using table A and then read x (or a subset of x) again using table B, producing different text in each case. I'm not sure whether that would qualify as genius or insanity. In either case, adding a check to ensure the tokenizations of overlapping strings agree solves this issue and the multibyte token overlap alignment issue, both of which I believe to be unlikely to occur in a commercially released product.

Also, it might be more appropriate to include the section on inserting raw hex in 2.2.4 instead of in 2.2.1.

It seemed logical to include it there, because that way, the dumping and insertion process for hexadecimal literals is completely defined, instead of breaking these two up. 2.2.4 doesn't deal with literals at all right now.

On the other hand, 2.2.1 specifically refers to dumping ("No-Entry-Found Behavior for Dumping"). The treatment of hexadecimal literals applies to multiple sections, so it might be cleaner to split it off into a separate section.

This should currently match [^,]*, i.e. any string that does not contain a comma. I would be willing to settle for [0-9A-Za-z]* in the light of not wanting to deal with Unicode confusables or different canonical decompositions of accented letters etc.

I'm pretty sure we're not talking about the same thing here. That would group end tokens' "endtoken" in with linked entry's "label", table id's "TableIDString", and table switch entry's "TableID", all of which may contain any character except comma. I understand the comma restriction for entries using comma as their internal delimiter (i.e. linked and table switch entries) and and how that restriction follows through to other entities referenced by those entries (i.e. @TableIDString lines), but I'm not sure what we gain by applying the same restriction to end tokens. If that actually is the intent, then at the very least

Any combination of formatting control codes and text representation may be used. This allows for nearly all variation of string ends.in 2.4 should be amended. Restricting "endtoken" to [0-9A-Za-z]* would disallow end token text sequences like "<end>", which seems undesirable. That does bring up another point I keep forgetting to mention: linked entry labels and table ID strings (whether in @ or ! form) should probably not be allowed to be the empty string.

This is where best practices/experience can come in. Such as representing a no-default-name renameable character's name with [Hero] rather than just plain Hero. So in case a townsperson talks about a "Legendary Hero", the game won't print the character's name.

A section in the standard about best practices would probably be useful. I remember making that very mistake when I first started .

I think the intent of the uniqueness condition is to prevent the creation of tables which encourage insertion errors. What do people think about taking this a step further by providing some mechanism for denoting table entries as generic in-game control codes and enforcing uniqueness of text sequences based on that?

As one possible approach, we could introduce another entry prefix character such as # (or even hijack the currently pointless $hex=label,0 construct) to identify in-game control codes, require that the first and last characters of the text sequence (or label) do not occur in the text sequences of any normal entry or as anything other than the first or last character of any non-normal entry, then check for uniqueness of all non-normal entries (probably across table boundaries?). This would draw a clear distinction between text which is a candidate for table compression and text which is not. Under such a scheme, an entry like #hex=[Hero] would guarantee that the string [Hero] would never be parsed as any other token during insertion while still allowing personal preference for other styles ("<Hero>", "~Hero~", etc.) and flexibility in cases where [ or ] appear as normal entries.

I think this did not occur to anybody, simple because one file per table and one table per file is the way it has always been. I feel we should leave it that way and be specific about it.

In the interests of cross-utility compatibility, I definitely agree we should be specific about it. Allowing multiple tables per file has some organizational advantages for the end user, but if it's going to cause problems for other utilities, the costs may outweigh the benefits.

Linked EntriesAttempting to insert a linked entry which lacks its corresponding <$XX> in the insert script should generate an error under 4.2, right?

Exactly. It would basically be a sequence that cannot be inserted according to general insertion rules. Like for instance "福" cannot be inserted when it is not defined in any table.[...]I couldn't find your Color example. [...]

The <Color> example comes from 2.5, and your description of the parsing process is also how I thought it should occur. The alternative seems like an excellent recipe for mangling the rest of the insertion script, particularly so in the presence of multibyte entries. Nightcrawler disagrees with us, alas .

Ideally, if I dump a bunch of hex to text and then reinsert that text unaltered, my basic expectation is to get the original hex again.

This expectation is a false premise, really.

As a hard and fast rule, yes, it is. I did make allowances for differences in compression and equivalent text sequences. However, the example as I gave it results in a loss of textual integrity - the original text "<$01>=this is a '<$01>' string" becomes "==this is a '=' string" after being dumped and re-inserted without modification, which is why I then argued for disallowing <$XX> sequences in table entries. Once that change makes its way into the standard, my example will become invalid and order will be restored to the universe .

I'll quickly address why linked entries are necessary. If you have an entry, say $80 which has one parameter byte and represents how many options there are in a dialogue user choice box. You can do $80=<selectoption>,1 or you can do the following:

8000=<selectoption><$00>8001=<selectoption><$01>...80FF<selectoption><$FF>Manual entry allows you clean up the format, but linked entries are a way to expedite this tedious work when meaning is not necessary. Speaker portraits are sometimes necessary to define.

The next point is on verification. I don't verify linked entry parameter bytes during insertion in my TableLib for simplicity and that was a minor oversight. Most linked entries (usually outside of text) will be transitioned well. But there's a large margin for error when it comes to linked entries inside of dialogue that will be translated/edited and that needs to be mitigated.

So there's two ways to do so: one is to keep the current $XX=<token>,Y and read Y hex-formatted bytes afterwards. The second, a bit more complex, is to implement a format string like $XX=<token %2X> which can clean things up a bit (a single control code and its parameters will be clumped together inside of angle brackets in this case). I won't advocate for either way yet and that string format is just an example. You could do $XX=<token>,%2X,%2X to print <token $01 $E0> for example.

The last point is that linked entry is a terrible term. I should've made up something sensible instead of continuing the name from the Thingy readme.

The actual calculation is specific to the input table and string, so I said "roughly" linear time/memory to gloss over the details. The idea of the algorithm is to find every valid token for each position in the string (starting from the end of the string and working backwards), and for each valid token, determine the minimum forward cost that must be incurred by any complete tokenization with a token starting at the current position. After we've done all that, starting at the beginning of the string, we simply take the minimum cost token for the current position, move to the string position following the chosen token, and repeat until the end of the string is reached.

For each position, then, we'll need to remember the minimum cost and the token which produces the minimum cost. Memory requirements are thus limited to the size of the table and 3 * n, where n is the (character) length of the string, plus a small, constant amount of storage space (for e.g. loop index variables).

Run time is dominated by the sum (for x = 0 to n) of v(x), where v(x) is the cost of finding all valid tokens at position (x). In the worst case (e.g. every token matches at every position [except maybe near the end of the string]), v(x) is bounded by m, where m is the length of the longest text sequence in the table, giving us a runtime of O(m*n) (I'm using a pre-constructed trie for O(m) valid token lookup). In practice, however, large values for v(x) are rare (in the particular case of DTE, m and thus v(x) are at most 2), so the cost of finding valid tokens becomes negligible in comparison to the string length.

Running your example under my algorithm (elided steps are left as an exercise for the reader ):

Then comes the fun part: starting with (0), we select 01=F, move to (1), select 08=iveS, move to (5), select 09=ixSeven, move to (12), and we're done! Since we disqualify impossible tokenizations while calculating each position's token/cost, we know the string can not be tokenized if we move to a position with no remembered token.

Other benefits:This algorithm can be easily adapted to give precedence to raw byte sequences <$XX> by undefining the token/cost for the positions covered by $XX>.The trie can easily include tokens from all tables accessible from the starting table, which in a multiple table context gives us access to every possible tokenization of a string (across table boundaries) without having to list them all (listing is of course O(2^n)).The algorithm can be split into multiple passes (finding tokens on the first pass, calculating costs on the second pass), which in a multiple table context gives us a chance to do any extra validity checks that may be required before cost analysis is begun using full knowledge of all possible tokenizations.

While on the topic of insertion algorithms... how are you handling table switching? I've been thinking more about this, and in the general case it presents an even larger can of worms than I was anticipating.

Since I'm the guy that introduced this, I can only say that ? yes ? it does pose a problem. However, my naïve solution was to use ordered data sets and do a longest-prefix over all tables. You can still do this with A*, where path cost is adapted to table switching as well, i.e. the path to an entry with switched table from another table is the cost in bytes in the new table plus the cost in bytes from the switching code in the old table. This of course, needs just some bookkeeping to know for each explored node in which table it belongs and how many matches the table can have before switching back.

Since I thought about it some time now, I would actually like to introduce the notion of leveled compliance. Basically, one could incorporate into the standard more than one notion of insertion whereas tools can then specify which version(s) they comply to and by. This way, tool x can comply to a) longest prefix, b) the proposed A* method (or maybe just a depth first etc.) and c) it's own method. IMHO, this would also preclude utility authors inventing their own insertion technique but not indicating it anywhere. Table files themselves would be required to stay the way they are for compatibility to other tools once the final draft is done. Versioning will likely happen anyway once some <glaring oversight that really should have been caught and everybody feels miserable about once the standard is out> has been identified and rectified in a subsequent version of the standard.

cYa,

Tauwasser

I have more to say about this (and Klarth's post), but it is now past my bedtime .

I hope to have demonstrated that this is neither a laughable nor impossible claim or problem.

I cannot think of a single argument except burden of implementation, which he ruled out. Once a provably optimal algorithm is used, why is it not desirable? Speed-wise we can do brute force of some 100 kB in a few seconds, so speed doesn't seem to be the issue here, does it?

I think you got the wrong inference. I wasn't implying it was laughable or impossible. I was implying the resulting solution is magnitudes more complicated than a longest match algorithm. Burden of implementation is a very big part of this standard and is argument enough. This is especially true now with the detailed algorithms described in the past few posts. Before, I was arguing on behalf of others that needed simplicity, but now I can simply state that *I* would have difficulty successfully implementing these algorithms. You guys are computer science guys, many of the rest of us are not. I can't support a standard I can't implement, thus it's undesirable. I had read several sources on A* path finding algorithms to even understand what was posted, let alone implement it in my program.

With that said, I do entertain the idea of freedom of implementation as an alternative. Just because I may be inadequate, I see the merit in not explicitly disallowing you to do it better, especially for a more optimal output. I have two hesitations.

1.) Ideally, I wanted to see Utility X and Utility Y to translate the basic text to the same hex output. I understand that we're less concerned with that and more concerned with it resulting in the same text output in the video game. This is along the examples you gave with data compression or images. It just makes life easier for testing, comparisons and interchangeability though if they did do it the same. It would keep things simpler and I like simple.

2.) If freedom of insertion algorithm is given, I'm not sure how to give a satisfying answer to '2.2.4 Text Collisions'. It seems desirable and logical that a simple, straight forward answer should be given on what to do with text collision situations. We do that now, but if it then becomes an algorithm free-for-all, it's not clearly defined. It seems counter-intuitive to the standard to not standardize what to do with that case.

Quote

I'm not sure what you're thinking of as an inserter in this case, but pretty much every compression that does not use sliding window techniques will have to use a backtracking algorithm to be optimal alongside a good heuristic. So yes, people have written inserters that insert input data into an output file while contemplating size and combination of mappings. It might just have been for binary data or the like.

I'm talking about ROM Hacking utilities that can be used for script insertion. To my knowledge there are no public utilities available that use such an algorithm. I'd take an educated guess and say that's due to the unnecessary complexity. That illustrates my point of putting the standard in the realm of nobody using it because it's just too burdensome and/or complicated to implement.

Quote

As for uniqueness of labels, I have to admit I was silently going for uniqueness in each type, but this might have to be discussed again.

Is there any reason why all types can't simply follow the rules of 2.2.6?

The next point is on verification. I don't verify linked entry parameter bytes during insertion in my TableLib for simplicity and that was a minor oversight. Most linked entries (usually outside of text) will be transitioned well. But there's a large margin for error when it comes to linked entries inside of dialogue that will be translated/edited and that needs to be mitigated.

You just exemplified that verifying is not necessary for insertion. In text form, it's already in the format of a normal token followed by raw hex bytes. Doing any extra verifying is a utility issue in my opinion, and is not required for translating text-to-hex. It *IS* required for proper hex-to-text conversion and thus why it is included there for dumping.

Additionally, as I mentioned previously, there have been times where I edited the number of raw hex parameters or the linked entry itself in game, and wouldn't want the original behavior any more. Although I suppose, I could always edit my table to reflect this, but it was previously unnecessary for insertion. We also seem to blur the line here in our discussions being able to dump and insert with the same table and expecting changes would be made for the insertion table. Originally, I had always intended a different table to be used for insertion and many issues would be alleviated, simplified, or not have to be dealt with in the standard. Historically, I have always used a different table for insertion. Others pushed to ensure you can dump and insert with the exact same table, which is fine, but adds a number of nuances to deal with that we otherwise probably wouldn't have.

Additionally, as I mentioned previously, there have been times where I edited the number of raw hex parameters or the linked entry itself in game, and wouldn't want the original behavior any more. Although I suppose, I could always edit my table to reflect this, but it was previously unnecessary for insertion. We also seem to blur the line here in our discussions being able to dump and insert with the same table and expecting changes would be made for the insertion table. Originally, I had always intended a different table to be used for insertion and many issues would be alleviated, simplified, or not have to be dealt with in the standard. Historically, I have always used a different table for insertion. Others pushed to ensure you can dump and insert with the exact same table, which is fine, but adds a number of nuances to deal with that we otherwise probably wouldn't have.

Well there's two cases that I can think of when modifying linked values in a script. If you can edit the number of raw hex parameters, it's either a null-terminated parameter list or you've modified the game code to account for this. In the case of control codes that use null-terminated lists of parameters (I don't know if these exist, but it's plausible), linked values can't tackle this problem because of their defined bytes-after value. And the second case of modifying game code, you've changed the underlaying representation of the text engine. Which means a new table for insertion just you would do as you transition a game through DTE changes, overwriting Japanese font, etc. In a global control code modification like this, it might be nice to have validation to ensure a correct transition. And if you want to add real validation, the hex values need to be within the linked value tag: ie. <color $FF>. Otherwise you lose the context for validation when such tags are in clumps of poorly defined hex output.

You're correct in saying that this validation is not necessary for insertion. But it still has some value to consider.

Lastly on A*. I read about the algorithm almost 10 years ago for pathfinding in videogames. I honestly have no desire to implement an A* insertion algorithm for reasons I discussed two posts ago in longest match vs hex-saving. I don't yet see a clearly superior algorithm in A* that makes me to want to jump to transition.

When it comes to longest match vs hex-saving, there are a few things to consider....

I agree with almost everything here. In particular, if a utility author wants to spend more time making the previously impossible possible and less time squeezing a little bit extra out of things that already have reasonable solutions, then that's what they should do and I wish them well.

I had assumed ScriptCrunch produced optimal results (the Algorithm Overview section makes it sound like it should), but upon closer examination, something odd is definitely going on - Dict with DictEntrySize=1, DictMinString=2, DictMaxString=2 produces different results than DTE, and some of its numbers look off (e.g. I feed it files totalling 82K of script [including comments and Atlas commands, which my .ini file excludes] and it says there are 211533 ScriptBytes). Interestingly, the resulting Dict table produced an encoding 4% smaller than that of the DTE table, according to Atlas.

With that said, I do entertain the idea of freedom of implementation as an alternative. Just because I may be inadequate, I see the merit in not explicitly disallowing you to do it better, especially for a more optimal output. I have two hesitations.

1) This would have to be sacrificed, and the resulting complications for testing are lamentable. All other things being equal, I agree that simplicity is to be preferred, but in this case all other things are not equal.2) The longest text sequence rule for insertion nicely mirrors the longest hex sequence rule for dumping, so you also lose some parallelism . In the interests of accessibility, I'd definitely keep longest prefix as a suggested algorithm, since it is easy to explain and understand, but make a note that other algorithms are possible, and that the same text should be output by the video game no matter which algorithm is used.... except that that doesn't cover cases where longest prefix fails to insert otherwise insertable text. On that note, the "ignore and continue" rule of 2.2.2 further complicates the description of 2.2.4 when different algorithms are allowed. What's the reasoning behind "ignore and continue"? I think I prefer Atlas' "give up and die" approach .

I have a preference for validation, but it's not a strong enough preference for me to be willing to say that validation should be enforced. At least, not yet. Until somebody comes up with a workable algorithm for multi-table insertion (more on that later), I'm not entirely comfortable with finalizing some of these issues, since it might turn out that a different decision is necessary or expedient.

And if you want to add real validation, the hex values need to be within the linked value tag: ie. <color $FF>. Otherwise you lose the context for validation when such tags are in clumps of poorly defined hex output.

We're already using < and > for representing raw hex bytes, and even if we disallow <$XX> in text sequences, there could still be issues with combining entries like <$X and X>. Allowing < and > only as the first and last character respectively of non-normal entries would almost solve that edge case (and requiring non-normal entries to use < and > as the first and last character respectively, coupled with checking for uniqueness across <$XX> and all non-normal entries, would put that issue [and a few more] to rest for good). I was thinking about just pre- and post-fix strings for linked entries, but a format string is definitely more useful. You could then say things like $hex=<window x=%2X, y=%2X>. What kind of format strings would you allow? The full range of printf?

Putting all of that together would make $ the general in-game control code identifier I was shooting for earlier with #. Salient changes would be: