I spent Christmas day coding. That was fun. As part of my efforts to move to Linux, I decided to port some of my code. One of the first things I’ll want in the world of Linux is the ability to read .ini files.

I really like .ini files. You can put any program settings in them in any order. You can edit them with a text editor. You can read and write to them from within your program. This is much better than (say) storing all your settings in binary files. Some people are moving to XML these days, but XML files are massive overkill for a job like this, and end up being incredibly verbose and annoying for humans to read. For context, here is the .ini file for Project Frontier:

The drawback of .ini files is that they’re basically a Windows thing. If you’re writing code targeted at windows, then you can change one of the above settings like so:

1
2
3
4
5

//Specify section, then entry, then the new value, then the file being changed.//I have no idea why the inputs are in this order. Wouldn't it make more sense//to list the FILE first, so they're in ascending levels of specificity?//But whatever....
WritePrivateProfileString ("Shaders","ShaderNormal","cellshade.cg","frontier.ini");

The problem is that WritePrivateProfileString () is not available on other platforms. If you want to use ini files elsewhere, then you need to write your own version of these. This means writing a text parser. Text parsers can sometimes be kind of fiddly.

The big problem is that C++ is not the best language for juggling text. In fact, it might be one of the worst. Yes, you can use std::string. That’s not so bad. But sooner or later you’ll want to pass around char* strings. (If this never happens you you, then you’re probably a student or working in a really cutting-edge environment where you never have to interface with code that’s more than a few years old. Most of us, sooner or later, need to use a char*.) When that happens you’ll need to make a char*, allocate some memory, copy the std::string, and then forget to free () the memory later because screw this language, man. And even when you’re free to use std::string, you still wind up with situations where you can’t do simple things like this:

You either need to clutter up the code with a bunch of casting, or you have to break the operation up onto multiple lines. That’s fine. You’ll still get the job done and it’ll still work, but it can be done cleaner and faster in other languages, and with less memory-management pitfalls.

But the real horror show begins when you have to maintain a parser written in vanilla C. No std::string to help you. No new and delete to make allocating memory easier and less dangerous. When you build a parser in C, you are cutting down a tree with a utility knife.

In my Activeworlds days, I had to maintain such a parser. It had been written in 1994 or so using nothing but base C. Also, the material being parsed was particularly troublesome. It was a scripting language that…

…was often written by end users. The parser needed to be VERY forgiving of errors or it would drive people crazy and be too large of a challenge for the average person to learn. If possible, one bad command shouldn’t prevent subsequent commands from executing.

…was used in bulk. The users went around the world tagging objects with scripts. For example, the picture frame on the wall might change images when clicked, or bumping into an object would teleport the user to a new place. Scenes were frequently made of thousands of objects, and all of this was taking place in the late 90’s, before the days of ubiquitous graphics acceleration. Every CPU cycle cut directly into the framerate, so the parser needed to be as fast as possible.

…was designed to be as terse as possible. All of this data was flowing down the user’s 28.8k modem connection. Users, being users, will naturally expand to use ALL available space. Without limits end users will never stop packing in data. If you let them write a megabyte-long script, they will. And then they will copy & paste that script onto every object in the vicinity. So each object was limited to 256 bytes. With limits like this in place, every single byte matters. Certain fields need to be optional. There need to be abbreviations.

rotate [x] y [z][sync OR nosync][time=time][loop OR noloop][reset OR noreset][wait=wait][name=name][smooth]

You could use this to make an object spin on all axis, like so:

rotate 10 20 30.

If I remember correctly, those numbers were expressed in RPM. That command would spin the object 10 times a minute on the X axis, 20 times a minute on the Y, and 30 times a minute on the Z. However, the vast majority of cases where the command is used are just to make an object spin on the Y axis. (Like a merry-go-around.) So instead of wasting FOUR WHOLE BYTES specifying zeroes for X and Z, you’re optionally allowed to specify only Y.

After the numbers(s), you have a list of directives that may or may not be used. Using the rotate command, you could make a door that swung open an closed. You could put hands an an analog clock that would always show the correct time. You could make a blade that swings back and forth, Skyrim dungeon-trap style. You could make a spinning helicopter blade.

That’s a lot of power in a very small command, with the downside being that it’s a pain in the ass to parse.

Parsers usually work by taking a block of text and breaking it up by whitespace. (Whitespace is any non-printing character. Spaces, tabs, line feed.) Extra whitespace is ignored, so rotate 5 is identical to rotate 5. This is how pages are parsed by your web browser. It’s how my C++ compiler reads my code. It’s how ini files, css files, and MS-DOS batch files are read.

Parsing code remains the only place I have ever seen the forbidden goto used in production code. In many parsing situations, once you’re past a word you can discard it, but in this parser there were situations where you needed to save values for later. So the parser would allocate a whole bunch of stuff, saving values while reading a command that could end in error at any time. I no longer have access to the codebase, and I’m pretty sure the whole thing was re-written at some point in the last few years, but back in the day I remember seeing something like:

Keep in mind that those if () statements are underselling the complexity at work here. For example, if thing2 ends in .jpg or .png then it is a texture name for sure. But if not, then it still MIGHT be a texture name, pending the contents of the other things. But those other tests shouldn’t be performed unless thing2 doesn’t have an extension. Later, if thing2 doesn’t have an extension and we figure out it must be a texture name, then we append .jpg. And so on. You get the idea. We’re talking about complex branching logic, done in stages, each of which allocates memory that needs to be released before we move on.

Now, the C/C++ orthodoxy is that goto is forbidden. Never shall ye use it, lest ye be subject to ridicule and possibly stoning. That’s mostly true, and even if you do happen to encounter a situation where goto looks like a good solution, most programmers will avoid using it because of social pressure. Using goto in production code is the equivalent of an electrician turning screws with a butterknife. It might get the job done, but it looks unprofessional.

But I could never see any good way to avoid the use of goto in the parse_thing () code. Sure, you COULD get rid of it, but just about anything else would require more redundant lines of code, much deeper nesting, and more convoluted logic. Despite what they teach you in school, readability trumps orthodoxy. Most likely the presence of goto here signals that the back end of the parser itself (where it pulls text in) is perhaps built… oddly. I won’t diagnose it further to avoid getting into nitty-gritty details and publicly critiquing code written by other people while I was still making tacos for a living. But the point is, this goto was probably a symptom, not a problem.

Hang on. What were we talking about again?

Oh! ini files, right. Totally forgot. So yesterday I wrote some code to parse Windows-style ini files. I usually hate parser work. It’s very fussy and has lots of little pitfalls and hassles and headaches. What if this whitespace is part of the data? Is this file using Windows or Linux style line breaks? What if the user has odd spaces where they shouldn’t, like inside the [Section] header? What if some of the data has markup in it, like so:

If done wrong, then Bob would break the settings file when he names his character “[[[masta killa187]]]”. He arguably deserves it, so maybe you can pass this off as a feature?

Parsers always seem simple at first, but even something as rudimentary as an ini file can have a lot of possible routes for chaos once you allow for the fact that they must contain the most dangerous form of information in computer science: User entered data.

Like I said, I usually hate writing parsers. They’re boring drudge work. But for whatever reason I was in the mood for that kind of work yesterday. So… that’s what I did. Seemed to turn out okay.

Did I really just spend 2,000 words rambling on through digressions instead of getting to the point, which wasn’t interesting anyway? I might have. Anyway. How was your Christmas?

From the archives:

“So the parser would allocate a whole bunch of stuff, saving values while reading a command that could end in error at any time. I no longer have access to the codebase, and I’m pretty sure the whole thing was re-written at some point in the last few tears, but back in the day I remember seeing something like:”

Probably a typo, but it made me chuckle given the tone of the article.

No,no,no. You are supposed to constantly ridicule C# or otherwise make it seem unattractive. That or tell him that it is impossible to create his pet projects using C#. Then he’ll begin using it to prove you wrong and satisfy his contradictory nature like a good little programmer.

I would say arbitrary labels and short jumps are discouraged in code rather than the other flow control tools but are far from verboten in C. A switch/case is just a block of scope which is all about jumping to labels and many a sane C coder has used labels for sensible releasing of resources in code that can fail (similar to your example but with three labels and no if null checks as the different jump points define which pointers need to be freed). And once the compiler gets hold of your work then it all boils down to program flow by labels so the restrictions on where to use them in your own code is obviously only to avoid people blowing their leg off (and even then we provide bigger guns like longjumps for those interested in potentially removing limbs).

That ini file format seems to not require a full grammared parser but rather just a simple “[*]n” “*=*n” (eager consumption of = on second pattern) while iterating over each line of the file. (maybe also with “//*n” to allow simple comment lines) and so something a RegEx lib could provide. When it comes time to consume a complex language then I highly recommend using someone else’s hand-written parser rather than building your own (manually can only be reasonably done for very simple stuff and via a tool like Yacc or ANTLR isn’t ideal for anything as complex as C or beyond due to quirks and so anything generated needs manual tweaking to work in the real world).

Edit: it is probably worth pointing out that the above label jumping comments are written about C. C++ has RAII and using smart pointers guarantee your objects clean up after themselves when falling out of scope.

Actually, ANTLR has a rather nice (and dangerous) feature where it lets you write arbitrary code into the parser spec and change the AST it generates. This pretty much lets you get around the sort of quirks you see in more complicated languages since you can override the theoretical limitations of the parser (like doing extra lookahead in certain cases).

Sorry, yes, that was my point and I was unclear in my language. Whether it be post-generation edits or pre-, in the grammar decorated with manual code, the fully generated parser is likely to not be as simple as making a grammar and any of the major players (from Clang to GCC to small stuff built on Lex-Yacc like PyCParser) who get to C and above complexity are doing manual work to get their parser working, with or without a base code that is automatically generated from a tool. The sane option is normally to find someone else who has already got an industry grade parser working and integrate it into your project. If you’re making your own language with your own rules then writing an unambiguous grammar that is easy to parse and generating the parser is a plan but then you might just use something simple that can parse as a linear character traversal.

Of course, this being linux, there are already many different ways of doing this, with “do it yourself” always being a sensible suggestion :P

As for fixing the parse_things function, I think the only really nice way to write the function involves having some sort of automatic memory management for thing1/2/3 so that they get freed when they go out of scope.

I long-ago migrated to a preponderance of try/throw/catch/finally syntax for parsing. C++ supports it, however working in vanilla C requires some serious finessing in other ways, such as the setjmp/longjmp method found here. (The minimal set of #Define commands which that site gives as an example to replicate try..catch block syntax is very nice.)

This is always the problem with C stuff. There are probably many. Which ones are good? Is the interface good? Will it compile cleanly? Is it FULLY portable? (No sense in moving to something new if it’s just going to re-create the problem I’m trying to solve: That my code isn’t portable.) Is it properly documented? Does it work the way I want?

In the time it would take you to search, download, integrate, compile, and review the candidates, you could have just done it yourself.

Obviously there’s a threshold in there somewhere. Some stuff is large enough to justify the cost of code-shopping, but for me an ini parser seems like a good DIY job.

I think the best is to DIY, and then find a really good one that someone else has written and compare. What trade offs did I make? Is my code more or less flexible or robust? Was there a clever technique that I missed?

Looking for a solution before solving it is more difficult. Once I’ve gone through the exercise of making it work myself I have a much better idea of the problems that need addressing.

Of course, Shamus has solved this problem himself several times already, so it sounds like the pure fun of it.

Hmm maybe there is an opportunity there…
An online service that creates and evaluates code. Kind of like sourceforge but with a focus on completed/orphaned code. It’s goal would be to answer “Is this code good for my project?” in an organized and accurate way.

With modern cloud / virtual machine / continuous integration stuff, a site like that could automatically test every piece of submitted code on at least windows / linux / mac, probably with several different flavors of each (XP, 7, 8, Android, etc).

It’s Christmas today (we get two Christmas days, ’cause we’re lazy like that), and all I’m doing is pointing out spelling errors on the Internet.
“[..]a door that swung open an closed. You could put hands an an analog clock[..]”

I had to comment on something. About the char * when working with strings, calling std::string::data() will return a char * to the string’s memory, and calling std::string::c_string() (or something like that) will return a c-style string of the data contained in the string (null ended char array). You should be able to use std::string when working with c functions with no issue, so long as you’re not planning to give ownership of the memory to the function you called.

“You either need to clutter up the code with a bunch of casting, or you have to break the operation up onto multiple lines. That’s fine. You’ll still get the job done and it’ll still work, but it can be done cleaner and faster in other languages, and with less memory-management pitfalls.”

The point isn’t that C/C++ is a bad language or anything. It’s just something you have to deal with in C and not other languages.

I should keep these. I've always been annoyed about how I can't use that with std::strings.

Still, I agree that any language that requires you to write your own code to get the syntax to behave the way you want it to is not easy to use. And this is a non-standard syntax that might confuse people.

//Arrays of all types are primitive data types in D.
//D also has a built-in alias of “const char[]” to “string”

string testString = “oo”;
writeln(charToString(testString));
//D strings are NOT null-terminated!
//String.toStringz() returns a proper C-format string with the null terminator added to the end.
writeln(charToString(testString.toStringz()));
//I really don’t advise that last one, you’ve got a null right in the middle of your D string. I don’t know how D handles that.
//If you pass it to a C library, everything after the null will get dropped.

return 0;
}

Output is

Foo bar
Foo bar
Foo bar

The above block of code should be compile-ready. There used to be an online compiler on the D website, but it got taken down…

Yeah, I’ve never been a fan of stringstream. It’s like they saw string was missing some utility, so they created a whole other class to provide the functionality string was missing and make it compatible with streams.

I just want a string class that does what a string class is supposed to do.

Yeah, there’s no way around that. You can’t write to a string unless you pass the string along.

And this is just basic string operations like adding letters to a string. There’s no standard support for switching to upper or lower case, the encoding and locale stuff is an arcane mess that is inscrutable to even experienced coders, there’s no easy way to turn a number into a string, etc. You can code it, but it all has to be hand crafted or you have to hunt for libraries that do it the way you want it to work.

1: “You can’t write to a string unless you pass the string along.”
False. It is guaranteed safe to write to the char* you get from doing e.g. &str[0], so long as you stay within bounds (as with any buffer).

For example:std::string buffer(32, 0); // 32 byte buffer
size_t newlen = sprintf(&buffer[0], "%d", time()); // store the actual written length for later
buffer.resize(newlen); // set string.size() to match the number of used bytes

And you can similarly pass a suitably sized std::string or std::vector to any C API that writes to a given char*. Note that string.reserve() is not what you want here – you need the bytes initialized, not merely reserved.

2: “There’s no standard support for switching to upper or lower case”
C++ inherited C’s tolower() and toupper() so those exist, but I agree they are not exactly great. To change a whole string you need to use std::transform() – and that page even uses toupper() in the example, so won’t duplicate here.

3: “the encoding and locale stuff is an arcane mess that is inscrutable to even experienced coders”
Agreed. This is where Boost Locale and ICU enter the picture. ICU is the de facto Unicode handling library – basically everyone but Microsoft uses and ships ICU, including Boost.Locale. If ICU can’t handle your locale or encoding need in a cross-platform manner, your need is not of this world. What ICU lacks is a polished interface, which is what Boost.Locale aims to provide.

4: “there’s no easy way to turn a number into a string”
See #1 for the quite simple sprintf() way, which can be turned into 2 lines. There’s also the excellent Boost.Lexical_cast. Or if your compiler is new enough, use std::to_string().

Btw, if you aren’t already using Boost, today is the day to start. It’s a collection of high quality cross-platform libraries that play very nice with the C++ Standard Library. Most of it is header-only, meaning zero runtime dependencies.

Also, if you’re serious about learning the inner workings of C++, hang out in ##C++ and ##C++-general on Freenode IRC for a few months (yes, two #s).

On #2, you also start running into a whole *boatload* of character issues almost instantly. Is a byte in a string with value 0xC2 something you can lowercase? Sometimes, sometimes not. You need character encoding awareness to tell, and once you’ve got awareness of character sets then you’ve got to decide which ones you’re going to support or which multi-byte string manipulation library you’re going to use because that IS a project that too big to roll your own in any sane amount of time.

About the first point, even if you are working with the buffer you need to know it belongs to a std::string. If you’re working with legacy code that didn’t know about it and tried to realloc the pointer you passed, you’d get in trouble. And if it didn’t return the length because modifying the string was a side effect you’d also have issues.

So you can avoid using the string interface and work directly on the buffer, but your code needs to know you’re dealing with a string. By that point, you might as well be passing the string along instead of a pointer to the buffer.

That was what I meant. Of course, this is a problem with c++ having to remain c-compatible, and using c code.

We’re pretty much in agreement over the rest. I didn’t mean to imply any of these operations weren’t possible, they just require more knowledge than you can expect from a complete newbie. String manipulation was complicated in c, and c++ did little to improve on it; reflected by the fact that you need to return to c functions for many of those operations.

The fact that at this stage of the development of the language they’re still adding string functions for operations such as to_string shows the string interface was very poor to begin with.

Still, thanks for the links, wasn’t aware of some of those. I don’t generally have much need for string manipulation myself.

And how much bloat does Lua add, just to use that?
Even PHP has a similar ability to write/read it’s data structures. Why not JSON instead then, same thing right?

In a program of mine I use XML config files, but there is also a skinning feature, and that one does use .ini so people can easily make (and tweak) skin features/settings.
The XML settings (program prefs) are not meant for hand editing (though due to it being XML you can easily look at it in a browser or edit it if needed).

The XML code is used elsewhwere in the program for other features, so the xml prefs files is a bonus in that regard.

If you are using LUA serialization then I assume that your program also uses LUA scripts for program scripting?

I also LOVE ini files – SO SO SO SO SO much easier than a lot of other ways to do it.

But I do have to say, even in that case, I wouldn’t have used GOTO, but I have an extremely high nesting tolerance.

Also, don’t most newer languages have a command that basically replaces “goto end” without using a goto? Break, exit sub, stuff like that. So yeah, a lot of people agree with you on it (enough to make commands just to replace it).

The problem is, you can’t just return. (Which is what C uses to jump out of the current function.) You have to release the memory you allocated, otherwise you’ll leak a bunch of it every time you see something invalid — this is not likely to work out all that well when being run continuously on a server somewhere. :-)

C++ *sometimes* allows you to try/finally, but not always (there are lots of projects that compile with exceptions disabled, and lots of companies that forbid them in code that’s used there, mostly because exceptions tend to break cleanup in weird ways: http://blogs.msdn.com/b/oldnewthing/archive/2005/01/14/352949.aspx), so that may not be usable either.

This type of code is actually used all over the place in the Linux kernel, as well; there are a surprising number of places where you have to pass up a (perhaps-translated) error code to your caller, while still cleaning up allocations or other state changes you’ve made.

//allocate thing 1
thing1 = get_next_string ();//makes a copy
if (whole bunch of complex tests prove that thing1 makes no sense)
return;
thing2 = get_next_string ();//makes a copy
if (more tests to see if thing1 and thing 2 make no sense together)
return;
thing3 = get_next_string ();//makes a copy
if (more tests to see if all three things fail to add up to a proper command)
return;
//Yay! The user managed to enter something coherent!
do_thing (thing1, thing2, thing3);
}

But functional purists tend to live in a fantasy world where function calls are free instead of time consuming.

In C++, the unique_ptr (and its associated move semantics to handle ownership) solve this pretty well *assuming* the only thing you care about is destroying objects. See the linked article, where he’s talking about adding a reference to the created object into something else (in that case, it’s a list of notification icons; in the earlier article he links to, it’s a reference to a player object stored in the team).

Sometimes you have to clean up some of the object’s post-creation steps, as well as just calling its destructor, since its destructor doesn’t always know all the lists that the object itself had been added to. It would be possible to wrap those post-creation steps in another wrapper object I suppose, but that starts to get really really complicated…

Yes you can write bad code in every language.
The most important thing is to think about who is owning a resource and therefore responsible to clean up. C++ makes it really easy to deterministically (is this as word?) clean up resources by providing value semantics (and therefore defined scope) through wrapper classes, which by the way have almost none to zero overhead at all (depends how good your compiler is).
And you get exception safety for free if you stick to RAII

“Cleanup” is not exclusively “deallocating the memory for the object”. It also includes “removing pointers to the object from any random other lists of pointers that it may belong to”, which is impossible to do from the object’s destructor.

And I don’t think you can write that off as “bad code” — see the two linked oldnewthing posts for two perfectly legitimate cases where this type of cleanup is needed. The OS maintains a list of notification icons (so that it can, you know, display them :-P), and the Team class maintains a list of its Player objects (so that it knows who’s on which team). Both of these lists need to be cleaned up, otherwise you’re going to crash when calling a method on the class in the list, and passing a “this” pointer whose memory has been deallocated.

(I suppose the notification-icon class could clean up the OS’s list. It’ll still leak the HICON, but that’s because .net is silly. The Team/Player pair still need manual cleanup outside the destructor though, especially in the case of multiple threads…)

Personally, i try very hard not to use shared pointer. I like to know which class owns an object and with shared pointer this can get very unclear. But like you said, if you work with an external api sometimes you have no choice :)

Apart from the whole “spending time with family” thing (No, I’m not particularly social, what clued you in?), my Christmas was pretty good.
At least the family part wasn’t an all-day affair. We went to see the Hobbit (a 3rd time for me) and then I was able to go home and just play games the rest of the day.

Not sure if that would work (it does seem a bit fragile, especially at handling user input), but once the {Get,Set}ConfFileEntry template functions are written, it’s actually a few lines fewer to use than the {Write,Read}PrivateProfileString functions.

Although… hmm. It looks like this doesn’t support sections either. I thought it did?

Aha, the Frontier version did; this earlier version did not. This one instead:

If you need comments in a config file then it’s no longer just a config file. (though comments that are in the default.ini is fine and many games do this) A separate documentation file might be in order then.

As to sections, they are both a blessing and a curse, how should the parser handle duplicate sections (but with different variables), are they added (how to handle numbers then, can they be comma separated, etc.) And how to handle the lack of a section (as a “” maybe?)

At this point looking at XML or some other standard is suddenly no longer so silly.
A program I’m working on the skins for it has .ini files for parameters, no sections at all, and no comments.
This is how it looks:

For simple configs like this the .ini is unbeatable, for something larger and standardized, and which can be interchangable among multiple platforms and different software (like in my case) XML is worth it, especially if the XML code is used previously in the software.

Keep the .ini as basic as possible, that is partly the reason why it’s been so popular, start adding bloat and you get issues later.

For those not seeing the issue with .ini file parsing, take a peek at http://en.wikipedia.org/wiki/INI_file
That’s as close as you get to a “official” standard (please note the “” as there is no actual .ini standard)
And then there are variations not shown on that wikipage at all.

INI and XML only structures the data, it does not describe it, so regardless if .ini or .xml is the file format the content is always program/application specific.

That inilib seem nice but has a few issues. It’s GPL (and not LGPL, or BSD or MIT/zlib/PNG license) so it may or may not be possible to use depending on the project and the way it’s distributed.

Also the source for that inilib is over 1.5MB which is insane (I’m a stickler for really small efficient and logical code with tight minimalistic loops) and yeah I know, the build/configure environment eats up a lot of the space, but I consider that part of the source (as you usually need it to build it).

That lib is incomplete (just look in the TODO), it even says that comments can be lost and that full section/key name support isn’t there. At a glance it looks like feature creep (there is mention of replication certain Windows Registry features).

Which such code size and complexity then one might just as well go for TinyXML2 instead http://sourceforge.net/projects/tinyxml/
Which is fully standards compliant and any XML viewer/web browser or editor that support XML can read/write the .xml file generated.
It is also way smaller (1.2MB and 80% of that is the documentation) than that inilib. The license is also the very liberal zlib (MIT/libPNG) license.

I’d been watching the livestream xcom game from a while ago and the site Is it saying that a lot of the videos on rutskarns channel don’t exist anymore, was wondering if the site was pulling a viddler or something.

Christmas Eve is the day of celebration here and since the nearest bit of (other) family lives more than 800 kilometers away, it was just the three of us (me, my girlfriend and my daughter) having a cosy evening.

On your Christmas I spent about one fruitless hour trying to figure out why my program (visual programming here: “patch”, actually) runs perfectly fine unless I fullscreen the ouput which seems to introduce a mysterious something-like-a-buffer out of nowhere. And I’m talking logic defying timetravel mystery here. sigh.

My Christmas was really good! We’re visiting family (my parents and my wife’s parents live about a block apart) and it was good to see my brothers again. Didn’t get much stuff, but it’s ceasing to be about that anymore, which is an interesting transition to observe.
On the other hand, Hanging out with the inlaws is always a bit… trying. I have trouble having a good time, which makes it hard for my wife to have a good time as well. That, and I feel like my kids pick up bad habits from then, which are only going to result in hours of re-training. Oh well.
So, mixed bag. Learning to be tolerant (the hard way). Enjoying old friends.

I’ve enjoyed writing parsers as well, but mostly in Python, which has none of these frustrations. I wonder, could you write the parser in another language (PHP or Perl or Python or something) and then build a link to the module from your C library?

Somewhat OT, but I just had to post because just today I encountered two of your comment-counter messages that were truly fascinating to me (40, and 77 — and yes, by posting this, 40’s message is gone now).

Have you ever posted a list of all the messages you’ve built into your comment counter? Thinking about the coding behind it, it must be a huge switch statement… actually maybe not a switch, as some messages connect to a single number (simple EQUALS test) while others connect to a range of numbers (perhaps GREATER THAN OR EQUAL test?). Hmmm, this alone could be an interesting post someday… :-)

Oh, and I had a lovely Christmas! Managed to make it all the way to 8am before being woken up… :-)

I had an excellent Christmas because I received The Witch Watch as a present! I spent the rest of Christmas reading it (and waiting for Steam games to download on a 100kbs connection). Great read, and I was glad to be able to support my favorite blogger. Here’s hoping you don’t lose the motivation to finish your current book; if you keep it up then maybe my future Christmases will be as nice.

It’s kinda nice in Windows that Most (but not all) things use one kind of file (.ini) for config. Linux is based on a much more… uh, varied background. So if you want to administrate it, you need to know how a billion different config files look. A maze of twisty little passages, all slightly different.

Similar to MSXML? Yeah, there’s either SAX (if your schema is simple enough that a single callback per tag (IIRC anyway) and a serial walk of the text file can work) or libxml (which generates an entire DOM tree and gives you the ability to process it — this is much more expensive in terms of time and memory than SAX’s setup, but is also a lot more flexible since you’re not limited to a serial tag walk).

At least libxml is installed on almost every system; I believe SAX is about as widespread.

If there are multiple competing .ini parsers, all the better: Choose which one that works makes the most sense for you. It’s still a single include that gets you functions that you call that (allegedly) do what you want them to.

Plus, if you are WRITING the program in question, the .ini files shall conform to whichever standard you make them.

My Christmas did not have enough coding in it. But about those goto statements. A parser is a state machine. All parsers are built from state machines underneath. And if you’re coding in C, the standard way to implement a state machine is using goto. It is actually more readable than the alternative. And I tell you, as someone who teaches compilers and programming languages at university level, it is even orthodox. (Although if you happen to be parsing an LL(1) EBNF grammar, it is even more orthodox to use classic recursive descent with if statements and while loops.)

goto breaks the flow of the program, by instructing the program to go to any arbitrary point in the code.

Imagine reading a book whose pages were out of order, and at the end of each you were told “continue on page x”. Kind of like a choose your own adventure book, only without multiple endings — so a choose your own adventure book written by Bioware *zing!*.

That’s what making sense of a program that uses goto is like.

As to why it’s used, it’s because sometimes it’s easier on the person writing the code to force a jump rather than plan the program ahead and design a clear flow that does what he needs. Or they are dealing with feature creep and can’t afford a rewrite.

Essentially, it’s a necessary evil. The best you can hope for is that the goto is reduced to a “continue on the next page” at the end of every page.

Also, I challenge anyone to find a car analogy for goto. I couldn’t :D

“with GOTO if you mess up you can easily crash shit, you need to be very careful about registers in use and the stack etc.”
No, goto is guaranteed safe to use in C and C++ – it will unwind the stack properly and everything. You’re maybe thinking of longjmp()?

GOTO cannot jump outside the current function, so i don’t know how the stack would be relevant in any way.
I don’t know what’s the fuzz about it, if your functions are so big that you have to search where a GOTO goes to, they are too big anyway :)
that said, i never use it myself .. it reminds me to much of the old BASIC style :)

What about object-oriented programs, then? then ones that rely on conditional states and properties and interactions of entities and now a single fixed ‘program flow’ – why is that not a taboo, becuase I sure as hell can see oo code breaking “program flow” just as easily.

The main problem with goto isn’t really with the jump, but rather with the fact that in a imperative “state mutating” program, it’s generally easier to track the state with structured code. To reason about correctness of the code at a given label, you have to find all the goto statements that jump to that label.

This isn’t much of a problem when goto is used for something like non-local exit in a few well-defined places, but it becomes a huge pain if you have lots of labels jumping back and forth in apparently random order. This is what structured programming was designed to eliminate. In many large C code-bases you can still find a few goto-statements here and there where they improve the clarity, but these are almost invariably non-local exists or certain types of error-handling code (there are some other cases where it’s hard to avoid code-duplication without goto, but usually you can split the code in question to another function as a cleaner solution).

Curiously, tail-calls in functional languages are more or less “goto” statements with arguments, but since programs in such languages are generally structured to use function calls and “binding” rather than sequential execution and “mutation” the whole problem largely goes away. You no longer need non-local information to reason about the code (the lexical environment is sufficient), so there isn’t similar problem with jumping around. You can still take it too far, but it’s significantly less of a problem.

For config files there’s a nice small library called libconfig (http://www.hyperrealm.com/libconfig) with an c and c++ interface. the files look a bit like json. it’s available on almost any linux distribution. I already used this in production code.

Personally I prefer Java for the exact reason of being OS independent. It does use ini files occasionally (e.g. eclipse) and being OO it allows for quite a bit of flexibility.
Any code that’s re-used becomes its own method so no gotos are needed, and sanitation falls on the object that is being set instead of on the parser. So for example in your system the section up to the first = char is what matters, it just reads to the end of line (which should read any EOL character), shoves that into a string, and passes to the object to be set and its a black box how “[[[masta killa187]]]” is dealt with. The same can be done with C++ if you start from a heavy OO perspective.

*looks around* Is it just me or is there a lot of coders around in this part of the year? I certainly can’t recall this many coders in previous posts, either that or the non-coders passed out (or has a life) *laughs*.

It’s not so much the time of year as it is the subject. When Shamus is talking about his own projects, there’s not much to add. But nothing gets coders riled up as much as giving opinions on a language. Just mention a preference for an IDE, or language, or compiler, and step back.

” (If this never happens you you, then you’re probably a student or working in a really cutting-edge environment where you never have to interface with code that’s more than a few years old. Most of us, sooner or later, need to use a char*.) ”

Is wrong.
It is not hard to write c++ programs where you communicate with old software even with the ‘new’ std::strings

std::string in c++ are basically wrappers around character pointers, and they interface just fine with old char * code. First of all standard string CAN be initialized from char pointers, and they CAN just as easily be converted to a char * pointer.
That is the whole point with c++ strings, they are compatible with old c code.

Look up the string constructor and string.c_str().

std::strings are often prefered to character pointers as they are self contained and handle memory themselves. Compilers can also do some very aggressive optimization on objects, which can’t be done on character pointers.

Bad, bad, bad things, I’d expect — this is exactly the kind of thing that seems like it’ll only work if the std::string class’s operator[], and its iterators, incidentally, are both implemented naively. The SGI STL reference doesn’t say anything about this either way. :-/

If I had a copy of the C++ standard handy I’d go look for this in there (though I wouldn’t be surprised if it weren’t there either, since I don’t know if STL semantics are specified there or not).

But I’m pretty sure it’s perfectly valid to actually store the data in a string in a bunch of disjoint buffers; this would be useful when trying to do zero-copy I/O, for instance. (Methods like append() and insert() would find this layout extremely efficient as well.)

The offset passed to operator[] would determine which of the internal buffers to use based on the size of each buffer, and then the “rest” of the offset would determine which charT instance to return a reference to. But taking the address of that reference would trivially break if the code you’re passing the pointer to assumes that the layout is sequential.

With this string implementation, .data() and .c_str() would both allocate a new buffer (…though destruction of that buffer would potentially be complicated; it’d have to be tied to the string instance somehow, but .c_str() already has this problem since it has to include a zero byte at the end, while .data() and the string instance itself do not), copy from each of the internal buffers into it, and return a pointer to its first byte. So those don’t actually require any given internal representation…

It is guaranteed safe as per the C++ Standard, both 1998 and 2011 editions.
– It is explicitly stated in C++ 2011 that std::string must be stored contiguously and with an extra null termination byte.
– In C++ 1998, this is not explicitly stated, but it is implicitly the case due to complexity requirements and other parts.
– All implementations do it the safe way, anyway, which formed the basis for explicitly requiring it in C++ 2011.

As for modifying an std::string via a C function, my post way above showed how to:std::string buffer(32, 0); // 32 byte buffer
size_t newlen = sprintf(&buffer[0], "%d", time()); // store the actual written length for later
buffer.resize(newlen); // set string.size() to match the number of used bytes

You don’t have to .resize() after if you’re using some other way to track used length (even relying on null termination), but I prefer to let std::string handle it.

Thanks for validating my views regarding goto. And thinking back, where did I use goto extensively? Parser code! I wrote a C++ program called JACOsub that was popular in the Amiga anime subtitling community in the 1990s. It included a parser that interpreted a rather complex subtitle script format that included multiple timing formats as well as codes for font selections and positioning and formatting of subtitles.

It always bothered me a bit that I had to rely heavily on goto while interpreting a script. Not only that, but I often used goto to jump to different sections inside a switch() statement to avoid duplication and keep my code size small (the whole package had to fit on a 880K floppy diskette and occupy less than a megabyte of memory while running).

Never having seen anyone else’s parser code before, I wondered if I was doing the right thing. I mean, my source code made perfect logical sense even with all those goto statements, and my code was tight and efficient, so I figured it was OK. I’ve had was this nagging feeling about what other programmers would think if they saw my source, but then again, I couldn’t see how I’d do it any other way that made as much sense. A dozen or so years later, you have lifted that small weight of uncertainty from my mind. Thanks!

I’m not sure I know why, but this is one of my favorite posts in a long time. I think I just enjoy reading your explanations of things? Or maybe it’s because I’m going to have to write a scripting parser soon? Whatever it was, it was entertaining.

I have to agree with the people above who recommend the Boost library.

I feel like it’s everything that should have been included in C++ to make it a competitive modern language. (Compared to Java or C# for example.) And it’s an extremely well-behaved library, it makes no assumptions and doesn’t force the programmer’s hands.

Their, one-line parsers for XML, INI, INFO, and JSON are a great example.

Check out this “Five minute tutorial” for loading config files with the Boost libraries. The functions to load and save the data are 7 lines each!