How C compiler stores symbol table? This link https://www.tutorialspoint.com/compiler_design/compiler_design_symbol_table.htm
explained most of it, but I'm still not sure how they store type data. They can be quite complex, some types are actually combinations of types (array of struct, pointer to pointer to integer), some types reference user defined types (creating a struct creates a new type, which can be referenced in another type (array of structs again)). How compilers and debuggers store that?

Also why you moved this thread from Programming Language Design to High Level Languages? People here ask how to use C, I'm asking how to make C. I'm not complaining, maybe this subforum is read more often.

I'd probably store variable type as a single dword. With one bit of it deciding if it's a normal type or a user defined type.

The problem with it is in those combined types, like *******int. I know you rarely meet more than two pointer stars, but still compiler probably has to support up to infinity pointers, this means this probably requires up to infinity space...

Probably that's a good place to use linked lists, I dunno.

Yeah, I think I'll do something like that. Type information is two dword,
first dword is type
second dword is child type
and when I'll need to "say" that a symbol is a *********int (this looks like a swearing, lol), I'll create a bunch of middle types that aren't used by any symbol directly.
I'll create a type with first dword saying pointer, and second dword saying int.
Then I'll create a type with first dword saying pointer, and second dword saying the previous type.
And continue until I created what I wanted.

The next problem is with languages that have not one, but two child types. Python and go have "dict" and "map" data types, and both of them are maps of something to something. That's two child types, this means at least three dwords, for every type. And C++ allows creation of combined types (they are called templates, i think) that may use even more child types, which means again up to infinity dwords. I'm again not sure how to implement that.

Since I moved to assembly, I started to panic whenever I see a variable length anything. I guess that's my problem right now.

Since I moved to assembly, I started to panic whenever I see a variable length anything. I guess that's my problem right now.

Once you have dealt with a problem of memory allocation (and in any modern OS you have a reliable "malloc" API), creation of variable-length structures in assembly is really not that hard, and it often is very rewarding.

For example to make a linked list traversable in both directions you need to add just two 32-bit (or 64-bit) fields to every element - call them like "next" and "previous" - and store pointers to adjacent elements there. To access the list just keep a pointer to a element of your choice (the first one, or the last one, or even one in the middle if that makes sense, it all depends on application). Insertion of a new element at any point may look like:

I think that implementation of growing data structures can be a lot of fun and I highly recommend to experiment with them. The more you do it, the less intimidating and more enjoyable it becomes.

And of course, to implement any kind of compiler - even a relatively simple one like an assembler - you need to be able to process growing data structures en masse. The symbol tree in fasmg contains a thick pattern of interconnected structures, some of the connections create linked lists, some make trees, hash tables, hash trees (a hash-based binary trie is my favorite one, I use it in fasm perhaps partly out of sentiment). And it does not have any complex types at all!

@Tomasz Grysztar
Interesting how in fasm you only seem to use memory allocation like once, at the very beginning, and that's it. If I understood and remember correctly. Was it just more convenient that way, or you wanted fasm to be faster?

Also, can you point out at where the binary trie is used in fasm? I'm curious how it looks like.

Yeah, I should start using malloc already, I'm avoiding it much more than necessary.

Interesting how in fasm you only seem to use memory allocation like once, at the very beginning, and that's it. If I understood and remember correctly. Was it just more convenient that way, or you wanted fasm to be faster?

fasm was originally written for DOS where good malloc-style API was not universally available (especially not for extended memory) and so the only real option was to take the entire available memory (not a problem in system with no multitasking) and implement own allocation scheme inside it. But instead of writing some kind of internal "malloc" for fasm I used allocation methods very specific to consecutive stages of preprocessing and assembling which were impossible to untangle from fasm's internals when porting it to more modern systems. And thus pre-allocation of a fixed amount of memory had to stay.

vivik wrote:

Also, can you point out at where the binary trie is used in fasm? I'm curious how it looks like.

In fasm source you can find it in the "get_label_id" routine in PARSER.INC, start looking from "find_label", the tree traversal starts right after the FNV-1a hash is computed. In fasmg source this is the "scan_namespace" function in SYMBOLS.INC.

I wonder what reasons golang developers had for supporting maps (aka dicts, hash tables) on type system level. Can't it just be a class with a bunch of methods? Having maps in type system only makes sense if there is some kind of function that may assept only arguments of this exact type. Can't imagine a function that needs to have maps of strings to integers, and nothing else.

Hash tables are in golang probably because they were in python, and python is pretty lax about type system in general.

If I'll ever need to make sure that a function always receives the hash table of correct type at compile type, there are more ways to make sure of than to just use type system. I'll use something that operates on program source, I guess? Dunno, don't want to think about it now.

Even if you stick to the OS-provided allocator in the beginning, later you always have an option of replacing it with your own one, perhaps better tuned to your application. This is also the approach I chose for fasmg (if you scroll to the end of that thread you may find a very good article on memory allocation strategies), though it turned out that "malloc" has not that much impact on the overall assembly time. The strongest slow-downs were from a re-allocations of overflowing buffers and I managed to keep them in a small number by growing buffers exponentially (but still not too aggressively, see the "grow_stack" routine in fasmg source).

I planned to use HeapSize to keep track of all variable length data structures, but if it will suddenly allocate 12 bytes when I ask for only 8, it will break things pretty badly. But if it only allocates more memory when it's not dividible by 4, or when asked for something weird like zero size memory, and behaves correctly in all other cases, when I will use HeapSize for that. Overwise, I'll have to keep sizes somewhere in my own data, manually. Probably as a first dword inside data chunk itself.

There are no guarantees about what is allocated other than you can use what you ask for; and maybe more if the OS uses some form of minimum allocation or alignment strategies. Also, do not rely on all versions of any particular OS all doing the same things.

Note that my "malloc" function is a generic API, it always returns the size allocated, but the code that called it may use this information or not. If you discard such information, you simply do not use the excess memory if such is provided and it should not be a big problem - any "malloc" implementation wastes some memory anyway. But in cases where I allocated buffers that may need to grow later I store the allocated size, so when there was more memory allocated that requested it still may get used before calling "realloc" to grow the buffer.

One change that I consider for inclusion in fasmg source is to differentiate between a few "malloc" variants - "malloc_fixed" for blocks that are not going to be resized or freed before the end of assembly (the basic symbol tree structure buffers are like that), "malloc_growable" for blocks that are likely to be resized with "realloc" in future and plain "malloc" for blocks that may get freed but not resized. I could make them all label the same function in basic implementation, but with a custom memory allocator these hits could be taken into account to make better choices about allocation strategy.

I like how fasm doesn't even use stack to pass function arguments, and only uses registers and globals. I wonder how much of advantage in speed and size this way of coding has, and if any high level language compilers can produce a code like that.

I like how fasm doesn't even use stack to pass function arguments, and only uses registers and globals. I wonder how much of advantage in speed and size this way of coding has, and if any high level language compilers can produce a code like that.

One should be careful when using globals like fasm does, because it is not thread-safe. For this reason I chose not to use EBP register in the fasmg core (even though it made a register allocation a bit harder), so that when a thread-safe version of the assembler is needed (like a DLL one when you might want to call multiple assemblies in parallel) it should be possible to put all the globals into "virtual at ebp+x" block and have a separate local copy for every call of the assembler function.

You cannot post new topics in this forumYou cannot reply to topics in this forumYou cannot edit your posts in this forumYou cannot delete your posts in this forumYou cannot vote in polls in this forumYou cannot attach files in this forumYou can download files in this forum