Tachyon V4 "DAWN" - exploring new worlds

Taking a fresh look at Tachyon since I first implemented it in 2012 I got to thinking what would happen if I used 16-bits instead of 8-bits for the opcode? I know it takes up twice as much memory for the simpler functions and as soon as I encode a literal it takes up 4 bytes. Originally even high-level calls were only 2 bytes long since they used a vector table but V3 cut back on the preset table so many calls are now 3-bytes anyway.

Nonetheless I started playing with some code so that a 16-bit opcode could still efficiently call PASM code while only wasting one extra cycle. Addresses above the usable cog range of 0..$1C0(ish) would automatically call a high-level function which means it would push and replace the IP (instruction pointer) onto the return stack. This is good as I don't need to waste a 16-bit opcode plus address to tell it to call a function and it seems a bit like a conventional 16-bit address Forth.

But it's an opcode interpreter and not a dumb address interpreter so any address above hub ram is interpreted as a 15-bit literal and pushed onto the stack. That's great and that's even more compact and efficient than bytecode and because now there is no need to use up cog code for fast constants either (in addition to the elimination of absolute and vectored CALL code).

There is also bit 0 which is not required for high-level hub calls as all hub code is word aligned but I may use it to tell the code interpreter not to bother pushing the IP onto the return stack, in effect causing it to jump.

Branch runtime optimization is already implemented in Tachyon as a branch stack for DO/LOOP & FOR/NEXT anyway while BEGIN was tried that way but left out due to cog memory constraints. So I could reimplement this at least for BEGIN so that it is actually a compiled codeword that pushes the IP onto the branch stack which is then used by WHILE/REPEAT/UNTIL/AGAIN.

One thing that bothers me is an IF/ELSE branch at present requires one bytecode plus one displacement byte but that could double the memory usage with DAWN.
!Idea - since the top of hub ram is needed for buffers we wouldn't be calling code there so how about I encode $7Fxx as an IF+forward displacement and do similar for ELSE as $7Exx. Maybe I could do the same for BEGIN loops which have a negative displacement so that UNTIL is $7Dxx, REPEAT/AGAIN as $7Cxx. If I make the displacement signed and word aligned I just need two "opcodes" Wow! way better now.

Also any literal > $7FFF needs to be encoded as 32-bits plus the opcode word = 6 bytes total (vs 3-5 in bytecode). I may just have to have a -1 as a unique opcode. The compiler can still make use of double-opcode macros too (i.e. NIP=SWAP,DROP)

Variables and constants will no longer sit in code space requiring an aligned opcode and to be called as they did but will simply be compiled as an embedded or long literal using only the information found in the header which is a record in the dictionary. Variables are mostly read and before used an appropriate fetch (C@,W@,@) although with the lexing of source code we could take a variable such as myvar and simply say myvar@ or something similar for correct code to be compiled. Same again for myvar! too.

Here's just a very quick and incomplete rundown of changes and features -

Features
Faster than bytecode - most operations do not require to read any additional hub code
Direct access to all cog code rather than having to page into the upper 256 longs.
More cog space to include critical functions (elimination of fast constants and call types etc)
Embedded 15-bit literals - handles all hub addresses in one operation
Elimination of call vector table
Fast internal data stack - still with top 4 registers directly addressable.

High-level features
Separate dataspace (code/data/dictionary)
Symbols such as variables and constants are "code-less" and simply compile a literal inline without the need to call code.
All calls (or jumps) only ever consumes a 16-bit opcode (vs bytecode+16-bit address)

Compile-time features
Text input is "executed" via a 128 word lookup table character by character rather than by text word in earlier Tachyon (compared to by line in Forths)
Numbers are built digit by digit rather than processed as a string. (expect floating point to be encoded too)
Syntactical preprocessing and optimization (embedded symbols/parenthesis/braces)
Length of variable encoded into header for correct width access and indexing.

I've also been playing with directly handling C-style syntax, it's quite interesting what can be accomplished while being able to freely mix both styles (and interactively).

This is a work in progress as I am not really starting from scratch as I have the whole Tachyon kernel to overhaul. However I will update this thread with my incremental results for those who are interested.

All is looking good so far as I separated the PASM kernel from the to-be-completed wordcode kernel and drivers etc. So I can test this with a little bit of wordcode etc. Not much needed to be changed to most of the PASM opcodes, just removed a lot of "junk" and enhanced the doNext VM interpreter.

I think the Proptool max'd out because it couldn't handle all the DAT symbols which have been greatly reduced now since I don't need an indirect vector etc so I guess ol' Proptool should work again.

I'm really chuffed to come up with this simple word encoding that still allows code direct address as it makes all the difference plus cog code does not suffer any real penalty. Even just having cog space to implement a 32 level hybrid data stack makes a lot of difference too. This should really fly and still be compact. There won't be many changes that I should need to make to all the extensions so once I sort out the compiled kernel I should be ripping.

Some questions:
how long will it take to bring this to live?
could it be the preferred version for P][
can we create a git repository?

At this rate, within the week, maybe just days assuming I get to spend some time on it which I haven't really been able to yet. When it's all done and baked then maybe we can look at a git repository although unnecessary complexities just get in my way, but anyone is welcome to implement it if they so desire.

P2 already uses 16-bit addresses although they are not encoded and I could mix hubexec into it too. Even though the TF2 opcode is only 16-bits to make it more compact it did have the full 64k available just for code, with the dictionary and data in other areas.

BTW, one of the reasons I came up with this new version is because I need a morale booster, I just see the Propeller chip itself fading further and further into the background of forum chatter and dreams.

I'm loving this new word encoding format, it really does work well, even at the DAT level of the Spin tool.

You can see the memory it ends up saving even if it looks like it uses twice as much memory as bytecode. For instance in bytecode if I wanted a value of 1,000 for instance this would take 3 code bytes plus the time to read them as well. With wordcode this is just one word and one read and faster. Another saving is not having to have the call vector table as all I need to do to call for example BLINK (with the Spin tool) is:

word @BLINK+s

where s is simply the 16 byte Spin header offset that we have to factor in. But if I didn't want to return I can just as easily say:

word @BLINK+t

where t is s+1 since bit 0 is redundant for word addressing and used instead to indicate that the IP should not be pushed onto the return stack, so it effectively becomes a jump and thus removes an otherwise required EXIT instruction.

Wordcode compiles nicely, have a look at a startup test demo which dumps some RAM and then sits in a loop incrementing and printing a number while blinking two LEDs.

Notice how DEMO terminated with a @BLINK+t vs @BLINK+s,EXIT to save two byte and some time.

The doNEXT wordcode loop is a bit more complicated than it is in bytecode Tachyon but this same loop handles literals up to $7FFF and calls to wordcode as well. BTW, cog PASM codes which directly jump to that address in the cog only requires four doNEXT instructions which is only one more than bytecode Tachyon.

the new model sounds really promising for some speedup & code shrinkage then ...

You haven't looked close enough, in the testing section perhaps? Can you sync the Tachyon folder to your PC so it is always up-to-date?

V4 is not yet interactive as I'm starting afresh with how input is processed and I may allow for line input as well so corrections can be made while typing.

BTW, there are some sections of code, especially early on, which definitely consume almost twice as much memory. But the more that code is added, the more that savings are seen. There is no real penalty for factoring a similar snippet out and calling it from a from a routines which in the past always required at least another vector. The other saving is being able to jump instead of call which saves on the enter/exit overhead.

For instance, I noticed that I had the sequence SWAP DROP EXIT in a lot of places as well as DROP EXIT. So I just define:

I share your sentiments on the prop fading. Once upon a time you couldn't leave the forum for a day without heaps of posts. Now I can go away for a week and not miss much. Rarely are more than a couple of people are logged in.
I find other things to do now, rather than work on the prop, which is a shame.

Wot! It's 200ns slower Well, this test is mainly to check that the extra doNEXT overheads don't impact much on code execution, which it doesn't. The speed savings I expect to see in larger sections of code whereas fibo is basically doing a tight loop with FOR BOUNDS forNEXT where BOUNDS is the equivalent of OVER + SWAP.

I will do some further tests with less optimizable (real world) code to measure the gains which I expect to see.

To be fair I replaced the "2 +" with 1+ 1+ which V4 had and the result was Primes = 182.542ms
However 121.58ms vs 182.542ms shows that V4 is 50% faster based on V3 = 547.8 primes/100sec vs V4 = 822.5 primes/100sec
It seems that two Tachyons bound together can travel faster than one Tachyon alone!

V4 embeds constants as wordcode literals so this also makes it faster.

btw, the earlier benchmark times were skewed by 200ns so they were not in fact 200ns slower.

I share your sentiments on the prop fading. Once upon a time you couldn't leave the forum for a day without heaps of posts. Now I can go away for a week and not miss much. Rarely are more than a couple of people are logged in.

The latter is a function of what I call the "log-in latency." I'm pretty much logged in all the time, but it only shows up if I've clicked on something in the last minute or so. In the old forum, log-in latency was much longer, so more people showed up as being logged in.

What you call "prop fading," I believe, has been function of three factors:

1. The never-ending agony/ecstasy cycle of P2 development has simply worn most people out, and they've moved on.

2. The introduction and emphasis of C language programming vs. Spin has split Prop enthusiasts into two camps. This has hindered cross-fertilization between the two, since they seem mutually incompatible.

3. The new (now not-so-new) forum software has alienated a lot of forumistas due to issues that remain unfixed and features that did not survive from the older version.

Phil,
I think the richness of the microcomputer / microcontroller market is also a factor. When I go into a MicroCenter store or look on-line, there are so many different Arduino-compatible devices plus ESPxxxx variations with WiFi built-in plus RaspberryPi's and C.H.I.P.s also with WiFi and Bluetooth. The latter are getting much better at doing the sorts of things the Basic Stamp does with development on-chip with a Bluetooth keyboard and either a composite video or HDMI/VGA display.

For sure that little C.H.I.P. is sweet and neat and the smaller Pro version certainly appeals to me. I also like the tag line "Powered by a chip you can actually buy" which for us forumistas vs RPippl has more than one meaning ....

Still, all these chips don't have what the Prop has which for actual embedded control work is far more important than playing pico-8 games. But I'm about to squeeze a whole lot more out of what we've got just when I thought I couldn't squeeze no more, because it's all we've really got, so we can either dream or do.

I'm on a R&D project I can't share. It involves control of a pile of inductive loads.

We used the P1 for some early tests, was simple. Complexity went up, and people thought that exceeded the bounds of the Propeller. "Hobby toy"

There is a complex controller in development, GUI, C, etc... so far that has been stalled as a lot of stuff needs to be debugged, written and so forth. I got told all the amazing features, interrupts, timers, peripherals, integrated development tools, libraries...

"Real pro grade rapid development system."

Okie Dokie, well how come it takes so long? I got a bunch of answers all centered on features, tests and complexity.

Well, one evening a few days ago, I decided to connect my proto board to the controllers. Took an hour to solder up the connects, another 30 minutes for a quick systems check, and about an hour to code up motion in SPIN. This board was used to help characterize the inductive loads. So it was just sitting there along with a little POS netbook.

"Interpreted is too slow"

"Concurrency is hard"

LOL, while they fight with a heirachy of interrupts, missing brackets, API and library oddities, I wrote the few methods I needed and rolled that all up into a nice repeat loop that demonstrated the proof of viability nicely.

Sent a video out and the answer I got back was hilarious!

Basically, it was a bunch of, "when we get done..." Yeah, tons of setup, so the real work is easy.

Well, on the P1, just doing the real work was easy, almost no real setup needed.

That evening moved the project forward about a month. A lot of basic science and testing to characterize the device and it's physics needs to be done no matter how it's controlled on a high level in the end.

They keep building in features anticipating tests, while I just wrote them as needed quickly and easily. I've got the tech doing the same thing.

Now, the tech wants to get onto the science, so I set him up. Ran through my setup, SPIN basics, and after a couple hiccups, he's off and running with the odd programming question I can answer easily.

Over the years here, I have learned a ton! Thanks.

The best part was need for some debug and status output. Took me 10 minutes to merge the test code with one of the serial demos.

"Do you want keyboard, mouse and a video display or serial?"

"On that thing?"

"Yup, will take me an hour, maybe two..."

"How?"

"Concurrency is kind of easy" Damn right it is.

"I'll take serial."

"OK, that's 10 minutes."

So far, SPIN is quick enough for this application. I may need a PASM helper COG, depending on where the science takes things. I've got it done, and showed it off.

"Assembly is scary"

Well, after showing them how to just drop it in and do things the way we do...

"That assembly is stupid easy" Again, damn right it is.

"Won't scale"

Wait, until I show them just how much one of these little chips can do with basically zero support libraries, fancy multitasking, interrupt based kernels... my limit is pins, not any real speed bottleneck.

"Can't do anything in 32K RAM"

Already blew this one out of the water. . "And you get video, keyboard and mouse too? SD card?"

We need that P2 done. It's got similar potential, and a lot more basic room to work in.

But, I just wanted to share a positive. This hobby level guy just kicked a lot of arse, and did so with how you all have helped me to think and Parallax made a chip that is lean, mean, and effective.

I can write a few lines, hit F10 and see it all happen. They have to know a ton more and do much more, all of which gets one far away from the problem or task.

And this is just the bog standard stuff. Tachyon can do so much! Wish I got along with Forth better, but I don't.

What I do know is Peter is sharing pure gold here for those who can run with it.

And I also know the P1, and how we think here, and why we do that is very seriously potent stuff.

It's hard to get others to believe or understand. Nothing beats, "oh, you did that in an evening?"

And It's not even my area of expertise. I do this for fun and my own enlightenment too.

Last chat, "I might have to get one of these, it's simple and effective and fast."

... Addresses above the usable cog range of 0..$1C0(ish) would automatically call a high-level function which means it would push and replace the IP (instruction pointer) onto the return stack. This is good as I don't need to waste a 16-bit opcode plus address to tell it to call a function and it seems a bit like a conventional 16-bit address Forth.

I like the sound of that too..

When this is tuned and running, what about looking at a P1V V4 Core ?
A few choices there
** Do a COG that uses only the V4 forth opcodes, and see how much smaller that is
** Do a COG with helper opcodes, to see how much faster that can be

This is what turns me off that big world out there, the unnecessary complexity of the tools and even the peripherals, which would be acceptable if it did everything properly, which it doesn't. So you have to get your head into all that stuff and then use that knowledge to create workarounds which increases the complexity even more.

Had a laugh on the CHIP forum as they were trying to interface a DHT11 type sensor which they eventually worked out a solution for using SPI, a big buffer, and several I/O lines Wow! On the Prop it's only a couple of dozen lines of Tachyon code (about 150 bytes) to make it work using a single I/O.

EDIT: As for Spin even if I don't love the speed, I really love the simplicity and ease of use of this language. Along with PASM it just works or is easy enough to get working. It got me started with the Prop.

Nonetheless I started playing with some code so that a 16-bit opcode could still efficiently call PASM code while only wasting one extra cycle. Addresses above the usable cog range of 0..$1C0(ish) would automatically call a high-level function which means it would push and replace the IP (instruction pointer) onto the return stack. This is good as I don't need to waste a 16-bit opcode plus address to tell it to call a function and it seems a bit like a conventional 16-bit address Forth.

... another idea, is what would this look like, coded to read HyperRAM as XIP ? - allows 2Mx8 Memory, with low pin code.
The 16b opcode could fit well there, as that's just 2 clocks for sequential read, leaving a lot of idle time for the hidden refresh.

What about a P1 module, with two HyperRam ? - one for XIP Code, and the other as a video buffer.

Although V4 is not at the interactive stage yet since I am making design decisions along the way but I decided to code my DHT22 routine into the kernel in DAT fashion which btw takes 150 code bytes in V3.

Turns out the V4 wordcode version at 130 code bytes takes less code space which is what I anticipated once I started coding in higher level functions.

@jmg - not sure what you mean but I try to keep within the hardware limits and I'd rather marry an ARM to a Prop than try to mutilate it with memory expansion schemes

@jmg - not sure what you mean but I try to keep within the hardware limits and I'd rather marry an ARM to a Prop than try to mutilate it with memory expansion schemes

Yes, it can make sense to Pair a P1 with a (larger) ARM, but there are also cases where users may want to use the P1 RAM for data, and design with external code.

A P1 also pairs well with smaller MCUs like EFM8LB1, where higher performance ADCs/ DACs are needed.
Parts like the EFM8LB1 are now comparable/lower in price with the equivalent ADC/DACs so you get the MCU for free.

HyperRAM has a good trade off, in pins-vs-bandwidth for memory expansion.
Yes, it adds another chip, but the jump is to go from 32kB to 2MB, which is quite a gain - plus it proves that 32k is less of a hard ceiling.
Otherwise, many would avoid a P1, in order to avoid hitting that limit.

@jmg - not sure what you mean but I try to keep within the hardware limits and I'd rather marry an ARM to a Prop than try to mutilate it with memory expansion schemes

Yes, it can make sense to Pair a P1 with a (larger) ARM, but there are also cases where users may want to use the P1 RAM for data, and design with external code.

A P1 also pairs well with smaller MCUs like EFM8LB1, where higher performance ADCs/ DACs are needed.
Parts like the EFM8LB1 are now comparable/lower in price with the equivalent ADC/DACs so you get the MCU for free.

HyperRAM has a good trade off, in pins-vs-bandwidth for memory expansion.
Yes, it adds another chip, but the jump is to go from 32kB to 2MB, which is quite a gain - plus it proves that 32k is less of a hard ceiling.
Otherwise, many would avoid a P1, in order to avoid hitting that limit.

It's such a pity that the Prop doesn't have more I/O pins (or maybe Address/Data bus) which is why I object to memory expansion schemes since they try to turn a microcontroller into a microcomputer but end up cannibalizing and crippling the Prop. P2 was supposed to be a reality many many many moons ago but when it eventually is then all this that we discuss will be moot.

I've added another special opcode for task variables so that each cog may have its own but share common routines that use these task variables. This means that all word codes are only one word long except for a LONG literal which of course requires 32-bit operand and 16-bit word code.

So far so good, says the eternal optimist.

Now the dilemma I face is what to do with the dictionary as I need to mix word aligned code with byte aligned characters. Currently Tachyon stores each record in the dictionary in this format:

That works well enough and the count also helps to search faster and to skip to the next record in ascending memory since the dictionary build down towards code memory which builds up.

Now to code a dictionary entry for V4 wordcode in the Prop tool requires an entry like this:

byte 0,"WORDS", hd, (@WORDS+s)>>8,@WORDS+s

Since I'm lazy I don't bother working out the count byte as Tachyon will fix up the counts on a cold start.

But that looks messy so I could just enter @WORDS+s as a word like this:

byte 0,"DUMP", hd
word @DUMP+s

So that is a lot cleaner but depending upon the byte alignment it might add one extra byte between the atr and the wordcode which has to be factored in when skipping to this field or the previous name.

Now the other way to format this is to bite the bullet and make each dictionary record a fixed length given that it is rare that Forth names exceed more than 12 characters since_we_don't_use_long_names as they are a pain to interactively type plus they take up memory. One advantage of a fixed format is that it lends itself to begin accessed easier in slow memory such as I2C EEPROM. At present the dictionary can take up a lot of precious hub RAM but I have a scheme which drops these names into 1 of 64 hashed index blocks of EEPROM/SD. However if I use fixed length records I could just store them as is in EEPROM without any special tricks or hashed index blocks except perhaps sorting them. Using a binary search it becomes quick and easy to locate a name without having to read a whole block of 384 bytes in.

So the main reason for the fixed length record is that the dictionary, or most of it really belongs somewhere other than hub RAM but normally that somewhere else is slow. The hashed index block uses around 24kB as some blocks are only half full yet the 16-byte fixed record approach would use less memory overall.

Decisions decisions decisions......
...
Now the other way to format this is to bite the bullet and make each dictionary record a fixed length given that it is rare that Forth names exceed more than 12 characters since_we_don't_use_long_names as they are a pain to interactively type plus they take up memory. One advantage of a fixed format is that it lends itself to begin accessed easier in slow memory such as I2C EEPROM. At present the dictionary can take up a lot of precious hub RAM but I have a scheme which drops these names into 1 of 64 hashed index blocks of EEPROM/SD. However if I use fixed length records I could just store them as is in EEPROM without any special tricks or hashed index blocks except perhaps sorting them. Using a binary search it becomes quick and easy to locate a name without having to read a whole block of 384 bytes in.

So the main reason for the fixed length record is that the dictionary, or most of it really belongs somewhere other than hub RAM but normally that somewhere else is slow. The hashed index block uses around 24kB as some blocks are only half full yet the 16-byte fixed record approach would use less memory overall.

Thoughts?

Hi Peter,

- I would completely deny a fixed length name schema although I try to keep names short. I think newcomers would give never more use Tachyon.
- I remember that in commodore 64 basic there was a way to shorten longer names for example "POKE" as P shift O ("pO") as a shortname this worked for many commands. I can't remember details maybe you had such a box also so you know what I'm speaking off.
- I would even think about something like a namespace in c++ this would free some names for application usage. This would be cool.

Some Forths I've used save just the first few characters and a character count. One long = 3 characters + 1 count. Or you could pack 4 non-extended ASCII characters into 28 bits, leaving 4 bits for the count. It's lossy, resulting in collisions if you're not careful, but compact and very fast for lookups.

Personally, I think the interactive aspect of Forth is overrated. I would much rather use a good off-line editor/pre-compiler and just upload all the resulting threaded code at once when I want to try something. With such a system, dictionary entries can be as long as you want them, since the target doesn't have to include the dictionary.