If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.

Update 12/31/2019:
All done! See the GitHub page for burnable CDIs and the latest source.https://github.com/Moopthehedgehog/dcload-ip
CDIs are in the "Releases" section of the GitHub repository. They are lightly padded (about 64MB of padding).

Old post (attachment explanations are in here, though as of 12/10/2019 all attachments to this post are old. The perf counter timing thing is still not figured out completely, but I have a way to work around it until the true meaning behind it is understood):

Spoiler!

Alrighty, so here is what I have so far since it was requested that I post the current status. No, it is not quite ready yet for actual release, so consider it very alpha. I think it's more like beta now with the 12/2/2019 source.

Some notes:

Everything from 1.0.5 should be there and still work as expected. (Ideally)

The seconds countdown for the DHCP lease is not exactly 1 second. It's veeeerrrrrryyyy close, but I don't know what the actual frequency of the performance counters in elapsed time mode is. My "stopwatch" calculation shows somewhere around 2.4GHz, but as I mentioned here I have a hard time believing that's correct.No, it's off by like 11.96988, see later in this thread. Still trying to figure this out; it's the last thing.

Resetting dcload with -r from DC-TOOL's command line interface causes the DHCP countdown display to stop working. I don't know why this is. It also happens after executing a command. I would like the DHCP countdown to persist.Fixed.

There's just over 1.4kB free memory space left, shared with the stack. Try to keep it over 700 bytes, ideally over 1024 bytes, if adding anything new. I'm trying to make this 2kB by converting the command buffer into using the packet transmit buffer, but it's not working for some reason. Well, that was actually pretty silly. I fixed it. BSS has 2.4KB freed up now. Unfortunately it looks like there's a more significant constraint on the file size that I didn't realize was there.There's like 3.6kB left free now as of 12/2/2019. Space is no longer an issue.

There appears to be an upper limit to the absolute filesize of dcload.bin. I don't know what it is; somewhere in the mid-20kBs? Needs more looking into.Linkerscript problems, as far as I can tell. I believe this is fixed now.

The makefile and build options for dcload had to be modified, so there are some new options I've added there to help with future debugging and to make things just work better in general. All the .asm files are GCC's produced assembly for the .c files, and output.map is, well, a map of the dcload elf file (dcload.elf is just the extensionless dcload file that appears after running make).

draw_progress() now outputs decimal digits instead of hexadecimal ones (lol), but nothing uses it in 1.0.5. It's there, nonetheless, and would take up about 144 bytes if it were used, so keep that in mind.

I just discovered right before posting this that I broke the DHCP countdown timer. The seconds are wrong. It was working earlier today...Oh no, they're not wrong. Not wrong at all. This is hilarious: my router decided to put my Dreamcast on probation because it was asking for the same IP too much, so it says my DC's lease expires "never" now, but everything else has an expiry time.

Important Changes:

Static IPs in Makefile.cfg should be written with leading zeros. I had to do this so that the second-most-common IP address range I've seen in the US (192.168.0.x) can work, in addition to many other IP address ranges that have zeros in them.

Specifying 0.0.0.0 in Makefile.cfg activates DHCP mode!! (Well, it's really 0.x.x.x, since the whole 0.0.0.0/8 range is reserved for that purpose anyways). The ARP trick will still work, but you should really be using 169.254.x.x IP addresses for linking two machines together and doing that. 0.0.0.0/8 is mine now.

If DHCP fails, the IP address will be shown as 0.0.0.0, and in this situation the "arp with IP of 0.0.0.0" trick will be necessary like "old times."

Requires KOS build environment set up. I'm using Windows Subsystem for Linux v2 (Windows Insider program), but it might work with WSLv1. Don't use Ubuntu if you go this route, as KOS won't build. Use Debian instead.

EDIT: Updated source attachment.

Another edit: So I'm wondering some more about these performance counters, and I found this: http://www2.lauterbach.com/pdf/debugger_user.pdf
So at least there's something that exists that uses them. However it very explicitly states to use the CPU clock frequency, which we know to be 200MHz, not the absurd 2.4GHz number I've had to use. I'm wondering if this isn't one of those things where like the bottom 16 bits are reserved and the "elapsed time" is only in the top 32 bits, and you divide that value by the clock frequency to get time in... milliseconds? seconds? Really not sure. Maybe an Lxdream dev knows? Might need to go track down that debugger software... -- found it, it's in the demo downloads for the SH supported processor cores: https://www.lauterbach.com/frames.html?home.html, but sadly that one screenshot in the pdf showing PMCTR1 and PMCTR2 is more helpful than the software itself...

Also, I believe I've fixed DHCP renewal. Updated the source attachment. Will be sure in about 20-25 minutes. -- Yep, it's fixed. I'll even put up a CDI image so you can see what it looks like so far!

Just a note--you might be able to use that CDI as it is in its current state. The main purpose of DCLOAD has been working for me thus far for everything I've tried except the exception-test, and I would really appreciate someone explaining the exception binary so I can understand why this is happening and fix it.

There are really only 3 things left that need to be done before it's finished:

1. Understand the file size constraint. How does the exception handler binary work? What are the size limitations concerning dcload-ip because of it? How come the exception-test is the only example program that doesn't work? The CDI I posted above actually has the remote-PMCR functionality disabled/commented out and is using a stripped down "disable command only" version because I was trying to find some kind of filesize limit. With the full remote-PMCR functionality implemented, dcload.bin is about 21,920 bytes. With reduced PMCR it's 21,644 bytes, and with all remote-PMCR functions removed it's 21,184 bytes. Stock dcload.bin is 21,000 bytes, and exception.bin is 2048 bytes. There's 1024 bytes of space between the end of exception.bin and 0x8c010000, and I don't know what the purpose of that space is or what the importance of 0x8c00f400 might be.

2. Why doesn't the DHCP lease display persist when sending a cmd_reboot via dc-tool's -r flag or after executing some binary? It's very odd. It's like an entire if() block is just getting completely skipped on re-entry to main() for no rhyme or reason.

3. [Not strictly necessary, but would be nice to know for accuracy] How does the elapsed time measurement mode of the SH4 performance counters actually work? The way I'm using them for timing is just a best guess.

If I knew the answers to these it probably would've been done by now... I might be able to find the answer to 1 in some hardware documentation (maybe), 2 might require some sleuthing in the output ASM if it's a GCC thing, and 3 I just have no idea.

They've all been updated since WSLv2 first came out, though, so maybe you have a slightly older version? The latest version of Ubuntu (as of last week) didn't work for me without some extra finagling, unlike Debian which "just worked(tm)". If you run uname -r it's possible you might see you're using an older version... unless you did some extra work to get it up and running? The Windows Store downloads get updated with no fanfare, and the only way to really "update" is to uninstall and reinstall from the store. The Debian kernel I'm on is 4.19.79, for reference. I mean, I don't doubt Ubuntu can be made to work; Debian was just a seamless install and go.

Using a tool like Wireshark to view the response packets, the 64-bit data is stored in the last 8 bytes of the response when a read command is sent. It's in little endian, as I didn't byteorder-flip it for transmission across the network, and my reasoning is so it could be directly fed into a program on x86 without any extra conversion steps.

So I took some measurements of performance counter 1 and found that it appears to run at about 11.96988 x 200MHz. It would be great if someone double-checked my calculations here. I'm not sure how there's even any circuitry capable of running at 12x per clock on the Dreamcast, but here it is.

...And then I came across this, in the Renesas SH7750 hardware manual, page 67:

So now I'm off to check whether result_high and result_low need to be flipped. -- Edit: Nope, doesn't look like I got it wrong. If I unmask the upper bits they're just 0, and checking the byte order in the packet for small numbers (like <100 secs) has zeros in the expected places. The elapsed time counters are just incrementing way faster than the clock speed, so the max lease time that can be supported is actually 117576 seconds, or about 1 day, 8 hours, and 40 minutes. I think that's fine, especially since the most I've ever seen in practice is one day, anyways.

I am still curious to know how the designers pulled that ~12x multiplier off, though. 2393976245 is just a weird number. I didn't see any obvious patterns in the bits comparing it to 200000000, either.

EDIT: Regarding remaining thing #1, well, wow, what a wild goose chase. The exception-test example program doesn't even work with stock dcload, so I'm just not going to bother with it. And I think #2 is caused by go.s clearing all the registers and resetting the stack. GCC probably doesn't like that!

12/3/2019: Removed CDI image. It's way old now. The other attachments are still applicable.

I'm loving this topic, please keep posting as you find more information, this is pretty great!

Your perf results -- is that the perf for your windows program trying to communicate with the DC, or are you some how running a perf test on the DC itself?

Perf test on the DC itself. I added the ability to query the DC's counters, disable them, restart them, and even change their modes (there are like 30) from over the network. That's what I meant by remote-controlled PMCR.

It just took me a while to figure out how the heck to send arbitrary data over the network since dc-tool doesn't let you do what I was trying to do. If you send a specially made command packet as per my above post, you can access the functions in commands.c in dcload-ip's source. Ncat/netcat and a hex editor to make the payload data was the answer I was looking for.

You can apparently also use this syntax:
ncat -4 -u [Dreamcast's IP] < [path to payload] > [path to some file to store the response in]
And you won't need to use Wireshark to read the Dreamcast's response packet, although it won't be timestamped (there's definitely a way to append timestamping via Windows command line, though I don't know it off the top of my head). Also worth noting: this only works on the Windows command line, not PowerShell. I don't know how to do < and > on PowerShell. On Linux I assume these might be | pipes, unless < and > do the same thing there. I know > is the same.

In other news:
I also think I have a better lead on #2, and it's not go.s like I thought. I think GCC might be very unhappy with triple-nested if() statements... Working on this now. There are also MUCH better ways to implement the whole exception handler, and long story short it's do it like x86. There's no need to use a separate binary that has to rely on dcload C functions when the vector table can just be set up inside dcload's binary itself. That way loader.s only needs to copy one binary over, and dcload.c can then set up the exception region as it pleases making full use of all available internal functionality. This way GCC optimization won't destroy things. It would be particularly easy in this case to do it this way because the hardware is fixed and the addresses are always the same for everybody. Basically it's incorporating exception.s into dcload.bin instead of having it separate.

I'm loving this topic, please keep posting as you find more information, this is pretty great!

Your perf results -- is that the perf for your windows program trying to communicate with the DC, or are you some how running a perf test on the DC itself?

Perf test on the DC itself. I added the ability to query the DC's counters, disable them, restart them, and even change their modes (there are like 30) from over the network. That's what I meant by remote-controlled PMCR.

It just took me a while to figure out how the heck to send arbitrary data over the network since dc-tool doesn't let you do what I was trying to do. If you send a specially made command packet as per my above post, you can access the functions in commands.c in dcload-ip's source. Ncat/netcat and a hex editor to make the payload data was the answer I was looking for.

You can apparently also use this syntax:
ncat -4 -u [Dreamcast's IP] < [path to payload] > [path to some file to store the response in]
And you won't need to use Wireshark to read the Dreamcast's response packet, although it won't be timestamped (there's definitely a way to append timestamping via Windows command line, though I don't know it off the top of my head). Also worth noting: this only works on the Windows command line, not PowerShell. I don't know how to do < and > on PowerShell. On Linux I assume these might be | pipes, unless < and > do the same thing there. I know > is the same.

This is outstanding! I've actually been struggling to get some perf tools on the DC itself! I'd resorted to writing simple benchmarking tools that could be placed around code, but this sounds much more like what I'm looking for. So these perf tools are rolled directly into your copy of DCTool? I'm sorry if that's an ignorant question, I'm just trying to wrap my head around how you specifically generated that perf readout. Do you happen to go to that discord server posted above? I'd love to pick your brain in real time.

Excellent work so far, you've got me all excited lol

EDIT: Ahh, I see the perfctr.c/h files and changes to command.c in your source now, I'm setting everything up on my end to test this out. I'm unreasonably excited over this lol

Last edited by ThePerfectK on Tue Nov 26, 2019 7:22 pm, edited 1 time in total.

Well, oh boy do I have some interesting news.
I was trying to figure out what I could do to solve #2 today, and I discovered something very interesting.

Using the -fno-strict-aliasing flag SLOWS DOWN THE PERF COUNTERS BY 11-12x!!!!!!!!!!
Evidently GCC turns on strict-aliasing by default or something, which does this insane thing. So I'm using no-strict-aliasing from now on because not having it produces downright wrong behavior. Good, so I'm not going crazy. When I typecast I don't mess around, thank you very much. Now I gotta go and fix everything.

So these perf tools are rolled directly into your copy of DCTool? I'm sorry if that's an ignorant question, I'm just trying to wrap my head around how you specifically generated that perf readout. Do you happen to go to that discord server posted above?

1. Yes, but your terms are mixed up: dc-load-ip (the part that runs on the Dreamcast), not dc-tool (the part that runs on the PC). I'm able to query dcload-ip without using dc-tool by using ncat/netcat. I don't know if you're familiar with x86 performance counters, but the DC's SH4 has essentially the same thing. I was initially looking for an invariant TSC, which the elapsed time mode of the perf counters basically is. And there are TWO of them.
You can find all the modes they support listed here: http://git.lpclinux.com/cgit/linux-2.6. ... l_sh7750.c
2. Nope! I do get e-mail notifications of PM's, though. If there were a Dreamcast community Slack or something I'd probably be on that. Slack actually saves conversations and has notifications, which are like the big downfalls of standard IRC.

Well, oh boy do I have some interesting news.
I was trying to figure out what I could do to solve #2 today, and I discovered something very interesting.

Using the -fno-strict-aliasing flag SLOWS DOWN THE PERF COUNTERS BY 11-12x!!!!!!!!!!
Evidently GCC turns on strict-aliasing by default or something, which does this insane thing. So I'm using no-strict-aliasing from now on because not having it produces downright wrong behavior. Good, so I'm not going crazy. When I typecast I don't mess around, thank you very much. Now I gotta go and fix everything.

So these perf tools are rolled directly into your copy of DCTool? I'm sorry if that's an ignorant question, I'm just trying to wrap my head around how you specifically generated that perf readout. Do you happen to go to that discord server posted above?

1. Yes, but your terms are mixed up: dc-load-ip (the part that runs on the Dreamcast), not dc-tool (the part that runs on the PC). I'm able to query dcload-ip without using dc-tool by using ncat/netcat. I don't know if you're familiar with x86 performance counters, but the DC's SH4 has essentially the same thing. I was initially looking for an invariant TSC, which the elapsed time mode of the perf counters basically is. And there are TWO of them.
You can find all the modes they support listed here: http://git.lpclinux.com/cgit/linux-2.6. ... l_sh7750.c
2. Nope! I do get e-mail notifications of PM's, though. If there were a Dreamcast community Slack or something I'd probably be on that. Slack actually saves conversations and has notifications, which are like the big downfalls of standard IRC.

Sorry for the shorthand confusion with dctool/dctool-ip. I'm getting this all setup on my end to test. I'm familiar with x86 perf counters, so I'm pretty excited to try this out.

Only bummer is that it would seem this would be impossible to use with dctool-serial. Or... hm...

EDIT: wait, I'm dumb lol, I could just cat instead of netcat. Oh man, I gotta try this out, I'm going to try rolling your perf header and command.c changes into the dctool-serial source and see if it works there, too!

Sorry for the shorthand confusion with dctool/dctool-ip. I'm getting this all setup on my end to test. I'm familiar with x86 perf counters, so I'm pretty excited to try this out.

Only bummer is that it would seem this would be impossible to use with dctool-serial. Or... hm...

EDIT: wait, I'm dumb lol, I could just cat instead of netcat. Oh man, I gotta try this out, I'm going to try rolling your perf header and command.c changes into the dctool-serial source and see if it works there, too!

Here, use the latest versions. I thought I'd have to change more, but it was just a couple comments.
Might as well attach the latest source, too.

Unfortunately, I went and changed everything to account for the GCC 12x speed up bug, and then... IT CAME BACK! I have no idea what the heck GCC is doing, but this basically makes using 4.7.3 impossible. So I'm going to have to upgrade to GCC 8 or 9, I think. There's an SH4-ELF-GCC 8.3.0 and 9.2.1 (Debian Sid only) available it seems. Really annoying. :/

--source removed, still available in first post, but there's newer source there also--

Last edited by Moopthehedgehog on Mon Dec 02, 2019 10:27 pm, edited 1 time in total.

Could you please post an example of a raw payload file with a few example commands? I'd like to make sure I'm generating a payload file correctly by comparing to one you used to talk to dcload, as I'm having a bit of trouble sending commands through netcat.

A perf counter number of "3" means apply to both 1 and 2, and is supported in all functions except "read," which only takes 1 or 2 as valid perf counter options.

Note: In making this post I found a couple typos, which I've just fixed in posting this. I also recommend just using a hex editor like HxD to edit that payload data. ncat only sends ASCII, which is why a payload file is needed in the first place.

Edit: Just a note, if you send a bad payload, you should get an error message. If you do it right, ncat will probably respond with "PMCR" followed by what looks like some garbage. That garbage is actually the data, which is stored in raw format. You'll need to open the response in a hex editor or look at the raw packet in Wireshark--I designed remote PMCR stuff so that if someone wants to make a program around it, they can. Doing a remote read, the command->data in the response is transmitted in little endian "host" format, in contrast to the rest of the packet which is in big endian "network" format. Ncat just displays everything in ASCII and doesn't have a "raw" mode...

Last edited by Moopthehedgehog on Mon Dec 02, 2019 10:58 pm, edited 2 times in total.

Just a quick update:
I've managed to build GCC-9.2.0 SH4 and it works!
Let me tell you, it's soooo much nicer than GCC 4.7.3, since you can use modern compiler options and directives, plus the warning display is much better. Attaching a rather minimal and hastily written text file of the command line options I used to build it. It should be enough for anyone who's compiled GCC before to do it. Note: my paths are hardcoded to /mnt/c/DreamcastKOS, as I'm using WSLv2 and my dev folder is C:\DreamcastKOS.

I was able to fix some more bugs, this time in the linker script, and between that, adding some flags to the makefile, and changing some global variables, I got the lease time countdown to display across resets! The exception handler actually works, despite example-src's exception-test not working, as that's how I was able to track down alignment problems in the linkerscript. So that's cool. Turns it wasn't a case of "triple-nested if()s" like I mused about earlier--certain variables just need to be in .data since .bss gets wiped by dcload-crt0.s and the registers and stack get totally reset by go.s (and the stack is reset on top of that by dcload-crt0.s).

The only problem left to tackle, since I'm not really intending on overhauling the exception handler binary as part of dcload.bin, is the darn perfcounters still going too quickly. I really don't know what is happening here, or why I once saw it go at what I think should be the "correct" speed. I've not been able to replicate that. I even overhauled the read_pmctr function to use assembly to read the perf counters directly because I caught GCC doing some nasty things (an unconditional branch smack in between reads of the two halves of a uint64?????? REALLLLLYY??????), so I'm wondering if there's some undocumented configuration thing. Hopefully it's not a case of the "it only works right in big endian mode" or something stupid like that.

EDIT: removed source, it's still available in the first post, but there's newer source available now.

Just a quick update:
I've managed to build GCC-9.2.0 SH4 and it works!
Let me tell you, it's soooo much nicer than GCC 4.7.3, since you can use modern compiler options and directives, plus the warning display is much better.

I've recently been using gcc 7.1.3 and its excellent.

Is C++ working with yours? i wasnt able to get that working sadly, so C only at the moment

They've actually upgraded the warning system in GCC 9. I wasn't expecting it since I've been using GCC 7 & 8 for other things, but it's a really nice change. A cool unrelated thing about GCC is that on x86-64 it can even regularly beat Intel's C compiler, although it takes longer to compile GCC from source than it does the Linux kernel.

No arguing there. Although because I explicitly compiled with enable-languages=c, I don't have g++ to test, let alone any Dreamcast-related projects that make use of C++ to any real capacity. I don't see offhand why it wouldn't work, but in this situation I can't really test it...

In other news, I came across something: the PMCR_CLKF definition (0x100), which I've never seen used, is decimal value 256. I don't know what it's actually for, but I found some interesting relations:
256/11.96988 = 21.38.., which is extremely close to the number of seconds it's [supposed to] take for the high longword of the performance counters to increment by 1 (in other words, the number of sets of 200,000,000 needed to overflow the low longword). The actual number is 21.47..., and 256/21.47 = 11.920928, which is an interesting number. It's about 1% off from 11.96988, it's apparently a clock frequency for an RTC in ARM according to Google, and also equals 100,000,000 / (8 * 1024^2). That's 200,000,000/2^24 = 11.920928..., where 2^24 is 16,777,216, a very common number and also the square root of 2^48 (perf counter is 48-bit).

There's also 2^28/200,000,000 = 1.34217728, where 2^28 is the smallest power of 2 over 200MHz, and it's 256*1024^2. ((1.34217728)^-1)*16 = 11.920928.

No idea currently if any of this is relevant or a red herring, but that's a lot of overlapgoing around in circles!