If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.

This is something I've been banging my head against trying to solve for a long while. My idea was to implement a poor man's profiler -- if you aren't familiar with the concept, it's a sampling profiler that works by having the user manually raise breakpoints randomly, then using those breakpoints as samples to see where the program is bottle necking. The idea is that, with random sampling, you will naturally stop execution on parts of your program where most time is being spent.

Unfortunately, I can't find any way to manually raise a breakpoint. Moop has pointed me to a method of raising a breakpoint via the SH4 with his great work, but I haven't set aside any time to look at it. What I can say, however, is that you can simulate a poor man's profiler in a rough way without raising a break point. The way I do it is to define a macro that can be placed at the top and bottom of various functions you write. The function will take the current real time and check against a time threshhold you define in memory. If enough time has passed since the last sample, have the two macros work in concert to record the time the function being run takes. This will let you make a simulated model of the execution time and path of your program. When your function ends, via the function, have it record the time, and also print the call stack so you can do a backtrace for more information. You can manually create a firegraph with something like SDL and ImGUI on a PC to format this information into something resembling a standard profiler.

Just a thought. If you come up with a better solution, please share in as much detail as possible, as I'm all ears!

I wrote sampling profiler years ago. It uses a second thread to periodically record the PC of the main thread to an array. When you exit, the samples are saved to a text file on the PC, and a script is used to pass the addresses to addr2line and determine what functions are being run, and how much time is being spent in them. I'll upload it later.

I wrote sampling profiler years ago. It uses a second thread to periodically record the PC of the main thread to an array. When you exit, the samples are saved to a text file on the PC, and a script is used to pass the addresses to addr2line and determine what functions are being run, and how much time is being spent in them. I'll upload it later.

this sounds pretty useful. Would there be any way to modify your profiler to also trace the stack so it'd be possible to know what called the current function?

I wrote sampling profiler years ago. It uses a second thread to periodically record the PC of the main thread to an array. When you exit, the samples are saved to a text file on the PC, and a script is used to pass the addresses to addr2line and determine what functions are being run, and how much time is being spent in them. I'll upload it later.

I still have my profiling code, but it needs a bit of cleaning up. Judging from the time stamps on the files, it looks like the oldest parts of it are at least 8 years old, and some of the more recent code is about a year or two old. My coding style has changed over time, so the source looks really inconsistent at the moment. There's also some other issues that need fixing (e.g., hard coded file names and timing values) and the only documentation for it is in my head. I'd like to include a few examples on how to use it. I think I should have it ready this weekend.

I still have my profiling code, but it needs a bit of cleaning up. Judging from the time stamps on the files, it looks like the oldest parts of it are at least 8 years old, and some of the more recent code is about a year or two old. My coding style has changed over time, so the source looks really inconsistent at the moment. There's also some other issues that need fixing (e.g., hard coded file names and timing values) and the only documentation for it is in my head. I'd like to include a few examples on how to use it. I think I should have it ready this weekend.

Sorry, but the profiler isn't ready yet. When I was making an example of how to add the profiler to an existing program, I noticed that the profiler was generating incorrect results. After some tweaking, it seems to be working correctly now, but it took longer than I expected. I should have it ready soon.

Sorry, but the profiler isn't ready yet. When I was making an example of how to add the profiler to an existing program, I noticed that the profiler was generating incorrect results. After some tweaking, it seems to be working correctly now, but it took longer than I expected. I should have it ready soon.

Take your time and no rush. You're doing everyone a favor and it's not like this is paid work. A big thanks for your effort, but don't feel like you are obligated or owe anybody anything. Real life gets in the way often, and I completely understand.

Sorry, but the profiler isn't ready yet. When I was making an example of how to add the profiler to an existing program, I noticed that the profiler was generating incorrect results. After some tweaking, it seems to be working correctly now, but it took longer than I expected. I should have it ready soon.

Take your time and no rush. You're doing everyone a favor and it's not like this is paid work. A big thanks for your effort, but don't feel like you are obligated or owe anybody anything. Real life gets in the way often, and I completely understand.

My initial tests seem to show that it's working, but please let me know how you get on with it if you try it. When run under dcload, this code will generate a gmon.out file which is compatible with gprof.

The code is released under the MIT license, feel free to submit patches/improvements!

My initial tests seem to show that it's working, but please let me know how you get on with it if you try it. When run under dcload, this code will generate a gmon.out file which is compatible with gprof.

The code is released under the MIT license, feel free to submit patches/improvements!

Can't like this enough! I'd still like to see TapamN's solution if he is still working on his, but this is extremely exciting. With a proper gmon.out that means one can use all sorts of profiling visualization tools. Dude I'm so excited now!! Thanks so much!!

No announcement but I too wrote something but requires recompile to profile, it's very different from kazade's and might give a different set of data to look at.
If interested I'll toss up a mini example.

No announcement but I too wrote something but requires recompile to profile, it's very different from kazade's and might give a different set of data to look at.
If interested I'll toss up a mini example.

More implementations can only ever be a good thing. If nothing else, it could help verify results. Please do post your example if you don't mind!

Sorry, but the profiler isn't ready yet. When I was making an example of how to add the profiler to an existing program, I noticed that the profiler was generating incorrect results. After some tweaking, it seems to be working correctly now, but it took longer than I expected. I should have it ready soon.

Sorry for the LOOONG delay. There were things distracting me from this. (Mainly that my 71 year old mom fell, fractured her hip, and needed a partial hip replacement.) I'm not surprised that someone else made a profiler in the meantime, since it's actually pretty simple.

The ZIP contains my profiling and watchdog routines. There are four main components, each namespaced with a prefix:

The headers have ok documentation in them of what the functions do and how to use them. There are also two examples.

One example (main.c) shows how to profile a specific function to find out it's performance characteristics by wrapping the call with profiling code. You can build it with just "make", which creates "proftest.elf".

The second example (example.c) shows how to add profiling to an existing game or program. This can be built with "make -f Makefile.game", resulting in "game.elf". It's not a real game, but the code is shaped like how a game might be structured, with a main loop, and some do-nothing fake collision, logic, and rendering functions that waste time that the profiler will measure. The code added to the "game" to support the watchdog and profiler have been marked with comments.

It's worth noting that the examples dump the profiling results directly into /pc, the root directory of the PC.

I first wrote the pf* set of timing functions, based off of KOS's timer, which I used to cycle count loops. Later, I wrote a watchdog thread (wd* functions), to abort to dc-load if the main thread hangs up, so I could avoid having to power cycle the console. It usually doesn't recover from hangs caused by memory corruption, but can save you from accidental endless loops. Then I wrote the sampling profiler (pfSamp*) to profile the Bullet physics library. Rather than creating a second thread to gather samples, it's worked into the watchdog thread. A few years ago, I discovered the SH4 performance counter by accidentally running across Linux's driver for it when searching for something related to debugging on the SH4, and wrote my own version (pfc*).

Here's what most people are probably interested in, the sampling profiler:

To add the sampling profiler to existing code, you first need to start the watchdog thread (wdInit();), since that's what gathers the samples. Then you'll need to either add proper watchdog support (which takes a bit of work; you need to properly pet the dog each frame, and pause it before long operations like reading megabytes from disc) or just pause the watchdog immediately after you start it (wdPause();). The sampling profiler can still collect samples even when the watchdog is paused; pausing the watchdog just stops the watchdog from updating, it doesn't actually stop the thread. After the watchdog is set up, you can initialize the sampling profiler (pfSampInit(size);). The size parameter is the number of samples to allocate space for.

More samples allows the profiler to be run for longer, and longer running means more detailed results and less noise, however it also takes up more RAM. By default, one sample is collected about every 10 milliseconds (PROFILER_SAMPLE_RATE_MS constant in watchdog.c), so a 5000 sample buffer lasts for 50 seconds. Each sample is a 3 byte pointer to what the main thread was executing at that moment. Since RAM is 16 megabytes, 24 bits/3 bytes are sufficient to store any position in RAM. Not storing the top 8 bits reduces the size of the sample buffer by 1/4th. I wanted the sample buffer to be as small as possible since RAM spent on the profiler is RAM that can't be used anywhere else.

At this point the profiler is ready to go. Call pfSampGo() to start collecting samples. If you want to NOT collect samples for a certain section of code (like a load screen), you can call pfSampStop() before it, and resume afterwards with another call to pfSampGo(). When you're ready to write the results, make sure the profiler is stopped, then call pfSampSave(fname, append). The fname parameter is the filename the samples will be saved to, something like "/pc/blahblahblah/prof_samples.txt". The append parameter controls whether or not the new samples replace any existing ones (on false), or are appended to them (on true). Samples are written to a text file. Each line contains the program counter value at that time. It's a text file so that a shell script can process the results without any specialized programs.

If the sample buffer runs out of space, the profiler will stop automatically, and print a message to the console. If you want to disable this message, you'll have to edit the code. You can check if the profiler is stopped with pfSampIsRunning(). It will not automatically write the results, since that could take a while with a large buffer at slow DC-PC connection. If you want to do very long profiler runs with a small buffer, start the profiler, then poll once each frame to see if the profiler has stopped (pfSampIsRunning()). If it has stopped, save the results to the PC (in replace mode the first time, and append mode afterwards). Then reset the profiler (pfSampReset()) and start it again with pfSampGo().

To view the results of the sampling profiler, use the prof.sh script. You call it like this: "./prof.sh program.elf prof_samples.txt | less". Obviously, the first parameter is the program the samples belong to, and the second parameter is the samples themselves. The results will look something like this:

The first line is the number of samples collected. Each line afterwards is the number of samples per function called. You can get the percent time used by dividing the function samples by the total samples. I didn't do this in the script since I'm not good at writing scripts (this is the most advanced data processing script I've ever written...), and this works well enough for me.

This only measures the time spent in a specific function, not the time spent in any called functions. In example, "render" shows up as 1 sample, but actually takes a lot of time since it calls "render_model" (967 samples) and "frustum_check" (177 samples). In the "game" example, major operations are timed with the pf* timing functions, to get an idea how long a entire subsystem takes.

Sometimes the script does weird things, like in the example, there's "985 genwait_wait at genwait.c" and "1 genwait_wait at genwait.c:105" lines when it should probably be "986 genwait_wait at genwait.c". I haven't bothered to debug this. It's probably sh-elf-addr2line outputting something weird.

You'll probably see "genwait_wait" taking a noticeable amount of time. This means that time is spent waiting on a semaphore or something. Usually this is waiting for vblank in pvr_wait_ready, so you can kind of treat it as a "idle time" value.

I mentioned before that my profiler was generating bad results. I fixed this giving the watchdog thread a lower nice value (high priority). It was originally running with the same priority as the main thread. It seems to generate correct looking results, but it kind of reduced my confidence in it. Looking at the code for KOS's scheduler, I think the algorithm is just "run highest priority level not waiting, if that level has multiple threads, run them in round robin". I think it still should have worked, and this change shouldn't have been needed... so... *shrug* Less confident in this.

I remember seeing that someone else wrote a library for the SH4 performance counter. They seemed to be getting odd results, with it looking like the results were being multiplied by the peripheral clock or something (which does not make sense). The library I made does not seem to do this.

ThePerfectK: It's definitely be possible to have each sample be a stack trace. But it wouldn't be easy to process the results with just a script then; you'd need a real program to interpret the results. It would also require a larger sample buffer.

I remember seeing that someone else wrote a library for the SH4 performance counter. They seemed to be getting odd results, with it looking like the results were being multiplied by the peripheral clock or something (which does not make sense). The library I made does not seem to do this.

Oh, this was me.

I did find out what was happening (took a lot of digging and reverse engineering to figure out what was going on there!), but I solved all that--and documented it extensively in the performance counter module in DreamHAL. I'll just say the answer was weird and kind of obscure, and leave it at that so as to not derail this topic.