28:54handmade_render_group.cpp: Initialise some __m128 registers and use some SIMD intrinsics to operate on them 4-wide

28:54handmade_render_group.cpp: Initialise some __m128 registers and use some SIMD intrinsics to operate on them 4-wide

28:54handmade_render_group.cpp: Initialise some __m128 registers and use some SIMD intrinsics to operate on them 4-wide

31:07Debugger: Go to disassembly and look at the SIMD registers

31:07Debugger: Go to disassembly and look at the SIMD registers

31:07Debugger: Go to disassembly and look at the SIMD registers

36:02handmade_render_group.cpp: Set four different values in the registers

36:02handmade_render_group.cpp: Set four different values in the registers

36:02handmade_render_group.cpp: Set four different values in the registers

36:36Debugger: See those different values in the registers and note the order in which they are loaded

36:36Debugger: See those different values in the registers and note the order in which they are loaded

36:36Debugger: See those different values in the registers and note the order in which they are loaded

39:47handmade_render_group.cpp: Turn Square functions into multiplies

39:47handmade_render_group.cpp: Turn Square functions into multiplies

39:47handmade_render_group.cpp: Turn Square functions into multiplies

41:04Fix the loop to work on pixels in batches of 4

41:04Fix the loop to work on pixels in batches of 4

41:04Fix the loop to work on pixels in batches of 4

46:20Run the game and note that we are overwriting our boundary

46:20Run the game and note that we are overwriting our boundary

46:20Run the game and note that we are overwriting our boundary

47:15handmade_render_group.cpp: Temporarily clip the buffers

47:15handmade_render_group.cpp: Temporarily clip the buffers

47:15handmade_render_group.cpp: Temporarily clip the buffers

48:19Separate the memory loading stuff from the computations

48:19Separate the memory loading stuff from the computations

48:19Separate the memory loading stuff from the computations

50:58Declare the arrays before the loop

50:58Declare the arrays before the loop

50:58Declare the arrays before the loop

57:20Debugger: Run and investigate the error

57:20Debugger: Run and investigate the error

57:20Debugger: Run and investigate the error

58:18handmade_render_group.cpp: Correctly test ShouldFill[I]

58:18handmade_render_group.cpp: Correctly test ShouldFill[I]

58:18handmade_render_group.cpp: Correctly test ShouldFill[I]

58:50Run and note that we're (almost) back to where we started

58:50Run and note that we're (almost) back to where we started

58:50Run and note that we're (almost) back to where we started

59:42handmade_render_group.cpp: Walk through the routine

59:42handmade_render_group.cpp: Walk through the routine

59:42handmade_render_group.cpp: Walk through the routine

1:01:53Load in the Pixels from the right place

1:01:53Load in the Pixels from the right place

1:01:53Load in the Pixels from the right place

1:02:33Run, note that we're back to some semblance of good, and glimpse into the future

1:02:33Run, note that we're back to some semblance of good, and glimpse into the future

1:02:33Run, note that we're back to some semblance of good, and glimpse into the future

1:04:06Q&A

1:04:06Q&A

1:04:06Q&A

1:04:30thesizik Would it be faster to unpack pixels using a union of an int32 with a struct of 4 int8's, instead of doing 4 shifts and masks per pixel?

🗪

1:04:30thesizik Would it be faster to unpack pixels using a union of an int32 with a struct of 4 int8's, instead of doing 4 shifts and masks per pixel?

🗪

1:04:30thesizik Would it be faster to unpack pixels using a union of an int32 with a struct of 4 int8's, instead of doing 4 shifts and masks per pixel?

🗪

1:05:15houb_ Why don't we go: Y<2 and X<2 and go through in blocks, instead of a line?

🗪

1:05:15houb_ Why don't we go: Y<2 and X<2 and go through in blocks, instead of a line?

🗪

1:05:15houb_ Why don't we go: Y<2 and X<2 and go through in blocks, instead of a line?

🗪

1:07:44culver_fly Is it better if we calculate if the pixel should be filled and queue it up and only do the calculations once we hit 4 of them?

🗪

1:07:44culver_fly Is it better if we calculate if the pixel should be filled and queue it up and only do the calculations once we hit 4 of them?

🗪

1:07:44culver_fly Is it better if we calculate if the pixel should be filled and queue it up and only do the calculations once we hit 4 of them?

🗪

1:10:45hmh_bot Casey was using a Das Keyboard 4, but it broke, so he is currently using an unknown keyboard he had lying around

🗪

1:10:45hmh_bot Casey was using a Das Keyboard 4, but it broke, so he is currently using an unknown keyboard he had lying around

🗪

1:10:45hmh_bot Casey was using a Das Keyboard 4, but it broke, so he is currently using an unknown keyboard he had lying around

🗪

1:11:30hguleryuz Sorry, maybe this is off-topic: Would it be correct to say anyone coding in Java, by default, is not making use of any of the SIMD stuff, or do you think the JIT compiler is smart enough to make use of it in certain circumstances, maybe with some analysis of the bytecode?

🗪

1:11:30hguleryuz Sorry, maybe this is off-topic: Would it be correct to say anyone coding in Java, by default, is not making use of any of the SIMD stuff, or do you think the JIT compiler is smart enough to make use of it in certain circumstances, maybe with some analysis of the bytecode?

🗪

1:11:30hguleryuz Sorry, maybe this is off-topic: Would it be correct to say anyone coding in Java, by default, is not making use of any of the SIMD stuff, or do you think the JIT compiler is smart enough to make use of it in certain circumstances, maybe with some analysis of the bytecode?

🗪

1:12:29guit4rfreak How often do you optimize for cache misses vs optimizing with SIMD? I got the impression that cache misses are by far the most important things to look out for

🗪

1:12:29guit4rfreak How often do you optimize for cache misses vs optimizing with SIMD? I got the impression that cache misses are by far the most important things to look out for

🗪

1:12:29guit4rfreak How often do you optimize for cache misses vs optimizing with SIMD? I got the impression that cache misses are by far the most important things to look out for

🗪

1:14:40culver_fly Please send my best regards to Jeff

🗪

1:14:40culver_fly Please send my best regards to Jeff

🗪

1:14:40culver_fly Please send my best regards to Jeff

🗪

1:14:52sharlock93 Schedule-wise, how many more weeks until you are done with optimization of the renderer?

🗪

1:14:52sharlock93 Schedule-wise, how many more weeks until you are done with optimization of the renderer?

🗪

1:14:52sharlock93 Schedule-wise, how many more weeks until you are done with optimization of the renderer?

🗪

1:15:01ray_caster Will you be covering Morton order texture swizzling?

🗪

1:15:01ray_caster Will you be covering Morton order texture swizzling?

🗪

1:15:01ray_caster Will you be covering Morton order texture swizzling?

🗪

1:16:54dr_fubar Possibly a noob Q: Have you ever run into problems with floating point arithmetic, and what are some good approaches to avoiding those problems?2,3

🗪

1:16:54dr_fubar Possibly a noob Q: Have you ever run into problems with floating point arithmetic, and what are some good approaches to avoiding those problems?2,3

🗪

1:16:54dr_fubar Possibly a noob Q: Have you ever run into problems with floating point arithmetic, and what are some good approaches to avoiding those problems?2,3

1:24:06houb_ Is there a way to track how memory gets stored to cache?5

🗪

1:24:06houb_ Is there a way to track how memory gets stored to cache?5

🗪

1:24:06houb_ Is there a way to track how memory gets stored to cache?5

🗪

1:28:01hguleryuz Off-topic: Do you know if JAI will have extensions / a method for using SIMD?

🗪

1:28:01hguleryuz Off-topic: Do you know if JAI will have extensions / a method for using SIMD?

🗪

1:28:01hguleryuz Off-topic: Do you know if JAI will have extensions / a method for using SIMD?

🗪

1:28:50xaitra How much do you need to think about the intrinsic instructions while programming, or does the compiler usually take care of that? Is this the big difference between using GNU and Intel compiler, for example?

🗪

1:28:50xaitra How much do you need to think about the intrinsic instructions while programming, or does the compiler usually take care of that? Is this the big difference between using GNU and Intel compiler, for example?

🗪

1:28:50xaitra How much do you need to think about the intrinsic instructions while programming, or does the compiler usually take care of that? Is this the big difference between using GNU and Intel compiler, for example?

1:33:54rooctag Do you have to take the instruction cache into account? Or is it large enough?

🗪

1:33:54rooctag Do you have to take the instruction cache into account? Or is it large enough?

🗪

1:33:54rooctag Do you have to take the instruction cache into account? Or is it large enough?

🗪

1:34:39goodoldmalk How does intrinsics and parallel processing work together? Does each CPU have registers to do intrinsics? If so, could we increase X-fold the number of pixel rendering in our code if we computed in parallel?

🗪

1:34:39goodoldmalk How does intrinsics and parallel processing work together? Does each CPU have registers to do intrinsics? If so, could we increase X-fold the number of pixel rendering in our code if we computed in parallel?

🗪

1:34:39goodoldmalk How does intrinsics and parallel processing work together? Does each CPU have registers to do intrinsics? If so, could we increase X-fold the number of pixel rendering in our code if we computed in parallel?