to remove two-column,resize your browser window to narrow

hidden cost, stolen CPU cycles – latency engineering

Latency /engineering/ and optimization is all about the implicit operations, hidden costs and “stolen” cpu cycles. Incur the minimum CPU costs for a task.

eg: boxing/unboxing — extra work for cpu. Also creates garbage for the GC.eg: function A calling B, calling C is one more stack frame than A-calling-Ceg: one thread switching between multiple sockets — more cpu workload than one thread dedicated to one (exchange) socketeg: un-contended lock acquisition — more work than no-lockeg: garbage collection – competes for CPU.eg: page swap as part of virtual memory systems — competes for CPUeg: vtbl lookup — adds a few clock cycles per function call. To be avoided inside the most critical apps in an exchange. Therefore c developers favor templates than virtualseg: RTTI — latency sensitive apps generally disable RTTI early on — during compilationeg: null terminator at end of c-strings — adds network traffic by x%eg: inline – trims cpu cycles.eg: one kernel thread mapping to multiple user threads — fastest system should create no more user threads than the maximum cpu (or kernel) threads, so the thread scheduler doesn’t need to run at all. I feel this is possible only in a dedicated machine, but such a machine must then communicate with peripheral machines, involving serialization and network latency.

For a dedicated kernel thread to service a busy stream of tasks, we need to consider what if the tasks come in bursts so the thread becomes temporarily idle. One solution is to suspend the thread in wait() but a more radical approach is to simply let the thread busy-wait in a loop. Assumption is, one cpu is exclusively dedicated to this thread, so this cpu can’t do anything else even if we suspend this thread.