2013-07-10T11:41:06-04:00http://cbcg.net/Octopress2012-09-01T11:10:00-04:00http://cbcg.net/blog/2012/09/01/slides-from-database-talk-at-drexelI gave a lecture at Drexel this week on non-relational databases and “big data”. The slides are up. They are all new since last time; the world of NoSQL and Big Data has changed a whole lot in 2 years :)
]]>http://cbcg.net/blog/2012/09/01/slides-from-database-talk-at-drexel/2012-04-13T00:00:00-04:00http://cbcg.net/blog/2012/04/13/tuning-jvm-for-a-vm-lessons-learned-directly

The number of Java workloads running on virtualized infrastructure has been increasing exponentially over the last few years. Advancements in processors and hypervisor technology now make virtualizing Java a compelling proposition. However, there are still best practice provisos and considerations, particularly in the area of JVM memory management.

This talk will present a lot of the innovation, practical insight, and lessons learned gained from the last year by a senior engineer from VMware who recently developed a Java “ballooning” solution called Elastic Memory for Java (EM4J)

I really enjoy reverse engineering stuff. I also really like playing video
games. Sometimes, I get bored and start wondering how the video game I’m
playing works internally. Last year, this led me to analyze Tales of Symphonia
2, a Wii RPG. This game uses a custom virtual machine with some really
interesting features (including cooperative multithreading) in order to
describe cutscenes, maps, etc. I started to be very interested in how this
virtual machine worked, and wrote a (mostly) complete implementation of this
virtual machine in C++.

However, I recently discovered that some other games are also using this same
virtual machine for their own scripts. I was quite interested by that fact and
started analyzing scripts for these games and trying to find all the
improvements between versions of the virtual machine. Three days ago, I started
working on Tales of Vesperia (PS3) scripts, which seem to be compiled in the
same format as I analyzed before. Unfortunately, every single file in the
scripts directory seemed to be compressed using an unknown compression format,
using the magic number “TLZC”.

Continuing the Chrome extension hacking (see part 1 and 2), this time I’d like to draw you attention to the oh-so-popular AdBlock extension. It has over a million users, is being actively maintained and is a piece of a great software (heck, even I use it!). However - due to how Chrome extensions work in general it is still relatively easy to bypass it and display some ads. Let me describe two distinct vulnerabilities I’ve discovered. They are both exploitable in the newest 2.5.22 version.

In Maryland, job seekers applying to the state’s Department of Corrections have been asked during interviews to log into their accounts and let an interviewer watch while the potential employee clicks through wall posts, friends, photos and anything else that might be found behind the privacy wall.

Here’s an experiment anyone can do: Go get your Apple IR
remote. The LED emits at 980nm, or about 306THz, in the
near-IR spectrum. Relatively speaking, this is just outside
of the visible range. Take the remote into the basement, or
the darkest room in your house, in the middle of the night,
with the lights off. Let your eyes adjust to the
blackness.

Above: Apple IR remote photographed using a digital
camera. Though the emitter is quite bright and the
frequency emitted is not far past the red portion of
the visible spectrum, it’s completely invisible to the
eye.

Can you see the LED flash when you press a button
[4]? No? Not even the tiniest amount?
Try a few other IR remotes; most use an IR wavelength even
closer to the visible band, around 310-320THz. You won’t be
able to see them either, even though they would be
blindingly, painfully bright if they were in the visible
spectrum.

Above top: Frequency of an Apple IR remote emitter relative to the full visible spectrum.

These near-IR LEDs emit at about 20% beyond the visible
frequency limit. 192kHz audio extends to 400% of the
audible limit. Lest I be accused of comparing apples and
oranges, auditory and visual perception drop off similarly
toward the edges.

01192012 - The graphical models tab has links to video lectures on
tutorials on the subject (this is mainly for students who didn’t
get to attend the class by Mike Jordan and Martin Wainwright).

01182012 - The systems slides are available now (follow the systems link)

01182012 - Updated project guidelines

Overview

Scalable Machine Learning occurs when Statistics, Systems, Machine
Learning and Data Mining are combined into flexible, often
nonparametric, and scalable techniques for analyzing large amounts of
data at internet scale. This class aims to teach methods which
are going to power the next generation of internet applications.

NoSQL databases are often compared by various non-functional criteria, such as scalability, performance, and consistency. This aspect of NoSQL is well-studied both in practice and theory because specific non-functional properties are often the main justification for NoSQL usage and fundamental results on distributed systems like CAP theorem are well applicable to the NoSQL systems. At the same time, NoSQL data modeling is not so well studied and lacks of systematic theory like in relational databases. In this article I provide a short comparison of NoSQL system families from the data modeling point of view and digest several common modeling techniques.

To explore data modeling techniques, we have to start with some more or less systematic view of NoSQL data models that preferably reveals trends and interconnections. The following figure depicts imaginary “evolution” of the major NoSQL system families, namely, Key-Value stores, BigTable-style databases, Document databases, Full Text Search Engines, and Graph databases:

]]>http://cbcg.net/blog/2012/02/17/nodejs-is-bad-ass-rock-star-tech/2012-02-14T00:00:00-05:00http://cbcg.net/blog/2012/02/14/using-a-real-logger-with-netty-will-cause-bosI came across a really interesting bug that was plaguing me for about a day and a half and I finally tracked it down today, with JIT assistance from the superhuman @brianm.

I have a fairly simple app that uses Netty for its underlying event handling. Just a Web service, no great shakes. But, there was this weird bug that I found during load testing whereby it would just stop all network activity after a few minutes for no discernable reason. No more connections could be established, existing connections did not finish, etc. At first, I thought it was maybe a bug in Tsung, my load testing tool, but that wasn’t the case.

I upped the logging and found this strange error that was happening right around the time of the network shutdown:

I googled around for this error and found only a few references to it, mostly having to do with a partially-uploaded JAR. My app is packaged in a fat JAR, so I found this unlikely, but I reuploaded it and verified the upload with the MD5 hash and still the error occurred and the network shut down.

I was pretty much out of other options at this point, since there were no other anomalies and YourKit was unable to find any deadlocks or such. I took a stab and sent the gist above to @brianm to see if he had seen it before. He immediately noted that the call to java.security.AccessController.doPrivileged was bad news. He then suggested I try the “bazooka-squirrel approach”: if you can’t hit the squirrel with your .22 because the tree’s a mess, take out the whole tree with a bazooka ;-) In this case, the “tree” was the logging attempt in the backtrace.

However, I wasn’t using java.util.logging in the app. I was using logback. I checked out the source code of NioServerSocketPipelineSink and found that it was Netty using java.util.logging. Oh, really? At this point, java.util.logging was not configured. I configured it with a simple ConsoleLogger and all of a sudden *BLAM*… problem solved.

Netty is using java.util.logging which, if left unconfigured, was causing it to reach out into the fat JAR for the logging configuration which wasn’t there. When it did so, it caused an InternalError due to some kind of JVM security issue which blew up the server socket. Netty catches that and shuts down the boss thread. *sigh*

tl;dr You must configure java.util.logging to go somewhere or if there’s an exception in your boss thread it will decapitate Netty rather effectively.

]]>http://cbcg.net/blog/2012/02/09/tackling-the-folklore-surrounding-high-perfor/2012-02-03T00:00:00-05:00http://cbcg.net/blog/2012/02/03/if-youre-using-nodejs-youre-doing-life-wrongThis morning, on a conference mailing list, I made some disparaging remarks about Node.js (the title of this post, in fact). A couple people asked me why I felt that way. Rather than respond individually, I’ll just list my reasons here:

V8 is not server-class

At my current place, we have a Web crawler where some portions are leveraging Chrome/V8. Take a guess which component absolutely dominates the bug count and issue list? Not to mention the fact that its balls-ass slower than some straight un-optimized Scala. We’re looking to get rid of it completely ASAFP.

On a larger note, using JavaScript on the server-side seems kind of ridiculous. If Linden Labs came out with a server-side framework in LindenScript, would you use that? How about if Apple came out with a framework based on AppleScript?

Callback spaghetti is about the last pattern with which you’d ever want to write anything

I’ve written servers in just about every kind of server framework pattern there is. Node’s is the absolute worst. It provides the least amount of aid and comfort to the programmer and its nigh-on-impossible to follow the code 6 months later. The idea that the broken-ass concurrency model forced on Javascript by old browser implementations is actually a *good thing* gives me a fucking aneurism.

Non-blocking != fast != scalable

This is probably the most annoying point for me. First of all, scalability has very little to do with raw speed. Just because you’re fast does not mean you’re scalable. You know what’s fast? MySQL. You know what’s not scalable by itself? MySQL. The hype around Node.js on this issue makes me want to punch faces. Furthermore, Node.js isn’t even that fast. You can do much better with Scala and its a much nicer language, to boot. Oh, and nevermind about those extra CPUs you bought: you won’t be needing those. Events and non-blocking are instantaneous, so you can do everything on a single core, right?

JavaScript

The Ruby and Python communities are just now, many years after the hype has faded, learning that stuff like dependency injection and proper modularization are actually good things that help you maintain code over time. JavaScript has very little support for any of those nice things: it doesn’t even have namespaces, for chrissakes. Why would I want to repeat these same mistakes over and over again in a new language? Knowing a language is not the same as being able to maintain services written in that language over the long haul.

It seems to me that people who are really crazy about Node.js are people who only knew JavaScript to begin with and to whom none of the above would ever even occur. Perhaps, this is a case of Worse Is Better and I’ll eventually have to eat my words on this one. But the kind of misunderstanding going on in this video clip seems to pervade the Node.js hype and gives me rageface.

HeapAudit is not a monitoring tool, but rather an engineering tool that collects actionable data – information sufficient for directly making code change improvements. It is created for the real world, applicable to live running production servers.

HeapAudit is a foursquare open source project designed for understanding JVM heap allocations. It is implemented as a Java agent built on top of ASM.

Which means that the correct lesson the boy’s parents could have taught him was what it is the boy does to make Superman think he can manipulate him, or even what it is about Superman that makes him act that way; but the one they went with, the one that will make him neurotic for the rest of his life, is that he’s a winner.

At that point I got busy with other things (most notably final preparations
for the FreeBSD
9.0-RELEASE announcement) but on Sunday evening I sat down and wrote a
much-needed shell script:

# ssh-knownhost hostname [fingerprint ...]

The ssh-knownhost script uses ssh-keyscan to download all the host
keys for the specified hostname; uses ssh-keygen to compute their fingerprints;
compares them to the list of fingerprints provided on the command-line; and
adds any new host keys to ~/.ssh/known_hosts. Short, simple, and
effective.

Calling this “Orwellian”: Buffett wrote that “private equity” is a “name that turns facts upside-down: A purchase of a business by these firms almost invariably results in dramatic reductions in the equity portion of the acquiree’s capital structure compared to that previously existing.

There are different approaches for getting to the user’s credentials, I will present the easiest here and I will concentrate on iOS:

We register a custom NSURLProtocol for ‘keylogger://’ URLs. It is a dummy implementation which just makes sure that those URLs aren’t processed further by the framework.

In the webView:didFinishLoad: method, inject some JavaScript into the loaded page. The JavaScript will attach a listener to every input element on the page and that listener will call a ‘keylogger://’ URL crafted by us which contains the character the user entered.

In the shouldStartLoadWithRequest: method, we capture all of the ‘keylogger://’ requests and log the characters. Then we stop loading, because those URLs are just used to communicate between JS and Objective-C.