It's about time for another demystification post, I think :P This time, I'm going to talk about Compilers, Virtual Machines (VMs), and Just-In-Time Compilation (JIT) - and the way that they are both related and yet different.

Compilers

To start with, a compiler is a program that converts another program written in 1 language into another language (usually of a lower level). For example, gcc compiles C++ into native machine code.

A compiler usually has both a backend and a frontend. The frontend is responsible for the lexing and parsing of the source programming language. It's the part of the compiler that generates compiler warnings and errors. The frontend outputs an abstract syntax tree (or parse tree) that represents the source input program.

The backend then takes this abstract syntax tree, walks it, and generates code in the target output language. Such code generators are generally recursive in nature. Depending on the compiler toolchain, this may or may not be the last step in the build process.

gcc, for example, generates code to object files - which are then strung together by a linking program later in the build process.

Additional detail on the structure of a compiler (and how to build one yourself!) is beyond the scope of this article, but if you're interested I recommend reading my earlier Compilers 101 post.

Virtual Machines

A virtual machine comes in several flavours. Common to all types is the ability to execute instructions - through software - as if they were being executed on a 'real' hardware implementation of the programming language in question.

Examples here include:

KVM - The Linux virtual machine engine. Leverages hardware extensions to support various assembly languages that are implemented in real hardware - everything from Intel x86 Assembly to various ARM instruction sets. Virtual Machine Manager is a good GUI for it.

VirtualBox - Oracle's offering. Somewhat easier to use than the above - and cross-platform too!

.NET / Mono - The .NET runtime is actually a virtual machine in it's own right. It executes Common Intermediate Language in a managed environment.

It's also important to note the difference between a virtual machine and an emulator. An emulator is very similar to a virtual machine - except that it doesn't actually implement a language or instruction set itself. Instead, it 'polyfills' hardware (or environment) features that don't exist in the target runtime environment. A great example here is WINE or Parallels, which allow programs written for Microsoft Windows to be run on other platforms without modification.

Just-In-Time Compilation

JIT is sort of a combination of the above. The .NET runtime (officially knows as the Common Language Runtime, or CLR) is an example of this too - in that it compiles the CIL in the source assembly to native code just before execution.

Utilising such a mechanism does result in an additional delay during startup, but usually pays for itself in the vast performance improvements that can be made over an interpreter - as a JITting VM can automatically optimise code as it's executing to increase performance in real-time.

This is different to an interpreter, which reads a source program line-by-line - parsing and executing it as it goes. If it needs to go back to another part of the program, it will usually have to re-parse the code before it can execute it.

Conclusion

In this post, we've taken a brief look at compilers, VMs, and JIT. We've looked at the differences between them - and how they are different from their siblings (emulators and interpreters). In many ways, the line between the 3 can become somewhat blurred (hello, .NET!) - so learning the characteristics of each is helpful for disentangling different components of a program.

If there's anything I've missed - please let me know! If you're still confused on 1 or more points, I'm happy to expand on them if you comment below :-)

I've been doing some more experimenting with LoRa recently, as I've got 1 of my 2 RFM95 working (yay)! While the other is still giving me trouble (meaning that I can't have 1 transmit and the other receive yet :-/), I've still been able to experiment with other people's implementations.

To that end, I've been learning about a bunch of different words and concepts - and thought that I'd document them all here.

LoRa

The radio protocol itself is called LoRa, which stands for Long Range. It provides a chirp-based system (more on that later under Bandwidth) to allow 2 devices to communicate over great distances.

LoRaWAN

LoRaWAN builds on LoRa to provide a complete end-to-end protocol stack to allow Internet of Things (IoT) devices to communicate with an application server and each other. It provides:

Standard device classes (A, B, and C) with defined behaviours

Class A devices can only receive for a short time after transmitting

Class B devices receive on a regular, timed, basis - regardless of when they transmit

Class C devices send and receive whenever they like

The concept of a Gateway for picking up packets and forwarding them across the rest of the network (The Things Network is the largest open implementation to date - you should definitely check it out if you're thinking of using LoRa in a project)

Secure multiple-layered encryption of messages via AES

...amongst many other things.

The Things Network

The largest open implementation of LoRaWAN that I know of. If you hook into The Things Network's LoRaWAN network, then your messages will get delivered to and from your application server and LoRaWAN-enabled IoT device, wherever you are in the world (so long as you've got a connection to a gateway). It's often abbreviated to TTN.

(Above: A coverage map for The Things Network. The original can be found here)

Data Rate

The data rate is the speed at which a message is transmitted. This is measured in bits-per-second, as LoRa itself is an 'unreliable' protocol (it doesn't guarantee that anyone will pick anything up at the other end). There are a number of preset data rates:

These values are a little different in different places - the above are for Europe on 868MHz.

Maximum Payload Size

Going hand-in-hand with the Data Rate, the Maximum Payload Size is the maximum number of bytes that can be transmitted in a single packet. If more than the maximum number of bytes needs to be transmitted, then it will be split across multiple packets - much like TCP's Maximum Transmission Unit (MTU), when it comes to that.

With LoRa, the maximum payload size varies with the Data Rate - from 230 bytes at DR7 to just 59 at DF2 and below.

Spreading Factor

Often abbreviated to just simply SF, the spreading factor is also related to the Data Rate. In LoRa, the Spreading Factor refers to the duration of a single chirp. There are 6 defined Spreading Factors: ranging from SF7 (the fastest transmission speed) to SF12 (the slowest transmission speed).

Which one you use is up to you - and may be automatically determined by the driver library you use (it's always best to check). At first glance, it may seem optimal to choose SF7, but it's worth noting that the slower speeds achieved by the higher spreading factors can net you a longer range.

Duty Cycle

A Duty Cycle is the amount of time something is active as a percentage of a total time. In the case of LoRa(/WAN?), there is an imposed 1% Duty Cycle, which means that you aren't allowed to be transmitting for more than 1% of the time.

Bandwidth

Often understood, the Bandwidth is the range of frequencies across which LoRa transmits. The LoRa protocol itself uses a system of 'chirps', which are spread form one end of the Bandwidth to the other going either up (an up-chirp), or down (a down-chirp). LoRahas 2 bandwidths it uses: 125kHz, 250kHz, and 500kHz.

Frequency

Frequency is something that most of us are familiar with. Different wireless protocols utilise different frequencies - allowing them to go about their business in peace without interfering with each other. For example, 2.4GHz and 5GHz are used by WiFi, and 800MHz is one of the frequencies used by 4G.

In the case of LoRa, different frequencies are in use in different parts of the world. ~868MHz is used in Europe (443MHz can also be used, but I haven't heard of many people doing so), 915MHz is used in the US, and ~780MHz is used in China.

After writing the post that will be released after this one, I realised that I made a critical assumption that everyone knew what an inverted index was. Upon looking for an appropriate tutorial online, I couldn't find one that was close enough to what I did in Pepperminty Wiki, so I decided to write my own.

First, some context. What's Pepperminty Wiki? Well, it's a complete wiki engine in a single file of PHP. The source files are obviously not a single file, but it builds into a single file - making it easy to drop into any PHP-enabled web server.

One of its features is a full-text search engine. A personal wiki of mine has ~75k words spread across ~550 pages, and it manages to search them all in just ~450ms! It does this with the aid of an inverted index - which I'll be explaining in this post.

First though, we need some data to index. How about the descriptions of some video games?

Kerbal Space Program

In KSP, you must build a space-worthy craft, capable of flying its crew out into space, without killing them. At your disposal is a collection of parts, which must be assembled to create a functional ship. Each part has its own function and will affect the way a ship flies (or doesn't). So strap yourself in, and get ready to try some Rocket Science!

Cross Code

Meet Lea as she logs into an MMO of the distant future.
Follow her steps as she discovers a vast world, meets other players and overcomes all the challenges of the game.

Fort Meow

Fort Meow is a physics game by Upper Class Walrus about a girl, an old book and a house full of cats! Meow.

Factory Balls

Factory balls is the brand new bonte game I already announced yesterday. Factory balls takes part in the game design competition over at jayisgames. The goal of the design competition was to create a 'ball physics'-themed game. I hope you enjoy it!

Very cool, this should provide us with plenty of data to experiment with. Firstly, let's consider indexing. Take the Factory Balls description. We can split it up into tokens like this:

T

o

k

e

n

s

V

V

factory

balls

is

the

brand

new

bonte

game

i

already

announced

yesterday

factory

balls

takes

part

in

the

game

design

competition

over

at

jayisgames

the

goal

of

the

design

competition

was

to

create

a

ball

physics

themed

game

i

hope

you

enjoy

it

Notice how we've removed punctuation here, and made everything lowercase. This is important for the next step, as we want to make sure we consider Factory and factory to be the same word - otherwise when querying the index we'd have to remember to get the casing correct.

With our tokens sorted, we can now count them to create our index. It's like a sort of tally chart I guess, except we'll be including the offset in the text of every token in the list. We'll also be removing some of the most common words in the list that aren't very interesting - these are known as stop words. Here's an index generated from that Factory Balls text above:

Token

Frequency

Offsets

factory

2

0, 12

balls

2

1, 13

brand

1

4

new

1

5

bonte

1

6

game

3

7, 18, 37

i

2

8, 38

announced

1

10

yesterday

1

11

takes

1

14

design

2

19, 28

competition

2

20, 29

jayisgames

1

23

goal

1

25

create

1

32

ball

1

34

physics

1

35

themed

1

36

hope

1

39

enjoy

1

41

Very cool. Now we can generate an index for each page's content. The next step is to turn this into an inverted index. Basically, the difference between the normal index and a inverted index is that an entry in an inverted index contains not just the offsets for a single page, but all the pages that contain that token. For example, the Cross-Code example above also contains the token game, so the inverted index entry for game would contain a list of offsets for both the Factory Balls and Cross-Code pages.

Including the names of every page under every different token in the inverted index would be both inefficient computationally and cause the index to grow rather large, so we should assign each page a unique numerical id. Let's do that now:

Id

Page Name

1

Kerbal Space Program

2

Cross Code

3

Fort Meow

4

Factory Balls

There - much better. In Pepperminty Wiki, this is handled by the ids class, which has a pair of public methods: getid($pagename) and getpagename($id). If an id can't be found for a page name, then a new id is created and added to the list (Pepperminty Wiki calls this the id index) transparently. Similarly, if a page name can't be found for an id, then null should be returned.

Now that we've got ids for our pages, let's look at generating that inverted index entry for game we talked about above. Here it is:

Term:game

Id

Frequency

Offsets

2

1

31

3

1

5

4

3

5, 12, 23

Note how there isn't an entry for page id 1, as the Kerbal Space Program page doesn't contain the token game.

This, in essence, is the basics of inverted indexes. A full inverted index will contain an entry for every token that's found in at least 1 source document - though the approach used here is far from the only way of doing it (I'm sure there are much more advanced ways of doing it for larger datasets, but this came to mind from reading a few web articles and is fairly straight-forward and easy to understand).

Can you write a program that generates a full inverted index like I did in the example above? Try testing it on the test game descriptions at the start of this post.

You may also have noticed that the offsets used here are of the tokens in the list. If you wanted to generate contexts (like Duck Duck Go or Google do just below the title of a result), you'd need to use the character offsets from the source document instead. Can you extend your program to support querying the inverted index, generating contexts based on the inverted index too?

Liked this post? Got your own thoughts on the subject? Having trouble with the challenges at the end? Comment below!

Welcome back to another demystification post! This time, it's about microphones. I had a question recently about microphones and phantom power, and after doing some rather extensive research on the subject (unintentionally of course :P), I thought it a waste not to summarise it here.

Basically, phantom power is a +48V direct current that's transmitted through a microphone cable (not the kind you plug into your laptop I don't think - the big chunky ones). It's required by condenser microphones (though some use a battery instead), which have a pair of films (called diaphragms) which vibrate. When a current is passed through from one plate to the other, the physical sound gets converted into an electrical signal we can use.

Condenser microphones are much more sensitive than their dynamic microphone counterparts. They are able to better represent a wider range of frequencies - but as a result of this heightened sensitivity, you normally need a pop filter if you're recording vocals. In addition, they don't tend to perform too well in loud environments, such as concerts. Finally, they tend to be more expensive than dynamic microphones, too.

Dynamic microphones, on the other hand, don't require phantom power. They are basically a loudspeaker in reverse and generate the current themselves - they have a single diaphragm that's attached to a metal core - which in turn has a coil of wire around it. When the diaphragm vibrates, so does the metal core - and as you can probably guess, a current is induced in the coil, as metal cores tend to do when inside coils of conveniently placed wires.

While they are better in loud environments (like concerts and drum kits), dynamic microphones aren't so good at representing a wide ranges of frequencies - and as such they are usually tailored to be pick up a specific frequency range better than others. Furthermore, they aren't as sensitive in general as your average condenser microphone, so they don't get on particularly well with very quiet sounds either.

Which you use generally depends on what you want to do. If you've got an overly enthusiastic drummer in a rock concert, you probably want a dynamic microphone. On the other hand, if you're trying to record the song of a cricket on a still summer's evening, you probably want to keep a condenser microphone handy.

I'm not an audio expert, so I might have made a few mistakes here and there! If you spot one, please do let me know in the comments below :-)

A little while ago someone I know seemed a little bit confused as to how a traceroute interacts with a firewall, so I decided to properly look into and write this post.

Traceroute is the practice of sending particular packets across a network in order to discover all the different hops between the source computer and the destination. For example, here's a traceroute between starbeamrainbowlabs.com and bbc.co.uk:

As you can see, there are quite a number of hops between us and the BBC, not all of which responded to attempts to probe them. Before we can speculate as to why, it's important to understand how a traceroute is performed.

There are actually a number of different methods to perform a traceroute, but they all have a few things in common. The basic idea exploits something called time to live (TTL). This is a special value that all IP packets have (located 16 bytes into an ipv4 header, and 7 bytes into an ipv6 header for those who are curious) that determines the maximum number of hops that a packet is allowed to go through before it is dropped. Every hop along a packet's route decreases this value by 1. When it reaches 0, an ICMP TTL Exceeded message is returned to the source of the packet. This message can be used to discover the hops between a given source and destination.

With that out of the way, we can move on to the different methods of generating this response from every hop along a given route. Linux comes with a traceroute utility built-in, and this is the tool that I'm going to be investigating. If you're on Windows, you can use tracert, but it doesn't have as many options as the Linux version.

Linux's traceroute utility defaults to using UDP packets on an uncommon port. It defaults to this because it's the best method that unprivileged users can use if they have a kernel older than 3.0 (check your kernel version with uname -r). It isn't ideal though, because many hosts don't expect incoming UDP packets and silently drop them.

Adding the -I flag causes traceroute to use ICMP ping requests instead. Thankfully most hosts will respond to ICMP pings, making it a much better probing tool. Some networks, however, don't allow ping requests to pass through their gateways (usually large institutions and schools), rendering this method useless in certain situations.

To combat the above, a new method was developed that uses TCP SYN packets instead of UDP or ICMP ping. If you send a TCP SYN packet (manipulating the TTL as above), practically all hosts will return some kind of message. This is commonly referred to as the TCP half-open technique, and defaults to port 80 - this allows the traceroute to bypass nearly all firewalls. If you're behind a proxy though I suspect it'll snag on it - theoretically speaking using port 443 instead should rectify this problem in most cases (i.e. traceroute -T -p 443 hostname.tld).

Traceroute has a bunch of other less reliable methods, which I'll explain quickly below.

-U causes traceroute to use UDP on port 53. This method usually only elicits responses from DNS servers along the route.

-UL makes traceroute use udplite in a similar fashion to UDP in the bullet point above. This is only available to administrators.

DCCP can also be used with the -D. It works similar to the TCP method described earlier.

A raw IP packet can also be used, but I can't think of any reasons you'd use this.

That just about covers all the different traceroute methods. If you have any questions, please leave a comment below.

Yesterday I was taking a look at [UDP Multicast], and attempting to try it out in C#. Unfortunately, I got a little bit confused as to how it worked, and ended up sending a couple of hours wondering what I did wrong. I'm writing this post to hopefully save you the trouble of fiddling around trying to get it to work yourself.

UDP stands for User Datagram Protocol (or Unreliable Datagram Protocol). It offers no guarantee that message sent will be received at the other end, but is usually faster than its counterpart, TCP. Each UDP message has a source and a destination address, a source port, and a destination port.

When you send a message to a multicast address (like the 239.0.0.0/8 range or the FF00::/8 range for ipv6, but that's a little bit more complicated), your router will send a copy of the message to all the other interested hosts on your network, leaving out hosts that have not registered their interest. Note here that an exact copy of the original message is sent to all interested parties. The original source and destination addresses are NOT changed by your router.

In the above, I set up a few variables or things like the multicast address that we are going to join, the port number, and so on. I pass the port number to the new UdpClient I create, letting it know that we are interested in messages sent to that port. I also create a variable called channel, which we will be using later.

Next up, we need to figure out a way to send a message. Unfortunately, the UdpClient class only supports sends arrays of bytes, so we will be have to convert anything we want to send to and from a byte array. Thankfully though this isn't too tough:

The above converts a simple string to and from a byte[] array. If you're interested, you can also serialise and deserialise C&sharp; objects to and from a byte[] array by using Binary Serialisation. Anyway, we can now write a method to send a message across the network. Here's what I came up with:

You might not have seen (or heard of) asynchronous C# before, but basically it's a ways of doing another thing whilst you are waiting for one thing to complete. Dot net perls have a good tutorial on the subject if you want to read up on it.

For now though, here's how you call an asynchronous method from a synchronous one (like the Main() method since that once can't be async apparently):

Task.Run(() => Listen).Wait();

If you run the above in one program while sending a message in another, you should see something appear in the console of the listener. If not, your computer may not be configured to receive multicast messages that were sent from itself. In this case try running the listener on a different machine to the sender. In theory you should be able to run the listener on as many hosts on your local network as you want and they should all receive the same message.