Techualization

Sunday, November 8, 2015

A few days ago, some colleagues at work brought up a discussion about several closely-related concepts in Python: Iterable, Iterator, Generator. And I thought it might be a good blog post to explain them all, down to the words in the PEPs, and just not hand-waving over them.

Duck typing

Before we all go deep, let's quickly remember that Python is a dynamically typed language. Many people use the term duck typing to refer to Python's type system. Colloquially speaking, if it walks like a duck, quacks like a duck, it must be a duck. That is to say an object is considered a particular type if it behaves in accordance to the defined behavior of that type.

In other words, we are usually talking about interface rather than implementation in Python. The mentioned topics are in fact concepts, not any concrete implementation, as we shall see later.

Iterable

Iterable is the easiest one to explain. If an object has either __iter__ or __getitem__ method, it's an iterable.

The main purpose of an iterable object is it can be used in a for loop. Let's take a look at how for loop works.

As you can see, the for loop itself is converted into two main instructions: a GET_ITER followed by a FOR_ITER. The first instruction obtains an iterator from an iterable object. The next instruction grabs the next item from that iterator.

You can also tell that I have simplified things a little bit up there to not get into another term that we're getting to next

Iterator

An iterator object has a next method (or __next__ in Python 3). This method returns the next item in a stream, or raises StopIteration when there is no more item.

Now, if there is only next method, the iterator will not be usable in a for loop. In order for "code receiving an iterator can use a for-loop over the iterator", "iterators are currently required to support both protocols.". The other "protocol" referred in that direct quote is iterable concept. That is, iterators are required to have an __iter__ method returning itself. The two concepts, however, are distinct.

So, by now, I hope that it is clear that iterators are iterable, and iterable objects produce iterators.

We have a seq sequence. This sequence is iterable because it has __iter__ method. We obtain the first iterator iter1. We then run a for loop over iter1, breaking out at value 3. The fact that we can run for loop over iter1 confirms that iter1 is iterable. But unlike seq, iterating over iter1 again continues from where it left off, not from the beginning. We also see that iter2 (which is created at the same time we restarts iteration over iter1) is totally different from iter1.

I hope the example makes sense. When iterating over a sequence, you would want to iterate over all its items. When iterating over an iterator, you want to iterate over the items left in that iterator, not items that have been consumed. Put it differently, the __iter__ method defines how the items are iterated over while the next method simply returns the next item in that order. Therefore, an iterator is almost always assumed to be used once only, and an iterable is often iterated over many times.

So we have seen the relationship between iterator and iterable, let's get to the last topic.

Generator

Quoting PEP 255: Python generator is a kind of Python iterator, but of an especially powerful kind.

Generator is more powerful because it has these other methods on top of next: send, throw, and close. Therefore, it should be clear that if an object does not accept any of these methods, it is not a generator. Conversely, an object that provides all of these methods can be treated as a generator.

Due to duck typing, it is wrong to use isinstance(x, types.GeneratorType) to check if x exhibits generator behavior. That type is only defined for object returned from a generator function or generator expression.

Speaking of generator function and generator expression, they should not be confused with generator. Generator function is any function that has the yield keyword somewhere in its body. Invoking this function does not run its body, but returns a generator object which PEP 255 calls generator-iterator. Calling next on the returned generator will execute the function body. Generator expression is similar in concept but with a more familiar list comprehension syntax.

The above code shows that the return value from a generator expression and generator function is, as expected, a proper generator. The code also shows that MyGenerator is indeed a generator because it fulfills the generator concept, it can be used as a subgenerator in a yield from statement. Lastly, it shows that a regular iterator is not a generator.

On a side note, the example makes it clear that iterable and iterator/generator are two distinct concepts. MyIterable does not produce next item, and MyGenerator is not iterable, but when combined they can be used in for loop.

Summary

To summarize it all up, iterable object produces iterators and therefore control how the items are iterated over. Iterator produces next item in a defined order and usually consumed once. For iterator to be used in a for loop (no pun intended), iterator is required to define an __iter__ method that returns itself. Generator is an enhanced iterator with additional semantics such as send, throw, and close.

Practically speaking, if an object can be fully iterated over with for loop more than once, it's an iterable. If an object has next (or __next__ in Python 3), it is an iterator. And if a value can be sent to an iterator, it is a generator.

Saturday, July 26, 2014

The Economist published an article Divided we stand about the work that Full Professor Michael Franz did at University of California at Irvine to further secure software applications.

The gist of the technique is to compile the same source code into many variant binaries that perform the same function, but differ structurally, at the machine instruction level. For example: the sequence MOV EAX, EBX; XOR ECX, EDX can be rearranged into XOR ECX, EDX; MOV EAX, EBX, or made into MOV EAX, EBX; NOP; XOR ECX, EDX without affecting any functionality of the sequence. The team modified compilers (both LLVM clang and GCC) to automatically (and deterministically) introduce diversity (randomness) in the instruction scheduler. As such, exploit writers will have a harder time targeting all variants.

This immediately reminds me of a simply trick I used many years ago to achieve a similar effect: Randomizing the linking order of object files. Consider this, if you have object main.o, strcpy.o, and puts.o, you can create 6 (3 factorial) variants by linking them in different permutation orders:

In the first variant, f1 is at ~EC0 and main is at ~ED0. In the second variant, f1 is at ~EF0 and main is at ~EC0. There is a clear difference in the structure of the binaries but no functionality is affected.

This trick is performed at the final stage (linking) in the whole build process. Therefore, intermediate object files can be reused without recompilation. Furthermore, no source code is required for this "diversification" process to happen.

Clear tradeoffs are in the granularity of the diversification. In the context of Prof Franz's work, which is mainly in defense against ROP exploit, I'll happily ignore such granularity.

Oh, by the way, I did not use this trick to "secure" the application. It seems like a wrong tool for that purpose due to distribution and debugging problems it creates.

At other optimization levels, the code can be compiled and executed just fine.

If we uncomment the second functor, the code always fails, regardless of optimization levels.

The reason is our template declares a static constant variable functor_ (at the line marked with THIS LINE!!!). At high level optimization, the compiler finds out that we only use the functor object to execute a function so the compiler inlines the function and optimizes away the functor object. Without optimization, the compiler requires a definition of functor_ and fails to find one.

When we use f2, its functor_'s operator() refers back to itself via this. That requires the functor object to actually be allocated. But because we have not defined any such functor object, the compiler will fail to compile our code.

I find this piece of code interesting because usually higher (not lower) optimizations make code fail. For example: Prof. John Regehr initially blogged about undefined behavior under optimizations, and STACK team at MIT published a paper about optimization-safe code.

Wednesday, December 11, 2013

This is an easier way to set BrowsePass up on the Google Chrome browser (as well as its open source cousin Chromium). The Chrome app alleviates the most cumbersome steps in setting up BrowsePass: finding a web host for the source code. Everything else remains the same, including the convenience of opening your password database with a browser.

Though there is not yet extension/add-on/app for other browsers (such as Firefox, Safari, Opera), they can still run BrowsePass normally.

Wednesday, November 27, 2013

Just a quick note to myself. VnTeX recently moved its vietnam.ldf file to babel's contrib as vietnamese.dtx. The move happened on April 14, 2013. The babel package in MiKTeX is currently at March 23, 2013. The vntex package in MiKTeX is currently at May 21, 2013. That is to say vntex package no longer provides vietnam.ldf, yet babel package is still not update-to-date enough to have vietnamese.dtx.

The fix is to maintain your own vietnam.ldf file. The code (that was taken before the move) is pasted below.