taowen

2/27/2012

VBA Stacktrace

Logging in VBA is hard, believe it or not. There is no obvious way to find out what went wrong even you decided to do some manual logging. Unlike C# or other "industrial" programming environment, there is no easy way to tell the stacktrace of current execution point. The article describes a way that I invented/found to do it "elegantly" (relatively speaking).It looks like

HEG is a global boolean const, stands for "Handle Error Globally". If we wrap all our sub/function with blocks like above, we can ensure there will be logs printed out when some errors happens. And the stacktrace will be available in the log file, along with the invocation parameters in the invocation chain.

You might wonder how it is implemented. Actually quite simple, it is NOT maintaining a stacktrace some where. If we do it that way, we need to push/pop frame into/from the stacktrace. My way is a little bit simpler, it does force you to maintain a separate stacktrace. When a exception is raised, the On Error GoTo statement will catch it, and GetLogger().Error function call will write out the error to log file. One thing is not that obvious is instead of resume next, the GetLogger().ReraiseError function call will also raise another exception, which can be caught in outter level. Again, the reraised exception will be caught and log to file. This way, a stack trace can be recorded in the file, with the root cause in the top, and the most outside calling place in the bottom.

1/04/2011

Retrospection: the mistakes I have made these years

Someone told me, it took more than 10000 hours repeated practices to make a professional mature. I am still far away from the standard, but after 5 years of programming as my profession, I realized I already made so many mistakes, that worth some conscious retrospection.

One presentation I did not watch but really liked their slides: http://www.infoq.com/presentations/LMAX. In the slides, they said:

On a single thread you have ~3 billion instructions per second to play with: to get 10K+ TPS if you don't do anything too stupid.

I have to say, I did do many things smart at first and turned out to be stupid, which made the ~3 billion instructions per second hardware helpless to the project. Not just about performance, there are many mistakes leads to other symptoms as well.

Sometimes, the "I" here can be substitute with "We". I am sure, and I have seen other people made same mistakes as I did. As poor software developer, we do not have control of many things, but code at our hands. It is not surprising people spent a lot of time to make their code "smart". A lot lessons can be learned from those smartness.

I do not have a full list yet, but as a start I will list some here. As this blog is primarily technical, I will keep the items mostly relevant. If I find time I will complete them one by one:

11/27/2010

Package, the missing language feature - Part II

Problems

In previous post, we have talked about how package works in Python language. Essentially, the problem is, the package is a good box, but color is not black. We want the package to expose all its API at the package level, and seal up any internal details. from A import * should give you all the things you need, you do not need to import A.B, or import A.C.

Unimportable

So, how to make a module unimportable? There are two things you need to do. First, remove the B attribute from package A object. By doing this, import A.B will fail. Because import A.B will first import A, and then import A.B, and then get B from A. By deleting B from A the import will fail. Second you need to remove A.B from sys.modules, by sys.modules['A.B'] = None. This will make from A.B import * fail.

delattr(package_A, 'B')
sys.modules['A.B'] = None

This way, we completely hide the existence of A.B. Which is the behavior we want when other package want to import this private internal. The drawback of this mechanism is that, the error message user get is not friendly. They will be told the module does not exist, but it actually exist if you look it up in the file browser.

When?

By making internal packages modules unimportable we can make the parent package a blackbox. But when we do this, deleting all the internal packages and modules?

The best place is in the __init__.py of parent package. But after we delete the internal packages and modules, they are gone. What if A.B reference A.C in the code? The thing we need to do is to make sure A.B are initialized(imported) before sealing up A. In A.B it might use import A.C or from A.C import xxx, both ways copy the referenced name to local namespace. So even A.C no longer exists, in A.B they can still be referenced.

Do I need to write those kind of ungly delattr in every __init__.py file? Isn't that a cross-cutting concern that should not be repeated in every place?

Yes, let's find some way to magically inject those code in every __init__.py file. The code actually has three parts. Part 1, expose API. Part 2, eager load sub modules. Part 3, delete sub modules. API stil need to be manually defined in __init__.py. But part 2 and 3, they can be put into "post-import-hook".

What is post import hook? They are the code executed after module being imported. After A being imported, we can eager load all its sub modules by scanning folder and then delete them. Post import hook is not directly supported in Python, but can be done by more powerful meta import hook.

By eager loading sub modules and delete them in post import hook. We can seal up package and force people to define the package API in __init__.py, because that is the only way to let outsider to use the internal.

Another interesting side effect is that the circular dependency between packages are no longer possible. In python, circular dependency between modules are not possible, but because module in package is lazy loaded, so circular dependency between packages were possible. But after eager loading sub modules, we now disabled the circular dependency between packages. It is good thing, but could be very strict.

Finally, we have the box. And automatically seal it up in post import hook. If the box writer want to make the box external usable, they need to define its API in __init__.py. Plus, no circular dependency ever.

11/25/2010

Package, the missing language feature - Part I

Introduction

We have spent a way too much time on functions and classes. We put a lot of energy to maintain a clean and concise interface of classes. We care about the dependency between classes, by encouraging dependence injection and wire up objects via interface.

Besides functions and classes, we do have higher level construct. We have hacked the class loading of java to the hell, and then Satan gives us back his OSGi. We have invented a dedicated job to maintain the manifest of EJB. When dependency injection is not enough, people do find concept called Module emerging in modern things like Guice and Autofac.

What is Package? Itself is merely a name. It is just some annoying leading dots before the thing you actually want. It is not even a thing, it is just a being ignored prefix. People might say, oh yes, package is not doing anything, why should I care? Function is doing something, class is also doing somethings, package is just some dummy folder that I can put those valuable stuff inside it.

True, very true... so does the language designers. I can not say all of them does, but at least some of them does. Stroustrup ignored package. Gosling ignored package. Even Hejlsberg ignored package (But, assembly is better than nothing). What a huge mistake!

The problem we normally need to solve when writing business software is not some scientific work. In my mind, the only problem we need to solve is managing complexity. As we learned long time ago, the only way to control complexity is to break it down, and break it down further. One thing containing many other things. We need blackbox to encapsulate the internal complexity monster and give outside a clean and simple illusion. But constantly, when using Java or C# I find I need to reinvent all kinds of blackboxes to meet my needs. And none of them seems naturally to new comers, simply because they are not part of the original language, not known to most people, and not supported by many tools. There are many design patterns, people say they exist because the language itself is flawed. There are also many component platform/framework, I say it is because the language itself is flawed. It is because the language does not give us the blackbox, so we need to invent one ourself.

Package, it is a missing language feature for a long time. But luckily, Java or C# is not the only choice we have. In another open wonderland, without money but happiness, we have our lovely Python. In there, we finally see what is called package.

Package in Python

Package in python is simple. If you have a folder called some_package, and you have a __init__.py file in that folder, then it becomes a package called some_package. If you happen to have another folder inside some_package folder called another_package, and itself also has a __init__.py file inside the folder, then it becomes some_package.another_package.

The key difference between package in Java and package in Python is, in Java, the package is just a literal symbol, it does not exist in the runtime. In Python, the package is a living object and you set and get attribute on it anytime. some_package.another_package.abc = 'def' is a valid statement in Python language.

This gives us the box we want. We can use this box to define our interface and hide our internal complexities. A package structure like A.B.C, A should hide B, C. B should hide C. In the A level, you might say start the car. In the B level, you might say start the engine, and then start radio and air conditioner. The hierarchical structure of package naming is the best fit for natural encapsulation.

The box can also initialize itself. It has a __init__.py file which can be used to execute any bootstrapping code. Sometimes we need to sort out some internal stuff before ready for outside service. Sometimes, we need to register ourself as subscriber for event published somewhere else. Having simple __init__ solves a lot of problem. It is powerful enough? Not really, it does not support thousand other features, like full lifecycle management, standard remote control interface, etc. As a user facing public component, the package construct exposed by the language is very limited. But we can build on top of what language provides us.

The real problem of python package is not it does not support things like JMX. The real problem is the box is not really a black box. Actually, everything in Python is sort of made by Glass. You can see right through nearly everything, that is public in Java term. Although we can use _ as convention, and __ as hard compiler constraint in some place. But here, the ugly underscore is not helping. You can always reference A.B.C.xxx anytime you want, and that is dangerous. It breaks encapsulation, introducing tangling dependency without being noticed. No one likes that, we want to make sure Y.X.Z only reference A.B.C.xxx through A.yyy. It should not know the internals like A.B.C.xxx.

Classic "pythonic" response would be, that is just a convention. When convention is there, people should follow. The problem is, this is not a easy convention, and can be broken in any minute. There is no easy rule people can follow. The real difficulty is, when you reference A.B.C.xxx you can not always reference it as A.yyy. If your code lives inside A.B, then you should not reference A.yyy, because the inside package should not reference the outside, as it is in the lower place in the dependency pyramid. In this case, you do need to reference A.B.C.xxx as A.B.C.xxx as it is some thing you have to deal with. It is no longer a hidden internal, you are living inside the internal. In other words, A.B.C.xxx is not always public or private. It is accessible or not depending on where you are. And that is exactly what encapsulation is about.

How can we make the box really a black box? Let's continue in the Part II.

Update 11.27

Alan (https://alanfranz.pip.verisignlabs.com/) commented:
I'm not 100% sure about what you'd like to say in the next part, however:

beware about "bootstrapping code". Many times such "static initializer" is known to provoke unpredictable problems, and will prevent package breakup via pkg_resources namespace_package , if ever needed.
Most of the times initialization should be performed by the client code or should be performed at first request; import-time initialization is absolutely abused in python coding.

convention is fine. If anybody imports a module or package that starts with underscore, it's their business - after all, if they've got the source code, they can modify all the names and make them public, can't they? Would you prefer java-like things where you can set methods and attributes private, and then you can access them via other means through some common.util.lang tool?

nesting too much might just be unneeded, and if you want a leaf (a.b.c) not to depend on its parent you can use relative imports. But remember that two modules importing one the other trigger an error in Python, you simply can't do that.

Thanks Alan! I am surely aware of "bootstrapping code". The major problem of import time initialization is it is implicit. And it will be even worse if the bootstrapping is I/O intensive or causing other side effects. But if there are too many clients, like a lot of unit tests, pushing responsibility to them is also inconvenient. I use __import__('x.y.z') in the main function to implicit stating that I want to use those packages and initialize them now.

Starting with _ means private, that is fine. But a package is not always private, it is public to its siblings, but private to other packages. Traditional visibility only allow you to specify one thing is public or not, I think that is not enough. How many details you are allowed to know, or should depending on, might be contextual.

Python does check the circular dependency on module level. But it does not check circular dependency on package level. For example, A.B can not circular depend on C.D, but if A.E depend on C.D and C.D depend on A.B, that is allowed. But actually, that means package A depend on C, package C depend on A as well.

2/22/2010

Data Migration (3)

The final question about data migration. How to write it? We already know it is just a function to transform a dictionary into another dictionary. We also know there will be question around dependencies between entities. So, what we need are several functions, with each one upgrade one version. The migration function is per version delta, not per entity. The function need to do several things:

Find out what are the entities need to be migrated.

Load the entity state as dictionary.

Apply the migration logic on the dictionary.

Save the entity state back.

For finding entities to be migrated is easy. We already know SQLServer as XQuery support. We can write customized xquery to find out what are target entities. Most of time, it will be based on CLR_TYPE. Then the only thing that being a problem is how to write a function to transform a dictionary in memory to another dictionary.

It might seems easy, we just need to write a function in C#, which takes dictionary as input and return a new dictionary. Yes, this would work. But the code would be very details, and looks unintentional. It would need to a lot of casting to cast a entry to a list or a string or another dictionary, based on your knowledge of the object graph. It also need to do a lot of detailed operation, like copy a field to another and delete the older field to do a renaming. The issue of plumbing code might be solved by introducing non-static typed language, like ruby as a migratioin scripting language. But the more essential problem is how to raise the abstraction level, so that the migration script can looks like more intentional, and reveals the original requirements.

One naive change is we write several function for the well known refactorings. Like we can have Rename, MoveType, ExtractEntities. And that was exactly what we have tried before. The problem of these small functions are they are not really that reusable. Say, we have a rename function, which a change the direct field from one name to another. But what if we renamed a field but it is inside the object graph not directly on the root. Then the rename function can no longer help us. We might think we can abstract the "locating" part of the function. Instead of passing in two string to identify the fields by name, we pass in two locators.

The locator is not easy to implement. Say, we are renaming field x.y.z to x.y.k, x is branch of the root, and y is branch of x, and z was the field on y, and k is the new name of z. The rename function need to take "x.y.z" and "x.y.k" as input, and know how to apply them. For "x.y.z" we need to "get" the value, and then use "x.y.k" to "set" the value, then use "x.y.z" again to delete the field. The logic of getting value is very different from the logic of setting value.

In general, this apporach was called Functional Programming. By decomposing the big function to smaller one, and then compose them back to cope with different situations, we can maimize the reusbility.

2/21/2010

Flattening & Rebuilding

Using SQLServer as your nosql database to persist objects state has issue with data migration. I mentioned three in the previous post. There are two more, today we are going to talk one of them. You can not deserialize object state back to class whoes fields have been changed. Think about a class used to have a field called name, now the field changed to firstName. When deserialize, where should the value of name assigned back to? If we can not get the raw value, how can we apply the data migration logic? We already talked about the data we stored in SQLServer is XML, are we going to parse the XML and manipulate the xml element directly? Yeah, I think we have to.

So the data migration logic is not applied on the same object model which your application logic dealing with. It has to be at a lower level. We could data migrate the xml data elements, but that is just too tightly coupled with the data format. Before we use the xml, we actually tried JSON for a month, until we found the XQuery is really a killer. Also, xml element has many things we do not care about. So, what we need is a model which can capture the states of the objects, but simple enough. This is model is also directly related to how serialization/de-serialization is working. It works like this:

It will be deserialized to a dictionary containing 5 entries. The CLR_TYPE will be used in the rebuilding process to rebuild the dictionary back to a object. Except dictionary, string is also valid. string is used to store the field name as well as the simple field value. The persistence layer need to define how to translate a date time into a string, and etc. Collection is also valid. Although in theory, collection is just a special case of dictionary.

The XML is just state storage for a single entity. Entities can inter-relate with each other. We are not going to store the state for other entity in same XML. They will be referenced by ID, and stored separated in different rows in the table EntityState.

Because we separated the serialization into two phases, that is why we can do data migration. The data migration is just a function, who take a dictionary as input, and produce another dictionary. Now the only problem is, how can we write such kind of function? Yes, it might be trivial for just one version, but if you are going to change it very frequently doing agile software development, then it is a big issue. We are going to talk about the "reusbility" of data migration rules in the next post.

2/20/2010

Data Migration (2)

There are three difficulties about nosql data migration, particularly if the data is serialized object state:

Relationship between entities causing the dependencies between data migration rules.

Lack of ad-hoc query support.

Hard to migrate in batch.

The two approaches we talked about: Migrate on load, Migrate in one time. Both of them pros and cons, and they are very much related to the three difficulties mentioned above.

The pros of Migrate on load: It do not need to shutdown your database or application. So, in theory, live migration is possible this way. Another related big benefit is you spread the cost of data migration over the period. So, if the data set is huge, it is very economical to do it this way. Especially a lot of the data are not frequently being used.

The cons of Migrate on load: very very difficult to deal with the dependencies between data migration rules. Not able to fail fast, if there is flaws in the data migration code itself. Also the design more sophisticated so more likely to run into problem.

The pros/cons of Migrate in one time is exactly the opposite of the above. Both of them have problems to deal with lack of ad-hoc query support. For example, if you want to change a reference from id to a business id, then it is very likely you need to translate from one particular id to another business id. This kind of query is very unlikely to have designed index tables. So if we do not have ad-hoc query support, then the data migration code is very hard to write. You might need to build special index table just for data migration purpose. Luckily, if we use SQLServer as nosql database, then we can leverage the xquery capability.

For batching, if you migrate on load, it is not a problem. But if you migrate in one time, it might be very time-consuming. It is now the NO.1 concern in my team around data migration. I have no good solution to this one yet. Previously, we write data migration using SQL, it is batch processing in it's nature. But now we do not have schema, so SQL is not applicable anymore, which means more RPC round-trip involved in the data migration. We need to literally load the whole database out. The long term mitigation is to introduce Map-reduce.

The not so well known problem is the problem of dependencies. One simple question. If we move a field from class A to B. When we load the object of class A, should we do the migration of class A and referenced object of class B? When we load the object of class B, should we do the migration of class B and referenced object of class A? Then, are we running into the circular reference problem here? This is just a obvious example, there are much more not so obvious examples. For example, if we delete a class. That means all data migration referencing that class must be all executed against the whole database, otherwise we have possibility to not able to load object of that class anymore. How can we avoid that? Let's talk about it in next post.