Categories

Related tags

Source Code Analysis: Translation Coverage

22 May 2010

How to Make Translation Term Coverage Report

Let’s assume our application has a UI translation module. We created dictionary and use terms from there within the code of application components. Everything is clear so far. But one day we begin to suspect that not all of the terms presented in the dictionary are really used within the application. Besides, probably some of terms which are used in components are not available in the dictionary. So we need a script to traverse all the source code files searching translation module use occurrences. Having the list of all encountered terms the script can compare it with the terms of the dictionary.

Since PHP has for the case a simple but powerful tool token_get_all, let’s try it and see what is it:

Just looking on it, that’s not yet very implicit, what the output array means. The first element of every row contains code of parser token, found text for the token and code line where the token was encountered respectively. Using token_name function to map the token codes , we can easily convert the array into something like that:

Now you see we can find anything what we want in source code. When you use gettext() translation technique you just search for consequences of T_STRING/gettext and T_CONSTANT_ENCAPSED_STRING. True? Not so easy. We should find scopes defined by textdomain and analyze all of them. Moreover I guess you use some other translation approach, e.g. Zend_Translation. Here we also need to find translater object declaration and all the calls for translate method.

My solution is to create an Iterator class, which extends ArrayIterator, and provide the class with method match, which receives an array of tokes as parameter. Iterating the tokens array match checks whether sequence of given token encountered or not.

Though I prefer to remove all the WHITESPACE tokens in the array before giving it to Tokenizer_Iterator, not to care of whitespace number and sequence for the match occurrences. When object name is found, we can iterate the array once again, looking now for $obj->translate() occurrences.

If you are required to search in class or function scopes, before Tokenizer_Iterator creation, traverse the array of tokens and , encountering an open parenthesis, keep adding this index to each array element as id of scope opener till the proper close parenthesis is found.

When you have array of all the terms encountered in source code, you can match it against terms of dictionary and get all the reports you want.

What about non-PHP source code?

You use the same technique to parse JS, CSS, Java or other source code files. Just cheat the tokenizer by prepending to the code '

$tokens = token_get_all('<?php ' . file_get_contents('default.css'));

Let’s assume our application has a UI translation module. We created dictionary and use terms from there within the code of application components. Everything is clear so far. But one day we begin to suspect that not all of the terms presented in the dictionary are really used within the application. Besides, probably some of terms which are used in components are not available in the dictionary. So we need a script to traverse all the source code files searching translation module use occurrences. Having the list of all encountered terms the script can compare it with the terms of the dictionary.

Who's the dude?

Dmitry Sheiko is a web-developer living and working in Frankfurt am Main, DE