Saturday, October 23, 2010

I just released the pack 1.51 (located in sourceforge: http://sourceforge.net/projects/texlexan/files/ ); this new release corrects the bug (wrong integer size) when texlexan was compiled in 64bits (I'm currently using Ubuntu 10.10 - 64bits).

I have started to work on the batch training programs (pack 2, programs: buildkeybase.c, globalkeybase, analysekeybase.c); these programs do not work correctly since I've modified the structure of the dictionaries (Feb 2010). I'm very busy this semester, so I'm afraid that will take time to release I robust revision of these programs.

Saturday, August 21, 2010

The objective is to speedup the classifier by using a GPU to do the strings matching and the computation of probabilities. But there are two big issues that I listed in my previous post:

- The number of threads limited at 32.
- The memory bandwidth 30 times lower than the computation capability.

The straightforward solution is to use the GPU for what it is essentially designed; "Compute a large array of independant floating point values with exactly the same code, the same instruction excuted on the same time." on many simplied processors. In this way, we will able to run the 512 cores.

The simplest way is to transform the terms (string of alphanumeric characters) in values. First step: A hash routine using a parallel algorithm [1] will perform this transformation. Next step: The hash values of terms in the dictionary are compared with the hash value of the searched term. Final step: All the product wi*xi will be computed. Both comparisons and multiplications are performed in parallel.

Tuesday, August 17, 2010

The idea is to use the most advanced GPU of nVIDIA to increase the speed of the classifier.

The GPUs are the most available and performant multiprocessors for the microcomputers. They are specifically designed for graphic applications but the FERMI architecture of nVIDIA is more flexible than the previous G80 / GT200 architecture. In fact, this new architecture has been designed to allow more efficiently and easily the massive parallel computations that is needed in imaging, flow simulations and financial predictions. This new architecture looks very interesting for the linear classifier that is used in TexLexAn, and should (theoretically) increase the classifier of several order of magnitude.

Before starting this project, there are several points to check such as the specialized languages for parallel computing, how to redesign the classifier routine to use the advantage of a GPU, the debugging tools, the price of the graphic card based on the GTX 400 series.

1 - The specialized languages:

CUDA [1] and OpenCL [2] are both available for Linux with the particularity that CUDA is specific nVIDIA and OpenCL is cross-platform and able to work with any GPUs, multi-core CPUs or DSPs.

It seems that CUBA is better documented than OpenCL. But both languages look pretty close, so it will not too painful to switch from one to another.

For the current version of TexLexAn, terms are searched inside in line of the dictionaries with the simple linear searching function strstr(). It's of course a very inefficient solution but it offers 3 advantages:

A small memory requirement, because only one line of the dictionary (one class) is loaded in memory.

The possibility to view and edit the dictionaries with a simple text editor.

A simple and robust algorithm, because only one line of the dictionary (one class) is loaded in memory at once.

A more efficient structure that tries to preserve the advantages listed above has been discussed in my posts of Febuary 2010 ( http://texlexan.blogspot.com/2010/02/speedup-strings-matching.html ).

2.1.4 - New routine:

The idea is to keep the current file structure of the dictionaries for its simplicity to edit it and its relative robustness. The structure of the data will be modified during the loading in memory. The whole dictionary will be loaded and converted in a new structure more efficient for a parallel search.

2.1.4.1 - Rethinking the structure of the data.

2.1.4.1.1 - The ideal processor architecture:

The ideal architecture of the processor reflects the model to compute [1st paragraph]. Each cell stores one term and weight, the code to compare the term stored with the term searched and the code to compute bji*xji . Each row of cells contains the terms and the weights of one class. Results of the cells in the row are added.

Each cell performs the same code, the differences between cells are limited to the terms stored and the weights. Cells are strictly independent. These particularities will help to design a simplified "many-core" processor architecture. The "raw" specification could be:

Cells do not have to communicate each other.

Cells work on the same data ( searched term and its frequency ).

Cells run independently their program.

All cell must finish before another term can be proceeded.

Result is valid only when all cells have finished.

Only terms and weights change when a new dictionary is loaded.

Cells run the same program (string comparison and 64bits multiplication).

Results of each row of cells are compared, highest is cumuled to the previous results.

Cells will not run the instructions of the code strictly in the same order (different paths) depending of the terms compared, so the processor should have a MIND architecture.
(*) version memory optimized O(m).

2.1.4.1.2 - The nVIDIA FERMI architecture:

Unfortunately the ideal processor described above does not exist, it's closest equivalent is the last generation of GPU. It is important to well understand the architecture of the GPU and to take care of several limitations such as the shared memory, the memory bandwidth.

- 16 SM allow to run 16 different codes ( = 16 different kernels ) and 2 warps (parallel threads) can run concurrently the same code but with 2 different paths. So, only 32 parallel threads can run concurently, it's a severe limitation as we know that each string matching function have to run their code independently (same code following different paths).

- vram bandwidth: 177.4 GB/s => 44.35 GWords 32-bit or 22.175Gwords 64-bit, so the bandwidth is about 30 times (672/22.175) lower than the floating point capacity. At this point we know that the vram bandwidth is second severe limitation.

- PCI bandwidth: 500 MB/s is so low compared to the vram bandwidth that the data transfer between the central ram or cpu and the graphic card must be strictly limited.

We know that we will have to solve 2 majors problems: - The relatively small number of parallel theads possible. - The memory bandwidth.

In the next post, we will discuss of solutions to speedup the classifier with the GPU.

Saturday, August 14, 2010

Searching for information inside a database is a very common task in computer science. Very efficient algorithms have been developed for a fast searching such as based on tree structures, on hash tables, binary method. Many softwares are designed to manage large database very efficiently (dBase, FoxPro, Access, MySql...), of course, they use fast searching algorithms mentioned above.

Unfortunatly and very logically [1] fast search algorithms are based on a strict matching, so they are not applicable to find the closest match. Approximate search methods require to compare each elements in the database with the searched element and to retain the element with the smallest difference. The comparison is performed by a function that measures the similarity or the difference between the elements compared [2]. This process is expensive [3]. The current processors have one or hundred of cores [4], so the comparison is repeated as many time that there are elements to compare in database divided by the number of cores.

To speed the fuzzy matching, the simple idea could be a giant memory-coupled processors array (MCPA) strutured like a neural network. Each element of the database is stored in one cell of the MCPA. The element to search is distributed over all the MCPA. The processor in each cell compares its element stored with the element to seach and returns a value indicating the difference. The element can be a string (corresponding to one word or one sequence of words or one sentence), or a matrix of dots (small portion of an image), or a limited time series (small portion of sound). The MCPA will contain the whole text, the whole image, the whole speech or song. The advantage is just a few cycles will be required to search the element and to find its closest match.

An example:

If we are able to design a MCPA of 1 million cells of 64 bits and with a clock at 500 Mhz, then it will be large enough to store the whole bible ( about 780,000 words ) where each cell will contain one word represented by a 64 bits value [5]. It will be possible to found any misspelled word in just a few nanoseconds, I will say 10 ns.

Unfortunately, there are several problems, the first one is how to load quickly 1 million cells? The memory bandwidth is the main bottle neck; if the databus has 64 lines, it will take 1 million cycles to load the whole chip. 1 million cycles at 500 MHz represent 2 ms, so the loading 200,000 times longer than the searching! If we use the both edges of the clock signal and use a databus of 128 lines, it will take 50,000 more times to load the data than to compare them. The library of Congress has 29,000,000 volumes. How long time that will take to load these books? Perhaps, something between 4 to 8 hours! The solution is to use thousands of MCPA chips and to fill up each chip with several books.

The second problem is the complexity of the chip in term of number of transistors. If each cell is a tiny 64 bits risc processor requiring 70,000 transistors (estimated) plus a 10,000 transistors for the tiny EEPROM and SRAM memories, the whole chip will contain at least 80 billions of transistors. It is much more than the current processors (Itanium 9300 contains 2 billion transistors). 80 billions of transistors in a chip is considerable but should be possible in a close future, probably around 2020.

The third problem is the price. This kind of chip will cost thousands of dollars the first years. Is it acceptable to spend so much just for a fast fuzzy pattern or string matching? In fact, there is probably no other possibility; with the current silicon technology, it's difficult to increase the speed of the processor fare above 3 GHz and so we have only one solution: create a giant array of processors in one chip. Because the transfert speed of data between the memory and a large multicores processor is a limitation, the solution is to integrate the memory and the processor in a cell memory-coupled processor.

The complexity of each cell memory-coupled processor is due to the separation of the storage of the information (the memory) and the processing of the information (the cpu). This structure requires many lines of communication between the cpu and the memory and some extra logic to manage the flux. The ideal solution will be to intregrate the memory and the logic in a single component. The crossbar latch [6] constitued of memristors [7] seems an interesting solution to combine the storage and the processing of the information.

[1]: Fast searching algorithm tries to minimize the number of data compared and ideally to none; just find the exact match in the first shoot, it's what the perfect hash table is expected to allow. In fact, it's exactly the opposite of a fuzzy search algorithm.
[2]: Edit distance of 2 strings such as Levenshtein distance, Hausdorff distance (smallest euclidian distance between points).
[3]: expensive in number of cycles, in energy and time. The algorithms that compute the distances require often O(m.n) cycles.
[4]: Tile Gx Processor ( TILEPro64 ) contains up to 100 identical 64 bits cores connected in a square matrix. The graphic processor GeForce GT200 can run 30,720 threads. The Fermi architecture http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf is fare more flexible than the G80 / GT200 architecture. An application for the classifier TexLexAn will discussed in a next post.
[5]: 64 bits (~ 1.84x10^19) is large enough to prevent the risk of collision (2 different words that a hash function will convert will the same value).
[6]:http://www.hpl.hp.com/news/2005/jan-mar/crossbar.html
[7]:http://www.hp.com/hpinfo/newsroom/press/2010/100408xa.html

Thursday, June 24, 2010

This is not the most complex task. The main goal is to remove the meaningless informations.

For instance: Articles, very common words and formating characters of a text document are removed. High frequency or low frequency sounds of low intensity masked by a middle frequency sound of an audio recording are ignored. Edges of a still image are detected and extracted. Motions in video are detected and quantified.

In case of text, we use a small dictionary containing the words to suppress and simple algorithm to suppress paragraph numbers, tabulation, indentation marks... The words can be simplified. Plurals can be converted in singular or stemmed or lemmatized...

In case of audio, we use simultaneous and temporal masking codec to suppress sounds that ears cannot discriminate.

A additional step in the data formating is to compute the relative values of the parameters in order to create a kind of invariant patterns. For instance, we compute the relative frequency of words (1), the relative frequency and amplitude of sounds (2), the relative size and position of shapes (3). This operation makes the informations to analyse independent of the size of the text (1), independent of the pitch and volume of the speaker (2), independent of the distance of the objects on the image (3).

Sunday, June 20, 2010

In my previous post, I tried to explain that a large knowledge base with the appropriate algorithms could mimic our brain reasoning. But what could be these algorithms?

In a naïve approach, the simplest form of reasoning program will have to:
1 - format the input facts.
2 - retrieve similar or related facts in the knowledge base.
3 - retrieve rules linked with the facts (retrieved) in the knowledge base.
4 - search for synonyms and repeat steps 2 and 3.
5 - check the incoherences in the facts and rules retrieved and mark them "check it!"
6 - make inferences between facts and rules
7 - recycle the conclusions in the step 1 (the conclusion becomes an input fact) until the conclusion matches our criteria.

This a very raw description of the different steps and algorithms required to mimic our brain reasoning. I will detail each steps in the future posts.

Friday, June 18, 2010

We are able to estimate the risk to receive the fruit on our head because our observations accumulated since our childhood. As young kid without any knowledge about the gravity we know that any object between our fingers will fall when we will release the object. Later, we will know that fruits in the tree fall when it is windy or when the fruits are ripe. If a program has this knowledge too, logically it will be able to estimate the risk to receive the fruit. The reasoning will be a succession of inferences. For example, the database contains these facts and rules (the probability that the fact is true is indicated in percent):

1 - 100% - A pear is a fruit.
2 - 100% - A pear grows in the top of a tree.
3 - 100% - The top of a tree is above the ground surface.
4 - 100% - A thing above the ground surface falls when released.
5 - 100% - A pear is a thing.
6 - 80% - A fruit is released when is ripe.
7 - 60% - A pear is ripe on October.
8 - 100% - We are on October.

The inferences are:

(1,6 => 9) A pear is a fruit + A fruit is released when is ripe = A pear is released when is ripe.

(9,7,8 => 10) A pear is released when is ripe + A pear is ripe on October + We are on October = A pear is released (0.8 x 0.6 x 1 = 0.48).

(4,5 => 11) Thing above the ground surface fall when released + A pear is a thing =A pear above the ground surface falls when released.

(11,10 => 12) A pear above the ground surface falls when released + A pear is released = A pear above the ground surface falls (1 x 0.48 = 0.48).

(2,3 => 13) A pear is in the top of a tree + The top of a tree is above the ground surface = A pear is above the ground surface.

So the set of facts and rules above bring to the conclusion that a pear could fall with a probability of 48%.

This sort example shows that a software is able of reasoning but that requires an extensive and precise factual and procedural knowledge base.

But what happens when the knowledge base is incomplete, a fact is wrong or incertain , or different words are used for the same thing?

For example:

a) If the fact 6 is missing "A fruit is released when is ripe" ?

b) If the fact 7 is wrong "A pear is ripe on June" ?

c) If the fact 3 and 4 use two different terms soil and ground surface for the same thing: "The top of a tree is above the soil" and "A thing above the ground surface falls when released" ?

As for human, it will depend of the quantity of knowledge stored in the database.

For the problem (a), the software can proceed by analogy, for instance the 2 facts below conduct to the hypothesis that a pear could fall like an apple :
- An apple falls from the tree when it is ripe.
- Apples and pears grow on tree.

For (b), the software can detect an error in the knowledge base. If for instance the knowledge base can contain these facts:
- Anjou, Bartlett, Bosc, Comice are pear varieties.
- Comice pears are harvested September through February.
- Bosc pears are harvested September through April.
- Barlett pears are harvested July through October.
So the fact (b) is illogical regarding the facts listed above and the program will warn there is an incoherence in its knowledge base and it will ask for checking of the fact (b).

For (c), the software can search a synonyms of soil and ground and then it will decide that both words design the same thing.

As a human, a program is able to support some incoherence in its knowledge base and ask to verify some weird facts.

But what kind of algorithms are required for this pseudo-human reasoning ?

Thursday, June 17, 2010

My work to improve the dictionaries of TexLexAn, in fact its knowledge base brings me to a very basic question: What is knowledge?

We can distinguish two kind of knowledge:

The procedural knowledge or the rules to do things.
The declarative knowledge or the facts.

In the computer world, the procedural knowledge is represented by the rules-based system experts such as the online diagnosis programs, and the declarative knowledge is foundation of the databases such as the phone directory for the simplest form.

Today, the both forms of knowledge are managed by two completely different kinds of programs but our brain does not work like that! It looks evident that facts and rules are intimately mixed. There is a good reason for it; we construct our rules from ours observations (the facts). These rules can very dependent of the observations but we try generalize these rules. We call it an inductive reasoning.

We can imagine this funny thing: In the middle of October while we are reading under an apple tree, we receive an apple on our head, so we decide to move under another tree, in fact a pear tree. Pears are not apples but we are cautious because we infer that if one apple fell down there is a chance that a pear can fall down too. Can a program do the same hypothesis?

Saturday, May 1, 2010

I am currently thinking about a new kind of wiki especially designed for the education. The students will be the main contributors and the main readers too. A tool able to extract some statistic concerning the text, to classify and to provide automatically a short summary will be useful for the instructor and the reader too.
I think that TexLexAn after some modification (to make it able to understand the markup language of the wiki) will be able to do the job.

Wednesday, April 7, 2010

I could call it "BAMFAQ"; it is the idea to create of database of the most frequent answers to the most frequent questions. It is common to the see the same questions repeated again and again in different forums.

- The idea is to crawl the different forums and to collect all the questions and their answers, so that will be the first step.
- The second step will be a classification of the questions in order to regroup the similar questions. The classification will be unsupervised (clustering). So now, we have groups of similar questions with their answers.
- The third step will be the extraction of the 2 or 3 most frequent answers for each group of questions. Our database is almost ready, for each group of similar questions we have the 2-3 most frequent answers that we can suppose as the best answers.

Wednesday, March 3, 2010

The robustness of a structure of data is an important point when a large amount will have to be stored on disk for a long period of time. The current hard disk reliability is high ( 600,000 hrs < MTBF < 1,200,000 ), furthermore some array configurations of hard disks increase the overall reliability (raid 6 for instance). Anyway, the risk is non-null and it is interesting to verify if it will be possible to recover the data in case of hard disk failure.

There are 3 dictionaries with the same structure for the word-unigrams, words-digrams and words-trigrams.

Hypothesis: one or a few consecutive bytes are corrupted.

- The sequence "label_index label_name:" is exactly repeated in each dictionary. It is easy to rebuilt this sequence from another dictionary by searching the closest match. If the probability is P to have a problem is this sequence and the probability is the same for each dictionary, then the probability is P3 to have an unrecoverable sequence "label_index label_name:". P3 is a tiny probability and furthermore, it will always possible to correct manually the label_name.

- The value k can be easily computed again from the Number of terms in the dictionary / number of term in the class. So, the risk is not here.

Finally , the biggest risk comes from the list of terms of each class because it represent the largest part of the data.

/S1/w1L1\TERM1/w2L2\TERM2...
- The sequence /Si/ takes 6 bytes, the value Si 4 takes bytes and can be computed again by searching the next leading digram and counting the number of characters.
- The sequence /wjLj\ takes 6 bytes. wj takes 2 bytes and can be roughly estimated from the previous weight wj-1 and the next weight wj+1 if the term of the same leading digram are sorted on the weight w. Lj take 2 bytes and is the length of the TERMj. It can be computed again.
So, the sequence /Si/wjLj\ does not represent a big risk too.

- The chain \TERMj/ has a variable length given by Lj + 2. In many case, this term belongs to other classes inside the dictionary, then it will be possible to retreive it by searching its closest match (lowest edit distance) inside the dictionary.

In conclusion:
This structure seems intrinsectly robust with enough redundency to recover any errors affecting one byte or a few continuous bytes.

Note: I am starting to modify the dictionaries and the programs. They will be in the new branch of the repository /branch/new_dico_struct/... Be patient, that will take time to modify and to valid the code.

Sunday, February 28, 2010

The computation of the jump value of each leading digrams has cost (my post of feb 17), so when the number of terms with the same leading digrams is too small, it will be more expensive to compute the jump to the next leading diagram than to compare each term. This conclusion brings two ideas:

- The dictionary should have two structures: the terms sorted and grouped on their leading digrams when the number of terms of same digrams is important and the terms unsorted and ungrouped when this number is small.
- A calibration routine will compute the threshold value. This value will be used to decide when to switch between the two structures. This calibration will only run the first time the program is executed.

where S1 is the length of the substring "/w1L1\TERM1/w2L2\TERM2.../wiLi\TERMi/"S2 is the length of the substring "/wi+1Li+1\TERMi+1..../wi+jLi+j\TERMi+j/"
etc for S3, S4... Sn/0/ marks the end of grouped leading digrams.

The difference with the structure presented in my post of Feb 17 are the /0/ sequence of characters. The value 0 indicates the end of the group of terms, and then a routine will have to compare each terms one by one rather to compare just the leading digram of the first term of each group of terms.

2 -New routines:

To keep the program simple, the grouping of terms with the same leading digram will be done by a third program, independently of the classifier and the learner.

Simplified flowchar:

The new program named 'smartgroup' will reorganize the dictionaries. It will be trigged after several dictionaries updates. It runs independently of Texlexan and Lazylearner.

New search algorithm:
(pseudo code, Python style with goto):

GROUP: While not EOL do: get length S of the group of digram if length S is 0 goto BULK compute the index of the next group get the digram of the first term if digram of the first term is the digram searched goto TERM else: goto NO_FOUNDTERM: while not EOL do: get length L of the term compute the index of the next term get the term if term is the term searched goto FOUND if index of the next term is at the end of the group goto GROUP else: goto NO_FOUND

BULK: while not EOL do: get the length L of the term compute the index of the next term get the digram if digram is the digram searched: get the term if term is the term searched goto FOUND else: goto NO_FOUND

FOUND: get the weight return the weight ( exit of this routine )

NO_FOUND: return 0 ( exit of this routine )

This algorithm is divided in three parts:
- GROUP: jump from group of digram to group of digram when the first digram read does not match the searched digram.
- TERM: jump from term to term inside the group of same leading digram when the term does not match the searched term.
- BULK: jump from term to term as soon as the value of S is zero.

Speed optimization (multi-core processor):
The computation of the index of each group, the computation of the index of each term, and the comparison of digrams or terms can be done in parallel . The code should eliminate the race conditions ( in case the comparison is faster than the computation of the indexes ).

The table 1 gives the 60 most frequent digrams and represent about 55% of the words.

Now,
if n is the computation cost of the comparison of 2 digrams, N is the number of terms in a class with the same leading digram, and m is the computation cost of the new index pointing on the next digram group,
then the new structure is interesting if m < N * n or m/n < N

What are the values of N, m and n?
The values of m and n depend of the processor, the programming langage and the code optimization, but we can estimate N for each digram pretty easily.
One class can have about 10,000 terms.
If we do the naïve assumption that all combinations are possible, the 26 letters give 676 digrams, the second naïve assumption is the words are evenly distributed, then the number of words for each digram is 10,000/676 =14.79
We conclude (naïvely) that the index computation must not be 14.79 slower than the digrams comparison to make this dictionary structure insteresting.
In fact, the words are not evenly distibuted, the table 1 shows that the leading digram 'TH' represents 3.15% of the English words, then this class will contain about 315 terms beginning with 'th' (this is the best case). The worst case in our table, 'LY' will represent 47 terms. The problem is our table covers only 55% of the words. Of course, it would be better to find a complete table, but we can continue a rough approximation:
If we assume there are 676 digrams possible, the table 1 gives the 60 first digrams, then the 45% of the words are distributed over the 616 digrams. The average frequency is 0.45/616 = 0.000666 (0.0666%) , so for the class of 10,000 terms , the average is N=6.7 words per digrams.
This result lets think that the new structure will efficient for the most frequent leading digrams (roughly the 60~100 first digrams) and will be penalising for the large group of rare digrams.

The new idea is to switch from the new structure (and algorithm) to the old structure (and old algorithm) passed the most frequent digrams.

Monday, February 15, 2010

In the previous post, I presented a new structure of the dictionary that will improve the classification speed by 40%. In spite of this improvement, the classification will stay to slow when the dictionaries will grow up. In this post, I will present a another solution theorically more performant.

1 - The objectives

- To find any alphanumeric term in the dictionary and to get its weigh and its class. The term can belong several different classes and have different weighs.
- The dictionary must be easily viewable and editable with a basic text editor.
- The dictionary must be robust enough to be repeared in case of data corruption.

2 - Solution: Dictionary structure

I proposed this structure in my previous post:

label_index label_name:K/W1L1\TERM1/W2L2\TERM2..../WiLi\TERMi/EOL

where Wi is the weigh and Li the length of the TERMi.
Li is used to compute the position of the next TERM to compare with the searched term, then the comparison is limited to TERMi and the searched term.
The algorithm proposed sequentially gets the length Li, compare the TERMi with the searched term, compute the index of the TERMi+1, jump to the TERMi+1.

3 - Optimization

The comparison operation and the new index computation can be treated in parallel pretty easily.

let index= position of the first slash

do
{

thread
{

let length=val(s1[index+1],s1[index+2]);

index2=index+length+5;
}

if (integer)(s1[index+3,index+4]) equals (integer)(s2[0,1])
{
wait for end of thread;

if s1[index+3,index+3+length] equals s2
{

let weight=val('0',s1[index-1]);

return weight;
}
}
wait for end of thread;

index=index2;
}

while s1[index] does not equal EOL;

Note: val() converts 2 chars (0...F) in integer (0...255)

In the algorithm above, the first tread computes the index of the next substring TERMi+1 to compare, while the second tread compares the current substring TERMi with the searched term in s2.
The instruction (integer)(s1[index+3,index+4]) equals (integer)(s2[0,1]) compares the first 2 bytes of the TERMi with the first 2 bytes of the string S2. It is a very fast operation between two integers; it is interesting because the probability (*) is low that the comparison returns true, then the slower strings comparison "if s1[index+3,index+3+length] equals s2" will run rarely.

(*) Low probability that the first two letters of a word match the first two letters of another word. Explanation:
English words have leading digrams with these frequencies:

th 3.15% he 2.51% an 1.72% in 1.69% er 1.54% re 1.48% es 1.45% ...

In consequence, 3.15% of the terms in the dictionary will start with "th" and there are 3.15% of chance that the searched term s2 starts with "th" too.
Hence, the probability that the leading digram of s2 matches the leading digrams of any term in the dictionary is only 0.0315 * 0.0315 ~ 0.001 or 0.1%

Finally there is a very low probability that will have to compare the other characters following the digrams. Because there are 99.9% chance that the digrams will not match and because we compare the digrams simultaneously, we can consider that we have an O(m+n) string searching algorithm. The limitation is to code each letter on one byte (plain ascii) and to exclude terms of length < 2.

4 - Sorting

Terms in each class of the dictionary have different probabilities of occuring. Intuitively we understand that if the terms with the highest probabilities are in the beginning of the string s1 and terms with the lowest probabilities are at the end of s1, and if of course, we start looking for the term from the left to the right, then we have a better chance to find the term faster. The weight of each term represents the frequency of the term in its class. In consequence, it is just needed to sort the terms of each class based on their weights. Intuitively, we can imagine the gain cannot be important because a significant term cannot strongly belongs all the classes of the dictionary.

Other solution:

Sorting terms of dictionary based on the inverse of the frequency of their leading digrams is probably more interesting. A term in s2 with a frequent digram will be found faster than a term with a rare digram. We can significantly increase the speed if we skip all the similar leading diagram as soon as the first test shows they do not match. To do that, we just need to know the length of the substrings of s1 containing the same leading digrams.

If we search the weight of the word "hertz", the first digrams comparison (th of theobromine with he of hertz) will not match, then we jump directly to the next group of digram (heavy, heat, hertz). The advantage is we do only one comparison for each digram until both digram match.

Now, we have just to imagine a smart algorithm to do this job efficiently.

Friday, February 12, 2010

The structure of the dictionaries is very simple. Because TexLexAn is essentially experimental, the dictionary structure was designed to be easily viewable/editable with any basic text editor and to work without complication with the c function strstr(). The consequence of this choice is a poor search speed. The classification becomes particularly slow when the size of the dictionaries increase in number of classes and number of words.

The idea is to improve the structure of the dictionary without losing the easiness to view and edit the dictionary.

The current structure of dictionaries 'keyworder.lan.dicN' is a set class labels, where each class is defined by a single line of any length:

j class_label:kj/w1\n-gram1/....../wi\n-grami/
where j is the class index, kj is the constant of the class j, wi is the weight of the n-gram i.

The function strstr(s1,s2) is used to search a term (n-gram) inside each line of the dictionary.
The term searched is delimited with one slash at the beginning and one backslash at the end: /searched n-gram\ , so the string s1 contains the line "j class_label:kj/w1\n-gram1/....../wi\n-grami/" and the s2 contains the searched term: "\searched n-gram/"

Of course, it is a very simple solution but allows a simple and robust algorithm, and gives the possibility to search the root of words very easily. For example: The term 'power' delimited as "\power" will be found in the line of the dictionary "/w1\powerful/.....". This is a basic solution but requires only a fast stemming operation of the searched term.

The biggest inconvenient of strstr() in our case is that the backslash '\' of s2 is searched in the n-gram and weight of s1. Example: for the string s2 "\thermal/" and the string s1 "/8\powerful/5\system/9\thermal/" , the backslash '\' will be searched all along powerfull/5 and system/9, that is a wast of time because we know it cannot be present in the n-grams and weights.

One solution is to store the length of each term/n-gram and to used this length to compute the position of the next term/n-gram.

Example of structure:j class_label:kj/w1l1\n-gram1/....../wili\n-grami/ where the length li of each term is coded on 2 digits.
Our previous string s1 becomes:/808\powerful/506\system/907\thermal/

Another example of structure:j class_label:kj/w1\l1n-gram1/....../wi\lin-grami/ where the length li of each term is coded on 2 digits too.

Our previous string s1 becomes:

/8\08powerful/5\06system/9\07thermal/

For both examples, the term "powerful" has a length of 8 characters, so we know the next term is at the position i=i+8+5 ; (the value 5 is for the sequence /808\ of the first structure or /8\08 of the second structure).

For the second example the search algorithm (simplified) could be:

let index=0;

while s1[index] does not equal '\'

let index=index+1;

do
{

let length=val(s1[index+1],s1[index+2]);

if s1[index+3,index+3+length] equals s2
{

let weight=val(s1[index-1]);

return weight;
}

else

index=index+length+5;
}

while s1[index] does not equal EOL;

Note 1: The first loop is not required if we take care to store the length of the class label.

Note 2: If we want to keep the easy root or stem search, it is better to choose the first example as structure of the dictionary: "/808\powerful/506\system/907\thermal/" , the algorithm described above stays almost the same but we keep the possibility to search the root of the word, for example: "/power".

Search speed and Gain:

The simple search solution strstr(s1,s2) costs in the worst case: L1 + N1*L2 byte-byte comparisons, where L1 is the length of s1, N1 the number of '\' (equiv to the number of terms in s1) and L2 is the length of s2.

The algorithm given above decreases the search cost to: N1*L2 comparisons, but requires to compute an index based on the length of each term in the dictionary.
We can estimate that the computation of the index costs the equiv. of 5 comparisons, then the total cost of algorithm is 5 * N1 + N1 * L2 or N1 * (5 + L2) , and always in the worst case.

The gain for the dictionary of single words (or unigrams) is not very important, if we consider the average length of words is 8 characters and 3 characters are used to delimit the weight and the word ( .../w\term... ), then L1 = ( 8 + 3 ) * N1 ,
in consequence the gain is just ( 8 + 3 ) * N1 + N1 * L2 - ( 5 * N1 + N1 * L2 ) ,
simplified: Gain = 6 * N1.

Practically,if we have a text of 1,000 filtered words and a small dictionary of 1,000,000 terms (200 classes of 5000 terms), then the maximum gains are 6,000,000,000 ; 15,000,000,000 and 24,000,000,000 comparisons for the unigrams, digrams and trigrams dictionaries. If the comparison routine of 2 unsigned bytes takes 1ns, then the gains are 6s, 15s and 24 s.

Of course, these results are in the worst case, where all terms of the document are not found in the dictionaries and furthermore terms only differ on the last character. It is theoretically impossible but that give an idea of the search efficiency of the algorithm.

We can see that the search algorithm improves the search speed of about 40%.

Because it will be exceptional that the search term and the term is the dictionary will differ only on the last character, we can say only the first half of the term will be compared (*), then the search duration can be divided by 2.
So the search duration of 1,000 terms in dictionaries of 1,000,000 terms can be estimated at:
Unigrams: 6.5s
Digrams: 11s
Trigrams: 15.5s

Note: Classification algorithm of TexLexAn is based on the uni, di and trigrams search, we can estimate that the classification of a text containing 1000 words with our small dictionaries of 1,000,000 terms will take at least 33 secondes, that is still pretty long!

Conclusion:The new structures of the dictionaries will improve the classification speed significantly, a speedup of 40% can be expected but the classification will stay too slow, particularly when the dictionaries will grow up in number of classes and terms per class. Finally a more sophisticated structure and algorithm will have to be developped.

Sunday, February 7, 2010

I explained in the last post that the constant Bc depends of the size of the training set, more precisely: Bc = Log(P(C)/P'(C)) .

Because P(C) = Nc/Nt and P'(C) = 1-Nc/Nt .
We have Bc = Log(Nc/(Nt-Nc)) , where Nc is the number of documents of class C and Nt the total of documents all classes included.

The size of the documents are very different, so it seems more correct to use the number of words rather the number of documents, then we have this relation:

Bc = Log(Wc/(Wt-Wc)) , where Wc is the number of words in documents of class C and Wt is the number of words all classes included.

This second relation is better but is still not prefect because it doesn't take in consideration the updating frequency of each class. A better solution is the combination of the two relations above:

Bc = Log(Nc/(Nt-Nc) * Wc/(Wt-Wc) / 2)

Wc and Wt are the number of words in the documents unfiltered, but all words of the documents do not participate to the classification. Furthermore, each class of the dictionary does not contain the same number of word. Depending of the lexical richness of the class and of the training set used, the number of N-grams of the dictionary's class may vary a lot. Intuitively, we can say that the probability to classify a document in the class C increases when the number of N-grams present in the dictionary's class C is high.

Probably a better estimation of Bc should include the size of each class of the dictionary. Below, it is an average of the three frequencies:

Bc = Log(Nc/(Nt-Nc) * Wc/(Wt-Wc) * Gc/(Gt-Gc) / 3), where Gc is the number of N-gram in the class C of the dictionary and Gt is the number total of N-grams in the dictionary.

Saturday, February 6, 2010

We saw in the previous post that the Naive Bayes model is a linear classifier in the log space: Score= ∑ Wci*Xi + Bc

The weight Wci of each term i in the class C is estimated during the training of the classifier. It is the job of the program Lazylearner

The term Bc is logarithmically proportional to the number of document (*) of class C used to train the classifier: Bc ~ Log(Nc/(Nt-Nc)) (Nc number of documents of class C, Nt total of document)

(*) documents are supposed have the same size.

Because we just look for the highest score to affect the class label to the document, we can forget Bc if we take care to train the classifier with almost the same number of documents for each class. Unfortunately, it is generally impossible to train evenly the classifier. The consequence of training the classifier with more documents for one class than for the other classes will be an emphasis of the classes with the largest number of training documents.

The new version of texlexan ( pack 1.47 ) tries to compensate the size inequalities in the training set. The model used to classify the documents includes the constant Bc, in consequence the dictionaries are modified and completed with these constants (one cst for each class). These constants are computed by Lazylearner from the size (number of words) of the documents used to train the classifier.

The number of words has been chosen rather the number of documents because the size of the documents are often very unequal.

Note: The dictionaries keyworder.'lan'.dic'N' have changed but stay compatible with the previous versions.

Sunday, January 24, 2010

The classifier and the summarizer are based on a pure statistical approach. Concerning the classifier, the text is view as a bag of words and short sequence of words. For the summarizer, each sentences are view as a bag of words.

The order of the words do not have any importance in case of word unigram. In case of bigrams and trigrams (sequence of 2 and 3 words), the order of the words plays a role in the limited area of 2 or 3 consecutive words.

TexLexAn is configured by default to classify a text based from it unigrams, bigrams and trigrams. Texlexan can use 4-grams, 5-grams and more (just add the option -n, for instance -n6 for 1-gram to 6-gram), but the computation time and the memory required will increase dramatically.

Bigrams and trigrams are sufficient for the majority of text. Many concepts are expressed with a sequence of two or three words, for examples:

Random access memory, local water treatment, free online dictionary, high speed internet, very high frequency are trigrams.

Single words generaly belong many classes, but bigrams and trigrams often more specific. Using bigrams and trigrams to classify a text help to resolve many ambiguity and increase the precision of the classification.

Some statistics:

In our particular case, the Bayes' theorem will help use to estimate the probability that a text belongs a class when a set of n-grams (words, bigrams,trigrams...) is present in the text. If we apply the Bayes' theorem to our particular case of text classification, we can write:

P(C|W) = P(W|C) * P(C) / P(W)

In the expression above:
C is a topic or text classification
P(C) the probability C occurs.
W is one word (unigram) or one sequence of words (bigrams,trigrams...).
P(W) the probability W occurs.
P(W|C) is the probability to have W when C
P(C|W) is the probability to have C when W

P(C) is normaly well know because it depends of the number of classes we have.
P(W) is independant of the number classes and is constant.
P(W|C) can be estimated by computing the frequence of W when the text has the class C. it is the training of the classifier.

We can extend the simple equation above to more than one n-gram (W). If we suppose that each n-grams are independant (we know that it's false, words in a sentence depend each others and sentences in a text depend each others too, but it's greatly simplify the equation and works pretty well.), then the resulting probability is the product of probabilities of each P(C|W).

We assume the probability Pc follows a multinomial distribution:
Pc= ∏ (θci)pow(xi) * Kc where θci is the probability of the term i in C and xi the occurence of the term i in the document. Kc is dependant of the number of class C.

The weight Wci of each term i in the class C will be estimated during the training of the classifier. The term Bc is proportional to the logarithm of number of document of class C used to train the classifier: Bc ~ Log(Nc/(Nt-Nc)) , where Nc is the number of documents of class C and Nt is the total of document.
Because we look for the highest score , we can forget Bc if we take care to train the classifier with almost the same number of documents for each class.

Saturday, January 16, 2010

I am continuing my previous post with a description of the results. The window is split in two part. The left panel is the text returned by the engine texlexan, and for the most useful information, we can find the classification results and the list of the most relevant sentences extracted from the summaries.

The right panel displays the classification results under the form of bar graphs and allows to see quickly the most significant results.

We know the result comes from 42 summaries that have been extracted and analysed. It means 42 documents were analyzed, summarized and archived oven the period considered.
The classification list shows the majority of the documents were about the computer, text mining, processor, machine learning, operating system and agreement international.
It is important to be careful when the grade (pseudo-probability) of a classification is low. There is a high probability that the classification can be erroneous and simply due to the noise, for instance, the result "1% Class: en.text-health-drug".

The next interesting part of the results are the sentences extracted from the summaries:
Additionally, some use these terms to refer only to multi-core
microprocessors that are manufactured on the same integrated circuit
die .These people generally refer to separate microprocessor dies in
the same package by another name, such as multi-chip module .This
article uses both the terms "multi-core" and "dual-core" to
reference microelectronic CPUs manufactured on the same integrated
circuit, unless otherwise noted.

There are only multi-threaded managed runtimes means when it loads
an single threaded managed app the runtime itself creates multi
threads for its own purpose, right ? A: The multi-threading managed
runtime takes care of creating multiple threads as needed by the
application.

Others, generally seeking more compact and stable methods for
indexing highly diverse sources for which full, word-based indexes
are often unavailable, have explored higher-level indexing methods
including free-text and controlled-vocabulary metadata schemes,
semantic representations, and query-based indexing with training
sets.

Hierarchical Indexing Hierarchical indexing is a method of indexing
large documents at several levels of structure, so that a retrieval
system can pinpoint the most relevant sections within each document.

The tag "business" indicates the cue words belong the business and only 431 cue words were found in summaries analysed.

The sentences extracted above represent "theorically" the main information expressed in the 42 documents analysed. Unfortunately because it is based on purely statistic method, sometime a few non-relevant sentences can be extracted. I will explain more in detail the reason in a next post.

Friday, January 15, 2010

In a previous post, I discussed the idea to extract the most interesting information in the mass of electronic text circulating in the enterprise. After two weeks of work, the first step is done: TexLexAn is able to extract the most relevant sentences in a set of documents.

The main difficulty is to decide if a sentence is relevant or not. The solution chosen is to weight each sentences with the keywords extracted from the summaries, and to use a list of cue words to increase the weight. Finally, only the sentences with a weight above a threshold are kept.

A trial has been done by classifying the set of summaries and then by using the keywords belonging the classification dictionary. This solution could seem better because it increases the number of the keywords, but when the classification is wrong then the keywords are inappropriate.

The interface is very basic: There are two fields to enter the starting date and the ending date (a calendar can be called), and a large text window to enter some extra-options. The most interesting option is -v1 for verbose and -K for the keyword list.

The results are pretty long to comment and cannot fit here, they will be the object of a next post.