AbstractIn this Master’s project,the use of semantic relationshipsin email classiﬁcation with Support Vector Machines is ex-amined.The corpus consists of emails in German language.Semantically related words are mapped to structures.Theapproach is based on the theory of semantic ﬁelds.A graphis built with the help of the semantical relations betweenwords looked up in a thesaurus.Two search algorithms,breadth-ﬁrst search and theTarjan algorithm,are applied to identify graph compo-nents.The size of the structures is limited to a suitablemaximal size.Disambiguation is done in three diﬀerentways with a graph-based approach.Experiments evaluateresults comparing the values of the diﬀerent variables (dis-ambiguation,path length,information gain,search algo-rithm).The results show that the outcome for the variablesare mostly homogeneous,the quality of a category dependson its character and the correlation to the thesaurus,i.e.the rate of words that are out-of-vocabulary but highly fre-quent in the category.An optimal conﬁguration for eachcategory can be found by an exhaustive search over all vari-able conﬁgurations.ReferatE-postklassiﬁcering med hjälp av semantiskarelationerI det här examensarbetet undersöks epostklassiﬁcering medhjälp av semantiska relationer.Korpusen består av e-postpå tyska.Semantiskt relaterade ord samlas i strukturer.Ansatsen grundas på teorin om semantiska fält.Ord slåsupp i en tesaurus och med de givna relationerna byggsen graf.Två sökningsalgoritmer,breddenförstsökning ochTarjans algoritm,används för att hitta grafkomponenter.Strukturerna begränsas till en passande storlek.Disam-biguering sker med en grafbaserad metod på tre olika sätt.Metoden evalueras och olika variabler undersöks (disam-biguering,den maximala stiglängden,information gain,sökn-ingsalgoritmen).Resultaten visar att variablerna är likvärdi-ga.En kategoris kvalitet hänger samman med dess karaktäroch korrelation med tesaurusen,dvs.hur många ord somär högfrekventa i kategorin men inte ﬁnns i tesaurusen.Enoptimal konﬁguration kan hittas för varje kategori genomen totalsökning över alla möjliga konﬁgurationer.Contents1 Introduction 12 Theoretical Background 32.1 Text Classiﬁcation............................32.1.1 General Overview about Text Categorization.........32.1.2 Machine Learning........................42.1.3 Support Vector Machines....................62.1.4 Quality Measurements......................82.2 Linguistic Background..........................92.2.1 Ambiguousness of language...................92.2.2 Semantic relationships......................102.2.3 The theory of semantic ﬁelds..................112.2.4 The German language......................122.3 Search Algorithms............................122.3.1 Breadth-ﬁrst search.......................132.3.2 Tarjan’s algorithm........................133 Related Work 174 Method 194.1 Tools and Corpus.............................194.1.1 Minor Third...........................204.1.2 Tree Tagger...........................204.1.3 Thesaurus.............................204.1.4 The Corpus............................204.2 Description of Test Framework.....................214.3 Theoretical Reasoning about a Graph.................224.4 Algorithms and Implementation for Building Semantic Structures..244.4.1 Details about implementation..................254.4.2 Construction of a graph.....................264.4.3 Component Search........................274.4.4 Treatment of polysemous words.................314.4.5 Trimming graphs - pathlength.................355 Evaluation - Experiments 415.1 Corpus Analysis.............................415.1.1 Decisions about the corpus...................415.1.2 Category System.........................425.1.3 Preprocessing...........................425.1.4 Categories sizes and other statistical data...........435.1.5 Correlation to the thesaurus...................455.2 Baseline Tests and Test Setup......................465.2.1 Test setup.............................465.2.2 Variance..............................465.2.3 Results in Baseline........................465.3 Experiments with Variables in the Graph...............475.3.1 Appearance............................475.3.2 Algorithms............................485.3.3 Disambiguation..........................495.3.4 Path Length...........................505.3.5 Characteristics of the Categories................515.3.6 Summary of the Experiments..................526 Conclusion and Future Work 55Bibliography 59Appendices 63A Statistics about the Corpus 65B Experiments 67C Graph Gallery 77C.1 Good Graph Components........................77C.2 Problematic Graph Components....................77C.3 Large Graph Components........................77List of Figures 85List of Tables 86Chapter 1IntroductionText based communication over digital media has become a very important meansof communication;the number of emails sent every day is growing continuously.Inhuge email-client systems such as a customer support,emails have to be sorted andcategorized according to their content.The topic of this Master’s thesis is automated email classiﬁcation.The main ideais to apply semantic relationships for feature reduction.Hyperonymrelationship andsynonym relationship are applied.Emails are classiﬁed according their semanticcontent according to the given category system.The classiﬁcation algorithm isSupport Vector Machines (SVM).The approach used to model words and their semantic relationships is basedon graph theory and the theory of semantic ﬁelds [16].Keywords are looked upin a thesaurus and their relationships among each other are mapped in a graph.Two search algorithms,breadth-ﬁrst search and the Tarjan algorithm are appliedto identify graph components,i.e.structures of semantically related words.A verycentral problem is to decide how to set the limits of semantic structures.The ques-tion is under which conditions synonyms are still „semantically related enough“toeach other to be pooled together.The email corpus used for this study is emails sent to the customer support ofan Austrian telephone provider.The thesis is written at Artiﬁcial Solutions ABin Stockholm.The corpus based on data provided by Artiﬁcial Solutions AB.Theemails are in German.The results of this study are evaluated by experiments.Results of classiﬁcationusing semantic relationships are compared to test results on the same corpus witha basic SVM.Structure of the thesis This thesis report is structured as follows:The secondchapter treats the theoretical background of this Master’s thesis.Foundations ontext classiﬁcation and linguistics are described.In the next chapter a short overviewof related works is given.The fourth chapter explains the method and experimentset up.The strategy and implementation of feature regrouping is explained and1CHAPTER 1.INTRODUCTIONtools used for this Master’s project are presented.The last chapter deals withevaluation.It is divided into two parts.In the ﬁrst part,the corpus is analyzed.The second part reports about classiﬁcation experiments.Finally a conclusion isgiven.2Chapter 2Theoretical BackgroundThis chapter presents the theoretical background for this Master’s project.Theﬁrst part gives a short overview about text classiﬁcation,machine learning andexplains the algorithm Support Vector Machines (SVM) which are used in thisMaster’s project.The second part introduces linguistical terms while the third parttreats the necessary graph theory and explains two search algorithms used for graphcomponent search.2.1 Text Classiﬁcation2.1.1 General Overview about Text CategorizationThe discipline of text categorization deals with content oriented search.Text doc-uments are assigned to categories according to their semantic content.The set ofcategories or the hierarchy of categories (category system) is already deﬁned beforecategorization.This is the very diﬀerence to the sister discipline text clusteringwhere categories have to be found to a number of given documents.A classiﬁer is a function that maps a text document to a category with a prob-ability.Classiﬁers (considered in the scope of this Master’s project) are binary,i.e.for every single category,one classiﬁer has to be trained up.All text documentsbelonging to a category are called positive examples,all other documents are callednegative examples.There are diﬀerent machine learning approaches to text classiﬁcation:in generaldecision trees,neuronal networks and statistical methods can be distinguished.Sta-tistical methods,especially Support Vector Machines have been shown to be veryappropriate for the problem of text classiﬁcation [39].A very central problem is the task of modeling natural language,as informa-tion is stored in complex semantical and grammatical structures.Usually,textdocuments are represented with the vector space model.A text document is transormed into a so-called bag-of-words.Words (or terms)are features,they occurence is counted in a vector.A feature vector describes atext document in the following way:each dimension dtin the vector corresponds to3CHAPTER 2.THEORETICAL BACKGROUNDa separate term t occuring in the corpus1.The value of the dimension dtweightsthe number of occurences of the term t in the text in relation to the number ofoccurences of the term t in the whole corpus2.An important characteristic fortext categorization is the dimensionality of the vector space and very sparse featurevectors.This model is the simplest approach [41].Even in a small corpus the number ofdiﬀerent words raises to extremes (feature explosion).Feature selection is the task to decide which terms are taken into account whenbuilding feature vectors.A strict and brutal feature selection is a way to face theproblem of feature explosion.Joachims [20] uses an information gain criterion,aterm has to appear in at least three diﬀerent documents.Deﬁnition Atermt has an appearance value of k if t appears in at least k diﬀerentdocuments.2.1.2 Machine LearningMachine learning is a very large discipline in artiﬁcial intelligence.The major focusof machine learning research is to extract information from data automatically,bycomputational and statistical methods.The basic principle is to generate automati-cally knowledge fromexperience.An algorithmlearns by example,trying to extractrules and principles.In that way,unknown examples can be classiﬁed.Several forms can be distinguished,the best known are supervised,unsupervisedand reinforcement learning.In the ﬁrst form,labeled data are available.A largerpart is used to train a classiﬁer,a smaller part is kept aside to test it afterward.Forunsupervised learning,no labeled data are available and the agent has to model aset of inputs.In the case of reinforcement learning,the algorithm learns a policy ofhow to act by rewarding or punishing depending on its performance.In our case,supervised learning is applied.Machine learning has two mainphases:The given set of annotated data is separated into two subsets.In thetraining step,the classiﬁer is"taught"by a set of given examples.A smaller part isleft out for testing.Those examples are unknown to the classiﬁer.By comparingresults achieved by the classiﬁer to the annotations,the quality of the classiﬁer isevaluated.Dealing with computational learning,we cannot expect to gain correctnessand completeness of the learning method.According to the theorem of PAC-learning (Probably approximately correct learning) proposed by L.Valiant [23]and Vapnik-Chervonenki [43] the learner receives samples and must select a gen-eralization function (called the hypothesis) froma certain class of possible functions.The goal is that the selected function will have low generalization error with high1The term corpus describes the collection of texts used for a classiﬁcation project.2Several diﬀerent ways of computing these (term) weights,have been developed.One of thebest known schemes is tf-idf weighting [35].The SVM implementation Minor Third [8] provides arepresentation of the vector space model.42.1.TEXT CLASSIFICATIONprobability.The learner must be able to learn the concept given any arbitraryapproximation ratio,probability of success,or distribution of the samples.However there are two further requirements which are very crucial for the deﬁ-nition of correctness of a learner:time complexity and feasibility of learning.• A function must be learnable in polynomial time.• The result of learning is to be a concept deﬁnition or a function that recognizesa concept.The following deﬁnitions are quite theoretical,therefore some terms need tobe introduced.For a given set X of all possible examples a concept c is a subsetof examples (those expressing the concept c).A concept class C is a subset of thepower set of X.A concept c is often deﬁned the same way as the classiﬁcation of theconcept:c can be considered as a function mapping X to 1 (for positive examples)or to 0 (for negative examples).In the ﬁeld of application for this Master’s project,aconcept c can be considered as a (well deﬁnied) email category.The actual learningresult is called hypothesis h.A hypothesis h and the hypothesis space H are deﬁnedin a similair way.A hypothesis h corresponds to the set of emails,a trained classiﬁerassigns to a category.If the classiﬁer assigns all given emails (for all x in a learningset E) to the actual category (h(x) = c(x)),it is called consistent for the givenexamples.Deﬁnition:PAC (probably approximately correct)-learnable A conceptclass C with the examples x of the size |x| is PAC-learnable by a hypothesis spaceH,if there is an algorithms L(δ,) that• for all arbitrary but constant probability distributions D over x• for all concepts c ∈ C• in time polynomial in1

,|C|,|x|• with a probability of at least 1 −δreturns a hypotheses h ∈ H,of which the error is not higher than .We canalso say that L is a PAC learning algorithm for C.The PAC theorem can be proved with the Vapnik-Chervonenkis-dimension (cf.[31,22ﬀ]).This concept helps to illustrate expression strength of a hypothesis class.It is a measure of the capacity of a statistical classiﬁcation algorithm,deﬁned asthe cardinality of the largest set of points that the algorithm can shatter.Deﬁnition:Shattering a set of examples Given a hypotheses space H overX and a subset S of X containing m elements.S is shattered by H if for all S

⊆ Sthere is a hypothesis hS∈ H that covers S

that is S ∩hS = S

.All subsets of S are recognized by hypotheses in H.5CHAPTER 2.THEORETICAL BACKGROUNDDeﬁnition:Vapnik-Chervonenkis-dimension (VC-dim) The Vapnik-Chervonenkisdimension of H VCdim(H) is deﬁned as the number of elements of the largest setS,if S is shattered by H.VCdim(H) = max{m:∃S ⊆ X,|S| = m,H shatters S} (2.1)This dimension indicates how many diﬀerences H can express.If the maximumof S is undeﬁned,is VCdim inﬁnite.To shatter a set of the size m,at least 2mhypotheses are needed.To calculate VCdim exactly is often very diﬃcult.2.1.3 Support Vector MachinesSupport Vector Machines (SVM) are used in many diﬀerent areas where many fea-tures are involved in the classiﬁcation task as image classiﬁcation and handwritingrecognition.Statistical learning theory describes the exactness of the concluded from a setof seen examples to unseen examples.SVMs learn by example.The overall goalis to generalize training data.This is done by creating a hyperplane in the vectorspace and maximizing the margin between the two classes.The hyperplane dividesthe vector space into positive and negative areas according to the membership inthe category as Figure 2.1 illustrates.Figure 2.1.A binary classiﬁcation problem with positive (+) and negative (-) ex-amples.The picture on the left side illustrates that all hyperplanes h1to h4dividepositive from negative examples.On the right site,a hyperplane h∗ is shown maxi-mized by SVM (cf.[33,p.27]).The easiest version of a SVM uses a linear classiﬁcation rule.Given a set oftraining data E with l examples:E = ( x1,y1),( x2,y2),...,(xl,yl).Every exampleconsists of a feature vector x ∈ X and a classiﬁcation of this example y ∈ {+1,−1}.This classiﬁcation rule is deﬁned as (cf.Eq.2.2):h(x) = sign

b +n

i=1xiwi

= sign(w ∙ x +b) (2.2)62.1.TEXT CLASSIFICATIONwhere w and b are the two variables adapted by SVM.w is the weight vectorassigning a weight to every feature.The variable b is a threshold value.If w ∙ x +b > 0 then the example will beclassiﬁed as positive.The task SVM has to accomplish is to fulﬁll the followinginequalities (cf.Eq.2.3).It can be considered as an optimization problem as SVMmaximizes the hyperplane h∗ (cf.Fig.2.1).y11w[ w ∙ x1+b] ≥ δ...yl1w[ w ∙ xl+b] ≥ δ (2.3)δ describes the distance to the hyperplane of the examples with the vectorsclosest to the hyperplane,so called support vectors.They give the name to thealgorithm3.Vapnik shows that the maximal hyperplane really is maximal by formulatingthe expected error,which is limited by the number of support vectors4.To allow training mistakes (mislabeled examples),a slight modiﬁcation was sug-gested by Vapnik and Cortes in 1995 [9]:the Soft Margin method (cf.Eq.2.4,2.5).If there exists no hyperplane that can split the"yes"and"no"examples,theSoft Margin method will choose a hyperplane that splits the examples as cleanlyas possible,while still maximizing the distance to the nearest cleanly split exam-ples.The introduced slack variable ζimeasure the degree of misclassiﬁcation of thedatum xi.∀i:ci(w ∙ xi−b) ≥ 1 −ζi(2.4)The objective function is then increased by a function which penalizes non-zeroζi,and the optimization becomes a trade oﬀ between a large margin,and a smallerror penalty.If the penalty function is linear,the equation now transforms to:min

12w2+C

ζili=1such that ci(w ∙ xi−b) ≥ 1 −ζi

;∀i (2.5)Joachims ﬁrst applied SVMin text categorization and achieved outstanding re-sults [20].He argues that SVMs are well usable for text categorization because theyare robust in cases of high dimensional vector spaces with sparse feature vectors.One more outstanding characteristic of the SVM algorithm is that the kernelfunction can be extended very easily to a non-linear function5.For the applicationarea of text categorization,usually linear kernels are taken [20].3Only these support vectors determine the maximum hyperplane.This is interesting featureof the SVM algorithm.Leaving out all other examples,the result of SVM would be the same [33].4The detailed presentation of the mathematical proof can be found in [43] and in any moredetailed articles about SVM such as [31],[33]5This is known as the kernel trick [39].It is done by projecting the data into a vector spaceof a higher dimension.This function can also be calculated in an eﬃcient way as Boser et al.showed by calculating the scalar product of this function [7].7CHAPTER 2.THEORETICAL BACKGROUND2.1.4 Quality MeasurementsTo be able to compare and understand the quality of the results of a classiﬁcationexperiment,a quality measure is needed.A classiﬁcation experiment can be con-sidered as a search problem:the classiﬁer searches for emails belonging to the givencategory in a number of uncategorized emails.In general two basic questions are important to evaluate a classiﬁer:• Do all documents that the classiﬁer assign to the target category really belongto that category?• Does the classiﬁer ﬁnd all documents belonging to the given category?The ﬁrst question focuses on the classiﬁer’s accuracy the second the classiﬁersbreadth (recall).To calculate these measures,we introduce the confusion matrix.Deﬁnition:confusion matrix A confusion matrix maps the result of a clas-siﬁer against given classiﬁcation information.A relevant document is a documentthat belongs to the target category accordning to given information.True positivedocuments are documents that belong to the target category and are classiﬁed theright way of the classiﬁer (found or not found).relevantnot relevantfoundtrue positive (tp)false positive (fp)not foundfalse negative (fn)true negative (tn)Deﬁnition:precision The precision of a classiﬁer is equal to the relation of thetrue positive documents to all documents the classiﬁer has returned:p =tptp +tn(2.6)Deﬁnition:recall The measure recall is calculated by dividing the number oftrue positive documents by the number of all relevant documents.r =tptn +fn(2.7)To illustrate the deﬁnition of precision and recall,let us consider the worstcase:If the classiﬁer ﬁnds all documents,recall is maximal.Precision is minimalas it depends on the proportion of relevant and non-relevant documents.Only onerelevant document found maximizes precision,but minimizes recall.Precision andrecall are normally correlated negatively:increasing precision results in decreasingrecall and vice versa.82.2.LINGUISTIC BACKGROUND2.2 Linguistic BackgroundThis section gives a short overview of the theory of semantic meaning,relationshipsand ﬁelds.It introduces essential terms and concepts for the linguistic backgroundof this project and also describes the problem of the structure of language.“Structure is the most general and deepest feature of language” states Will-helm von Humboldt [40].Words can be grouped in families and ﬁelds of seman-tically related terms.The school of European Structuralism tries to deﬁne and toexplain the structures in language.A very essential way to model a word’s meaningis the two-component model of de Saussure:a word is described by its phoneticcomponent (signiﬁant) and by its semantical component (signiﬁé).The semanticalcomponent can consist of a concrete or abstract concept.Meaning of words andespecially for those with an abstract concept is often explained by using the refer-ential model:to explain one term,other terms are needed.Language is reﬂexive.But language is also blurred:a concept can hardly be deﬁned exactly and it isnot possible to enumerate all the characteristical features to deﬁne one concept.Blank [4] hit the bull’s-eye of this problem with his well-known example question:“Considering the concept of a dog and trying to deﬁne the concept “dog-alike” asan animal with the feature [has four legs],is a dog with only three legs still a dog?”.2.2.1 Ambiguousness of languageLanguage is ambiguous.Ambiguousness can happen on several levels:two wordscan be connected to the same phonetic sound chain,homograph words have thesame spelling.But even for groups of words and whole sentences,ambiguousnessis a very common phenomenon6.For email classiﬁcation,ambiguousness on wordlevel is most important as words usually are used as a base to determine featurevectors.Ambiguousness on a lexical level is called polysemy.A polyseme is a word orphrase with multiple,related meanings (called lexemes or senses of meaning).Thediﬀerent lexemes must have a common semantic core7.Word sense disambiguation describes the method of identifying the appropriatemeaning for a polyseme word in a given context.This is a very prominent problemin computational linguistics as it appears in diﬀerent disciplines of natural languageprocessing such as machine translations,information retrieval,text classiﬁcationand question answering.But considering the last 25 years of research in artiﬁcialintelligence,it has to be confessed,that there is no global disambiguation algorithm6Such as the sentence John told Robert’s son that he must help him([28]) changes its meaningaccording to the assignment of the personal pronouns.7This term has to be delimitated against homonymy.Homonym words do not have any relatedmeaning,but the forms have transformed to the same formin coincidence because of sound shiftingand other linguistical developments.In proper words a homonym are two signs with the samephonetic form.A standard example [4] for homonymy is the French verb louer which has twodiﬀerent meanings:’praise’ and ’hire,rent’.These two lemmas deviate from two diﬀerent Latinorigins:the verb laudare and the verb locare.9CHAPTER 2.THEORETICAL BACKGROUND[44].Depending on the type of application,disambiguation of words can be achievedby combining several knowledge sources using selection criteria and encyclopedicknowledge.Dealing with written language,information about part-of-speech isessential to distinguish polyseme words and expressions8.A very eﬃcient method for disambiguation of diﬀerent words that coincide on ahomograph formis Part-Of-Speech-tagging (POS-tagging) and lemmatisation wherethe grammatical word class is identiﬁed and the word is brought back to its basicform9.The POS-tagger used in this Master’s project is described in section 4.1.2.Ambiguous polyseme words10cannot be resolved with a POS-tagger.Mostapproaches to word sense disambiguation use the context the word occurs in toidentify the right lexeme.The well-known Yarowsky algorithm[21,p.638 ﬀ.]11isbased on two basic assumptions borrowed from the linguistic discipline of discourseanalysis [4]:1.One sense per collocation:the collocations (words in the neighborhood of theconcerned word) is very useful to deﬁne the speciﬁc sense of a word2.One sense per discourse:The sense of a polyseme word that appears severaltimes in a document is often the same in the same document2.2.2 Semantic relationshipsThere are diﬀerent types of semantic relationships,the most well-known are:syn-onymy,antonymy and hyperonymy.To deﬁne synonymy is one of the central prob-lems in semantics,the dispute about this topic is as old as the ﬁrst reﬂexionson language [51].Gauger proposes a wide deﬁnition,deﬁning synonymy not asmeaning identity but as an ”alikeness in meaning“ [15].There are diﬀerent typesor degrees of synonymy:partial synonymy describes a simple alikeness in meaning.”Proper“ synonymy implies that two terms can be exchanged without changing thesemantic content of the context.Total synonymy implies the complete identity oftwo terms,that means their commutability in every context.The problem is thatthe ”degree of alikeness” is hard to measure and to compute.An empirical approachis to deﬁne synonym pairs by votes of native speakers12.Hyperonymy is often described as the ”lexicographical relationship“ and canalso be classiﬁed as a special case of synonymy.In computational linguistics,this8The dialog system SMARTKOM recognizing spoken language is able to distinguish ironiesand sarcasm by taking the speaker’s mimic into account [45].9The word Lachen can be translated with laughter or puddle (Lache in its basic form).In thesentence Die Lachen waren gross.(The puddles were big) a POS-tagger identiﬁes the word Lachenas a plural form of Lache -puddle because of the verb form,the words function in the sentence etc.10Fig.C.7 (Appendix) illustrates an example of the polysemous word Leiter (conductor (elec.)and head,leader,chief) and two lexemes united in one node.11The implementation of the algorithm [29] achieves 96% accuacy but it requires a data basethat mappes a lexem to a context or a set of possible contexts such as [48].12Online thesauri such as [22] are based on this approach:users vote for proposed synonympairs.102.2.LINGUISTIC BACKGROUNDrelationship is also known as an IS-A relationship13.Antonyms express a contraryrelationship.There are as well diﬀerent types but those can be omitted14.2.2.3 The theory of semantic ﬁeldsThe theory of semantic ﬁelds picks up Humboldt’s thought,describing semantic oflanguage in groups of semantically related words,so called semantic ﬁelds,synﬁelds,subsynsystems (Lyons).The relationships between diﬀerent words construct alarger structure,a ﬁeld.Depending on how tight the ﬁeld deﬁnition is drawn,andwhere the limits are set,the ﬁeld describes one single concept in its diﬀerent aspects,the meaning of the single ﬁeld elements becomes equivalent to a certain degree.Coseriu and Geckeler [16] deﬁnes a semantic ﬁeld as a continuum of semanticcontent,describing one concept (archilexeme)15.The diﬀerent ﬁeld elements are inopposition to each other for one semantic feature16.The notion of semantic ﬁelds in their most strict and idealistic deﬁnition can becompared to the mathematical concept of equivalence classes17:• Language is organized in structures or ﬁelds.The congruence representativeis the archilexeme.• In a ﬁeld,synonyms are equivalent to each other.• Among the elements of a semantic ﬁeld,the synonym relation is transitive.A main problemis to set up limits for semantic ﬁelds as borders between seman-tic ﬁelds become blurred because of polyseme words and unclear borders betweenthe ﬁeld of meaning of diﬀerent words [25].To deﬁne those borders properly,a verydetailed analysis of collocations is necessary.18Creating a semantic ﬁeld manuallywith the proper analysis of the opposition of every single ﬁeld element pair is a lotof work19.13In the opposite to synonymy,hyperonymy has beneﬁcial qualities thanks to the stricter def-inition:Hyperonymy is transitive (cf.[4]):if a dog is a mammalian and a dachshund is a dog,then a dachshund is a mammalian,too.This fact facilitates a lot the use of this relationship in acomputational context.14The thesaurus used in this Master’s project does nor represent antonym relations [32].15This superordinate concept or umbrella term is called archilexem.The archilexeme in asemantic ﬁeld unites all semantic features of the ﬁeld elements.16A semantic feature (seme) is the smallest unit of meaning recognized.It is a binary variable,i.e.the feature is there or not.In the semantic ﬁeld of cooking terms in English,for example,fryand roast share the feature [+method of cooking meat] but only roast has the feature [+oven] (cf.[6,p.11]).17This comparison is not to be accepted at face value,language does not obey mathematicalrules.The purpose of this comparison is to enhance understanding of the linguistic concept.18A collocation is a group of words that often appears in the context of the concerned word andhelps to deﬁne its meaning in this concrete part of speech [5].19In this project we are not going to ﬁnd semantic ﬁelds,we only look for structures of seman-tically related words,as the strict deﬁnitions cannot be fulﬁlled currently by the computationalapproach.11CHAPTER 2.THEORETICAL BACKGROUNDA semantic ﬁeld can be modelled as a graph,words represent nodes,their rela-tionships are edges.The center of a semantic ﬁeld,called archilexem [16],representsthe complete meaning core of the semantic ﬁeld.Using the graph based approach,an evidence for the archilexem is the maximal number of outgoing edges and aposition in the very center of the ﬁeld [47].2.2.4 The German languageThe German language is a West Germanic language.German is usually cited as anoutstanding example of a highly inﬂected language.Words are ﬂected according tothe four-level casus system,for verbs several conjugation schema exist.With fourcases and three genders plus plural there are 16 distinct possible combinations ofcase and gender and number [14].In the German orthography,nouns and mostwords with the syntactical function of nouns are capitalized,which is supposed tomake it easier for readers to ﬁnd out what function a word has within the sentence(Das Auto war kaputt - the car was broken).In German noun compounds are formed where the ﬁrst noun modiﬁes the cat-egory given by the second,for example:Telefonrechnung (telephone bill).UnlikeEnglish,where newer compounds or combinations of longer nouns are often writtenin open form with separating spaces,German (like the other Germanic languages)nearly always uses the closed form without spaces20.2.3 Search AlgorithmsBefore the two searching algorithms Breadth-ﬁrst search and Tarjan are presented,some basic terms have to be deﬁned (cf.[11]).In an undirected graph G,two vertices u and v are called connected if thereis a path in G from u to v.If every pair of distinct vertices in a graph G can bereached through some path in the graph,the graph is called connected.If G is adirected graph,it is called weakly connected if all directed edges are replaced withundirected and the resulting graph is connected.It is strongly connected or strongif it contains a directed path from u to v and a directed path from v to u for everypair of vertices u,v.A connected component is a maximal connected subgraph of G.The stronglyconnected components are the maximal strongly connected subgraphs.Each vertexbelongs to exactly one connected component,as does each edge.A disconnectedundirected graph G is composed of a set of at least two connected components.20The longest German word veriﬁed to be actually in (albeit very limited) use is Rindﬂeis-chetikettierungsüberwachungsaufgabenübertragungsgesetz („Beef labeling supervision duty assign-ment law“) [19].122.3.SEARCH ALGORITHMS2.3.1 Breadth-ﬁrst searchBreadth-ﬁrst search (bfs) is an uniform search algorithm that aims to search ex-haustingly every node of a graph.Starting from the start node and expanding thediﬀerent graph levels,the graph is searched in its breadth.given:starting node uempty queue Qempty component Cunmark all verticeschoose the starting vertex umark uenqueue uadd u in Cwhile Q not emptydequeue a vertex u from Qvisit ufor each unmarked neighbor vmark vadd it to end of the queue Qadd edge (u,v) to the component Cadd v to Cend forend whileFigure 2.2.Pseudocode of the algorithm breadth-ﬁrst search (bfs)Starting from a given start node u (cf.Fig.2.2),it is tested for each neighbor vto u whether v already has been visited.If node v has not yet been discovered,it isadded to a waiting queue and expanded in the next step.After having consideredall neighbors to u,the same procedure is done with the next node in the waitingqueue.The algorithmhas a complexity of O(|V |+|E|) where |V | is equal to the numberof vertices and |E| describes the number of edges.The bfs-algorithmcan be used with a slight modiﬁcation to identify all connectedcomponents in a graph.The original bfs-algorithmis enclosed in a further while-loopand a further queue administrates the vertices,marking nodes that already havebeen found in a component and providing new starting vertices to the bfs-algorithm.The complexity of the modiﬁed algorithm is still O(|V | +|E|).2.3.2 Tarjan’s algorithmThe Tarjan algorithm (tarj) named by its inventor Robert Tarjan ﬁnds stronglyconnected components in a directed graph (cf.[42]).The basic idea of the algorithm13CHAPTER 2.THEORETICAL BACKGROUNDis to execute depth-ﬁrst search from a given starting vertex (cf.Fig.2.3).Theidentiﬁed components are subgraphs of the depth-ﬁrst-search tree.The root of thissubgraph is the root of the strong component.Visited vertices are put on a stack in the order of visit.If depth-ﬁrst searchreturns from a subgraph,the stack is emptied step-by-step,deciding whether thecurrent start vertex v for this subgraph is the root vertex of a strong component.Performing depth-ﬁrst search,the visited vertices in the v

in the supposed subgraphare indexed in this order (for vertex v

this index is called v

.dfs).Additionally eachnode v

is assigned a value v’.lowlink that takes the minimum from v

.dfs and thepathlength between the currently visited node v

and the current start vertex v.v.lowlink:= min {v’.dfs;v’ is reachable from v}.This value is calculated duringruntime.A node v is identiﬁed as root vertex of a strong component if and only ifv.lowlink = v.dfs.The tarjan algorithmhas a linear time complexity.The tarjan procedure is calledonce for each node;the forall statement considers each edge at most twice.Thealgorithm’s running time is therefore linear in the number of edges in G (O(|V | +|E|)).Applying the same modiﬁcation as for the bfs-algorithm,all components in adisconnected graph can be found.142.3.SEARCH ALGORITHMSInput:Graph G = (V,E)index = 0empty stack Sempty component Cforall v in V doif (v.dfs is undefined)tarjan(v)end forallprocedure tarjan(v)v.dfs = indexv.lowlink = indexindex = index + 1S.push(v)forall (v,v’) in E doif (v’.dfs is undefined or v’ is in S)if (v’.dfs is undefined)tarjan(v’)v.lowlink = min(v.lowlink,v’.lowlink)if (v.lowlink == v.dfs)repeatv’ = S.popC.add(v’)until (v’ == v)end forallend procedureFigure 2.3.Pseudocode of the Tarjan algorithm (tarj)15Chapter 3Related WorkA lot of research is done comparing algorithms in text categorization.Joachimswas the ﬁrst to apply Support Vector Machines (SVM) to text classiﬁcation [20].A number of studies conﬁrmed these results comparing diﬀerent algorithms andsettings against each other [54,41].In diﬀerent studies diﬀerent techniques have been applied,such as co-training1[24] and testing diﬀerent kernels [30].The general goal is to increase the performanceof email classiﬁers.Features are entities used as basic information for classiﬁcation.In text classi-ﬁcation,words are usually used as features in the bag-of-word model (BOW),i.e.words are represented as an unordered set disregarding word order and grammar.A very central problem is the high variety of words in natural language which leadsto a very high number of features.Because of this,a smart way to reduce or selectfeatures can have a deep impact on the quality of classiﬁcation.Abasic idea is to identify features that are characteristic for a category and to cutoﬀ unimportant features.Aggressive feature selection implies usually a reduction of90 −95%2.This idea seems to work quite well in diﬀerent studies and for diﬀerentcorpora [10].However there are pros and cons for aggressive feature selection.Itmay result in a loss of information and SVM can handle high-dimensional inputspace [34],anyway.Linguistically coined strategies for feature selection touch the bag-of-words modeland aim to improve representation of information.Crawford et.al.[10] foundimprovements for classiﬁcation with SVM for their corpora using phrase selectionbased on statistical selected 1- and 2-grams3.Youn and McLeod built a spam-ﬁlter using an ontology based on key-words [53].1Instead of training up one classiﬁer,two classiﬁers are used.The view on the data of oneclassiﬁer is used to train up the second one.In case of only small amounts of labeled data andlarge amounts of unlabeled data,co-training has been shown to be very eﬃcient.Feature sets forthe classiﬁers have to be independent.2Statistical method such as information gain or chi quadrat are applied.3A n-gram is a subsequence of n items in a given sequence.In the sequence abcab the 2-gramab occurs two times.17CHAPTER 3.RELATED WORKInstead of single words,semantic concepts can be treated as features.From alinguistic point-of-view,words similar or alike in meaning can be pooled into onesemantic concept.Yang and Callan built an ontology from a large email-corpus.In order toidentify similar concepts,a graph based approach involving WordNet is used.Hy-peronym relationships with 2-levels in between and 2-grams are used to regroupconcepts [52].Wermter and Hung ﬁnd hyperonym relationships useful to buildself-organizing maps4for text classiﬁcation [46].In all three studies WordNet5[13]is the knowledge base to map semantical relationships.4Self-organizing maps are a kind of artiﬁcial neural network that is trained using unsupervisedlearning to produce a low-dimensional representation of the input space of the training samples,called a map.The diﬀerence to other neuronal networks is that they use a neighborhood functionto preserve the topological properties of the input space.5WordNet is a lexical database or a kind of machine readable thesaurus for the English lan-guage.It groups English words into sets of synonyms and records the various semantic relationsbetween these synonym sets.18Chapter 4MethodClassifying emails involves several working steps but feature selection and featurereduction are very central problems.The main thought behind this Master’s projectis the use of semantic relationships between keywords to regroup themto one feature.This is done by a graph-based approach.The words are vertices,connected to eachother.Edges are the relationships between words,given by a thesaurus.To identifycomponents in the graph,two search algorithms (tarjan and bfs) are applied.Areal challenge is to identify graph components and to limit their size in a way thatthe words involved still are semantically related enough to share a semantic core1.Eﬀects of feature selection methods can hardly be calculated in advance as cate-gories and their characteristical word frequencies may diﬀer a lot.Therefore the wayof research is mostly empirical:Each approach has to be evaluated experimentally.There is no base line for this corpus.Describing the method is strictly separatedfrom results.In order to conduct classiﬁcation experiments a number of tools are needed.Support Vector Machines are used for classiﬁcation in this Master’s project.AsGerman is a highly inﬂective language,a lemmatiser is used to bring words back totheir basic form.This chapter describes method and implementation.The ﬁrst section intro-duces the tools needed.Then the test framework and its diﬀerent components areexplained.The third section goes more into detail about constructing a graph.Thelast section explicates the implementation.4.1 Tools and CorpusIn this section,the tools used in this Master’s project are presented:Minor Third(the implementation of support vector machines),the POS-tagger and lemmatiserTree Tagger and the thesaurus.The corpus is presented.1A selection of examples of semantic structures is found in Sec.C (Appendix).The examplesillustrate diﬀerent aspects of the approach applied.19CHAPTER 4.METHOD4.1.1 Minor ThirdMinor Third [8] is an open source Java implementation library for machine learningin NLP (natural language processing) applications.It disposes of a large number ofclassiﬁcation algorithms and a very easy-to-use interface2.Tools for annotating textand visualising are provided as well,but not used for this Master’s project.A verybig advantage of the architecture of Minor Third is its approach to handle labelinginformation:The corpus is stored in a so called text base.All labeling informationis stored in an extern and from the text source completely independent ﬁle (in thisreport called label ﬁle).Several label ﬁles containing diﬀerent labeling informationcan be used for the same text base.The format for a label ﬁle is intuitive and canbe recreated manually or automatically.4.1.2 Tree TaggerAs being a high ﬂectional language,POS-tagging of German is not trivial.As aPOS-tagger and lemmatizer the Tree Tagger [1] is used.It is based on a proba-bilistic tagging method estimating transition probabilities by constructing a binarydecision tree out of trigrams [38].The accuracy of the Tree Tagger was measuredto 96 % [38] so mistakes of the tagger occure.The TreeTagger also recognizes com-mon proper nouns such as Peter and Maria.Unknown words are guessed.Wheninstalled,it is ready to use and does not need to be trained.4.1.3 ThesaurusThe Open Thesaurus [32] is an open source thesaurus project for German3.Its structure is similar to WordNet [13],groups of synonyms are mapped inhyperonym relations.The thesaurus does not contain any grammatical informationi.e.about word classes.But as most work is done by laymen,the structure is keptsimpler:only synonym and Is-A relationships are implemented.Meta informationi.e.about language level to the lemmas is given,but no POS meta information,that means about the word class to the lemma.The thesaurus provides a web interface,a version to be integrated into a textprocessing software or can be downloaded as a MySQL database dumb.The vo-cabulary is situated in every-day speech.It contains about 60000 entries [32].4.1.4 The CorpusThe corpus in its original form consists of about 45000 emails sent to the customersupport service of an Austrian telephone and Internet provider.The corpus consists2Exchanging the classiﬁcation algorithms from support vector machines to Naive Bayes justdemands to change one command line option.3Access and collaboration are free to everybody if registered.To avert drawbacks on quality,the community pays attention to quality assurance following the principle of self-control with theadministrator as last control instance.204.2.DESCRIPTION OF TEST FRAMEWORKonly of the ﬁrst emails in a communication chain,that means emails that were sentfromoutside to the customer service.The emails are saved in a large database.Theemails are assigned to the categories by CSO EMS ™4and the employees in thecustomer service,i.e.labeling is reliable and does not need to be reviewed manually.The corpus is not standardized and was never used for such a research projectbefore,so there are no base line results or any gold standard to compare with5.4.2 Description of Test FrameworkMain steps to prepare a test set are extracting the emails from the database,pre-processing and feature regrouping using semantic relationships.Figure 4.1 illustrates the steps.The corpus is created by extracting emails fromthe CSO EMS ™– database.In the database EMS system stores emails and logsvarious pieces of meta information.Only the ﬁrst mails sent to the customer serviceare taken into account.Labeling information for training and testing the classiﬁersare as well extracted from the database.Figure 4.1.Architecture of the test frameworkThe emails are extracted only one time and saved as separate text ﬁles.Thepreprocessing steps are only done once as well.Preprocessing implies four diﬀerenttreatments for the data:4CSO EMS ™is an Email Processing Application provided by Artiﬁcial Solutions that handles,analyzes and classiﬁes incoming emails to assist support staﬀ in handling emails.The approach forclassiﬁcation is rule-based.5The well known Reuters-Corpus e.g.is considered as an inoﬃcial standard corpus for NLP-research projects in English language [36]21CHAPTER 4.METHOD• Some emails are extracted in html encoding.All html-tags have to be removedto get raw text.• Analysing the content showed that the original corpus contains a lot of spam.The spam is taken away.• Stop word6treatment:special characters and objectionable pieces of informa-tion such as email addresses and customer IDs are omitted• Lemmatisation:By using the Tree Tagger [1] the emails are POS-tagged andwords are brought back to their basic form.Removing html-tags and stop word alike information is done via Perl scripts,programmed for this Master’s project.Taking away the spam is realized by elimi-nating two categories (cf.5.1.1).In the last step,illustrated by a double rectangle in Fig.4.1,the real experi-ments are done.The classifying algorithm Minor Third manages all tasks in theclassiﬁcation process.Input data are the corpus as separate text ﬁles and labelingﬁles containing categorization information.The main part of this project work is the step which is illustrated in ﬁgure4.1 by a ﬂash:regrouping features using semantic structures.This working stepis very complex.A tool programmed in Java looks up words in a thesaurus andcalculates semantic structures of the resulting graph.It is described in the nextsections.Practically speaking this could be considered as a preprocessing step aswell as groups of words in the raw text are replaced.4.3 Theoretical Reasoning about a GraphIn [47] it was shown that semantic ﬁelds can be modelled with a graph based ap-proach.In this Master’s project,a similar model is used to compute automaticallysemantic structures.The vocabulary is mapped in a graph.Words are vertices andtheir relationships are edges connecting them.The thesaurus serves as a knowledgebase for semantical relationships,the structure and the completeness (i.e.that allwords are represented in the graph) depends a lot on the thesaurus.The thesaurus models only two types of semantic relationships:synonyms andIS-A relations.In order to create a “real hierarchy” (“eine echte Hierarchie” [32])a synonym group can be related to only one hyperonym.This fact limits the vari-ety of the thesaurus;on the other hand it facilitates the construction of semanticstructures.Furthers antonym relations are omitted completely.The IS-A or hyper-6Stop words are words without semantic meaning such as or,and,the.Removing them is avery common preprocessing step in NLP applications.224.3.THEORETICAL REASONING ABOUT A GRAPHonym relation can be considered as a special case of a partial synonym relation [15].Therefore in the graph model we can pool the two into one type of relation7:Deﬁnition:neighbor A word a is related to another word b if it is mentionedin the thesaurus as a synonym or hyperonym.We say that a is a neighbor to b.Wedistinguish between thesaurus neighbors (relations mentioned in the thesaurus) andﬁeld or structure neighbors (related words in the graphs).One graph is built for the whole corpus and for all categories.The groups mustbe disjunct subsets of the set of words,the vocabulary of the corpus.The groupsshall not be too large,as the members are to be related semantically.For eachgroup,an identiﬁer such as e.g._Syngroup209_ is created.The words in the mailsare replaced with the accurate identiﬁer.Central questions are disambiguation of polyseme words and the size and char-acteristics of those components.The solution applied to these questions has to ﬁtin limits of time and eﬀort.Disambiguation is done according to the graph based model counting neighborsto the diﬀerent lexemes of a word.The context is not taken into account.Such a word graph consists of a number of unconnected components.Two searchalgorithms,breadth-ﬁrst search and the Tarjan algorithm are applied to identifythem.Components can become very large,about 1000 nodes.Those componentshave to be trimmed.Selecting a central nodes,the idea is to go a path in a maximallength before cutting oﬀ.The central node is selected by having a maximal numberof active neighbors as high connectivity is an evidence for the central position of aword in semantic structure (cf.Sec 2.2.3).A very large graph component can beseen as a cluster of diﬀerent overlapping semantic structures.The next section willgo more into detail about the implementation and construction of a graph.Threeof the main questions described there (search algorithm,disambiguation and pathlength) are variables examined in the evaluation part (cf.Chap.5.3).7It may be considered to introduce a weight in the graph based model in order to diﬀerentiatebetween hyperonym and synonym relations.But this is left to an eventual later phase of ﬁnetuning.23CHAPTER 4.METHOD4.4 Algorithms and Implementation for Building SemanticStructuresIn this section the implementation of the tool to analyze and regroup semanticstructures is described.Synﬁeld,the heart of this project,is a tool to compute a graph of semanticstructures with several features.The graph or graph components can be visual-ized.A number of smaller tools performs tasks for preprocessing and preparingexperiments.Figure 4.2.The functionality of the tool SynﬁeldFigure 4.2 illustrates the functionality of the tool.The program itself is shownas a rectangle.It has access to two data sources,the corpus and the thesaurus.Theoutput of the program is a graph mapping words in semantic structures or a lexiconﬁle containing regrouped features.Such a graph is constructed in the following way:244.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTICSTRUCTURES1.Extract a list of all words occurring in the corpus (text elements)2.Look up all text elements in the thesaurus and save them in nodes each withall given neighbors according to the chosen disambiguation type.3.Search in the graph for relationships between nodes.4.Analyze the graph and search for components using bfs-search or tarjan.Possible actions with the graphical user interface:1.Select and visualize graph components.2.Trim graph components with maximal path length and prepare a version ofthe corpus for classiﬁcation experiments.Looking up a word is done iterating over the list of text elements and queryingthe thesaurus.How a graph is built is explained in detail in Section 4.4.2.Infor-mation about the ways of disambiguation gives Section 4.4.4.The algorithms forcomponent search are explained in the theoretical chapter 2.3,in Section 4.4.3 theresults are reviewed.The path-length algorithm to trim graphs and its eﬀects isdescribed in Section 4.4.5.The next section goes more into detail with aspects ofimplementation of the tool.4.4.1 Details about implementationSynfield is programmed in Java.A graph consists of a list of nodes.A node isrepresented by the word itself and lists referring to its neighbors (graph synonyms,graph hyperonyms,thesaurus synonyms and thesaurus hyperonyms).The datastructure is inherited from Java TreeMap,the name of the node represents the key.The class Nodelist and Node provide a number of functionalities,such as diﬀerentkinds of list uniﬁcation.The thesaurus is available as a mySQL database [32].Looking up a word is doneby querying the word entry from the database.Visualizing the graph or graph components can be accomplished in two diﬀer-ent ways:The tool provides an interface to Viz Graph (cf.[12]).Running thisapplication creates a picture of a graph.It is useful for large graphs to gain anvisual overview of the graph structure.For a smaller graph component and manualanalysis the second visualization tequnique using the library JGraph [17] is moreadvisable.Building on Java Swing,nodes and edges are movable and resizable.The user has to adjust the graph structure on his or her own,however the outcomeis much more ﬂexible8.The visualization tool was used to analyze graph compo-nents and to determine semantically modiﬁed constraints on size and path lengthas described in this chapter.8The ﬁgures in the appendix (Sec.Appendix C) are created with Graph Viz,the ﬁgures in thissection with JGraph.25CHAPTER 4.METHODOnce graph components (or characteristics to compute graph components) areidentiﬁed,the tool can be used to replace and regroup words according to the graphcomponents.Using one identiﬁer for each component,a lexicon is created.Again aTreeMap is used.Iterating over words in all mails,lemmas given in the lexicon arereplaced in the corpus.4.4.2 Construction of a graphA graph is built after the following principle (cf.pseudocode in Fig.4.3).Thevocabulary of the corpus is given in a list of words C.The words extracted fromthe corpus are called text elements.Each text element is looked up in the thesaurus.A node is created for each text element and each word mentioned in the thesaurus.Found thesaurus neighbors9are registered for each text element.The next step is disambiguation of text elements with more than one lexeme.Itis treated in Section 4.4.4.Then the graph is searched for neighbor relationships.It is done by an exhaustive search over all text elements.ﬁnd_neighbors:∀t ∈ C:∀m∈ C:m= t,m∈ Tt⇒Nt:= Nt∪ {m} (4.1)The procedure ﬁnd_neighbors (cf.4.1) iterates in a double loop over all textelements t and min the given vocabulary list C.If the text element mis a thesaurusneighbor to the word t (i.e.element in the set of all thesaurus neighbors of t,Tt),m is added as a ﬁeld neighbor to the node of t to the set of all ﬁeld neighbors of t,Nt.Let us consider an example.The following list of words (cf.Fig.4.4) is acomponent from the corpus and some supplementary words.The words are lookedup in the thesaurus,their neighbors are registered.Relationships between the wordsare found.Figure 4.5 shows a graph derived from the vocabulary list given in Fig.4.4with translation.The graph consists of totally 23 nodes.Three components consistof several vertices,three only of one single node.The single-node components arenot processed further.This graph is the result of a bfs-search.The diﬀerence toa tarjan-search is discussed in Paragraph 4.4.3.Some nodes are not connected toany other nodes,such as Interessent (prospective customer) and Bestandskunde(existing customer).As the thesaurus does not contain those words,no informationabout semantical relationships is available10.Talking about feature reduction and regrouping of semantically related features,the size of the graph is important.The question may arise,why to cause so muchtrouble involving graph theory,search algorithms and a lot of other instruments9The term thesaurus neighbors refers to both synonyms and hyperonyms (cf.4.3).10Those words are called out-of-vocabulary-words (OOV) in speech-regocnition [21,p.95].Bothterms are very frequent in the corpus,they are speciﬁc terms in the area of application.The useof a thesaurus modiﬁed for the speciﬁc ﬁeld of application and containing all speciﬁc terms mayimprove classiﬁcation results.264.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTICSTRUCTURESGiven:the corpus vocabulary in a list of words.global node list N emptyforall text elements tif (t is not in N)create a node n for tadd t to Nelseget node n for tlook up t in the thesaurus and get thesaurus neighbors t’forall thesaurus neighbors t’ to tif (t’ is not in N)create a node n’ for t’add t’ to Nelseget node n’ for t’register n’ as a thesaurus neighbor to nend forallend forallforall polyseme text elements tdisambiguate tend forallforall text elements tfind field neighbors for tend forallFigure 4.3.Pseudocode of the algorithm to construct a graphjust because of regrouping words.Considering the second largest graph componentgives a ﬁrst answer:Only looking up erfolglos (unsuccessful) in the thesaurus returnsthe synonyms:fruchtlos (fruitless) and vergeblich (in vain).The graph componentdescribes the meaning of the synset.Not only direct neighbors of the word areconsidered.Applying graph theory extends one word to a enclosed structure wherethe sense of meaning is kept.But there is no need of manual labor as graphs canbe calculated automatically and in an acceptable time scale.4.4.3 Component SearchIn order to identify components in a disconnected graph,the two algorithms,breadth-ﬁrst search and Tarjan algorithm are applied (cf.Sec.2.3).The algorithms return diﬀerent results.The Tarjan algorithm ﬁnds 186 compo-nents for 4990 text elements,BFS returns 194.Overall the tarjan-algorithm binds27CHAPTER 4.METHOD• Nett,nett (amiable,likable)• lieb (good,nice,dear)• höﬂich (courteous,polite)• schön (nice,beautiful)• verbindlich (authoritative,binding)• geehrt (honored)• geschätzt (appreciated,expected)• teuer (beloved,dear,expensive)• verehrt (adored,venerated)• angesehen (esteemed)• freundlich (friendly)• erfolglos (unsuccessful)• fruchtlos (fruitless,eﬀectless)• vergeblich (in vain)• unnötig (unnecessary)• zwecklos (useless)• schwer (heavy,diﬃcult)• schwierig (diﬃcult)• kompliziert (complicated)• Interessent (prospective customer)• Bestandskunde (existing customer)• Kunde (customer)Figure 4.4.Word list for the running example284.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTICSTRUCTURESFigure 4.5.A whole graph of the running example drawn with JGraph1572 nodes and the bfs-algorithm 1661.Figure 4.6 maps the size of components totheir number.Components containing only one node are left out.The node sizeis summed up in classes (on the x-axis).The Tarjan-algorithm produces slightlymore middle-sized components (5 - 20 nodes) whereas the bfs-search returns a largernumber of small components (2 - 4 nodes).The largest component found in the corpus is in both cases bigger than 1000nodes (cf.ﬁgures in Sec.C.3).Such a big cluster is not usable for semanticstructures but with the help of the path length algorithm (cf.paragraph 4.4.5) thecomponents can be trimmed to a suitable size.But apart fromstatistical data,the algorithms diﬀer also in detail.As clear fromthe deﬁnition,bfs-search components are larger and wider as the following examplesillustrate.Searching the running example (cf.Fig 4.4) for graph components withbfs - search,the largest component bfs ﬁnds is illustrated in Fig 4.7.Applying the tarjan-algorithm instead of the bfs-algorithm returns the followingresult:In the Tarjan- component the two nodes Nett11(nice,likable) and schön (ﬁne,beautiful) are missing (cf.Fig.4.8).Those two nodes are not highly enoughconnected to the component.The vertex Nett sends three edges but does not receiveany,the node schön receives only one edge but does not have any own neighbors in11Attention:There is a distinction between Nett and nett.29CHAPTER 4.METHODFigure 4.6.The number of components in a certain sizeFigure 4.7.The largest component that bfs ﬁnds in the running example304.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTICSTRUCTURESFigure 4.8.The largest component in the running example with Tarjanthis component.This example illustrates the diﬀerences of the two search algorithms as deﬁnedby their basic postulation (cf.Sec.2.3):In a strongly connected components thereis a directed path from every node u to every other node v.Both of the missingnodes only have neighbors in one direction,therefore they are omitted by the Tarjanalgorithm.This is an important distinction in view of linguistics:weakly connected nodesare semantically not related to the semantic core12of the ﬁeld.4.4.4 Treatment of polysemous wordsDisambiguation (cf.Sec.2.2.1) is the process of identifying which sense of a poly-semous word is used in any given context.The usual approach is to apply contextinformation where the word occurs to identify the accurate sense of meaning.Let’s consider an example.The word freundlich (friendly) is given with thefollowing annotations in the thesaurus:1.leutselig (archaic)132.charmant,gefällig,nett12This is the case for the node schön.Beautiful has not so much in common with the notion offriendly-kind-polite.This case is interesting as well for disambiguation (cf.paragraph 4.4.4).Buthere again,the quality of the semantic structure depends a lot on the entries in the thesaurus.Onlynouns in German are spelled with capital letters.The adjective nett does not exist in a nominativeform.So this entry is actually wrong.13translations:leutselig:accostable;charmant:charming;gefällig:accommodating;nett:nice,kind,friendly;galant:chivalric,gallant;höﬂich:polite;verbindlich:obliging,authoritative;zuvork-ommend:courteous;heiter:bright;schön:beautiful;sonnig:sunny31CHAPTER 4.METHOD3.galant,höﬂich,verbindlich,zuvorkommend4.heiter,schön,sonnig (weather)The second and the third lexeme are used most frequently,describing the char-acter (2) or the behavior of a person (3)14.The fourth lexeme is used in the contextof weather.To apply word-sense disambiguation implies to choose the most correctdeﬁnition,i.e.the most suitable sets of thesaurus neighbors.In the following three approaches for the treatment of polysemous words aredescribed and compared15.Polyseme words are identiﬁed by several groups ofneighbors,one for each lemma.For each lemma a node is created,neighbors arestored as for usual nodes.A polysemous word is represented as a group of itslexemes.Naive disambiguationThe naive approach does not do any disambiguation at all16.Instead of drawingdistinctions between the diﬀerent lemmas,the lists of synonyms are merged to onelist as the lists of hyperonyms are.So schön,nett and verbindlich,i.e.three wordsthat diﬀer a lot in their sense of meaning are treated as equivalent neighbors of theword freundlich.The resulting graphs have a higher connectivity.The risk is to falsify semanticregrouping if neighbors with diﬀerent meanings get mixed17.Graph model based approaches to disambiguationStandard approaches using disambiguation based on context recognition demandcontext information relating a given lexeme to a certain context class18.This in-formation is not available in this Master’s project or beyond the limits of time andeﬀort for a subproblem.14In the sentence Er ist ein freundlicher Mensch.(He is a kind person),the second or the thirdalternative would be suitable.A semantic structure to replace freundlich in this context shouldconsist of these words to catch the correct sense of meaning.15These three approaches are the values for the variable disambiguation tested later in theevaluation part (cf.Sec.5.3.3).16A very illustrative example (cf.Fig.C.7 (Appendix)) is the word Leiter (conductor (elec.)and head,leader,chief) if only Naive disambiguation is done,synonyms for ’conductor’ and forthe second lexeme ’head,leader,chief’ are treated as one single feature.However,as this Master’sproject examines the eﬀect of feature regrouping and of its diﬀerent aspects,the diﬀerent ways ofdisambiguations have to be taken into account.17Another question is,if the lexeme freundlich in the describing of weather is used in the contextof customer mails sent to a telefon agency.18An implementation of the Yarowsky algorithm [21,p.638 ﬀ.],adequately trained could beapplied or a database,mapping a sense of meaning to frequent collocators (i.e.words that appearusually in the context of the word and deﬁne its sense of meaning) such as [48,50] would allow toapply context information.324.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTICSTRUCTURESFor this reason an approach to disambiguation using the graph based model wasdeveloped and applied19.After looking up all text elements in the thesaurus,the appropriate lexeme fora polysemic word is chosen.This is done by identifying the lexeme that has themost neighbors in the text elements.Given a polysemic word w with its lexemesl1...lnand the set of all words W.By calculating the maximum intersection of allneighbor sets of the lexemes with the global word set,the lexeme representing theword is chosen.w:= ljwhere lj= max(|W ∩Nj|) (4.2)Referring to the example of freundlich,in Fig.4.5 the text elements that areconnected to freundlich are nett,schön,höﬂich,verbindlich.Nett is synonym to thesecond lexeme describing a person character,höﬂich and verbindlich are found inthe third one,schön in the fourth one.So the third lexeme is chosen because mostof its list items are found as text elements.There is an ambiguous case and the two following approaches resolves this in adiﬀerent way.Disambiguation unite If lexemes have the same maximal amount of neighborsin the graph,those lexemes are treated with naive disambiguation.They are unitedand their neighbors lists are merged.This is the case of the word nett.It has twoneighbors in the ﬁeld,lieb and freundlich,both belong to diﬀerent lexemes20.Butas no explicit maximum can be found,those two lexemes are united.In ﬁgure 4.9 the node nett has still two neighbors.The node freundlich howeveris only connected to the neighbors of the chosen lexeme,höﬂich and verbindlich.Other edges,outgoing from this node are omitted.The graph changed a lot.Thenode schön is now unconnected.In some regions,e.g.around the node lieb,thegraph is thinned out.Disambiguation alpha This approach to disambiguation is more severe.If lex-emes have the same maximal amount of neighbors in the graph,the lexeme that19As one global graph is built for the whole corpus,Yarowsky second assumption „One senseper discourse“is extended:The whole corpus is considered as one single part of speech.It isassumed that e.g.the word freundlich occurs only in the sense of meaning,describing a person’sbehavior.This very generalizing assumption facilitates implementation but has a big drawback onthe precision of this approach.20The complete thesaurus entry of nett is:1.fein,hübsch,niedlich,süß2.ansprechend,lieb,liebenswürdig,reizend,sympathisch,umgänglich3.charmant,freundlich,gefällig(Translations:fein:ﬁne;hübsch:pretty;niedlich:cute;süß:sweet;ansprechend:pleasant;lieb:beloved;liebenswürdig:amiable;reizend:attractive,charming;sympathisch:sympathic;umgänglich:companionable;charmant:charming;freundlich:friendly;gefällig:complaisant)33CHAPTER 4.METHODFigure 4.9.The graph of the running example built with disambiguation Unitecomes last in the list is chosen,all others are ignored.Figure 4.10 illustrates theoutput of the running example,built with bfs and disambiguation Alpha.The wordnett is again an ambiguous case where the special rule of disambiguation Alpha isapplied.As no explicit maximum can be found,the second lexeme containing thenode lieb wins.In Figure 4.10 the node nett has only one neighbor.The node freundlich howeveris only connected to the neighbors of the chosen lexeme,höﬂich and verbindlich.Other edges,outgoing from this node are omitted.The graph changed a lot.Thenode schön is now unconnected.In some regions,e.g.around the node lieb,thegraph is thinned out.344.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTICSTRUCTURESFigure 4.10.The graph of the running example built with disambiguation Alpha4.4.5 Trimming graphs - pathlengthA graph component containing 20 nodes or more is not useful to regroup words torelated word groups as the semantic content diﬀers a lot21.In order to limit up thecomponent size,an algorithm is developed to trim components according to a givenmaximal path length.The algorithm reminds of breadth-ﬁrth search,diﬀering inthe path length quality:From a chosen central node,the algorithm takes steps forthe maximal path length.If the the upper bound is reached,the connection toneighbor nodes is cut down:The procedure cutComponent (cf.Fig.4.11) is called in a similar way as thesearch algorithmwere adapted:all nodes in a big component are stored in a priorityqueue,sorted by the number of neighbors22.The procedure is run with the node21The graph in C.5 (Appendix) illustrates how much words can diﬀer after a short path.22If a large graph consists of several overlapping semantic structures,the nodes with the highestnumber of neighbors are supposed to be the semantic core of a local semantic structure (cf.Sec.2.2.3,4.3).This approach could be improved by adding another dimension which involves theword frequency in the corpus.In the current approach,only the position in the graph componentis taken into account.An idea would be e.g.to correlate the number of neighbors of a node to the35CHAPTER 4.METHODprocedure:cutComponentgiven:node c with highest connectivity (maximal number of neighbors)all nodes in graph are unmarkeddistance = 0 for all nodesempty queue QmaxDistanceempty component Cenqueue c in Qwhile q is not empty and globalDistance <= maxDistancedequeue current node d from Qfor all neighbors n of d doif n was already visited,continue with next neighbormark n as visitedincrement n.distanceif n.distance < maxDistanceenqueue n in Qadd n to Cif n.distance = maxDistanceadd n to Ccut off all neighbors of n that are not in Cend for allend whilereturn CFigure 4.11.Pseudocode for the procedure cut componentwith maximal connectivity.All nodes contained in the component the procedurereturns are removed from the priority queue.The complexity of the algorithm is quadratic23.In the following the algorithm is illustrated by an example,the biggest com-ponent in the ﬁrst graph picture in this section (Figure 4.5).The node with thehighest connectivity (the most outgoing edges) is usually the semantic center of theﬁeld (cf.Sec.2.2.3).The algorithms starts with the node geschätzt24marked with a red circle intd-idf measure for the respective word.23Assuming the worst case,a maximal connected graph component where all verteces are ad-jacent to each other.Each node is only visited once,but anyhow all neighbors for each node areexamined.24The node lieb has outgoing edges (δ (lieb) = 5 = δ (geschätzt)).For further distinction the364.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTICSTRUCTURESFigure 4.12.The maximal path length is 2.In the ﬁrst step,all direct neighborsto the node geschätzt are discovered (angesehen,gesehen,verehrt,teuer,lieb).Thesecond step involves only one node,the node nett.For this border node,a cutoperation is needed.The ﬁrst iteration returns the component illustrated in Figure4.13.The second iteration starts at the node freundlich marked with a red,dottedcircle.The ﬁrst step reaches the neighbors schön,verbindlich and höﬂich.Thiscomponent is shown on Figure 4.14.The node Nett drops out as a single-nodecomponent.Search with the cutting algorithm is exhaustive as all nodes are visited.Theused queue or further the basic thought of bfs-search ensures that nodes are visitedin the convenient order,by their distance to the central node.In the following we will have a closer look at the newly created components.Thelarger component on Figure 4.13 resembles a K525taking away the node angesehen.Apart from the word angesehen all words in this component are closely related intheir lexical meaning26.In the experiments,graphs with maximal pathlength from 1 till 3 are tested.number of thesaurus neighbors is taken into account.This node has less thesaurus neighbors asthe node geschätzt.25The K5 graph is maximal connected and unplanar i.e.there is no way to arrange the nodesin the plane such that the edges do not cross [11].26Some more examples are found in the Graph Gallery in the Appendix (cf.Sec.C)37CHAPTER 4.METHODFigure 4.12.A component cut with the path length algorithm384.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTICSTRUCTURESFigure 4.13.Component 1 cut from the example graph with path length 2Figure 4.14.Component 2 cut from the example graph with path length 239Chapter 5Evaluation - ExperimentsExperiments are done in 2 steps.First the corpus is analyzed,preprocessing stepsand baseline tests are discussed.The second part reports classiﬁcation experimentsdone with SVM.It involves semantic feature regrouping.The eﬀect of the diﬀerentvariables such as the search algorithm and path length is examined.These resultslead to a characterization of an ideal category where semantical feature regroupinghas a positive eﬀect on text classiﬁcation with SVM.5.1 Corpus AnalysisThis section treats the analysis of the corpus and the diﬀerent preprocessing steps.5.1.1 Decisions about the corpusTo bring the data into a usable form was one of the most diﬃcult challenges in thisMaster’s project.Keep in mind that this is not a certiﬁed corpus such as [36].Thedata in the whole original corpus is very unclean,analysing randomly picked emails,about 2/3 turned out to be spam or unusable emails (i.e.mail returned to sender,out-of-oﬃce emails etc).In order to purify data,the two Unbekannt-categories wereremoved completely,the size of the corpus was reduced by two third.The data isstill very unclean.Experiments using such a big corpus take a lot of time1.Theemails are not always the ﬁrst mails of a customer contacting the customer service,as e.g.colleagues forward information about a customer calling etc.For this reason,the mails were ﬁltered again for mails coming from a mail - contact interface onthe website of the telephone agency.Only the proper text of the email was left,allother information (e.g.variables about the type of the subscription were omitted).The result are about 2000 mails remaining.The average text length was halved1The diﬀerent steps of preprocessing (as they are all done oﬄine),i.e.to calculate a graph andto replace words in the mails demands time and assistance.To run an experiment series with SVMfor all categories,with 5 folder cross validation and two test setups takes with the given hardwarecircumstances about 48 hours.To run a test setup with the smaller corpus (about 2000 emails) isabout 100 times faster.41CHAPTER 5.EVALUATION - EXPERIMENTS(the average text length was 260 words in the spam reduced corpus,in the mailform corpus a mail counts 126 words in average (cf.Tab.A.2 (Appendix)).Whentalking about the corpus in the following,it refers to the last form containing 2000mails.5.1.2 Category SystemThe structure of the category system is complex.There are two main categories:Produkt and Typ.Every email is assigned to the two main categories,i.e.it hastwo classiﬁcation variables.The main category Produkt describes diﬀerent product types,i.e.a telephoneﬂat rate or an Internet service.It consists of 12 subcategories.The categoryTyp could be translated with service,it consists of technical questions,to cancel acontract etc.There are 6 subcategories.Every message is classiﬁed for only onepair of categories.Considering pairs of category assignments (Produkt,Typ),thenumber of categories rises to 72 (of which 67 occur).To train and to test a classiﬁer in an accurate way demands a large enough setof data,at least more emails than folds2in the category.Omitting the two maincategories Produkt and Typ only 5 categories out of 18 contain less then 5 emailsand cannot be used.In the complex double-featured category system more thanhalt categories (43 of 67) are concerned.Usually in text categorization a complex category hierarchy is broken downto a combination of binary classiﬁcation tasks [33].A binary classiﬁer returns aprobability for a document to belong to a category.The scope of this Master’sthesis is not to build a classiﬁcation system but to examine the eﬀect of semanticfeature selection.For these reasons the category system had to be simpliﬁed.It was decided toomit the two main categories and just to consider the sub categories.So doing anexperiment,the whole corpus is treated twice and each email is classiﬁed twice.Table 5.1 gives an overview of all categories and a short description of thecontent.5.1.3 PreprocessingSome steps of preprocessing were done.The emails were taken out of the enterprise’sdatabase and stored in separate text ﬁles.The whole email,subject and body weretaken for classiﬁcation.Further information,as sender,sending date etc had beenomitted.To be able to administrate the large amount of emails in the 18 categories,a database was created.It was used to extract lists with ﬁlenames to identify allemails in one category or to provide classiﬁcation information to the classiﬁcationalgorithm.2Folds is here the number of parts the data set is divided into for cross-validation (cf.5.2.1)425.1.CORPUS ANALYSISCategoryDescriptionProduct 1ADSL high speed internetProduct 2Diﬀerent kinds of oﬀers for business customersProduct 3a combination of internet and telephone ﬂat rateProduct 4a dsl ﬂat rateProduct 5diﬀerent kinds of oﬀers for ﬁxed line networkProduct 6cell phone issuesProduct 7a low price telephone subscriptionProduct 8a telephone ﬂat rateType 1contract cancellationType 2information about a productType 3payment issuesType 4technical issuesType 5changesTable 5.1.Overview of categories and their content (5 categories have been omitted)The steps of preprocessing concerned text pureness and tagging.About half ofthe emails where stored in HTML format,so HTML tags such as"< br >< br >"had to be removed.In Section 2.1 feature reduction was identiﬁed as a central problem in textcategorization.The whole corpus is stored in a POS-tagged and in a lemmatizedversion.Lemmatisation and POS-tagging is done by the Tree Tagger [37],[38].This helped to cut down the number of features by about one third (241973 tokentypes reduced to 140630 cf.Tab.A.1).Further knowledge about the word’s basicform (lemma) and the word class is essential for the use of semantic relationships.The tagger has an error rate less than 5% but mistakes or unsolvable ambiguationsare possible.Tagging or lemmantisation errors were not corrected as this manuallabour is too time consuming.In any ambigous case,the tagger returns all possibleforms3.Then,the ﬁrst form in the list was chosen.A very common step in preprocessing of natural language data is the eliminationof stop-words.Stop-words are frequent words without a direct semantic meaningsuch as or,and,the.Stop words were removed from the corpus using a stop wordlist [2].5.1.4 Categories sizes and other statistical dataIn the category system Typ,categories are quite balanced in size as ﬁgure 5.1 illus-trates.The four categories Type 5,Type 4,Type 2 and Type 1 consist of 100 - 5003The personal pronoun sie (she,them) can as well be used in the polite form of address spelledwith a capital letter.If a sentence starts with sie,it is not clear,which form is meant.The taggerreturns sie,Sie,sie (3.person singular,polite form of address,3.person plural).In such a case,the ﬁrst word in this list was taken.43CHAPTER 5.EVALUATION - EXPERIMENTSmails.The largest category Type 3 contains about 2/3 of the whole corpus,about700 mails.Figure 5.1.Histogram of the category system TypThe category system Produkt (illustrated in ﬁgure 5.2) shows a more heteroge-neous distribution of category sizes as the other category system.Three categoriesconsist of less than 5 emails and have to be omitted.They are not mentioned in thediagram.The category Product 7 encloses little more than one third of all mails(740).The rest is between about 50 (Product 4) and about 500 emails (Product 3).Figure 5.2.Histogram of the category system ProduktThe tables in the appendix (Tab.A.1) give an overview of the categories,their445.1.CORPUS ANALYSISsize and some more statistic data.5.1.5 Correlation to the thesaurusThe thesaurus contains about 60000 entries but most of the words are situated inevery-day speech.The vocabulary of the corpus consists of speciﬁc terms in a largepercentage which are not registered in the thesaurus.Those words are as out-of-vocabulary (OOV) words in speech recognition (cf.[21]) not included in furtherprocessing,i.e.feature regrouping is not possible as there is no information abouttheir semantic relations available.Of 13010 words in the corpus only 4008 are foundin the thesaurus,i.e.less than one third (24%).A lot of the terms OOV aﬀectedare highly frequent and important to classiﬁcation.For instance the word Anschluß(access,connection) which appears 288 times,is OOV.Another example is the wordtriple from the example graph in section 4.4.2:Interessent (prospective customer),Bestandskunde (existing customer),Kunde (customer).Table A.3 (Appendix) maps the categories to the percentage of words containedin the thesaurus.The categories can be separated into four groups1.very low coverage (< 60%):Product 3,Type 3 and Product 7.2.low coverage (60 −70%):Type 1 and Type 2.3.medium coverage (70 −80%):Type 4,Product 8 and Type 54.high coverage (> 80%):Product 1,Product 2,Product 4,Product 5 andProduct 6When working with semantic feature regrouping,this correlation is an importantfeature of a category.[18] unites concepts by identifying 2-gram words.Air pollution,water pollu-tion and pollution are hierarchical regrouped to the concept of pollution.Such anapproach would be thinkable for this Master’s project as many words OOV arecomposed of several single words in the thesaurus.The word Telefonrechnug (tele-phone bill) consists of Telefon and Rechnung.Both composites are covered by the