Transcription

1 Predicting Word Fixations in Text with a CRF Model for Capturing General Reading Strategies among Readers Tada yoshi Hara 1 Daichi M ochihashi 2 Yoshinobu Kano 1,3 Akiko Aizawa 1 (1) National Institute of Informatics, Japan (2) The Institute of Statistical Mathematics, Japan (3) PRESTO, Japan Science and Technology Agency {kano, Abstract Human gaze behavior while reading text reflects a variety of strategies for precise and efficient reading. Nevertheless, the possibility of extracting and importing these strategies from gaze data into natural language processing technologies has not been explored to any extent. In this research, as a first step in this investigation, we examine the possibility of extracting reading strategies through the observation of word-based fixation behavior. Using existing gaze data, we train conditional random field models to predict whether each word is fixated by subjects. The experimental results show that, using both lexical and screen position cues, the model has a prediction accuracy of between 73% and 84% for each subject. Moreover, when focusing on the distribution of fixation/skip behavior of subjects on each word, the total similarity between the predicted and observed distributions is , which strongly supports the possibility of capturing general reading strategies from gaze data. Title and Abstract in Japanese CRF 73% 84% Keywords: eye-tracking, gaze data, reading behavior, conditional random field (CRF). Keywords in Japanese: (CRF). Proceedings of the First Workshop on Eye-tracking and Natural Language Processing, pages 55 70, COLING 2012, Mumbai, December

2 1 Introduction Natural language processing (NLP) technologies have long been explored and some have approached close to satisfactory performance. Nevertheless, even for such sophisticated technologies, there are still various issues pending further improvement. For example, in parsing technologies, over 90% parsing accuracy has been achieved, yet some coordination structures or modifier dependencies are still analyzed incorrectly. Humans, on the other hand, can deal with such issues relatively effectively. We expect that if we could clarify the mechanism used by humans, the performance of NLP technologies could be improved by incorporating such mechanisms in their systems. To clarify these mechanisms, analyzing human reading behavior is essential, while gaze data should strongly reflect this behavior. When a human reads a piece of text, especially for the first time, it is important that his/her eye movements are optimized for rapid understanding of the text. Humans typically perform this optimization unconsciously, which is reflected in the gaze data. Eye movements while reading text have long been explored in the field of psycholinguistics (Rayner, 1998), and the accumulated knowledge of human eye movements has been reflected in various eye movement models (Reichle et al., 1998, 2003, 2006). Reinterpretation of the knowledge from an NLP perspective, however, has not been thoroughly investigated (Nilsson and Nivre, 2009, 2010; Martínez-Gómez et al., 2012). One possible reason for this could be that eye movements inevitably contain individual differences among readers as well as unstable movements caused by various external or internal factors, which make it difficult to extract general reading strategies from gaze data obtained from different readers or even from a single reader. In this research, we explore whether this difficulty can be overcome. We aim to predict whether each word in the text is fixated by training conditional random field (CRF) models on existing gaze data (Kennedy, 2003), and then examining whether such fixation behavior can be sufficiently explained from the viewpoint of NLP-based linguistic features. In the experiments, the trained CRF models predicted word fixations with 73% to 84% accuracy for each subject. While the accuracy does not seem high enough to explain human gaze behavior, a CRF model trained on the merged gaze data of all the subjects can predict the fixation distribution across subjects for each word with a similarity of to the observed distribution, which should be high enough to extract a general distribution regardless of individual differences or unstable movements in the gaze data. The experimental results also show that to capture human reading behavior correctly, both lexical and screen position features are essential, which would suggest that we need to adequately distinguish the effects of these two kinds of features on gaze data when incorporating certain strategies from gaze data into NLP technologies. In Section 2, we discuss related work on analyzing gaze data obtained while reading text. In Section 3, we briefly explain the fundamental concepts of gaze data by introducing existing gaze data in the form of the Dundee Corpus, and also introduce the CRF model, which is trained to predict word-based fixations. In Section 4, we discuss preprocessing and observation of the Dundee Corpus in designing our model. Finally, in Sections 5 and 6, we explain how to predict word-based fixations in the corpus and analyze the performance of our model, respectively. 2 Related work In the field of psycholinguistics, eye movements while reading text is a well established research field (Rayner, 1998), and the accumulated knowledge has resulted in various models for eye move- 56

3 ments. E-Z Reader (Reichle et al., 1998, 2003, 2006) is one such model. The E-Z Reader was developed to explain how eye movements are generated for the target gaze data, and not to predict eye movements when reading text for the first time. These models are optimized for the target gaze data by adjusting certain parameters without including any machine learning approaches. On the other hand, the work presented in (Nilsson and Nivre, 2009) was, as the authors stated, the first work that incorporated a machine learning approach to model human eye movements. The authors predicted word-based fixations for unseen text using a transition-based model. In (Nilsson and Nivre, 2010), temporal features were also considered to predict the duration of fixations. There are important differences between the two approaches mentioned above, other than the way in which the parameters are adjusted and the purpose of the modeling. The former approach modeled the average eye movement of the subjects, while the latter trained the model for each subject. The key point here is that the former approach attempts to generalize human eye-movement strategies, while the latter attempts to capture individual characteristics. Our final goal is not only to explain or predict human eye movements, but rather to extract from gaze data, reading strategies that can be imported into NLP technologies. Since it is not clear whether extracting individual or averaged strategies is better for this purpose, we set out to train our models to predict both word-based fixations for each subject and the total distribution of the behavior across the subjects. An image-based approach was proposed in (Martínez-Gómez et al., 2012) to clarify the position in the text that should be fixated in order to understand the text more quickly. The authors represented words in the text as bounding boxes, and visualized each of the linguistic features of words as an image by setting the pixel values of the word-bounding boxes according to the magnitude of the feature values of the words. They then attempted to explain the target gaze data represented in the image using a linear sum of the weighted feature images. This work also incorporated screen position features of words by representing each linguistic feature in a text image, which meant that the screen position and linguistic features were considered to be strongly connected. In our models, on the other hand, these two features are described separately and then paired, since we need to exclude the contribution of screen position features when incorporating captured reading strategies into NLP technologies, where screen positions are rarely considered. 3 The target gaze data and the model used to analyze them 3.1 The Dundee Corpus The Dundee Corpus (Kennedy, 2003) is a corpus of eye movement data obtained while reading English and French text. For each language, 20 texts from newspaper editorials (each of which contained around 2,800 words) were selected, and each of the texts was divided into 40 five-line screens containing 80 characters per line. While 10 native speakers read the texts displayed on the screen, an eye tracker was used to record the gaze points on the text every millisecond. Through their screen settings, patient calibration of the eye tracker, and post-adjustment of gaze data, the authors successfully controlled the error of each gaze point to be within a character. The gaze data included in the corpus, therefore, consisted of character-based fixations. Consecutive gaze points on a single character were reduced to a single fixation point with the combined duration (Figure 1). Generally, an eye movement from one fixation point to another is called a saccade, and backward saccades are called regressions. In a saccade action, the human gaze usually moves several characters forward in the text, which means that some characters are not fixated. The reason for this is that humans can see and process the areas around fixated points, referred to as peripheral fields. 57

4 t h r e a t e n i n g t h e i r v e r y e x i s t e n c e? 96ms 232ms 168ms 335ms 173ms 188ms 88ms : fixation : saccade : regression Figure 1: Character-based gaze data in the Dundee Corpus 3.2 Conditional random fields CRFs (Lafferty et al., 2001) are a type of discriminative undirected probabilistic graphical model. Theoretically, CRFs can deal with various types of graph structures although we use CRFs for sequential labeling of whether each word is fixated. We therefore, explain CRFs with respect to sequences only, borrowing the explanation from (Sha and Pereira, 2003). CRFs define the conditional probability distributions p(y X) of label sequences Y given input sequences X. We assume that random variable sequences X and Y have the same length, and that the generic input and label sequences are x = x 1 x n and y = y 1 y n, respectively. A CRF on (X, Y) is specified by a vector f of local features and a corresponding weight vector λ. Each local feature is either a state feature s(y, x, i) or a transition feature t(y, y, x, i) where y, y are labels, x is an input sequence, and i is an input position. Typically, features depend on the inputs around the given position, although they may also depend on global properties of the input. The CRF s global feature vector for input sequence x and label sequence y is given by F(y, x ) = i f (y, x, i), where i ranges over the input positions. The conditional probability distribution defined by the CRF is then p λ (Y X) = (1/Z λ (X)) exp λ F(Y, X), where Z λ (x ) = y exp λ F(y, x ). The most likely label sequence for x is then given by ŷ = arg max y p λ (y x ) = arg max y λ F(y, x ). In our case, x represents the words in the text and y denotes whether each word is fixated. 4 Pre-processing and observation of the Dundee Corpus In this section, we extract first-pass word-based fixations from the Dundee Corpus as the first step in our investigation. We then observe what types of information seem to determine word fixations/skips, which will help us to design feature sets for our CRF model in Section Extraction of first-pass word-based fixations from the Dundee Corpus As a first step toward extracting reading strategies, we focus on word-based fixations ignoring their duration information, as examined in (Nilsson and Nivre, 2009). By merging consecutive fixations within a word into a single fixation, the resolution of the gaze data is reduced from a per character to a per word basis. Even after the merging, however, considering various types of observable behaviors at a time seems too complicated for the first step. We therefore further narrow our target by excluding regressions and saccades crossing lines from the gaze data as follows. [Step 1] Each word-fixation is dealt with according to (i) and (ii). (i) Omit the fixation from the gaze data and move to the next fixation if a fixated word (a) is labeled visited or (b) is in a different line from a previously-fixated word. (ii) Else, allocate visited labels to the fixated word and all the preceding words in the text. [Step 2] A sequence of gaze data is reconstructed using the remaining fixations. For the gaze data in Figure 1, for example, character-based fixations are first merged into wordbased fixations, the fixation after the regression from very to their is then ignored, and thereafter the gaze data are reconstructed as shown in Figure 2. With the data obtained from the above operation, 58

5 threatening their very existence? : word-based fixation : saccade Figure 2: First-pass word-based fixations in the Dundee Corpus Subject Total no. of No. of words in word sequence skipped by saccade saccades A 31,431 17,683 8,831 3, B 36,248 24,669 8,900 2, C 37,657 26,348 9,369 1, D 36,570 24,560 10,044 1, E 32,442 18,896 9,023 3, F 38,982 28,561 8,859 1, G 38,910 28,640 8,324 1, H 33,910 20,540 10,068 2, I 36,717 24,957 9,117 2, J 37,738 26,479 9,297 1, Avg. 36, , , , (100.00%) (66.91%) (25.46%) (6.44%) (0.92%) (0.16%) (0.05%) (0.02%) (0.01%) Table 1: Frequency of number of words in skipped sequence per subject we can focus only on word-fixations involved in first-pass forward saccades within single lines. 4.2 Observation of skipped words in the Dundee Corpus When observing the gaze data obtained in the previous section, we can see that for each subject many words were skipped by saccades, that is, not fixated at all. We consider that such skips would reduce the time for word-fixations and therefore lead to more effective human reading, that is, faster reading without sacrificing understanding. Here we explore this word-skip behavior in the gaze data in order to utilize the characteristics thereof to model word-fixations in the experiments. Table 1 shows the number of saccades per subject for the 20 texts of the Dundee Corpus (second column), and classifies these saccades according to how many consecutive words the subject skipped (third column onwards). The numbers in parentheses at the bottom of the table show the ratios of the number of saccades skipping a particular number of words against the total number of saccades. According to this table, the number of saccades skipping up to three words constitutes 99.73% of the total number of saccades. Even if we omit the number of saccades that move to the next word (shown in the third column) from our calculations, the number of saccades skipping one to three words constitutes 99.18%. Based on this observation, the assumption that each saccade action skips at most three consecutive words appears to be realistic. If there is a common regularity within the skipped sequences that can determine whether a target sequence is skipped, predicting whether a target word is skipped would require lexical information on the preceding or following two words from the target word. Table 2(a) shows the top 30 word sequences skipped by saccades in order of the number of skip times, averaged over the 10 subjects (leftmost values in the middle column). From this table, it seems that closed-class words such as determiners, prepositions, conjunctions, auxiliary verbs, and so on, are often skipped by saccades. When considering the ratio of skip times against total number of appearances of the target sequence (shown in the rightmost column), however, the frequently skipped sequences were not skipped with high frequencies. For example, the was skipped most often, although its skip rate was only 26.56%. Table 2(b) shows the top 30 sequences in order of skip rates against number of appearances only for sequences that appeared 5 times in the corpus. As observed in Table 2(a), we can see that 59

6 (a) Frequently observed skips Word # skips / # ap- Ratio sequence pearances (%) the / of / to / and / a / in / that / is / for / The / on / as / of the / are 99.5 / be 92.8 / with 92.4 / was 87.2 / it 84.5 / I 79.5 / by 76.7 / / have 71.4 / or 70.5 / in the 68.6 / at 67.4 / has 64.8 / from 63.1 / he 59.7 / but 56.7 / an 51.8 / (b) Sequences skipped with high rate (which appeared 5 times) Word # skips / # ap- Ratio sequence pearances (%) His 4.8 / Its 4.6 / How 3.3 / Of 6.7 / From 3.9 / A 21.7 / or a 4.6 / No 4.1 / I d 4.1 / Ms 3.1 / We 14.4 / led 2.6 / in 3.0 / Most 3.4 / The / de 3.8 / & 3.8 / or 70.5 / of a 30.7 / Is 2.1 / is 2.5 / It s 6.1 / as a 20.9 / We 2.4 / Those 2.4 / he s 2.4 / a 3.6 / He 19.6 / / and / (c) Skipped 2 or 3 word sequences (which appeared 5 times) Word # skips / # ap- Ratio sequence pearances (%) or a 4.6 / in 3.0 / of a 30.7 / is 2.5 / as a 20.9 / a 3.6 / to a 13.4 / and so 1.9 / in a 22.9 / the 4.5 / of us 3.1 / In a 2.4 / up a 1.7 / than a 4.4 / and to 2.0 / to be a 2.8 / many of the 0.4 / to do with 0.4 / is not a 0.4 / would be a 0.6 / it is a 0.5 / is that the 0.4 / to make a 0.3 / have been a 0.3 / it is the 0.4 / that it is 0.3 / as much as 0.2 / in order to 0.2 / because of the 0.2 / in the same 0.2 / Table 2: Word sequences skipped by saccades in the Dundee Corpus closed-class words are once again in the majority while first (capitalized) words in sentences were frequently skipped, although their skip rates were, as before, not that high. Even His at the top of the table was skipped with a rate of only 60.00%. Table 2(c) shows the top 15 sequences based on the same criteria used in Table 2(b), but only for two- and three-word sequences. The table suggests that word sequences connecting something like NP chunks tended to be skipped, although their skip rates were not that high. These observations suggest that target word sequences themselves seem to be related to whether they are skipped, while other factors, such as relations with surrounding words, and so on, should also be considered in skip decisions. Based on the above, we aim to capture factors for word-skip behaviors using features in the CRF models. Using CRF models trained on the gaze data, we examine how well the factors implemented as features can explain gaze behaviors. The main purpose of this research was to capture some generality in human reading strategies from an NLP perspective. From this point of view, it is desirable to be able to explain gaze behaviors mainly using combinations of lexical information, in the normal way for NLP. For example, the width of peripheral fields and the range of saccades, which are given by human eye mechanisms, have long since been shown to control gaze behavior in psycholinguistic fields, whereas we aim to interpret them in terms of window size, word length, and so on. Early in this section we assumed that the length of each skipped sequence is at most three words. We then attempt to predict a fixation or skip behavior for each word using lexical information on the word and the preceding and following two words, which implies a window size of five words. 60

7 Subject No. of skipped / all words (rate) A 20,048 / 51,501 (38.93%) B 15,224 / 51,501 (29.56%) C 13,817 / 51,501 (26.83%) D 14,890 / 51,501 (28.91%) E 19,039 / 51,501 (36.97%) F 12,490 / 51,501 (24.25%) G 12,570 / 51,501 (24.41%) H 17,563 / 51,501 (34.10%) I 14,763 / 51,501 (28.67%) J 13,736 / 51,501 (26.67%) Table 3: Rate of skipped words (No. of words) Condition for agreement Total (rate) = Skipped + Fixated 6 subjects displaying same behavior 47,320 (91.88%) = 10, ,211 7 subjects displaying same behavior 39,439 (76.58%) = 6, ,955 8 subjects displaying same behavior 31,855 (61.85%) = 3, ,382 9 subjects displaying same behavior 24,219 (47.03%) = 1, , (all) subjects displaying same behavior 16,313 (31.68%) = ,999 Total words in all texts 51,501 Table 4: Agreement on gaze behavior for each word The level of lexical information can vary, such as surface form, POS, length, probability, etc., while various combinations of these can also be considered. On the other hand, since text is displayed on a screen, optical factors must also be considered. In this research, we consider one of the most likely factors, that is, the screen position of each word. In the experiments in Sections 5 and 6, we examine the contribution of these factors by representing them as features in the CRF models. 4.3 Observation of commonality in gaze behaviors among subjects This section investigates a method for capturing generality in gaze behavior among subjects. Using the gaze data (obtained in Section 4.1), Table 3 gives the number of words that were skipped by each subject. From this table, we can roughly see some variability in gaze behavior among subjects. Table 4 shows the degree of agreement among the subjects on whether each word is fixated or skipped. For each row, the table shows the number of words for which a minimum number of subjects displayed the same behavior. For example, words for which all the subjects displayed the same behavior comprised only 31.68% of the texts. The low agreement given in the table would suggest that it is not a good idea to specify a single common behavior for each word. Based on this observation, we attempted instead to capture the distribution of how many subjects fixated or skipped each target word. We trained a CRF model on the merged gaze data for all 10 subjects, using the same feature set as in the model for each subject, and then used the obtained model to predict the distribution of each word in a target text. 5 Experimental settings Based on the observation in the previous section, we examine whether word-fixations can be predicted using CRF models, which are trained on the gaze data. In this section, we explain the experimental settings mainly of features that are utilized to train CRF models. 5.1 General settings For the experiments, we trained a CRF model on the gaze data for each subject to predict the fixation/skip behavior of the subject for each word. In addition, we also trained a CRF model on the merged data for all subjects, to predict the fixation/skip distributions of each word across the subjects. The evaluation metrics for the models are given in Section 5.3. For gaze data, we utilized the Dundee Corpus. As introduced in Section 3.1, the Dundee Corpus consists of gaze data for 20 texts, each of which was read by 10 subjects. We then divided the data into training data, consisting of the data for 18 texts, and test data, comprising data for the remaining two texts. All the gaze data were converted into first-pass saccade data according to Section 4.1, where each word was labeled skipped or fixated for each of the subjects. In the 61

8 Normalized log(p Poisson ) Unigram 0.4 Bigram 0.2 Trigram # characters in word sequence (a) L_POIS Normalized log(p) Unigram 0.4 Bigram 0.2 Trigram # characters in word sequence (b) L_PROB Reciprocal Uni-, bi-, trigram # characters in word sequence (c) L_RECI Figure 3: Word length features Dundee Corpus, symbols such as quotation marks, periods, and commas are concatenated with the nearest words. Considering the effect of this on gaze behavior, words in other tools were treated in the same manner. For the same reason, we left the capitalization of words unchanged. To train the CRF models, we utilized CRFsuite (Okazaki, 2007) ver We used a sentence as an input/output unit, since many of the existing NLP technologies are based on sentence-level processing, and we intend to associate outputs of the CRF models with NLP technologies in our future work. To obtain input sentences, five 80-character lines in each screen were split into sentences using the sentence splitter implemented in the Enju parser (Ninomiya et al., 2007) 1. In training the CRF models, we selected the option of maximizing the logarithm of the training data with an L1 regularization term, since this would effectively eliminate useless features, thereby highlighting those features that really contributed to capturing the gaze data. The coefficient for L1 regularization in each model was adjusted in the test data to examine to what extent we could explain the given data using our features. We next explain the features utilized for training our CRF models. 5.2 Features utilized for training CRF models Based on the observation in Section 4.2, we set up features to capture the reading strategies. The examined features can be classified into two types: lexical features and screen position features. For each target word, we considered the features on the target word, the preceding two words, and the following two words, which implies a window size of five words. Within the window size, we considered all possible uni-, by-, and trigrams for each feature, except for 3G-F and 3G-B. [Lexical features] WORD: word surface(s). POS: part(s) of speech obtained applying the POS tagger (Tsuruoka et al., 2005) to each sentence. L-POIS, L-PROB, L-RECI: information on surprisal of word length (real-value features). L- POIS assumes that the word length probability follows a Poisson distribution, and takes the logarithm of the probability of the target word length. The logarithmic values are normalized over the words in the texts (Figure 3(a)). L-PROB calculates the actual word length probability in the training data, takes the logarithm of the obtained probability, and then normalizes the logarithm (Figure 3(b)). L-RECI merely takes the reciprocal of the word length (Figure 3(c)). For all of the above three features, when obtaining bi- and trigrams, we summed the length of each of the words and single space characters inserted between them. 3G-F, 3G-B: surprisal of a forward or backward word trigram (real-value features). We first obtained the probabilistic distribution of forward or backward trigrams by training the trigram lan

9 1 line S start L start L mid L mid L mid L mid L mid L mid L start L mid L mid L mid L mid L mid L mid L mid L start L start L mid L mid L mid L mid L mid L mid L mid L mid L mid L mid L mid L mid L mid L mid L mid L mid L mid L mid L end S end L end L end L end S start : screen start S end : screen end L start : line start L end : line end L mid : in the middle of a line 1 line (1,1) (2,1) (3,1) (4,1) (5,1) (1,2) (2,2) (3,2) (4,2) (5,2) (1,3) (2,3) (3,3) (4,3) (5,3) (1,4) (2,4) (3,4) (4,4) (5,4) (1,5) (2,5) (3,5) (4,5) (5,5) 18 characters (a) LF: line- or screen-feed features (b) SC: screen coordinate features Figure 4: Screen position features Subjects A B C D E F G H I J # fixated words 3,076 3,366 3,716 3,761 3,225 3,906 3,878 3,389 3,443 3,679 (Rate (%)) (62.67) (68.58) (75.71) (76.63) (65.71) (79.58) (79.01) (69.05) (70.15) (74.96) # words in test data 4,908 (100.00%) Table 5: Baseline rates for fixated words in the test data guage model using SRILM (Stolcke, 2002) on the section of Agence France-Presse, English Service in the fourth edition of English Gigaword (Parker et al., 2009), which contains 466,718,000 words. The obtained probabilities for target trigrams were then converted into logarithmic values, and thereafter normalized over the trigrams in the texts. [Screen position features] LF: line- or screen-feed. This examines whether the target word is at the beginning or end of a line (L star t / L end ) or the screen (S star t / S end ) (see Figure 4(a)). SC: screen coordinates. This divides each screen into 5 5 grids and examines in which grid the beginning of the word falls. Each screen in the Dundee Corpus consists of five 80-character lines, and therefore, one grid has the capacity to hold 1 16 characters (see Figure 4(b)). 5.3 Evaluation metrics and baselines To evaluate the model trained on the gaze data for each subject, we counted the number of words in the test data for which the model correctly predicted the subject s behavior. Based on the observation that words were more often fixated than skipped for all subjects (see Table 3), we regarded the rate of fixated words in the gaze data for each subject as the baseline accuracy (see Table 5). For the model trained on the merged data of all subjects, we first predicted the fixation/skip distributions of each word across the subjects for the test set. For each predicted distribution, the similarity based on Kullback-Leibler divergence was calculated against the distribution observed in the gaze data. Then, we took the average of the similarities over all words in the test set. More precisely, we calculated exp{ (1/ T ) t T i p i,t log e (p i,t /q i,t )} where set T represents a target text in which each word t T is identified with its position in the text. T is accordingly the number of words in text T, i { fixated, skipped } is the label given to each t T, and p i,t and q i,t are the fixated / skipped distributions of target word t across the subjects, predicted by the CRF model and observed in the gaze data, respectively. This similarity measure returns values between (0, 1]; it returns 1 if the two distributions are the same. Using this similarity, we examined how well our model could capture generality in the reading strategies of all subjects. 63

10 Utilized feature types Merged Subjects A B C D E F G H I J (Baseline) WORD POS G-F G-B L-POIS L-PROB L-RECI LF SC (Using all of the above) Merged denotes the similarity of the distribution to the test data; Subjects gives the accuracy (%) of predicting word fixations/skips Table 6: Prediction accuracy of word fixation/skip behavior (using individual features) Utilized feature types Merged Subjects A B C D E F G H I J (All individual types) WORD POS G-F G-B L-POIS L-PROB L-RECI LF SC Merged denotes the similarity of the distribution to the test data; Subjects gives the accuracy (%) of predicting word fixations/skips Table 7: Contribution of individual features to prediction accuracy For the baseline of this similarity measure, we averaged over the training data the fixation/skip distributions of each word across the subjects, giving Prediction of word-based fixation or skip behavior using CRF models In the experiments, we first examine whether word fixation/skip behaviors in the test set can be explained using the trained CRF models. We then explore the individual contribution of each of the types of lexical and screen position features, and combinations of these features to prediction accuracy. We further observe which features are heavily weighted in the trained CRF model. 6.1 Individual contribution of each type of feature Table 6 gives the prediction accuracy of the CRF models using each feature individually on the test data, as well as the CRF model using all of the given features. Each of the columns A to J gives the prediction accuracy for the target subject, given by the CRF models trained on training data for the target subject, while the Merged column gives a similarity-based evaluation of the CRF models trained on the merged gaze data of all subjects (see Section 5.3). Using all the features, the trained CRF model gives between 0.90% and 12.57% higher accuracy than the baselines for each subject, and higher accuracy than using only individual features. The degree of contribution of each individual feature, however, seems to vary among subjects. For subjects A and E, the accuracy improvement over the baselines when using individual features is relatively higher than for other subjects. For subjects B, D, I, and J, an improvement is also observed, but this is less than for subjects A and E. For subjects F and G, on the other hand, barely any improvement is observed for all individual features. From these observations, although there 64

11 Subjects Utilized feature types Merged A B C D E F G H I J (All individual types) WORD, POS, 3G-F/-B L-POIS/-PROB/-RECI all lexical features LF SC LF, SC Merged denotes the similarity of the distribution to the test data; Subjects gives the accuracy (%) of predicting word fixations/skips Table 8: Contribution of lexical (upper part) and screen position (lower part) features to prediction are individual differences in the degree of improvement among subjects, it seems that some of the characteristics of word-fixation behavior can be captured using our features. However, the 72% to 84% prediction accuracy obtained using all individual features is not high enough to adequately explain each subject s behavior. This is discussed further in Section 6.5. For the CRF models trained on the merged gaze data of all subjects ( Merged column), on the other hand, each of the individual features drastically improves the distribution similarity to the test data, and when using all features, the distribution similarity is , which is an improvement of over the baseline similarity. This similarity bodes well in terms of our expectation that this CRF model can explain some generality on word-fixation behavior across all subjects. When we go back to the prediction for each subject, each of WORD, POS, L-PROB, and L- RECI individually seem to be able to capture some characteristics in the gaze data, while L-POIS and the screen position features LF and SC do not improve the prediction accuracy that much. Table 7 examines the contribution of each individual feature to prediction accuracy, by training CRF models using all feature types except the target feature type. The table seems to show that removing the respective individual feature does not lead to a noticeable decrease in accuracy. This would suggest that each individual feature is complemented by the remaining features. 6.2 Contribution of lexical and screen position features In order to explore the complementary characteristics of feature types, we start by focusing on the feature classification given by our definition: lexical and screen position features. Table 8 examines the contribution of lexical and screen position features to prediction accuracy. By removing all lexical features, that is, using only screen position features LF and SC (see all lexical features row), the distribution similarity drops drastically by , and prediction accuracy for each subject also decreases by between 0.88% and 10.63%. We observe similar characteristics by removing all screen position features; distribution similarity drops by (see LF, SC row), while prediction accuracy for each subject also decreases by between 0.94% and 6.31%. These observations suggest that both the lexical features and screen position features capture certain information that can only be captured by those features. In addition, the prediction accuracy obtained by removing all lexical features is similar to the baseline accuracy, regardless of the remaining screen position features. This would suggest that screen position features work well only in conjunction with lexical features. In other words, humans do not seem to be able to decide whether they fixate a word solely based on the word position. The WORD, POS, 3G-F/-B, and L-POIS/-PROB/-RECI rows in the table show that removing either the features on word length surprisal or all lexical features other than these does 65

12 Utilized feature types Mer Subjects ged A B C D E F G H I J Baseline All individual types (AIT) WORD, POS WORD POS,WORD,POS AIT, WORD POS LF, SC LF SC, LF, SC AIT, LF SC WORD, LF WORD LF, WORD, LF AIT, WORD LF WORD, SC WORD SC, WORD, SC AIT, WORD SC POS, LF POS LF, POS, LF AIT, POS LF POS, SC POS SC, POS, SC AIT, POS SC AIT, all combination Merged denotes the similarity of the distribution to the test data; Subjects gives the accuracy (%) of predicting word fixations/skips Table 9: Prediction accuracy of word fixation/skip behavior (using combined features) not bring about a serious decline in prediction accuracy. Considering that lexical features other than the word length features, such as WORD, can implicitly capture a great deal of information on word length, most of the lexical information affecting word fixations/skips seems to be word length surprisal. The LF and SC rows in the table, on the other hand, show that removing either screen coordinate features or line-/screen-feed features does not bring about a serious decline in prediction accuracy. Considering that most of the line-/screen-feed information is implicitly contained in the screen coordinate information, most of the screen position information affecting word fixations/skips seems to be whether a target word is at the beginning or end of a line/screen. 6.3 Contribution of combined features We also considered combinations of two feature types. Table 9 shows the contribution of each combination of features to prediction accuracy. In the table, A B represents the combination of feature types A and B, which means that this combined feature is fired only when both A and B are fired. Some feature types are real-value features, and cannot easily be combined with other feature types. We therefore, omitted the real-value features as candidates for combination. When using each combined feature, we also added the respective individual features for smoothing. From the table, we can see that adding each of the combined features barely contributes to any accuracy improvement. Even when using all the individual and combined features (see the bottom row of the table), the improvement over using only all the individual features is barely noticeable. These observations seem to imply that combining the features does not capture any extra information than when using the features separately. Owing to a lack of gaze data, these results may be misleading, and further investigation would be required in order to continue this discussion. 6.4 Observation of heavily weighted features From the heavily weighted features in the CRF model, we observed which features were regarded as important for explaining the gaze data. Table 10 shows the heavily weighted features in the CRF 66

13 Features (for fixations) Weight Features (for fixations) Weight L-PROB[0] L-RECI[-1] LF[0]= L end SC[-2,-1]=(5, 4),(5, 4) LF[0]= L start LF[-1,0]= L mid, L end LF[0]= S end SC[+2]=(1, 5) L-POIS[-1,0] SC[+1,+2]=(1, 3),(1, 3) L-PROB[-1] SC[0,+1,+2]=(5, 3),(5, 3),(1, 4) L-RECI[-2,-1] WORD[-1]=But SC[+1]=(1, 5) SC[-2,-1]=(5, 1),(5, 1) LF[+1]= L start LF[-1]= L end LF[0,+1]= L end, L start LF[-1,0]= L end, L start LF[0,+1]= L start, L mid LF[0,+1]= S end, S start SC[+1]=(1, 3) LF[+1]= S start SC[+1]=(1, 4) SC[+2]=(1, 2) L-PROB[-2,-1,0] SC[+2]=(1, 3) G-F[-2,-1,0] LF[-2]= L mid SC[0]=(5, 5) SC[0,+1]=(5, 5),(1, 1) SC[+1,+2]=(1, 1),(1, 1) POS[0]=CD SC[-1]=(5, 5) SC[-1]=(5, 4) SC[+1,+2]=(1, 2),(1, 2) POS[0,+1]=NN, NNS SC[+1]=(1, 2) SC[-2,-1]=(5, 5),(5, 5) SC[+1,+2]=(1, 4),(1, 4) SC[0,+1]=(1, 4),(1, 4) Features (for skips) Weight L-RECI[0] L-POIS[+1] Beginning of sentence End of sentence POS[-1]=_COLON_ WORD[0]=it s WORD[-1]= WORD[-1]=I LF[-2,-1,0]= L mid, L mid, L mid L-PROB[-1,0] WORD[0]=than LF[0,+1]= L mid, L mid WORD[0]=that LF[0,+1,+2]= L mid, L mid, L mid WORD[0]=and WORD[-1]=of WORD[-1,0]=as, a WORD[0]=from WORD[0]=which SC[-1,0,+1]=(1, 1),(1, 1),(1, 1) LF[0]= L mid Table 10: Features that were heavily weighted in the Merged model using all individual features model that was trained using all individual features on the merged training data of all subjects. The left and right tables show the features weighted for fixations and skips, respectively. A number in square brackets [ ] represents a word whose feature was captured, and identified with an offset from a target word. A sequence of two or three numbers in [ ] represents bi- or trigram features. The tables suggest that surprisal based on word length probability and the reciprocal word length of a target word (L-PROB[0] and L-RECI[0], respectively) have a large influence on whether subjects fixate or skip the word, respectively. For L-PROB[0], according to Figure 3(b), longer words tend to give greater surprisal. This may be because the longer length possibly suggests that the word is a content word and sometimes even an unknown word. In addition, it may be possible that a longer word cannot be skipped easily by a single saccade. The heavy weight for fixations thus seems reasonable. For L-RECI[0], a large value for the reciprocal word length means that the word length is short, and a shorter length possibly suggests that the word is a functional word or easily skipped by a single saccade. The weight for skips thus seems reasonable. From the viewpoint of the human eye mechanism, these features would have been fired without a fixation on a target word, using information on the word obtained by peripheral fields of the eyes or guessed from surrounding information. For WORD features, most of the heavily weighted features are for skips and on target words (WORD[0]) that belong to a closed-class, such as than, from, and which. These words are not content words and tend to be short, and therefore were likely weighted heavily for skips. On the other hand, WORD[-1]=But was heavily weighted for fixations. The reason for this may be that when a sentence starts with But, it attracts the reader s interest to focus on the next word. For SC features, almost all of the heavily weighted features were located in the leftmost (1,*) or rightmost (5,*) coordinates, which is consistent with our analysis in Section 6.2. Many of these features were weighted for fixations for the simple reason that the next word was in the leftmost coordinate (SC[+1]=(1,*)), which would mean that subjects tended to fixate last words in a line before their linefeed eye movements. SC[0]=(5,*) with conditions similar to SC[+1]=(1,*) were not weighted that highly, probably because the first character of the last word in a line does not always appear in position (5,*). 67

14 6.5 Discussion on the experimental results The experimental results in Section 6 show that the CRF model trained for each subject does not have high prediction accuracy. When we analyzed the prediction errors, we found many long spans in the gaze data where all words were fixated. The subjects seem to have read the spans very precisely, which differed from the behavior displayed in other areas. It is natural that subjects do not maintain the same level of concentration or understanding throughout a text, yet our model was not able to capture this. We believe that this is the main reason why the CRF model for each subject does not exhibit high prediction accuracy. This issue will be addressed in our future work. On the other hand, the experimental results also suggest that we can predict the distribution of fixation/skip behavior of each word across subjects with very high similarity to the gaze data, regardless of individual differences among subjects (see Table 4) and the above unstable movements in gaze data. This would imply the possibility of capturing and explaining generality in human reading strategies from an NLP perspective. It should also be noted that our results also depend on the preprocessing of the gaze data in Section 4.1. The authors in (Nilsson and Nivre, 2009) also used the Dundee Corpus, and trained and examined their model to predict word-based fixation behavior for each subject. Similar to our method, they applied some preprocessing to the gaze data to remove irregular eye movements, whereas, unlike our case, they also took regressions and revisits as well as first-pass forward saccades into consideration. Since the experimental settings differed, we cannot directly compare the prediction accuracy of our results with those in (Nilsson and Nivre, 2009). However, considering that our baselines seem to be higher than those in (Nilsson and Nivre, 2009), we could say that our additional preprocessing simplified the problem and made the gaze behavior easier to capture. We found that both lexical features and screen position features contributed to explaining the gaze data. Our final goal is to obtain some reading strategies from the gaze data, which can then be imported into NLP technologies. Considering this goal, we need to remove the screen position factors from the gaze data, since most NLP technologies consider sentence-based processing without any position information. The experimental results suggest that combined features of screen position and lexical information do not capture any extra characteristics. If this is true, we may be able to separate the two factor types without considering their mutual interaction. Conclusion In this research, we examined the possibility of extracting reading strategies by observing wordbased fixation behavior. We trained CRF models on gaze data to predict the gaze behavior of each subject and the distribution of gaze behavior across all subjects. Using lexical and screen position features, the CRF models could predict word fixation/skip behaviors for each subject with 73% to 84% accuracy as well as the distribution of word fixation/skip behaviors across the subjects with a similarity to the original gaze data. In our future work, we would like to collect gaze data on specific linguistic phenomena, such as coordination and prepositional attachment, and then attempt to extract some general reading strategies from this gaze data. Having achieved this, we aim to import the obtained strategies into NLP technologies such as parsing, to realize further progress in these fields. Acknowledgments This research was partially supported by Transdisciplinary Research Integration Center, Japan, Kakenhi, MEXT Japan [ ] and JST PRESTO. 68

Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,

Nominal, Real and PPP GDP It is crucial in economics to distinguish nominal and real values. This is also the case for GDP. While nominal GDP is easier to understand, real GDP is more important and used

Sequences and Series Overview Number of instruction days: 4 6 (1 day = 53 minutes) Content to Be Learned Write arithmetic and geometric sequences both recursively and with an explicit formula, use them

Tracking translation process: The impact of experience and training PINAR ARTAR Izmir University, Turkey Universitat Rovira i Virgili, Spain The translation process can be described through eye tracking.

Log-Linear Models Michael Collins 1 Introduction This note describes log-linear models, which are very widely used in natural language processing. A key advantage of log-linear models is their flexibility:

Guide to Writing a Project Report The following notes provide a guideline to report writing, and more generally to writing a scientific article. Please take the time to read them carefully. Even if your

You will often make scatter diagrams and line graphs to illustrate the data that you collect. Scatter diagrams are often used to show the relationship between two variables. For example, in an absorbance

White Paper Table of Contents Introduction...1 What is Coverage Analysis?...2 The McCabe IQ Approach to Coverage Analysis...3 The Importance of Coverage Analysis...4 Where Coverage Analysis Fits into your

Kerby Shedden October, 2007 Overview of R R R is a programming language for statistical computing, data analysis, and graphics. It is a re-implementation of the S language, which was developed in the 1980

Unit 9 Describing Relationships in Scatter Plots and Line Graphs Objectives: To construct and interpret a scatter plot or line graph for two quantitative variables To recognize linear relationships, non-linear

INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An

A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method

1 Matlab 1) Fundamentals a) Getting Help for more detailed help on any topic, typing help, then a space, and then the matlab command brings up a detailed page on the command or topic. For really difficult

WRITING A CRITICAL ARTICLE REVIEW A critical article review briefly describes the content of an article and, more importantly, provides an in-depth analysis and evaluation of its ideas and purpose. The

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89 by Joseph Collison Copyright 2000 by Joseph Collison All rights reserved Reproduction or translation of any part of this work beyond that permitted by Sections

Evaluation of Features for Sentence Extraction on Different Types of Corpora Chikashi Nobata, Satoshi Sekine and Hitoshi Isahara Communications Research Laboratory 3-5 Hikaridai, Seika-cho, Soraku-gun,

Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course

Conditional Random Fields: An Introduction Hanna M. Wallach February 24, 2004 1 Labeling Sequential Data The task of assigning label sequences to a set of observation sequences arises in many fields, including

Solving Systems of Linear Equations Using Matrices What is a Matrix? A matrix is a compact grid or array of numbers. It can be created from a system of equations and used to solve the system of equations.

Music Classification by Composer Janice Lan janlan@stanford.edu CS 229, Andrew Ng December 14, 2012 Armon Saied armons@stanford.edu Abstract Music classification by a computer has been an interesting subject

Punctuation in Academic Writing Academic punctuation presentation/ Defining your terms practice Choose one of the things below and work together to describe its form and uses in as much detail as possible,

MATHEMATICS: THE LEVEL DESCRIPTIONS In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data. Attainment target

National Center on INTENSIVE INTERVENTION at American Institutes for Research Fractions as Numbers 000 Thomas Jefferson Street, NW Washington, DC 0007 E-mail: NCII@air.org While permission to reprint this

Student Learning Development Presenting numerical data This guide offers practical advice on how to incorporate numerical information into essays, reports, dissertations, posters and presentations. The

Appendix I FV /26/5 SPECTROPHOTOMETRY Spectrophotometry is an analytical technique used to measure the amount of light of a particular wavelength absorbed by a sample in solution. This measurement is then

this version: 26 February 2009 7 Communication Classes Perhaps surprisingly, we can learn much about the long-run behavior of a Markov chain merely from the zero pattern of its transition matrix. In the

222 CHAPTER 6. PROBABILITY 6.3 Conditional Probability and Independence Conditional Probability Two cubical dice each have a triangle painted on one side, a circle painted on two sides and a square painted

CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

This is a preprint of an article whose final and definitive form has been published as: Angele, Bernhard, Slattery, Timothy J., Yang, Jinmian, Kliegl, Reinhold and Rayner, Keith (2008) 'Parafoveal processing

Rubrics for Assessing Student Writing, Listening, and Speaking High School Copyright by the McGraw-Hill Companies, Inc. All rights reserved. Permission is granted to reproduce the material contained herein

Am I Decisive? Handout for Government 317, Cornell University, Fall 2003 Walter Mebane I compute the probability that one s vote is decisive in a maority-rule election between two candidates. Here, a decisive

Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

S-Parameters and Related Quantities Sam Wetterlin 10/20/09 Basic Concept of S-Parameters S-Parameters are a type of network parameter, based on the concept of scattering. The more familiar network parameters

Chapter 4. robability and robability Distributions Importance of Knowing robability To know whether a sample is not identical to the population from which it was selected, it is necessary to assess the

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic Report prepared for Brandon Slama Department of Health Management and Informatics University of Missouri, Columbia

Short-Run Production and Costs The purpose of this section is to discuss the underlying work of firms in the short-run the production of goods and services. Why is understanding production important to

Q&As: Microsoft Excel 2013: Chapter 2 In Step 5, why did the date that was entered change from 4/5/10 to 4/5/2010? When Excel recognizes that you entered a date in mm/dd/yy format, it automatically formats

Chapter 11 Audit Sampling Concepts Review Questions 11-1 A representative sample is one in which the characteristics of interest for the sample are approximately the same as for the population (that is,

Chapter 3 Concepts of digital forensics Digital forensics is a branch of forensic science concerned with the use of digital information (produced, stored and transmitted by computers) as source of evidence

TEACHER S GUIDE: ORDER OF OPERATIONS LEARNING OBJECTIVES Students will learn the order of operations (PEMDAS). Students will solve simple expressions by following the order of operations. Students will

Transparent Flip-Flop The RS flip-flop forms the basis of a number of 1-bit storage devices in digital electronics. ne such device is shown in the figure, where extra combinational logic converts the input

An Optical Sudoku Solver Martin Byröd February 12, 07 Abstract In this report, a vision-based sudoku solver is described. The solver is capable of solving a sudoku directly from a photograph taken with

The equation for the 3-input XOR gate is derived as follows The last four product terms in the above derivation are the four 1-minterms in the 3-input XOR truth table. For 3 or more inputs, the XOR gate

7-1 Chapter 7 Hierarchical cluster analysis In Part 2 (Chapters 4 to 6) we defined several different ways of measuring distance (or dissimilarity as the case may be) between the rows or between the columns

Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,

Introduction The career world is competitive. The competition and the opportunities in the career world become a serious problem for students if they do not do well in Mathematics, because then they are

9 Descriptive and Multivariate Statistics Jamie Price Donald W. Chamberlayne * S tatistics is the science of collecting and organizing data and then drawing conclusions based on data. There are essentially

Probability Models for Discrete Variables Our study of probability begins much as any data analysis does: What is the distribution of the data? Histograms, boxplots, percentiles, means, standard deviations

It s the right thing to do: Adobe Conversion Settings in Word Section 508: Why comply? 11,400,000 people have visual conditions not correctible by glasses. 6,400,000 new cases of eye disease occur each

LLCC Study Skills Center 8-8 Tips on How to Study Math What s Different About Math Textbooks 1. Math textbooks must be studied very slowly. 2. Unlike many other textbooks, math textbooks have: No repetition

Classifying Manipulation Primitives from Visual Data Sandy Huang and Dylan Hadfield-Menell Abstract One approach to learning from demonstrations in robotics is to make use of a classifier to predict if