The way of viewing whole slide images (WSI) can be tracked and analyzed. In particular, it can be useful to learn how medical students view WSIs during exams and how their viewing behavior is correlated with correctness of the answers they give. We used software-based view path tracking method that enabled gathering data about viewing behavior of multiple simultaneous WSI users. This approach was implemented and applied during two practical exams in oral pathology in 2012 (88 students) and 2013 (91 students), which were based on questions with attached WSIs. Gathered data were visualized and analyzed in multiple ways. As a part of extended analysis, we tried to use machine learning approaches to predict correctness of students' answers based on how they viewed WSIs. We compared the results of analyses for years 2012 and 2013 - done for a single question, for student groups, and for a set of questions. The overall patterns were generally consistent across these 3 years. Moreover, viewing behavior data appeared to have certain potential for predicting answers' correctness and some outcomes of machine learning approaches were in the right direction. However, general prediction results were not satisfactory in terms of precision and recall. Our work confirmed that the view path tracking method is useful for discovering viewing behavior of students analyzing WSIs. It provided multiple useful insights in this area, and general results of our analyses were consistent across two exams. On the other hand, predicting answers' correctness appeared to be a difficult task - students' answers seem to be often unpredictable.

Whole slide images (WSI) are a technology that enables many new possibilities. One area in which WSIs are particularly useful is education. Digital representation of histological slides can be utilized not only in the process of teaching but also examination. If students view WSIs during an exam to answer the questions, we can track the way they actually navigate through the slides and the areas they look at. Gathering such data over multiple exams and analyzing it using advanced methods, such as machine learning, can provide interesting insights into students' viewing behavior and how it is correlated with the correctness of the answers they give. We can also try to predict a student giving correct or incorrect answer based solely on the way he or she viewed a slide.

Tracking viewing behavior while watching WSIs can be accomplished in a variety of ways. Some methods involve eye movement tracking, [1] but they require specialized equipment, which does not scale well to tracking many students taking the exam at the same time. We use the software-based view path tracking method, which has already been presented. [2] The current paper extends the use of this method to collect more data, draw even more general conclusions from extended analysis, and use more advanced methods, like machine learning, to attempt to predict answer correctness based on data which describes viewing patterns.

Methods

Practical exams in oral pathology at Poznan University of Medical Sciences in Poznan, Poland have been conducted with the use of WSIs since 2005. In the first few years, the possibilities of tracking students' WSI viewing behavior during the exams were limited due to the lack of reasonably scalable tracking method. In 2012, we introduced the view path tracking method. [2] It is integrated with the WSI system (WebMicroscope, Fimmic Ltd, Helsinki, Finland), does not require any specialized equipment and is based on records sent to the central database while the slides are viewed. Each record contains information about the WSI area (view field) displayed for a while on student's monitor while the student was navigating through the WSI. Additionally, records contain data identifying the given student, question and time when the fragment was displayed.

We collected data from two exams: From years 2012 (88 students) and 2013 (91 students). Each student in each year was answering 50 exam questions. The view path tracking method was applied to all students participating in the exams. This resulted in the total of about 130,000 view field records gathered during these two exams. More detailed numbers, split by years, can be found in the summary as shown in [Table 1].

Table 1: Summary of the tracking data collected during two practical exams in oral pathology

Most of the WSIs which appeared in the exam questions in 2012 were also present in the 2013 exam, which made certain comparative analyses possible. Like in the earlier paper, [2] the general analysis methods include generating visualizations (both static images and animations) and calculating measures. In terms of visualizations, drawing all students' view paths from 1-year on one image and confronting it with analogous drawing for another year seems to be a good way of comparing viewing patterns occurring year to year [Figure 1]. Similarly, calculated metrics can be aggregated for each year and compared side by side.

Figure 1: Comparing aggregated students' viewing patterns occurring year to year for a whole slide images with neurofibroma

In each year, students took classes in oral pathology in 6 groups. These groups were supervised by different teaching assistants, and the impact of these teachers on students' exam scores has been analyzed. [3] To see if there are any differences across metrics calculated for each group, we included group number as a dimension in one of the analyses in the current work.

Finally, we went beyond the general analysis of viewing behavior among students answering correctly and incorrectly. Our goal was to predict answer correctness based on the calculated metrics. To approach this task, we considered the prepared data as training and testing datasets in a typical binary classification problem, where computed statistics is treated as features (attributes) and correctness of an answer is a label. Since wrong answers were much rarer than good answers, it was convenient for us to focus on predicting incorrect answers. After looking at the correlation between metrics' values and answer incorrectness, we trained machine learning models and explored their prediction potential. We tried multiple types of models and eventually focused on two: Decision trees and random forests. We used two software environments that offer implementation of these (and many other) models: R [4] and Weka. [5]

Results

Having data from two exams available, we wanted to check whether conclusions from the exam in year 2012 [2] hold true for the exam in 2013. One potential level of analysis is comparing how all students who answered the given question were viewing the WSI attached to this question. An example of a WSI for which viewing patterns are consistent from year to year is the case of well-differentiated papillary squamous cell carcinoma [Figure 2]. We can see that relations between the values of six metrics calculated for students answering correctly and incorrectly are consistent across two exam years. On the other hand, it can be noticed that the magnitude of differences has changed for some metrics (for example, the difference in number of view steps is larger in year 2013).

Numbers aggregated within student groups, supervised by different teachers, were also compared. If we look at statistics calculated for students answering correctly and incorrectly in each group, we can notice that relations between average values of measures like number of view steps and viewing speed (expressed as number of view steps divided by viewing time) are mostly consistent across multiple student groups and exam years [Figure 3].

Figure 3: Average values of metrics aggregated for all questions within student groups, which were supervised by different teachers during the classes

We also confronted metrics aggregated across all questions and students but separately for years 2012 and 2013. In this general comparison, we first limited the set of questions to those for which we had at least 3 correct and 3 incorrect answers (in the analyzed year). Then, we compared the number of questions for which the average metric value was higher for correct answers with the number of questions for which the average value was higher for incorrect answers. The results of the side-by-side comparison for two exam years are presented in [Figure 4]. Although the magnitude differs, it can be seen that relations between the counts for 2012 are preserved in the results for 2013, which confirms the general patterns observed. Students answering correctly tended to spend less time viewing the slide, go through less view fields but faster, focus more on the diagnostic area (region of interest), use lower magnification level, and the fragments they viewed were rather less dispersed.

Figure 4: General year-to-year comparison metrics for multiple questions. Questions with average measure value higher for students answering correctly/incorrectly are counted separately in green/red bars, respectively

The attempt to predict correctness of students' answers, based on data about viewing behavior, was the most challenging task in this work. Based on the results of the above analysis, we expected that calculated measures have certain prediction potential. In [Figure 5], we put values of 'number of view steps' and 'viewing speed' measures into buckets to show the total volume in each bucket (bars) together with percentage of incorrect values (line). These charts confirm some correlation, also consistent with general analysis from [Figure 4] - when number of view steps is high or viewing speed is low, larger fraction of answers are incorrect. However, this increased percentage is for buckets with relatively low volume (small number of answers).

Figure 5: Analyzing correlation of values of selected metrics with the ratios of incorrect answers

One approach in the prediction experiment was to have a separate decision tree for each selected question, trained on 2012 exam data and tested on 2013 exam data, using 6 selected features. If the performance of these models was good, it would show that answers' correctness within a question can be predicted based solely on viewing patterns registered during a previous exam. This was not the case, and most decision trees trained this way could not predict correctly how students will answer the given question in 2013. However, [Figure 6] shows a tree that detected incorrect answers in 2013 reasonably well, as it can be seen in the confusion matrix. Given the difficulty of the task, precision of 50% at recall of 100% is a good result.

Figure 6: A decision tree trained on data from 2012 exam and tested on data from 2013 exam, plus performance of this tree. Algorithm used: R part from R; upsampling applied for unbalanced set

Finally, we used combined data (9214 instances) for all analyzed students and all questions, from both years, to prepare a general model that would predict correctness of any answer to any question. In this dataset, we extended the feature set to all 26 implemented measures. We also added two standardized versions of each measure for better generalization, resulting in 78 features in total. Then, we ran a 10-fold cross-validation experiment, in which we trained and evaluated random forest models (model settings: 200 trees, maximum depth of 5). Each prediction resulted in a value representing the probability of the given answer being incorrect. [Figure 7] shows the distribution of these values (bars - volume, blue line - predicted probability), together with a red line representing the actual percentages of incorrect answers, which ideally should be equal to predicted probabilities. It can be noticed that answers scored as highly probable of being incorrect are indeed more likely to be actually incorrect. However, if we want to detect most of the incorrect answers (i.e., increase recall by lowering the probability threshold), precision drops significantly, as presented in the confusion matrix in [Figure 7], generated for the probability threshold of 50%.

There is an extra outcome from training a random forest model - a list of feature importance values, which estimate the prediction potential of each measure. To generate such list, we trained a random forest model fit to the combined exam data from 2 years. It showed that total viewing time (one raw and two standardized versions) and number of viewed fragments were among top 5 most important features for predicting answer correctness. This is consistent with the general observation that students who view WSIs for long time and go through many view fields tend to answer questions incorrectly.

Conclusions

We confirmed usefulness and scalability of the software-based view path tracking method for WSIs. It was enabled during two practical exams in oral pathology, and data about students' viewing behavior was successfully collected and processed. Presented method is implemented in a WSI viewing system and works in a way that is transparent to the users. As it has been described, [2] this approach could be also applied to scenarios other than an exam in oral pathology.

The results demonstrate a variety of analyses that can be done using the data collected. Viewing patterns were discovered for students answering correctly and incorrectly. The overall metrics comparison shows consistency in the outcomes from two exam years and suggests that general viewing patterns are stable. However, the attempt to predict students' answers based on data about WSI viewing behavior appeared to be a difficult task. While some prediction results were in the expected direction, the outcome was not satisfactory in most cases, suggesting that students' answers are often unpredictable.