The present study aimed at investigating DIF sources on an EFL reading comprehension test. Accordingly, 2 DIF detection methods, logistic regression (LR) and item response theory (IRT), were used to flag emergent DIF of 203 (110 females & 93 males) Iranian EFL examinees’ performance on a reading comprehension test. Seven hypothetical DIF sources were examined in this regard: text familiarity, gender, topic/text interest, guessing, and the social variables of location, income, and educational status. Only LR, for gender and text familiarity, could preempt DIF with gender supporting the gendered-text effect while text familiarity benefiting, inversely, the participants with low level of text familiarity. For interest in topic, LR found a single item favoring the group with higher levels of interest and the IRT model detected DIF in either extreme. Regarding guessing and income, the LR indicated DIF supporting the low guessers and high-income group whereas IRT, conversely, showed DIF favoring the high guessers and low-income group. For location and education both methods, correspondingly, demonstrated DIF for the expensive location and educated groups. Finally, the differential test functioning result made it clear that only three sources of DIF (gender, income, & interests) were transferred to the test level. The findings could support a proportional effect of DIF sources.

IntroductionDifferential item functioning (DIF) occurs when equally knowledgeable individuals from different subgroups are of different likelihood of correctly answering (or endorsing) an item (Shepard, Camilli, & Averill, 1981). DIF is a necessary but not sufficient condition for bias (Zumbo, 1999). In other words, bias only exists when the groups illegitimately differ in their performance on an item. Because items indicating DIF can function as a threat to the validity of a test, DIF analysis has become an essential step in the validation of a test (Camilli & Shepard, 1994; Zumbo, 2007; Zumbo, & Rupp, 2004) especially in high-stakes testing situations (Pae & Park, 2006). The present study was an attempt to focus on several hypothetical factors that may cause DIF on a reading comprehension test. As pinpointed by McNamara and Roever, (2006), the literature lacks research on social or contextual sources of DIF. As such, several hypothetical social sources of DIF were also brought into focus in this study. Background Many studies have been conducted on DIF. The majority of these studies have focused on gender or ethnicity and only some have shed light on other sources of DIF. Among the studies conducted, mention could be made of those focusing on gender (e.g., Li & Suen, 2012; Pae, 2012; Ryan & Bachman, 1992; Takala & Kaftandjieva, 2000), ethnicity (e.g., Hammer, Pennock-Roman, Rzasa, & Tomblin, 2002; Stoneberg, 2004), language background (e.g. Harding, 2011; Li & Suen, 2012), topic familiarity or content knowledge (Pae, 2004), and field of study (Barati, Ketabi, & Ahmadi, 2006; Näsström (2004)). Such studies have indicated that for example east Asian test takers outperform test takers from European-language backgrounds on Michigan English Language Assessment Battery (Li & Suen, 2012); that females tend to outperform males in tests of verbal and written abilities, especially if constructed response items are included (Willingham & Cole, 1997); they also outperform males on aesthetics, human affairs, and contextualized reading items (Carlton & Harris, 1992) on the mood, impression, or tone items of a reading passage (Pae, 2012), on items related to human relations, human rights, aesthetic and on items referring to stereotypical female activities (O'Neill & McPeek, 1993; Sadker & Sadker, 1994), whereas males outperform females on antonyms, and analogies (Carlton & Harris, 1992), on reading items of inferencing type (Carlton & Harris, 1992; Lawrence & Curley, 1989), on items involving visualization and items calling upon knowledge and experiences acquired outside of school (Hamilton & Snow, 1998), and on items related to science and on items referring to stereotypical male activities (O'Neill & McPeek, 1993; Sadker & Sadker, 1994). The results of studies on DIF have sometimes been contradictory and far from being conclusive. Furthermore, the studies have been limited to few sources of DIF and lack of research is felt on social/contextual sources of DIF (McNamara & Roever, 2006). Very few studies have focused on how social factors may create DIF on different tests. The present study tried to fill this research gap by focusing on a number of social factors that may explain DIF on a reading comprehension test. It also tried to shed more light on a number of non-social factors mentioned as sources of DIF in the literature. As such, the following questions were put forward in the current study: 1. Can text familiarity, gender, interest, guessing, and contextual variables (location, income, and educational status) function as sources of DIF on EFL reading comprehension items? 2. To what extent do the results of DIF detection methods (IRT and LR) converge? 3. Can item-level differential performance manifest itself at the scale-level analysis? MethodDesign of the study Unlike the DIF studies in the literature which select an exploratory approach toward the analysis of DIF sources, the present study benefited from a mixed exploratory-confirmatory approach in this regard; that is, in dealing with social sources of DIF, an exploratory approach was adopted in which the test was administered to the test takers who were divided into groups based on the social variables (location, income, & educational level) and then their performance on the test was analyzed for instances of DIF. This approach is the one adopted in DIF studies in the literature. The present study, however, adopted a confirmatory approach in studying sources of DIF mentioned in the literature; that is, instead of running the test and then searching for the probable sources of DIF, the study, following McNamara and Roever's (2006) suggestion, selected the most cited sources of DIF in the literature and intentionally included them in the reading test to see whether they really manifest themselves in the DIF study. The assumption was that if such claimed variables are really sources of DIF, then an item which embraces such variables should necessarily indicate DIF. A counterargument for this will hold that such sources may be context and test specific and hence do not necessarily manifest themselves in DIF analysis of any test in any context. ParticipantsThe participants of this study included 203 English as a Foreign Language (EFL) learners (93 males and 110 females) selected from among around 300 intermediate-level students at Shiraz University Language Center (SULC) in the spring semester in 2011. Only those participants who based on their final scores in previous semesters were roughly identified as intermediate were selected for the study. Due to the administrative limitations, only the participants’ scores on the regular tests administered by the institute at the end of each term were taken into account and no specific proficiency test was employed. However, it should be mentioned that this was quite enough for the purpose of the present study because the DIF software (BILOG MG) automatically estimates the participants’ proficiency based on the test used for DIF analysis and does not need a separate test to be used for proficiency. The reason why we tried to roughly homogenize the participants based on their proficiency was only to make sure that the test used for DIF analysis was not too far from their level of proficiency and hence more realistic answers would be given by the participants; otherwise, DIF studies rely on a single test. InstrumentationReading comprehension test A reading comprehension test was developed for the purpose of the study. The passages of the reading test were originated from Anderson’s (2007, 2008) books entitled “ACTIVE Skills for Reading: Book 2” and “ACTIVE Skills for Reading: Book 3”. The books were selected from among a series of four books to fit the intermediate proficiency level. Furthermore, as these books are not usually taught in the Iranian language institutes or universities, there was a very low chance for the students to have seen the texts before. Attempt was made to include the hypothetical sources of DIF. To this end, first of all, on the basis of gender-based familiarity with text topic, 6 short passages were chosen. Following the literature (Bugel & Buunk, 1996; Newman, Groom, Handelman, & Pennebaker, 2008; O’Neill & McPeek, 1993) some topics were hypothesized to be gender-stereotypical; that is, such topics as practical affairs, money, mechanical tools, soccer, occupation were considered male favorable, whereas topics related to humanities, social sciences, aesthetics and human relations, wedding, and household chores were considered female favorable. Therefore, two passages were taken as male-friendly (What does a Million Dollars Buy? & Meet Freddy Adu, Soccer Sensation), two passages as female-friendly (Wedding Customs & The Right Job for Your Personality), and two others as gender-neutral (Are Human Beings Getting Smarter? & A Different Kind of Spring Break). To develop the test, each of the 6 passages was followed by five commonly used questions in reading comprehension tests (Broukal, 2007). They were items asking for details (facts), reference to a word or phrase, vocabulary knowledge, main idea (theme), and inference (logical conclusion). Therefore, the final test was a 30-item reading comprehension test. Two faculty members who were experts in L2 language testing and had worked on DIF checked the reading texts and questions in line with the suggestions made in the literature to make sure that the final test served the purpose of the study. The reliability of the test was also estimated through KR21which turned out to be .52. The moderate reliability observed was not far from logic as due to the confirmatory nature of the study the sources of error variance (DIF sources) were intentionally included in the test which could noticeably decrease the reliable variance proportion in the measurement equation. DIF questionnaire The second instrument was a very short questionnaire the first part of which was a set of bio-data questions asking the test takers to provide information about their gender, residency neighborhood, families’ monthly income, and parents’ educational level. The second part, however, was attached to the end of each passage and comprised these questions: a) How familiar was the text topic or content to you? b) How much were you interested in the reading topic? c) Which item did you answer by guessing? The purpose of this section was to collect information on the participants’ topic interest, topic familiarity and guessing. They were expected to indicate their level of familiarity with and interest in the text by selecting one of the five options very much, much, to some extent, little, and very little/ none. Data collection procedure The factors included in DIF analysis were literature-based variables hypothesized to contribute to test takers’ differential performance on EFL reading tests. They were gender, text familiarity, text interest, guessing, location in two levels (down=living in less expensive neighborhoods with a housing price of $700-$1500 per squared meter; up=living in the expensive neighborhoods with a housing price of $2200-$2800 per squared meter), income in two levels (low=less than $1000 per month, high= more than $1000 per month), and educational level in two levels (academically educated, academically uneducated). Data analysis It is recommended that more than one method of DIF analysis be employed in DIF studies to come to more dependable results (Camilli, 2006; Camilli & Shepard, 1994; Pae, 2012, Uiterwijk & Vallen, 2005). In line with this suggestion, the present study employed two methods: (a) a classical method: logistic regression (LR), (b) 1-p item response theory (IRT) model. This could add to the dependability of the results and made it possible to compare the degree of correspondence between the results of the two methods. Results and discussionDIF analysis based on Logistic Regression Overall, about 47% (14 items) of the whole test displayed DIF through the use of LR. Only three of these items indicated large DIF and the majority indicated moderate DIF based on the criteria recommended by Hidalgo and Lopez-Pina (2004); that is, negligible DIF: ΔR2 < 0.13, moderate DIF: 0.13 ΔR2 0.26), and large DIF: ΔR2> 0.26. In what follows the results of DIF analysis based on LR are presented in detail for each of the hypothetical sources. DIF based on gender As previously mentioned, the reading texts were selected with an eye toward the gender differences in topic familiarity of a reading test (e.g., Brantmeier, 2003; Pae, 2012). Overall, five items (17% of the whole test) were flagged for DIF based on gender. The results of LR indicated that in the female-friendly passages 13.33% of the items; that is, four items (two inference items, one vocabulary item, & one reference item, ) favored females. In the male-friendly passages only a single item (3.33% of the items) which was a vocabulary item favored males over females; and in the neutral passages no item indicated DIF in favor of a gender. The results of DIF overall, functioned in line with the literature on gender-based DIF that there exists DIF or differential performance on gendered texts (e.g., Carlton & Harris, 1992; Lawrence & Curley, 1989). In other words, the females got advantage of the female-friendly texts and the males were favored by male-friendly texts. Neither group was reported to gain benefit from the gender-neutral texts. More specifically, the results partially echoed those of the study by Newman et al. (2008) that most of the differences between men and women are related to the application of function words (e.g., reference-type item in the present study) and lexical words (e.g., vocabulary-type items in the present study). DIF based on the familiarity with text topic or content One part of familiarity was discussed under the rubric of gender-based DIF. However, mention was made that that particular analysis did not find any DIF for gender-neutral texts. Thus, familiarity was also considered separately with the intention of its effect on the items dispensed with the gender influence. The idea was that although two passages were claimed to be female-friendly, two male-friendly and two neutral based on the literature, still individual differences were possible in terms of their familiarity with the texts regardless of their gender. For example, saying that a text is more female-friendly based on the topic does not eliminate the chance that some males could be familiar with such texts. As such, the study focused on familiarity with the text topic as a factor for differential performance regardless of the gender. Therefore, the participants were divided into three groups based on their familiarity with each text (highly-familiar, moderately-familiar, and slightly-familiar). This was done based on the participants’ answers to the questionnaire items indicating their level of familiarity with each text. Therefore, those selecting very much and much options were considered as the highly-familiar group, those selecting to some extent as the moderately-familiar and those selecting little and very little as the slightly-familiar group. Then DIF analysis was performed. Unlike the results of DIF based on gender, the results of DIF based on familiarity found some traces of DIF in the gender-neutral texts. Overall, topic familiarity was found to be the source of DIF for three items, two of which were related to the gender-neutral texts. Therefore, the results were not in line with the results of gender-based DIF. It is therefore manifest that, with three items (10% of the whole test) indicating DIF, the familiarity was not supported to have a leading role in DIF occurrence since only one inference-type item (3.33% of the whole test) favored the moderately-familiar group and two items of details- and vocabulary-type (6.66% of the whole test) benefited the slightly-familiar group. This finding is against the literature (e.g., Sadker & Sadker, 1994) and does not support the assumption that higher familiarity may lead to higher chances of endorsing an item correctly. Therefore, taken with the results of the gender-based DIF together, it seems that gender (focusing on gender-friendly or gender-stereotypical texts) will provide us with a better explanation of DIF. In other words, familiarity is better to be considered together with the gender effect to explain the DIF on a test. DIF based on the interest in text topic To avoid the possible bias, test developers may want to select unfamiliar text topics, but by doing so the texts may not be so interesting and relevant to the test taker (Bachman, 1990). The challenging task of a test writer is to avoid either extreme, i.e. to develop topics that are very general and innocuous and at the same time interesting and relevant. Familiarity and interest pose difficulty in the design, development, and analysis of reading tests. Thus, we can hypothesize that the familiar texts would be interesting to a particular group often to the detriment of the other. To study the effect of interest as a source of DIF, like what we did for the familiarity, at first we divided the participants into three groups based on their interest in each text; that is, highly-interested, moderately-interested, and slightly-interested. This was done for each text separately and then DIF was checked for each item. Contrary to our expectations, the notion of interest in text topic did not turn out to be an influential source of DIF because only one item (3.33% of the whole items) which was a vocabulary item favored the highly-interested group over the others. DIF based on guessing It goes without saying that guessing is a hypothetical source of DIF in an MC test. In this study, the test takers were asked to inform the researcher whether for each passage with five items, they answered a particular item by chance. That is, they were asked to say if they guessed at a particular item or not based on which they were divided into three groups of low, mid, and high guessers. The results of DIF analysis flagged five items (16.66% of the whole test) based on guessing which favored low-guessers. In other words, the results indicated that being a member of the non-guesser (or low-guesser) group increased the probability of endorsing these items correctly. It seems that those who had guessed in this study were mostly wild guessers and thus their guessing did not help them. Guessing, therefore, was not supported to play a significant role in DIF results as low/non-guessers were more successful. DIF based on location Only one item was found to support the idea that the test takers' neighborhood (location) can function as a source of DIF. This item which was a vocabulary item functioned to the favor of those living in rich neighborhoods. Thus, living in different neighborhoods (and by generalization in different cultural settings) was corroborated by a single item (3.33 % of the whole test) to function significantly in the test takers’ differential performance on the vocabulary-type reading item. This finding, though based on a single item, supports Zumbo & Gelin's (2005) idea that by ignoring socio-geographic differences one may lose the whole picture of DIF. DIF based on income Some social groups, as a result of high income, may have access to more high-quality educational opportunities which, in turn, can lead to their gradually better performance on language tests. Income, in this study, emerged as a source of DIF in two items. One of the items was an inference-type item and the other one a main-idea item. Both items indicated much higher probability of endorsing the items correctly for those from higher-income families. Thus, income was supported only by two items (6.66% of the whole test) to distinguish between test takers in performing on the main idea- and inference-type reading items. DIF based on the educational level Unlike the location and income variables that reverberate the community-level contextual factors, the family (or parental) educational level is less susceptible to the community at large and compares test takers at the individual level (Zumbo & Gelin, 2005). To take note of this contextual factor, the test takers were asked to report whether their families or parents were academically educated or uneducated. The results of DIF analysis in this regard indicated that only one item (inference-type item) presented this contextual factor as a significant predictor of DIF in favor of the test takers coming from educated families. Thus, a small portion of the test (3.33%) supported the advantage of the educated group. IRT analysis Overall, 33.33% (10 items) of the whole test displayed DIF through the use of IRT. The results of IRT analysis for gender and familiarity variables did not show any meaningful and statistically significant DIF. As such, in what follows the remaining DIF sources (interest, guessing, location, income, & educational level) are taken into consideration. In each section the difficulty differences between the contrasting groups, called group threshold differences, and the standard error of measurement are provided for each item. Items for which the threshold difference is roughly twice (1.96) or more the size of the standard error display DIF at the p = .05 level (Thissen, Steinberg, & Wainer, 1993). The only parameter to be attended to in this program was the difficulty value (b) and therefore the lower threshold value for a particular group means that the item was easier for them. That is, the negative or positive direction of the threshold differences indicates which particular subgroup was favored. DIF based on the interest in text topic The IRT analysis revealed the threshold differences output for the interest in texts. As Table 1 indicates (see Appendix), items 1, 3, 5, 20, and 27 displayed DIF. The table indicates that in both mid-low and high-low comparisons, Item 1 functioned as a DIF item (Thissen et al., 1993). The negative threshold values in both cases indicate that the item (details type) was more difficult for the low group than the mid and high groups. Thus, those with high and moderate interest in Text 1 were favored by item 1. For item 3 the results demonstrate that in both cases (high-low, high-mid) the item was more difficult for the high group. Thus, it is concluded that the low and mid groups were favored by Item 3. By the same token, the results indicate that for Item 5, the low and mid groups benefited from the item. In item 20, the negative threshold difference (-0.792) indicates that for the low group the item was more difficult; and hence the high-interest group was favored by this inference-type item. Finally, for item 27 it is clear that the high group was favored because their threshold was lower than that of the mid group. The detection of 16.66 % of the whole items (five out of 30) as including DIF demonstrates a greater proportion of interest effect in the IRT analysis in comparison with the other factors. DIF based on guessing The negative threshold differences for items 11 and 14 and the positive difference for item 15 indicate that all the high guessers had more likelihood to get the details- (11), main idea- (14), and inference-type (15) items right. Thus, guessing as a speculative source of DIF indeed made the MC items function in favor of the high guessers. DIF based on location Regarding the residential neighborhoods of the respondents (up vs. down) only item 4 (3.33% of the whole test) revealed significant differential performance between the groups (1.417). Thus, the up group was reported to take advantage of this main idea-type item. DIF based on income Out of the 30 items, only one item (3.33% of the whole test) displayed DIF and supported the hypothetical function of income in the IRT analysis. Only item 19 was flagged for DIF. The negative threshold difference for this main idea-type item makes it clear that the low-income group found the item easier to endorse. In fact, those test takers coming from low-income families were favored by this particular item. As you will see in the DTF section, income was one of the sources transferred from the item level to the test level and had a proportional effect on the test bias. DIF based on educational level The comparison between the difficulty parameters of the educated- and uneducated-family test takers indicated that only item 5 (inference type) displayed significant DIF. The positive threshold difference reveals that upon answering item 5, the educated group was favored because for them the item was easier to endorse. Therefore, only 3.33% of the whole items supported the speculative effect of the parental educational level on differential performance of the test takers on reading comprehension. Comparing the LR and IRT results The overall results of DIF analyses are summarized in Table 2. The first column represents the hypothetical DIF sources; the second and the third illustrate DIF identified through LR and IRT respectively. The last column shows the items identified by both methods and therefore indicates the agreement between the two methods.

As the above table indicates, only items 5and 14 were identified as DIF by both DIF detection methods. This indicates a low level of correspondence between the results of the two methods. The IRT framework found no DIF related to the gender and familiarity variables. With respect to the interest variable both techniques could flag some DIF items. Through a small portion of correspondence between the models, interest as a source of DIF, which acts in a balanced correspondence with familiarity in reading comprehension (Bachman, 1990), was reported to function in a mixed direction, i.e. leading some items to favor interested test takers and some others to benefit the uninterested ones. Guessing, as a source of DIF in MC reading items was identified by both methods but with different functions. That is, the LR method indicated that low guessers were more successful, but the IRT model reported conversely that the DIF items were in favor of those test takers who guessed highly at the answers, hence a total divergence between the two methods. The results of both methods for location, in a complete correspondence, indicated that those living in the up neighborhoods were favored. In a similar vein, both frameworks identified a single item (item 5), in a complete convergence, as displaying education-oriented DIF in favor of those coming from educated families. However, for income-based DIF, the methods stood in a sharp contrast. In fact, unlike LR, IRT found that those coming from the low-income families were favored.

Let us interpret what we have come up with so far in two parts. First, as the above table depicts the two methods overlap in the preemption of only two DIF items. Why is there so much variance between them? Why are their results not supportive of each other? This finding could be explained in terms of the sample size. It is mentioned that for binary items at least 200 people per group is required for LR to function well and smaller sample

sizes could deteriorate the results (Zumbo, 1999). Further explanation of this finding, may come from the significance level used for DIF analysis. Usually IRT is more accurate and does much better with a smaller sample size than LR in flagging items for DIF. So a Bonferroni correction test could be useful to apply to the LR analysis results for more careful analysis (Alavi & Karami, 2010; Runnels, 2013; Thompson, 2006). This can lead to a lower number of items to be identified as DIF. However, this procedure was not of much help in the present study to resolve the differences in the results of IRT and LR DIF analyses because the differences did not lie in the number of items rather in the type of items.

However, overall the results of the two methods should be considered together, as there would be no method which is foolproof with all samples and for all contexts. Both methods, likewise, indicated DIF related to the contextual factors despite the low reputation of the factors in the literature. Furthermore, the presence of more DIF related to guessing rendered it as a very important source of DIF.

Second, of all the sources in the study guessing caused the most DIF followed by interest, gender, familiarity, and contextual factors. The study centered around reading comprehension tests and that is why the emergence of guessing DIF for MC items, interest in topics, and gender DIF for gendered passages was not unconceivable. Those sources that could be manipulatively included in the research yielded the results much more in keeping with the literature than those factors over which, in an ex post facto manner, the researchers had no control, i.e. the contextual (social) factors.

Analyzing differential test functioning (DTF) The presence of DIF items is considered to be a threat to the test validity to the extent that it manifests itself in the scale-level differential performance (Pae & Park, 2006; Zumbo, 2003). The results of a multiple-regression analysis indicated that although the effect of the variables on the total score was significant, only three variables (gender, income, and interest) were transferred from DIF to DTF level and added to the complexity of the DTF analysis (Pae and Park, 2006; Takala & Kaftandjieva, 2000). The remaining DIF sources, like the results of Zumbo’s (2003) study, did not emerge in the scale-level results.

ConclusionThe primary purpose of this study was to move beyond the hypothetical interpretation of DIF by following McNamara and Roever's (2006) suggestion, conducting an empirical confirmatory study. Its outlook was toward taking advantage of the falsification philosophy in a way that rival hypothetical DIF sources were proposed and none was presumed to predict the why of DIF occurrence. Two DIF sources were included in the reading test and the others were checked through a questionnaire attached to the end of each subtest (passage). The study could not support the notion of familiarity with text topic as a major source of DIF. For gender only the LR model could detect some male- and female-friendly DIF items. With the small number of DIF items for the interest, guessing, and income-related DIF, the results of the two DIF detection methods stood in a noticeable contrast to each other. However, the methods were found to be in a complete convergence in regard to the location and family educational status. Two other findings were related to the item type and bias in the test. The results indicated that gender could have an impact on reference- and vocabulary-type

reading items. Furthermore, the DTF result made it clear that of all the sources, only three factors (gender, income, & interests) were transferred to the test level. The findings of the study overall may warn language teachers and test developers about failing to pay due attention to DIF on reading comprehension tests. This would be double problematic in criterion-referenced tests where the number and size of DIF items can greatly affect the mastery and nonmastery decisions made based on a single cutoff score. Even more problematic would be ignoring DIF on high stakes tests which are usually used for gate-keeping purposes. The stigma of failing in such tests may distort the educational life (or even the entire personal life) of the test takers.