If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Please note that posts from new users are now moderated. If you have just joined this forum and post a new message it will be held in the moderation queue until a member of staff approves it. Please be patient and our staff will review your submission as soon as possible.

Stats for Researching Collocation

Hi. I am researching formulaic language among English learners, and I'm using as baseline information data I get for mutual information (MI) and frequency scores when searching given collocations from the Corpus of Contemporary American English (COCA). I hoped someone out there could help me with a couple questions related to frequency and MI counts.

Question 1: Previous research has mostly examined 2-word collocations, not formulas with variable slots and variable sequencing. Therefore, they’re more likely to find higher MI, aren’t they? I’m finding that adjacent collocations have much higher MI than those with variability of form. Is this to be expected? If so, why?

Question 2: When allowing slots between two co-occurring words, some of the results retrieved might actually be different from the original from my samples. For example, students used “encourage [NP] to”, but when I search for encourage followed by to, allowing 4 spaces between (max), I might find in the concordances something like “encourage innovative approaches to”. How do I deal with these cases? I don’t want to count them in the total frequency, but I can’t go through manually checking each concordance line.