Big Data in Education: The 5 Types That Matter

Big data in education is a hot topic, and getting hotter. Proponents tout its potential for reform. Detractors raise privacy concerns. Skeptics don’t see the point of it all.

Few people seem to have a clear understanding of what big data in education means, its scope, what will inevitably result, or even the differences between fundamental types of data. The responsibility for clarifying and communicating this understanding starts with the organizations building data platforms or applications.

Take a recent example. The Gates-funded initiative inBloom recently received scathing critiques that it would share confidential information without parental permission, along with other security concerns. InBloom’s mistake, in my opinion, is that it holds personally identifiable information (PII) but didn’t communicate a transparent payoff to users. For an education company to get big data right, it needs to be on the opposite side of both of those issues: avoid holding unnecessary PII and communicate clearly how its service makes transparent good use of users’ data.

(For the record: Knewton doesn’t hold any PII unless a user is able to consent and wants us to use the information for a specific reason: to create a private learning profile that can be carried by that user from app to app.)

Education has always had the capacity to produce a tremendous amount of data, more than maybe any other industry. First, academic study requires many hours of schoolwork and homework, 5+ days per week, for years. These extended interactions with materials produce a huge quantity of information. Second, education content is tailor-made for big data, generating cascade effects of insights thanks to the high correlation between concepts.

Only recently have advances in technology and data science made it possible to unlock these vast data sets. The benefits range from more effective self-paced learning to tools that enable instructors to pinpoint interventions, create productive peer groups, and free up class time for creativity and problem solving.

At Knewton, we divide educational data into five types: one pertaining to student identity and onboarding, and four student activity-based data sets that have the potential to improve learning outcomes. They’re listed below in order of how difficult they are to attain:

1) Identity Data: Who are you? Are you allowed to use this application? What admin rights do you have? What district are you in? How about demographic info?

2) User Interaction Data: User interaction data includes engagement metrics, click rate, page views, bounce rate, etc. These metrics have long been the cornerstone of internet optimization for consumer web companies, which use them to improve user experience and retention.

This is the easiest to collect of the data sets that affect student outcomes. Everyone who creates an online app can and should get this for themselves.

3) Inferred Content Data: How well does a piece of content “perform” across a group, or for any one subgroup, of students? What measurable student proficiency gains result when a certain type of student interacts with a certain piece of content? How well does a question actually assess what it intends to?

Efficacy data on instructional materials isn’t easy to generate — it requires algorithmically normed assessment items. However it’s possible now for even small companies to “norm” small quantities of items. (Years ago, before we developed more sophisticated methods of norming items at scale, Knewton did so using Amazon’s “Mechanical Turk” service.) Then, by splitting up instructional content and measuring (via the normed items) the resulting student proficiency gains of students using each pool, it’s possible to tease out differences in content efficacy.

4) System-Wide Data: Rosters, grades, disciplinary records, and attendance information are all examples of system-wide data. Assuming you have permission (e.g. you’re a teacher or principal), this information is easy to acquire locally for a class or school. But it isn’t very helpful at small scale because there is so little of it on a per-student basis.

At very large scale it becomes more useful, and inferences that may help inform system-wide recommendations can be teased out. But even a lot of these inferences are tautological (e.g. “if we improve system-wide student attendance rates we boost learning outcomes”); unreliable (because they hopelessly muddle correlation and causation); or unactionable (because they point to known, societal problems that no one knows how to solve). So these data sets — which are extremely wide but also extremely shallow on a per-student basis — should only be used with many grains of salt.

5) Inferred Student Data: Exactly what concepts does a student know, at exactly what percentile of proficiency? Was an incorrect answer due to a lack of proficiency, or forgetfulness, or distraction, or a poorly worded question, or something else altogether? What is the probability that a student will pass next week’s quiz, and what can she do right this moment to increase it?

Inferred student data are the most difficult type of data to generate — and the kind Knewton is focused on producing at scale. Doing so requires low-cost algorithmic assessment norming at scale. Without normed items, you don’t have inferred student data; you only have crude guesswork at best. You also need sophisticated database architecture and tagging infrastructure, complex taxonomic systems, and groundbreaking machine learning algorithms. To build it, you need teams of teachers, course designers, technologists, and data scientists. Then you need a lot of content and an even bigger number of engaged students and instructors interacting with that content. No one would build this system to get inferred student data for just one application — it would be much too expensive. Knewton can only accomplish it by amortizing, over every app our platform supports, the cost of creating these capabilities. To our knowledge, we’re the only ones out there doing it.

Educators are sometimes skeptical of adaptive apps because almost all of them go straight from gathering user interaction data to making recommendations, using simple rules engines with no inferred content data or inferred student data. (It is precisely because we envisioned a world in which everyone would try to build these apps that we created Knewton — so that app makers could all build them on top of low cost, yet highly accurate inferred content data and inferred student data.)

Big data is going to impact education in a big way. It is inevitable. It has already begun. If you’re part of an education organization, you need to have a vision for how you will take advantage of big data. Wait too long and you’ll wake up to find that your competitors (and the instructors that use them) have left you behind with new capabilities and insights that seem almost magical.

No one will build functionality to acquire all five of the above data sets. Most institutions will build none. Yet every institution must have an answer for all five. The answer will come in assembling an overall platform by using the best solution for each major data set.

It is incumbent upon the organizations building these solutions to make them as easy to integrate as possible, so that institutions can get the most value from them. Even more importantly, we must all commit to the principle that the data ultimately belong to the students and the schools. We are merely custodians, and we must do our utmost to safeguard it while providing maximum openness for those to whom it belongs.