Issues of Ethics and Equity

In this video, we’re gonna discuss some of the questions about ethics and equity that might be useful to consider when doing
educational data mining. So, educational data can include very sensitive information just by its nature, and some of the decisions
that we make, from the analysis
of this-this data can have long-term
impact on students, and for this reason, uh, it’s really important to consider ethics and equity when we’re working
with educational data. So, in this video, I’m gonna
discuss ethics and equity but I’m not gonna give answer, I’m mostly gonna
be raising questions that I think is important
to consider if you’re working
with educational data. So in terms of ethics, right, so, uh, the usage
of computers in education just keeps increasing
and increasing, and so we have a lot
of data, digital data, about the educational context
that’s also increasing. So it’s important to ask
ourselves ethical questions when we’re going
to be using this data. And one of the first question
we should be asking our self is about privacy. So depending on the data
that you’re using, the type of data
that you’re using and how you’re using it, privacy might be more
or less of an issue, uh, but usually what we want
is to make the data as n-as anonymous as possible, but to what extent
can we actually anonymize the data we have? So, if we’re working with log
files of students’ interaction with a digital
learning environment, uh, that might not be
too much of an issue. Uh, log files of interactions
are relatively low-stake data, uh, it’s just traces
of how the student is using
the learning environment and it’s relatively
easy to anonymize because we don’t really need
the name of the student. We don’t have
any type of really identifying information. But sometime we wanna
study that in the context of additional information, uh, such as the socioeconomical status of the student, the gender, the race,
the learning outcomes, and that information
can be sensitive, and the more information we add, the less
anonymous the data becomes. Right, if we have information
about gender, race, age, uh, where the data was collected,
then we might be, actually, even if we don’t
have the name of the student, we might still be able to
narrow down a very small subset of possible student
for that data. Right, so we need
to be careful. Especially that information
might be very useful for research purposes. So we need to ask
ourselves the question, do we really need that data? If we don’t collect that data, how is that gonna
affect our analysis and try to be careful
to only collect the data that we actually want and try to be careful on how we handle it. Um, we can also think about
institri-institutional data, uh, for example
detailed grades report, disciplinary
interventions report, uh, who should have access
to that data? Uh, what are the implication
of sharing this data, especially if it’s possible
to de-anonymize it? Um, there are also
some types of data that we just simply cannot,
uh, anonymize. For example,
if we’re studying webcam feeds because we wanna look at the facial expression of the student as they are using
a digital learning environment, if we wanted to
make that data anonymous, then we would be required to, uh, make sure that
the faces are not visible, but then if we remove the faces then we can’t look
at the facial expression and the utility of the data becomes far less, so we need to be careful
on how we store that data, make sure that only the persons that should have access
to that data actually do. A second question that that we
me-might be asking ourself is whether or not we should
be informing the student about what data we’re collecting and how that data
is gonna be used, and this is true
in education as it’s true
in different aspect of our life, if we think about social network online like Facebook or if we think
about all the data that’s collected from Google, we can ask ourselves
the question, should we be informed of the data that’s being collected and how it’s gonna be used? Um, but if we do
inform the student of what data we’re collecting, uh, a follow-up question is, will that affect
how student interact with the learning environment? So maybe if students know
that we’re monit-monitoring whether or not
they’re disengaged from the learning environment, they’re gonna find ways
to trick the sy-uh, the environment into thinking
that they’re not disengaged. So, when we’re
conducting research, uh, usually there is some form
of consent that must be given
for the collection of data. But what about
usage of data outside of the research context in the regular classroom, right? So should the student consent to the collection of data
in that context? Should we give the option
to the student to opt out
of this data collection? And if we do give
this-the option to opt out, is that gonna affect
the learning experience that the student have? So if the student,
if we can’t collect data and we can’t apply all our models of student behavior and-and we can’t drive
automatic intervention using those model, then the student might
not get as much as-out of the learning environment than someone else
that did not opt out. Another thing to consider
is data ownership. So, who owns the data
that we’re collecting? Uh, in the case
of institutional data, does the inti-institution
own the data? What does that mean? Can they share it
with other people? Or do they have to
keep it to themself? Uh, in the case
of learning environment, does the creator of the learning
environment owns the data? And if they do,
what can they do with it? There’s also people
that are advocating for, uh, the possibility of having
the student own their own data. So if the student own their
data, they have control over who has access to that data
and what can be done with it. Now in terms of equity, uh, there’s also some very important considerations here. So one of them is,
how do we ensure that models that we develop
and the results that we obtain from our analysis are valid
for every student and not just a subset
of the students? And the second question is,
how do we use those models, uh, and the result from those
models in a sensible way? So, in terms
of how do we ensure that the models are valid
for every students, well, one of
the things that happens is that when we use educational data mining to build models because data collection
can be very difficult and very expensive, uh, we tend to validate our model on specific datasets. And even though we take steps
to make sure that those models are as general as possible,
it’s still possible that, uh, it’s still very difficult
to really assess how general the model
is gonna be when it’s gonna be applied
to a new dataset. Uh, for example, uh,
Ocumpaugh and colleagues, eh, in 2014 studied how, uh, studied the accuracy of the models that we-that they built
to detect affect in an intelligent
tutoring system. So for example, detecting
whether a student is confused or a student is bored
or a student is frustrated, and in this study, what they realize is that
the models that they built were not equally good
across different datasets collected from three
different schools. So, if the model was built
using data from one school, it might not be as-be
as effective when applied to data
from second school. So how do we make sure that
the models that we build when we apply them
to new dataset, they’re actually going
to be working well? And then, even if we’re not applying
the models to a new dataset, they’re still, uh, we
should still ask ourself, is the model equally good across all the students
within one dataset? Is there any biases
for specific students? Uh, what kind of
biases do we have? For example, if we think
about facial recognition, uh, we can probably think that,
uh, facial recognition, if we train our model using data
from only female students, then it’s likely that it’s not
gonna generalize that well if we apply it to male students. And so for different type
of model, there might be different biases that we need to make sure that we account for
when we build our model, and this is
a very important thing to consider
when building models. Now once we have our model, uh, we need to figure out
what is a sensible way to use those models. So, when we build a model, usually the model that
we get are not 100% accurate. Uh, so we know that there’s
some room for errors in our models,
so how can we use the models knowing that there’s
gonna be error to improve the learning experience of every student? So, let’s say
that we have the model that’s able to detect students who game the system which is a form
of disengaged behavior in which the student
is trying to solve problem without actually having
to learn anything by abusing the support offered by the learning environment. Let’s say that
we have this model and it works at
about 50% above chance for detecting gaming behavior which is actually pretty good for a model of gaming behaviors. Then can we use this model
to automatically provide interventions when gaming behavior is detected? So, one of the way we might
approach this problem is, say, maybe we can wait multiple, uh, for multiple gaming behavior before we start
providing interventions. So that would reduce
the no-the chance that we’re gonna give
an intervention on the false positive where
the student is not gaming, but we still detect
the student is gaming. Uh, we can also have
the opposite where, what about if we have
false negatives which means that we
don’t detect gaming when the student is gaming. So again, we might be missing
some instances of gaming and maybe that’s not too bad
if we’re able to detect other instances of gaming
for that specific student, we’re still gonna be
giving an intervention. Right, another approach is
to change the strength of the intervention based
on different factors like how often did we detect the
student as gaming or how confident the model
is that the student is gaming. But then, all
of those considerations, they assume that the model detects gaming behavior equally well for every students. What would happen if our model
actually has a bias in its detection
of gaming behavior? Maybe some students might
always be identified as gaming even though they aren’t, and then what will happen is that the students, um,
for those students, the learning environment is gonna intervene more often, which is gonna take some time away from the student to actually engage
in meaningful way with the learning environment, and that might also
lead the student to develop negative affect towards the learning
environment and disengage. Um, and we can also have the
opposite situation where some students might never be
detected as gaming the system even though they are,
and what will happen now is that they will never receive those
additional inf-interventions that they would benefit from. Uh, the teachers won’t know,
they won’t be notified that the student is gaming, so they won’t be able
to intervene, and the student,
because of that might fall behind
their classmate and have difficulties
catching up later on. So this is in the context
of gaming behavior which is actually a relatively
low-stake construct, which means that it might not
be that much of an issue if the model is somewhat biased. But what happens
if we apply it-this to a higher-stake construct? For example, if we consider a model that’s gonna be predicting college enrollment
from student interaction with an intelligent
tutoring system, right? So this model
can provide us with some really useful information that allow us to study what leads student
to enroll in college or not, but how actionable is it? Right, so, what would the
intervention that we would do based on this model look like
and how would we determine which student receive
those intervention? Is the model used in addition
to the current interventions or does it replace it? And if we replace the current
intervention with the model, what is the impact
of intervening for a student who
doesn’t need it, and what is the impact
of not intervening for a student who needs it? We might be, uh, penalizing
student based on a model that’s not accurate for them. So if the model
has strong biases on a subpopulation of student, then we might have very
strong impact in the future on the enrollment
of student, uh, uh, the enrollment in college of that subpopulation of students. And so, just
to quickly conclude, so we’ve looked at some
of questions of ethics and equity
with educational data. So educational data
can be very sensitive depending on its content, so it’s very important to ask ourselves those questions. Unfortunately, there’s not
always an obvious right or wrong answer
to those questions, and it’s kind of contextual depending on what you’re doing, what kind of data
you’re working with, how you’re gonna use the result of those analysis, but hopefully, this video is gonna be able to provide you with some information that
in the future, if you’re working
with educational data, you can think back and ask yourself those questions and make the right decision
for your-for your work.