Thursday, November 07, 2013

When Big Data goes bad: 6 epic fails

Data, in the wrong hands, whether malicious, manipulative
or naïve can be downright dangerous. Indeed, when big data goes bad it can be
lethal. Unfortunately the learning game is no stranger to both the abuse of
data. Here’s six examples showing seven species of ‘bad data’.

1. Data subtraction: Ken
Robinson

Don’t let the selective graphical representation of data,
destroy the integrity of the data. A good example of blatant data editing is
the memorable ‘ritalin’ image used by Sir Ken Robinson in his TED talkat
3.47. This image is taken from its RSA animation.

His has no legend and he’s recalibrated states to look as
if there’s zero prescriptions. To understand this data you have to look at its
source to understand that the white areas represent states that did NOT
participate in the study or did not have reported prescription data. It’s a
distortion, an exaggeration to help make a point that the data doesn’t really
support

In fact, much of what passes
for fact in Sir Ken Robinson’s TED talks are not supported by any research or
data whatsoever.

2. Data addition: Bogus learning
theory

Ever seen this graph,
or one like it? It used to be a staple in education, training and teacher
training courses. Only one problem - it’s bogus.

A quick glance is
enough to be suspicious. Any study that produces a series of results bang on
units of ten would seem highly suspicious to someone with the most basic
knowledge of statistics.

But it’s worse than
nonsense, the lead author of the cited study, Dr. Chi of the University of
Pittsburgh, a leading expert on ‘expertise’, when contacted by Will Thalheimer, who uncovered the
deception,said,"I don't recognize this
graph at all. So the citation is definitely wrong; since it's not my graph."
What’s worse is that this image and variations of the data have been
circulating in thousands ofPowerPoints, articles and books since the 60s.

Furtherinvestigations of these graphs byKinnamon((2002) in Personal communication,
October 25) found dozens of references to these numbers in reports and
promotional material. MichaelMolenda( (2003) Personal communications,
February and March) did a similar job. Their investigations found that the
percentages have even been modified to suit the presenter’s needs.

The one here
is fromBersin (recently bought
by Deloitte). Categories have even been added to make a point (e.g. that
teaching is the most effective method of learning).

The root of the problem is an image by Edgar Dale’s depiction
of types of learning from the abstract to the concrete. He has no numbers on
his ‘cone of experience’ and regarded it as a visual metaphor implying no
hierarchy at all.

Serious looking histograms can look scientific,
especially when supported by bogus academic research. They create the illusion
of good data. This is one of the most famous examples of not ’Big’ but ‘Bad’
data in the history of learning.

3. Claims beyond the data –
University League Tables

University league tables are used by politicians,
Universities, parents and students. But they contain a dark, dirty, data
secret. They claim to rank universities but, astonishingly, tell you absolutely
nothing about ‘teaching’. They often claim to have ‘measures’ on teaching, but
they actually draw their data from proxies, such as employment and research
activity and use nothing but indirect measures to measure teaching.

The Times rankings are a case in point. They claim that
their ranking scores include teaching. In fact, only 30% is based on teaching but they use NO direct metrics. The proxies include student/staff ratios (which is skewed by how much research is
done) and, even more absurdly, the ratio of PhDs to BAs. It is therefore a self-fulfilling table, where the elite Universities
are bound to rise to the top. There is no direct measurement of face to face
tome, lecture attendance or student satisfaction.

4. Skewed data - PISA

Like the real Leaning
Tower of PISA, the OECD PISA results are built on flimsy foundations
and are seriously skewed. Nevertheless, they have become a major international
attraction for educators, and regularly spark off annual educational
‘international arms’ races.

Both left and right
now use the ‘sputnik’ myth, translated into the ‘Chinese competitiveness’ myth,
to chase their own agendas – more state funding or more privatisation. This is
a shame, as the last thing we need is yet another dysfunctional , deficit debate
in education. Nations have different approaches to education, different
demographic and social mixes and different agendas.

The problems in the
data are extreme as PISA compares apples and oranges. PISA is seriously flawed
because of the huge differences in demographics, socio-economic ranges and
linguistic diversity within the tested nations. There are many skews in the
data, including the selction of one flagship city (Shanghai) to compare against
entire nations. Immigration skews include numbers of immigrants, effect of
selective immigration, migration towards English speaking nations, and
first-generation language issues. There’s also the issue of taking longer to
read irregular languages and selectivity in the curriculum.

Sven de Kreiner
Danish statistician says PISA is not reliable at all. In the reading tests 28
questions were supposed to be equally difficult in every country. PISA has
failed here as differential item functioning - items with different degrees of
difficulty in different countries - are common. In fact he couldn't find any
that worked without bias. Items are more difficult in some countries. He used
his analysis to show that the UK moves up to 8 or down to 36. PISA assumes that
DIF has been eliminated but not one single item is the same across the 56
countries

Politicians and activists distort PISA to meet
their own ends. They cherry pick results and recommendations, ignoring the
detail. Finland is famously quoted by the right as a high performing PISA
country. Yet, it is a small, homogeneous country with no streaming, high levels
of vocational education, no substantial class divisions and no private schools.
Facts curiously ignored by PISA supporters.

5. Faked data

Eysenck worked with Cyril
Burt at the University of London, the man responsible for the introduction of the
standardised 11+ exam in the UK, enshrined in the 1944 Butler Education Act, an
examination that, incredibly, still exists in parts of the UK. Burt was
subsequently discredited for publishing largely in a journal that he himself edited,
falsifying, not only the data upon which he based his work, but also co-workers
on the research. To be precise, Burt's correlation coefficients on IQs in his twin studies were
the same to three decimal places, across articles, despite the fact that new data had been added twice twice to the
sample of twins. Leslie Hearnshaw, Burt’s friend and official biographer, claimed
that most of Burt's data after World War II were fraudulent or unreliable.

This is just one of many
standardised tests that have become common in education but many believe that
tests of this type serve little useful purpose and are unnecessary, even
socially divisive. Many argue that standard tests have led to a culture of constant
summative testing, which has become a destructive force in education,
demotivating and acting as an end-point and filter, rather than a useful mark
of success. Narrow academic
assessment has become almost an obsession in some countries, fuelled by
international pressure from PISA.

6. Dirty data deeds

One example of data gathering in education stands out as
truly evil. In 1939, the CEO of IBM, Thomas Watson, flew across the Atlantic to
meet Hitler. The meeting resulted in the Nazis leasing the mechanical
equivalent of a Learning Management System (LMS). Data was stored as holes in
punch cards to record details of people including their skills, race and sexual
inclination and used daily throughout the 12 year Reich. . It was a vital piece
of apparatus used in the Final Solution, to execute the very categories stored
on the apparently innocent cards - Jews, Gypsies, the disabled and homosexuals,
as documented in the book IBM and the Holocaust by Edwin Black. They were also use to organise slave labour
and trains to the concentration camps.

This is not the first time the state has recoded
educational details to keep tabs on potential dissent. It was common in the
Stasi infused East Germany. I shared a room at University with someone who
became a Stasi spy in the UK and have taken some interest in their methods.
Perhaps the most meticulous storing of data ever taken by a state, right down
to smells in jars, from clothing and towels placed under interviewees during
interrogation. The idea was that dissenters could be found by dogs, when
necessary.

Conclusion

I am a fan of Big Data in education, even though it’s
really closer to ‘large or little’ sets of data. However, we must be wary of
data when it is used to exaggerated claims through addition or subtraction or spearhead
prescriptive programmes and extreme testing. I am appalled at the way politicians
and educators take up PISA, PIAC and OECD data, with little or no detailed
examination of their assumptions or relative values and use it to shape prescriptive
policies that do more harm than good. Big Data in the hands of little brains is
downright dangerous.

4 Comments:

Good article, Donald. The misuse of data is of course endemic. It is most obvious in politics, advertising and the Daily Mail, but more worryingly pervades pharmaceutical and food research - ref Ben Goldacre's excellent books. I would also recommend "Fooled by Randomness" by Naseem Taleb of Black Swan fame.Even e-learning is not immune, as anyone who has perused a recent flotation document will attest.