Share this story

There's rarely time to write about every cool science-y story that comes our way. So this year, we're running a special Twelve Days of Christmas series of posts, highlighting one story that fell through the cracks each day, from December 25 through January 5. Today: combining machine learning with the humanities can yield some surprising insights.

Truly revolutionary political transformations are naturally of great interest to historians, and the French Revolution at the end of the 18th century is widely regarded as one of the most influential, serving as a model for building other European democracies. A paper published last summer in the Proceedings of the National Academy of Sciencesoffers new insight into how the members of the first National Constituent Assembly hammered out the details of this new type of governance.

Specifically, rhetorical innovations by key influential figures (like Robespierre) played a critical role in persuading others to accept what were, at the time, audacious principles of governance, according to co-author Simon DeDeo, a former physicist who now applies mathematical techniques to the study of historical and current cultural phenomena. And the cutting-edge machine learning methods he developed to reach that conclusion are now being employed by other scholars of history and literature.

It's part of the rise of so-called "digital humanities." As more and more archives are digitized, scholars are applying various analytical tools to those rich datasets, such as Google N-gram, Bookworm, and WordNet. Tagged and searchable archives mean connecting the dots between different records is much easier. Close reading of selected sources—the traditional method of historians—gives a deep but narrow view. Quantitative computational analysis has the potential to combine that kind of close reading with a broader, more generalized bird's-eye approach that might reveal hidden patterns or trends that otherwise might have escaped notice.

“One thing this so-called ‘distant reading’ can do is help us identify new questions.”

"It's like any other tool and can be used for good or bad; it depends on how you use it," said co-author Rebecca Spang, a historian at Indiana University Bloomington. "Crucially, one thing this so-called 'distant reading' can do is help us identify new questions and things we could not have recognized as questions reading in the slow, close way that human individuals read." Small wonder that an increasing number of historians is applying these kinds of digital tools to the growing number of digitized archives. Stanford University historian Caroline Winterer, for instance, has used the digitized letters of Benjamin Franklin to map his "social network," revealing a picture of his rise to global prominence that was previously hidden.

The French Revolution study builds on one of DeDeo's earlier collaborations in 2014 with historian Tim Hitchcock of the University of Sussex, analyzing the digitized archives of London's Old Bailey courthouse over a period of about 200 years. The goal was to pinpoint how the way different crimes were spoken about at trial changed over time. They split all the trials into two categories—violent crimes like murder or assault, and non-violent crimes like pickpocketing or fraud—and looked at the words used in the transcripts for each trial.

A word picked at random from the Old Bailey archive receives a score based on how useful it is in predicting whether it comes from an account of a violent or a non-violent trial. In this way, DeDeo and Hitchcock's analysis showed the gradual criminalization of violence over those two centuries. This was not necessarily evidence that our nature has become less violent, rather, society changed its definition of what would be considered a violent criminal offense.

Enlarge/ Depiction of a trial in London's Old Bailey Courthouse (1809).

Public Domain

For another study, DeDeo trawled the digital archives of US congressional debates from the 1960s to the present to identify buzzwords that might peg the political leanings of the various speakers. He was able to track the development of political parties (and the origins of their current polarization) via subtle shifts in rhetoric. In the 1960s data, it's not possible to determine political affiliation solely on someone's vocabulary. That has changed dramatically, and now each party has very distinct vocabulary terms that serve as political indicators.

For his analysis of the French National Constituent Assembly dataset with Spang, DeDeo worked with Jenny Huang, an undergraduate in mathematics, and Alexander Barron, a graduate student in computer science and the paper's lead author. [corrected] Together, they developed a similar machine-learning technique to comb through transcripts of some 40,000 speeches made during that body's deliberations, as legislators hashed out what the new laws and institutions would be for post-Revolutionary France. The researchers determined how "novel" the speech patterns were, in terms of using new turns of phrase to communicate new ideas, as well as noting whether the speech was given in a public forum or behind closed doors in committee.

DeDeo et al. discovered that assembly members who used innovative language to propose their ideas (say, liberty, equality, fraternity), were much more successful at swaying the other members to adopt their ideas. Their ideas "persisted," as it were, which wasn't true for every new idea that was proposed. That new revolutionary vocabulary developed over time rather than springing into being fully formed in the summer of 1789.

The idea is fairly simple at its core, and that's what makes DeDeo's latest analytic tool so broadly applicable to other areas of the humanities. "It's a very useful model for thinking about how culture works, because we're very interested in the influencers—who the movers and shakers are," said Andrew Piper, a professor at McGill University. He is also founder of the Journal of Cultural Analytics and heads up an interdisciplinary initiative, NovelTM: Text Mining the Novel, with the goal of producing "the first large-scale cross-cultural study of the novel according to quantitative methods."

“People have made very grandiose claims about the novel with a very, very small dataset.”

"People have made very grandiose claims about the novel with a very, very small dataset," Piper said. "It has real repercussions for the credibility of our field. By taking into account large sets of documents, you can have more confidence when you're [making such claims] that this is something that is accurate, reliable, and reproducible." He is adapting DeDeo's approach to conduct more of a meta-analysis of literary studies. Piper has compiled some 60,000 articles in the field dating back to 1950 with an aim toward identifying large-scale ideological shifts.

For instance, many new ideas and related jargon entered the field in the 1970s, when gender studies became a hot academic ticket—one of the most significant shifts in the last 50 years, according to Piper. A similar shift occurred in the late 1980s with race and post-colonialism. "People have talked anecdotally about these big shifts in the field, but [quantitative analysis] gives you very precise ways of measuring how severe they are," said Piper. Over the last decade, however, the field has experienced a period of stagnation, as past upheavals have become fully incorporated and normalized.

Ted Underwood, a literature professor at the University of Illinois, is using DeDeo's tools to analyze the text of 40,000 novels spanning two centuries. Underwood originally specialized in British Romantic literature, focusing on individual authors and books. But he now focuses on longer time scales, "because that's the scale where I think we know the least," he said.

DeDeo's method is particularly suited for that kind of analysis. They met at one of Piper's McGill workshops, where DeDeo spoke on using text mining to study the novel. "I'm on record as saying the talk made me want to run immediately out of the room and try and apply it to lit history to see what we can learn," said Underwood.

Enlarge/ Graphs showing novelty, transience, and resonance in the French Revolution.

S. Dedeo et al.

Underwood's approach involves topic modeling to identify key organizing topics in his digitized dataset—a term that describes the ways in which people were writing, such as greater use of profanity or vulgar language. By looking at the distribution of topics represented in each novel and comparing it to novels 20 or 40 years in the future, it's possible to identify influential works that were just a bit ahead of their time, like Uncle Tom's Cabin by Harriet Beecher Stowe.

"We're doing a longer timeline, but it's basically the same idea [as DeDeo's]," he said. "Can we think about literary change by looking at books, how much they're like the past, how much they're like the future, and looking at the ratio between those to learn something new."

There is naturally a certain amount of pushback against the notion that the quantitative methods of science could yield insight into the humanities, where the emphasis has long been on individual close reading of texts by people with narrow expertise in their chosen field. There's a sense that machine learning is meant to replace that kind of in-depth scholarship. But the best such studies (DeDeo's included) always involve a so-called "domain expert" to ensure there is no misinterpretation of the data. Quantitative analysis can identify a pattern; it takes a domain expert to fully understand what that means contextually.

"If you work with data, you know this," said Piper. "I think people who haven't really accepted the interpretive power you get when you work at a larger scale, only work at that traditional, close, analytical level."

"It seems like it might be a peanut butter and pizza kind of combination," said Ben Orlin, math teacher and author of Math with Bad Drawings, who is not involved in any of the aforementioned studies. While history or literature provide a rich dataset, "maybe it's not such a good idea to shred it up and treat it as this very disconnected set of words and frequencies."

But he agrees that involving domain experts can preserve the respective strengths of each discipline in a mutually beneficial way. "Digital humanities gives us this wonderful set of techniques that people can use to ask questions of literature," said Orlin. "But you definitely need people who are experts to frame which questions are gonna be interesting."

"...a former physicist who now applies mathematical techniques to the study of historical and current cultural phenomena."

So, we're basically a short few years from psychohistory?

FORWARD THE FOUNDATION!

Since psychohistory concerned itself with identifying predictive laws of history, definitely not. Data analysis is a useful tool, but just one of many. Like the article says, it helps suggest new questions and problems that were maybe harder to spot before. Scholars will still need to contextualize the data, by doing the hard work of actually reading the texts and understanding what they mean, and how they speak to each other.

I'm skeptical, yet curious by this type of methodology for the humanities. Not that it should be prohibited, but in my personal view the sort of over-quantification of certain disciplines can be damaging. A healthy mix of both is always encouraged and this machine learning tool sounds promising, to be sure!

"Can we think about literary change by looking at books, how much they're like the past, how much they're like the future, and looking at the ratio between those to learn something new."

Uhh, I have very mixed feelings about looking for and then inferring things from correlations in data; even though big data is very sexy.I imagine there is a place for domain experts in humanities to provide hypotheses which can then be tested against the data sets.

Man, I don't see why people are so down on this. Yes, humanities folks are going to feel threatened at first, but it's just a tool. You can find patterns in thousands of works that nobody would have noticed (or would have been the entire life work of one person). It's going to tell you useless things too, which is why you need the domain expert to see what makes sense.

'People who used novel language had better success selling their ideas' is one of those things that intuitively makes sense - but those are often wrong, so it's great to have some data to back it up.

And of course, no deep neural network is going to be able to tell you that 'Horton Hears a Who' is a deep psychosexual allegory built on an latent incestuous colonial narrative - that'll be the sole domain of the Doctor of Humanities for a while.

In the 1960s data, it's not possible to determine political affiliation solely on someone's vocabulary. That has changed dramatically, and now each party has very distinct vocabulary terms that serve as political indicators.

Fake news! There's no way to distinguish real Americans from libtard MSM socialist leftie elites using just vocabulary.

I think traditional humanities scholars should welcome this. The reason these techniques are useful is that they allow you to ask and answer new questions that would have been impractical previously, and to focus your other efforts on new and interesting topics. It does not threaten the old way of doing things - it’s not as though tenured English professors have to compete for research grants. I suppose it might threaten folks seeking tenure if their scholarship is considered less valuable, but in junior researchers’ quest for novelty, this just opened up an entire new set of cross disciplinary opportunities, so it seems to be good for them too.

In my own NLP projects, I’ve learned that these techniques are both extremely impressive and extremely limited. They require you to frame questions very carefully. For example, if you’re doing a binary classification (is this a positive review or a negative one), then they are amazing. They’re considerably less effective at answering the types of specific questions you might ask a human reader to consider or drawing fine distinctions (which of 20 overlapping categories is the best fit). This is largely because these algorithms discard huge quantities of data (e.g., punctuation) to draw statistical inferences. Accordingly, there will always be work left over for humans as long as the data discarded by the algorithm had any value to begin with. But these tools can substantially advance the starting point.

Whether or not deep academic studies into obscure topics, driven by the “publish or perish” novelty mandate, have any value is a separate question.

I grew up hearing about the kind of people who counted every occurrence of the the word ‘the’ in the Bible and things like that. (This was in the pre-digital age, and was usually the province of hermits who had nothing better to do in winter)

Hopefully these kinds of people can apply their talents and desires more fruitfully in this area of data-driven literary analysis.

The paper analyzes 40000 speeches during the revolution to determine how many novel speech patterns emerge and persist. I think you must analyze data before and after some arbitrary period of revolution to determine if this is important or unusual. Or at least look at a different nation during the same period. The conclusion reached seems to be that a new secondary vector apart from speeches by leaders was created, but I am not sure how they discount this vector in other times and types of governments.

One of the things about this kind of analysis is that it implicitly assumes we have a complete -- or at least representative -- record of the categories we're looking at. Otherwise we're analyzing history as written by the victors/critics/digitizers.

Right now the field is still rather new, and so you see work of very varying quality (the ones cited in the article do show off nicely what can be done in applying big data, but that's not in and of itself the whole field).

Despite some commenters say that folks in the humanities feel threatened, that's not my experience at all. The article shows how certain techniques can be applied to let us see what's going on in scales not possible by human readers. Used well, that can great. But it doesn't render unnecessary the work done on a close scale: think of it like anthropologists and archeologists who can now do land surveys through dense jungle cover. That's important work. So is going to visit the site, excavating, analyzing artifacts, etc. The one helps the other.

There are other ways to technology can be applied to the humanities. In my current project, for instance, I rely heavily on diff algorithms that I've had to tailor to the text I'm working with. But once I find differences, some may be analyzed programmatically (finding shifts in pronunciation over time) but others will be done the old fashioned way (trying to explain why a few words suddenly changed, or were deleted).

Where DH can get a bad rep is where I see someone claim to be a digital humanist just because they did a digital version of a dead tree book and claim novelty, or have serious misuse of tools (I've seen a PowerPoint slideshow once that was used as if a website, clicking on images to move from slide to slide), or using statistics obtained from analyses without understanding what they mean or providing adequate context (my only consolation, perhaps, is that bad understanding of statistics seems to be prevelant in quite a few disciplines).

One of the things about this kind of analysis is that it implicitly assumes we have a complete -- or at least representative -- record of the categories we're looking at. Otherwise we're analyzing history as written by the victors/critics/digitizers.

So one of the problems that we have is that to apply some of the techniques mentioned, you need a lot of data. Books from the 1600s onwards generally have pretty decent OCR, and by the time we reach the 1800s spelling has mostly standardized. Older books with newer editions can be easier to work with but it still leaves out a lot of manuscripts, incunables and other early books that can't be read easily by computers at the moment (and in the case of manuscripts, might not ever, by the time you transcribe enough of them to train an algorithm for even semi-okay performance, you have probably transcribed them all. I just, but only slightly). Sadly, it's hard to convince people to take the time and transcribe these older works and documents — though there is a standard (TEI) for doing so.

These tools are definitely interesting. When you consider how much NLP features in the giant techs (google, amazon, etc) you can see how much value there is in quantifying text. The real challenge is in doing it in some meaningful way.

I started it as a curiosity, teaching an ML with children's books and having it create texts. Funny. I actually got the idea from the articles about algorithms writing movie scripts and from my desire to get my kids exposed to coding.

Then I saw the potential at work. We prepare a lot of text but have no systematic way to understand how the entire corpus fits together, or what topics it has covered historically, patterns, etc. Bosses always wanted to know this, but we were limited to reading and summarizing books to "classify" them.

So I started to play around with these tools to analyze the corpus of text at work, and it has been fun. Topic models using mallet, then gephi to make them pretty. The learning curve has been a bit steep for me, but I finally feel like I get it. Now I'm finishing a semi-supervised classification of our texts that answer a meaningful question: how well have our texts covered the issues that define our mission.

As soon as I can get mallet's java tools working without error (are you listening, developers?), I can run the classification properly without my kludges.

For now, it is exciting for a bunch of economists to be able to analyze not just our inputs, but our text output.

Let's say we had a computer system that spit out a binary answer to the question "Do you get this loan?" and we poured our personal data into it. The computer system comes up with a number between 0-10.Then a human steps in and says "Well, that's an 8. No loan for you.""Wait, what?" you say. "An 8 always means no loan?""Oh, no" they reply, "an 8 means no loan for you."

We wouldn't accept such a system. In fact, we might call it beyond useless: it is a recipe for arbitrary abuse of power.

But a human controlling such a classification system is fine here? Why? Sure, the stakes are lower, but the mechanism is identical.

Let's say we had a computer system that spit out a binary answer to the question "Do you get this loan?" and we poured our personal data into it. The computer system comes up with a number between 0-10.Then a human steps in and says "Well, that's an 8. No loan for you.""Wait, what?" you say. "An 8 always means no loan?""Oh, no" they reply, "an 8 means no loan for you."

We wouldn't accept such a system. In fact, we might call it beyond useless: it is a recipe for arbitrary abuse of power.

But a human controlling such a classification system is fine here? Why? Sure, the stakes are lower, but the mechanism is identical.

My experience is that the system you describe is exactly the system I've been subject to all my life; and indeed arbitrarily abused power on many occasions.

"It seems like it might be a peanut butter and pizza kind of combination," said Ben Orlin, math teacher and author of Math with Bad Drawings, who is not involved in any of the aforementioned studies. While history or literature provide a rich dataset, "maybe it's not such a good idea to shred it up and treat it as this very disconnected set of words and frequencies."

I don't know about you, but now I kind of want to try peanut butter and pizza

Let's say we had a computer system that spit out a binary answer to the question "Do you get this loan?" and we poured our personal data into it. The computer system comes up with a number between 0-10.Then a human steps in and says "Well, that's an 8. No loan for you.""Wait, what?" you say. "An 8 always means no loan?""Oh, no" they reply, "an 8 means no loan for you."

We wouldn't accept such a system. In fact, we might call it beyond useless: it is a recipe for arbitrary abuse of power.

But a human controlling such a classification system is fine here? Why? Sure, the stakes are lower, but the mechanism is identical.

Because it's research.

At the level of an individual scientist research is about rigorously testing hypotheses. At the societal level it's a whole bunch of smart people trying a whole bunch of stuff in hopes that one of them turns out to be right. Sometimes somebody has a lot of successes and becomes Einstein. Everyone else? Not Einstein.

What is likely to happen here, is that many people will try many ideas for using big data to analyze historical texts, most of those ideas will not work out, but something might.

And the whole point of research is to find that one thing that does work out.

So a huge amount of money and time has been spent to find out the reported speeches of Danton and Robespierre aped the language of Diderot, Voltaire and Montesquieu. In other news water is wet. In addition in the era before sound recording or even Newspapers it was common place to polish what was actually said before it was published.

Let's say we had a computer system that spit out a binary answer to the question "Do you get this loan?" and we poured our personal data into it. The computer system comes up with a number between 0-10.Then a human steps in and says "Well, that's an 8. No loan for you.""Wait, what?" you say. "An 8 always means no loan?""Oh, no" they reply, "an 8 means no loan for you."

We wouldn't accept such a system. In fact, we might call it beyond useless: it is a recipe for arbitrary abuse of power.

But a human controlling such a classification system is fine here? Why? Sure, the stakes are lower, but the mechanism is identical.

Because it's research.

At the level of an individual scientist research is about rigorously testing hypotheses. At the societal level it's a whole bunch of smart people trying a whole bunch of stuff in hopes that one of them turns out to be right. Sometimes somebody has a lot of successes and becomes Einstein. Everyone else? Not Einstein.

What is likely to happen here, is that many people will try many ideas for using big data to analyze historical texts, most of those ideas will not work out, but something might.

And the whole point of research is to find that one thing that does work out.

It's not rigorous. How much weight you give to a sources interpretation is an arbitrary decision. it is extremely rare that any report is entirely objective so big data is no more rigorous than a human doing the same job.

Let's say we had a computer system that spit out a binary answer to the question "Do you get this loan?" and we poured our personal data into it. The computer system comes up with a number between 0-10.Then a human steps in and says "Well, that's an 8. No loan for you.""Wait, what?" you say. "An 8 always means no loan?""Oh, no" they reply, "an 8 means no loan for you."

We wouldn't accept such a system. In fact, we might call it beyond useless: it is a recipe for arbitrary abuse of power.

But a human controlling such a classification system is fine here? Why? Sure, the stakes are lower, but the mechanism is identical.

Because it's research.

At the level of an individual scientist research is about rigorously testing hypotheses. At the societal level it's a whole bunch of smart people trying a whole bunch of stuff in hopes that one of them turns out to be right. Sometimes somebody has a lot of successes and becomes Einstein. Everyone else? Not Einstein.

What is likely to happen here, is that many people will try many ideas for using big data to analyze historical texts, most of those ideas will not work out, but something might.

And the whole point of research is to find that one thing that does work out.

It's not rigorous. How much weight you give to a sources interpretation is an arbitrary decision. it is extremely rare that any report is entirely objective so big data is no more rigorous than a human doing the same job.

"How much weight you give to a sources interpretation is an arbitrary decision."

Well, maybe for you. But I carefully consider many, many variables when I weigh a source's interpretation. I've seen you pop up in numerous discussions, and I consider your interpretation in the context of your past posts, which means you, as a source of interpretation, carry virtually no weight.

Let's say we had a computer system that spit out a binary answer to the question "Do you get this loan?" and we poured our personal data into it. The computer system comes up with a number between 0-10.Then a human steps in and says "Well, that's an 8. No loan for you.""Wait, what?" you say. "An 8 always means no loan?""Oh, no" they reply, "an 8 means no loan for you."

We wouldn't accept such a system. In fact, we might call it beyond useless: it is a recipe for arbitrary abuse of power.

But a human controlling such a classification system is fine here? Why? Sure, the stakes are lower, but the mechanism is identical.

Because it's research.

At the level of an individual scientist research is about rigorously testing hypotheses. At the societal level it's a whole bunch of smart people trying a whole bunch of stuff in hopes that one of them turns out to be right. Sometimes somebody has a lot of successes and becomes Einstein. Everyone else? Not Einstein.

What is likely to happen here, is that many people will try many ideas for using big data to analyze historical texts, most of those ideas will not work out, but something might.

And the whole point of research is to find that one thing that does work out.

It's not rigorous. How much weight you give to a sources interpretation is an arbitrary decision. it is extremely rare that any report is entirely objective so big data is no more rigorous than a human doing the same job.

"How much weight you give to a sources interpretation is an arbitrary decision."

Well, maybe for you. But I carefully consider many, many variables when I weigh a source's interpretation. I've seen you pop up in numerous discussions, and I consider your interpretation in the context of your past posts, which means you, as a source of interpretation, carry virtually no weight.

This may come a shock too you but you are not arbiter of the universe and your view is just yours and isn't any more important than mine. However no amount of insults will change the fact that a source might not be 100% accurate and any choice is personal and arbitrary.

Man, I don't see why people are so down on this. Yes, humanities folks are going to feel threatened at first, but it's just a tool. You can find patterns in thousands of works that nobody would have noticed (or would have been the entire life work of one person). It's going to tell you useless things too, which is why you need the domain expert to see what makes sense.

'People who used novel language had better success selling their ideas' is one of those things that intuitively makes sense - but those are often wrong, so it's great to have some data to back it up.

And of course, no deep neural network is going to be able to tell you that 'Horton Hears a Who' is a deep psychosexual allegory built on an latent incestuous colonial narrative - that'll be the sole domain of the Doctor of Humanities for a while.

Why are they threatened?Because right now the humanities is the ultimate domain of p-hacking. You can cherry pick whatever data you want, then you can retrofit whatever argument you want to it.Want to claim that "the novel repressed the individuality of women in the 19th C"? You can find 50 novels that appear to support the claim. Want to claim the opposite, that "the novel encouraged the individuality of women in the 19th C"? Here's a different 50 novels that "prove" the point. Want to show how terrible the patriarchy is? Well, look for the number of women represented somewhere (let's say minutes of TV shows). If the number is lower than for men, well obviously women aren't respected as actors. If the number is higher than for men, well obviously women are treated as visual objects for the pleasure of the male gaze.

Every pathology you can think of in current research (say medical science, or social psychology), these guys don't even understand that these ARE pathologies --- this is just the way you do business, choose the data you want, then construct the narrative you want on top of that. This sort of free-for-all is a delight if your primary goal is to make political claims (on the left or the right; this is equal opportunity nonsense).

Of course there are definitely some in the humanities and cultural studies arena whose primary goal is knowledge, not politics, and I enjoy reading some of them. But overall, these are fields corrupted by "knowing the answer you want before you even start looking", and they're not going to appreciate anything that shows just how empty is such an approach.

Let's say we had a computer system that spit out a binary answer to the question "Do you get this loan?" and we poured our personal data into it. The computer system comes up with a number between 0-10.Then a human steps in and says "Well, that's an 8. No loan for you.""Wait, what?" you say. "An 8 always means no loan?""Oh, no" they reply, "an 8 means no loan for you."

We wouldn't accept such a system. In fact, we might call it beyond useless: it is a recipe for arbitrary abuse of power.

But a human controlling such a classification system is fine here? Why? Sure, the stakes are lower, but the mechanism is identical.

Because it's research.

At the level of an individual scientist research is about rigorously testing hypotheses. At the societal level it's a whole bunch of smart people trying a whole bunch of stuff in hopes that one of them turns out to be right. Sometimes somebody has a lot of successes and becomes Einstein. Everyone else? Not Einstein.

What is likely to happen here, is that many people will try many ideas for using big data to analyze historical texts, most of those ideas will not work out, but something might.

And the whole point of research is to find that one thing that does work out.

It's not rigorous. How much weight you give to a sources interpretation is an arbitrary decision. it is extremely rare that any report is entirely objective so big data is no more rigorous than a human doing the same job.

"How much weight you give to a sources interpretation is an arbitrary decision."

Well, maybe for you. But I carefully consider many, many variables when I weigh a source's interpretation. I've seen you pop up in numerous discussions, and I consider your interpretation in the context of your past posts, which means you, as a source of interpretation, carry virtually no weight.

This may come a shock too you but you are not arbiter of the universe and your view is just yours and isn't any more important than mine. However no amount of insults will change the fact that a source might not be 100% accurate and any choice is personal and arbitrary.

I'm not shocked that I'm not the arbiter of the universe. Nor am I shocked that my view is mine. My view will sometimes be more important than yours; while at other times your view will be more important than mine. I agree that a source will be less than 100-percent accurate. I agree the most choices are "personal." I completely disagree that the vast majority of choices are arbitrary.

Let's say we had a computer system that spit out a binary answer to the question "Do you get this loan?" and we poured our personal data into it. The computer system comes up with a number between 0-10.Then a human steps in and says "Well, that's an 8. No loan for you.""Wait, what?" you say. "An 8 always means no loan?""Oh, no" they reply, "an 8 means no loan for you."

We wouldn't accept such a system. In fact, we might call it beyond useless: it is a recipe for arbitrary abuse of power.

But a human controlling such a classification system is fine here? Why? Sure, the stakes are lower, but the mechanism is identical.

Because it's research.

At the level of an individual scientist research is about rigorously testing hypotheses. At the societal level it's a whole bunch of smart people trying a whole bunch of stuff in hopes that one of them turns out to be right. Sometimes somebody has a lot of successes and becomes Einstein. Everyone else? Not Einstein.

What is likely to happen here, is that many people will try many ideas for using big data to analyze historical texts, most of those ideas will not work out, but something might.

And the whole point of research is to find that one thing that does work out.

It's not rigorous. How much weight you give to a sources interpretation is an arbitrary decision. it is extremely rare that any report is entirely objective so big data is no more rigorous than a human doing the same job.

"How much weight you give to a sources interpretation is an arbitrary decision."

Well, maybe for you. But I carefully consider many, many variables when I weigh a source's interpretation. I've seen you pop up in numerous discussions, and I consider your interpretation in the context of your past posts, which means you, as a source of interpretation, carry virtually no weight.

This may come a shock too you but you are not arbiter of the universe and your view is just yours and isn't any more important than mine. However no amount of insults will change the fact that a source might not be 100% accurate and any choice is personal and arbitrary.

I'm not shocked that I'm not the arbiter of the universe. Nor am I shocked that my view is mine. My view will sometimes be more important than yours; while at other times your view will be more important than mine. I agree that a source will be less than 100-percent accurate. I agree the most choices are "personal." I completely disagree that the vast majority of choices are arbitrary.

If you have to different versions of the same speach how do tell which one is correct. When sources conflict you cannot definitively say which one is correct so that choice comes down the belief of the person reading the souce. Rigourous mathematical proofs are not open to interpretation, sources are. Any algorithm depends on someone making choices about the relative value of the source and will just reflect the the values of the person making the choice. This make big data results are no more valid than a human reading the same sources.

"...so that choice comes down the belief of the person reading the source."

Here's the thing: The "belief" of the person reading the source can be totally uninformed, or it can be extremely well informed. Let's take Isaac Newton's Principia Mathematica as our source. Now, if we place this original source in front of Neil deGrasse Tyson and he reads it, and then we place it in front of Donald Trump and he reads it, which of these two interpreters will you trust in more accurately interpret the source? In answering that question, don't you take into consideration that Tyson has a PhD in physics? Neither you nor Tyson are "arbitrary" in judging. You weigh his interpretation (against Trump's) knowing that Tyson is an expert in physics and math; and that's not arbitrary. Also, Tyson brings a wealth of intelligence and education to the topic, and all his education is not arbitrary.

All "belief" isn't equal. Some belief is poorly justified, and some belief is well justified. More bluntly, some people's beliefs are full of shit, and some people's beliefs are spot-on.

"...so that choice comes down the belief of the person reading the source."

Here's the thing: The "belief" of the person reading the source can be totally uninformed, or it can be extremely well informed. Let's take Isaac Newton's Principia Mathematica as our source. Now, if we place this original source in front of Neil deGrasse Tyson and he reads it, and then we place it in front of Donald Trump and he reads it, which of these two interpreters will you trust in more accurately interpret the source? In answering that question, don't you take into consideration that Tyson has a PhD in physics? Neither you nor Tyson are "arbitrary" in judging. You weigh his interpretation (against Trump's) knowing that Tyson is an expert in physics and math; and that's not arbitrary. Also, Tyson brings a wealth of intelligence and education to the topic, and all his education is not arbitrary.

All "belief" isn't equal. Some belief is poorly justified, and some belief is well justified. More bluntly, some people's beliefs are full of shit, and some people's beliefs are spot-on.

Also, I apologize for the insult.

For it to be scientifically rigorous it has to be mathematically quantifiable.In history there are 1000s of books all based on the same sources but come to different conclusions. You cannot mathematically assign a value to to degree of rightness between two different Oxford Classics professors opinions on the writings of Cicero. If you have two differing accounts of the same event 1600 years ago, how do you mathematically prove which one is right. If you showed the same sources to differing historians you will get different answers. If you showed the same mathematical proof to differing physicists you will get the same answer.

Let's say we had a computer system that spit out a binary answer to the question "Do you get this loan?" and we poured our personal data into it. The computer system comes up with a number between 0-10.Then a human steps in and says "Well, that's an 8. No loan for you.""Wait, what?" you say. "An 8 always means no loan?""Oh, no" they reply, "an 8 means no loan for you."

We wouldn't accept such a system. In fact, we might call it beyond useless: it is a recipe for arbitrary abuse of power.

But a human controlling such a classification system is fine here? Why? Sure, the stakes are lower, but the mechanism is identical.

Because it's research.

At the level of an individual scientist research is about rigorously testing hypotheses. At the societal level it's a whole bunch of smart people trying a whole bunch of stuff in hopes that one of them turns out to be right. Sometimes somebody has a lot of successes and becomes Einstein. Everyone else? Not Einstein.

What is likely to happen here, is that many people will try many ideas for using big data to analyze historical texts, most of those ideas will not work out, but something might.

And the whole point of research is to find that one thing that does work out.

It's not rigorous. How much weight you give to a sources interpretation is an arbitrary decision. it is extremely rare that any report is entirely objective so big data is no more rigorous than a human doing the same job.

That's a bit of a strawman given that nobody on this thread has claimed big data is more rigorous.

What I am claiming is that there's research that isn't possible for one person to do, but Big Data can do. For example, no English PhD, no matter how smart, can read every novel printed in the 1870s in London. Which means that if you're trying to answer any question that requires knowledge of every novel printed in London in the 1870s you simply can't do that without Big Data.

Big Data also allows for fascinating collaborations with other sciences. The Íslendingabók allows Geneticists to collaborate with historians in ways that are absolutely amazing. Collaborations between historians/archeologists/related fields and the people who do satellite mapping have already produced all kinds of cool information.

Let's say we had a computer system that spit out a binary answer to the question "Do you get this loan?" and we poured our personal data into it. The computer system comes up with a number between 0-10.Then a human steps in and says "Well, that's an 8. No loan for you.""Wait, what?" you say. "An 8 always means no loan?""Oh, no" they reply, "an 8 means no loan for you."

We wouldn't accept such a system. In fact, we might call it beyond useless: it is a recipe for arbitrary abuse of power.

But a human controlling such a classification system is fine here? Why? Sure, the stakes are lower, but the mechanism is identical.

Because it's research.

At the level of an individual scientist research is about rigorously testing hypotheses. At the societal level it's a whole bunch of smart people trying a whole bunch of stuff in hopes that one of them turns out to be right. Sometimes somebody has a lot of successes and becomes Einstein. Everyone else? Not Einstein.

What is likely to happen here, is that many people will try many ideas for using big data to analyze historical texts, most of those ideas will not work out, but something might.

And the whole point of research is to find that one thing that does work out.

It's not rigorous. How much weight you give to a sources interpretation is an arbitrary decision. it is extremely rare that any report is entirely objective so big data is no more rigorous than a human doing the same job.

That's a bit of a strawman given that nobody on this thread has claimed big data is more rigorous.

What I am claiming is that there's research that isn't possible for one person to do, but Big Data can do. For example, no English PhD, no matter how smart, can read every novel printed in the 1870s in London. Which means that if you're trying to answer any question that requires knowledge of every novel printed in London in the 1870s you simply can't do that without Big Data.

Big Data also allows for fascinating collaborations with other sciences. The Íslendingabók allows Geneticists to collaborate with historians in ways that are absolutely amazing. Collaborations between historians/archeologists/related fields and the people who do satellite mapping have already produced all kinds of cool information.

I don't need big data to tell me about Victorian sentimentality and romanticism in novels.

So a huge amount of money and time has been spent to find out the reported speeches of Danton and Robespierre aped the language of Diderot, Voltaire and Montesquieu. In other news water is wet. In addition in the era before sound recording or even Newspapers it was common place to polish what was actually said before it was published.

Yes, I like to say written English and spoken English are two different languages. Technically, written English is a notation system but it has its own rules and heritage that often diverge from spoken English.

There are many people who are highly fluent in one and illiterate in the other. (For example, many deaf children from a fluent signing background are able to develop high fluency in written English despite having the same lack of access to or inability to speak English as deaf children from other backgrounds.)

I can’t find the link, but somewhere there is a verbatim transcript of MLK’s famous “I have a dream” speech that includes all his errors, stumbles, interruptions, pauses, inadvertent repetitions etc. It’s a fascinating read and just as thrilling as the polished version we usually see.

Let's say we had a computer system that spit out a binary answer to the question "Do you get this loan?" and we poured our personal data into it. The computer system comes up with a number between 0-10.Then a human steps in and says "Well, that's an 8. No loan for you.""Wait, what?" you say. "An 8 always means no loan?""Oh, no" they reply, "an 8 means no loan for you."

We wouldn't accept such a system. In fact, we might call it beyond useless: it is a recipe for arbitrary abuse of power.

But a human controlling such a classification system is fine here? Why? Sure, the stakes are lower, but the mechanism is identical.

Because it's research.

At the level of an individual scientist research is about rigorously testing hypotheses. At the societal level it's a whole bunch of smart people trying a whole bunch of stuff in hopes that one of them turns out to be right. Sometimes somebody has a lot of successes and becomes Einstein. Everyone else? Not Einstein.

What is likely to happen here, is that many people will try many ideas for using big data to analyze historical texts, most of those ideas will not work out, but something might.

And the whole point of research is to find that one thing that does work out.

It's not rigorous. How much weight you give to a sources interpretation is an arbitrary decision. it is extremely rare that any report is entirely objective so big data is no more rigorous than a human doing the same job.

That's a bit of a strawman given that nobody on this thread has claimed big data is more rigorous.

What I am claiming is that there's research that isn't possible for one person to do, but Big Data can do. For example, no English PhD, no matter how smart, can read every novel printed in the 1870s in London. Which means that if you're trying to answer any question that requires knowledge of every novel printed in London in the 1870s you simply can't do that without Big Data.

Big Data also allows for fascinating collaborations with other sciences. The Íslendingabók allows Geneticists to collaborate with historians in ways that are absolutely amazing. Collaborations between historians/archeologists/related fields and the people who do satellite mapping have already produced all kinds of cool information.

I don't need big data to tell me about Victorian sentimentality and romanticism in novels.

First off, I doubt you can't actually read ALL novels, so your statement is more like "I don't need big data to tell me about Victorian Sentimentality and romanticism in the most popular Victorian novels, plus 20 other random ones that happened to be handy while researching this paper." Not to mention a that "sentimentality and romanticism" won't let you track change in language usage across all novels.

Let's say we had a computer system that spit out a binary answer to the question "Do you get this loan?" and we poured our personal data into it. The computer system comes up with a number between 0-10.Then a human steps in and says "Well, that's an 8. No loan for you.""Wait, what?" you say. "An 8 always means no loan?""Oh, no" they reply, "an 8 means no loan for you."

We wouldn't accept such a system. In fact, we might call it beyond useless: it is a recipe for arbitrary abuse of power.

But a human controlling such a classification system is fine here? Why? Sure, the stakes are lower, but the mechanism is identical.

Because it's research.

At the level of an individual scientist research is about rigorously testing hypotheses. At the societal level it's a whole bunch of smart people trying a whole bunch of stuff in hopes that one of them turns out to be right. Sometimes somebody has a lot of successes and becomes Einstein. Everyone else? Not Einstein.

What is likely to happen here, is that many people will try many ideas for using big data to analyze historical texts, most of those ideas will not work out, but something might.

And the whole point of research is to find that one thing that does work out.

It's not rigorous. How much weight you give to a sources interpretation is an arbitrary decision. it is extremely rare that any report is entirely objective so big data is no more rigorous than a human doing the same job.

That's a bit of a strawman given that nobody on this thread has claimed big data is more rigorous.

What I am claiming is that there's research that isn't possible for one person to do, but Big Data can do. For example, no English PhD, no matter how smart, can read every novel printed in the 1870s in London. Which means that if you're trying to answer any question that requires knowledge of every novel printed in London in the 1870s you simply can't do that without Big Data.

Big Data also allows for fascinating collaborations with other sciences. The Íslendingabók allows Geneticists to collaborate with historians in ways that are absolutely amazing. Collaborations between historians/archeologists/related fields and the people who do satellite mapping have already produced all kinds of cool information.

I don't need big data to tell me about Victorian sentimentality and romanticism in novels.

First off, I doubt you can't actually read ALL novels, so your statement is more like "I don't need big data to tell me about Victorian Sentimentality and romanticism in the most popular Victorian novels, plus 20 other random ones that happened to be handy while researching this paper." Not to mention a that "sentimentality and romanticism" won't let you track change in language usage across all novels.

Because you are physically incapable of measuring that yourself.

Publishers seek what sells. In the same way there are 100s of Harry Potter clones, there were 100s of Little Dorrit clones. Even the Prime Minister had a profitable sideline in writing novels. I don't need big data to tell me that literally style copies the fashionable style of the moment. Everyone writes in the same style as Dickens because he was the most successful author of the peroid. Big data can only tell you the obvious.