Posted
by
samzenpus
on Sunday June 12, 2011 @10:15PM
from the reading-between-a-million-lines dept.

meckdevil writes "Associated Press developer-journalist extraordinaire Jonathan Stray gives a brilliant explanation of the use of data-mining strategies to winnow and wring journalistic sense out of massive numbers of documents, using the Iraq and Afghanistan war logs released by Wikileaks as a case in point. The concepts for focusing on certain groups of documents and ignoring others are hardly new; they underlie the algorithms used by the major Web search engines. Their use in a journalistic context is on a cutting edge, though, and it raises a fascinating quandary: By choosing the parameters under which documents will be considered similar enough to pay attention to, journalist-programmers actually choose the frame in which a story will be told. This type of data mining holds great potential for investigative revelation — and great potential for journalistic abuse."

Worked miracles after I've gotten around the ugly HTML format they use to release all those INFORMATIONS. Still, there was very little new or worthwhile in the heap of those news clips and rumour aggregations. Frankly, the more I grep it, the less it looks like the "largest leak in history", and the more it seems like "the largest controlled release of information" in history.

Worked miracles after I've gotten around the ugly HTML format they use to release all those INFORMATIONS. Still, there was very little new or worthwhile in the heap of those news clips and rumour aggregations. Frankly, the more I grep it, the less it looks like the "largest leak in history", and the more it seems like "the largest controlled release of information" in history.

/ takes off conspiracy theory hat// flame on

When you use grep you have to know what you grep for. You can not stumble upon a search keyword with grep.Clustering allows that, if you let it build the clusters itself. Perhaps you are missing out on the interesting bits.

If you don't know what you're looking for in a pile of documents assembled by less than gifted embassy staff by retelling newspaper clips and general gossip, you have no business being a journalist in the first place. There was nothing else but local newspaper clips and gossip in the Wikileaks embassy leak.

If you know what's in the documents, then life gets easy of course. The trouble is that usually you do not know what's in the documents without reading them. And if there's nothing new, that's a pity. But anyway the fact that one could say "there is nothing but local newspaper clips and gossip" in a set of documents indicates that they actually went through them all.

And for sure with the WikiLeaks documents there's a lot of noise in it. The same will be with the Palin e-mail trove. And finding the interesting bits out of that enormous noise that's what journalists are for, and what those interesting bits are no journalist will know beforehand - which is exactly why they are interesting.

Isn't that one of the major reasons we have journalism? To synthesize and contextualize information? If the contextualized (or perhaps editorialized, depending on your point of view) information was the only kind available, then yes that is an issue. But with Wikileaks, the data is there for anyone who wants to parse it.

This strikes me as being similar to when Anderson Cooper was criticized for calling Mubarak a liar. Or the behavior that Colbert mocked the White House press corps for at the correspondents' dinner. Pretending that journalists are free of bias doesn't make it so, and saying that they should just regurgitate facts and talking points verbatim is counter-productive. Reasoned analysis should be encouraged.

If the contextualized (or perhaps editorialized, depending on your point of view) information was the only kind available, then yes that is an issue. But with Wikileaks, the data is there for anyone who wants to parse it.

If memory serves, and I'm not missing something in my quick re-read of the Wikipedia page, the leaked cables were not all made available to everyone. They were distributed to five major news organizations so more than one editorial staff could reasonably decide which material was newsworthy

That's part of the point of the video, using data mining techniques to broaden analytical tools beyond a simple keyword search and the preconceptions it can reinforce (the reporter mentions seeing a cluster of tanker truck incidents that was bigger than his organization was previously aware). He ends by noting that the way one writes the algorithm can determine what trends pop out and thus how the story is framed, which seems like a perfectly reasonable statement. Then someone (either the submitter or the Slashdot editors) transforms that into a "great potential for journalistic abuse."

I don't have an issue with the methodology portrayed in the video. But to than take the presenter's words and twist them to support a "just the facts, ma'am" style of journalism seems dishonest and unproductive.

From the video (I only watched 5 minutes before getting bored, though) I am not sure he even used his own algorithm/s. It does appear that he used Gephi and its built-in or 3rd party algorithms (plugins) to display the data in a way that made associations not immediately apparent... apparent. The tanker truck incident cluster is an example of this, and about when I stopped watching.

He used a standard text mining approach (TF-IDF), followed by clustering of documents on pairwise distance. We did something similar here http://journal.imbio.de/article.php?aid=121 [imbio.de] to text mine the biological literature although we went further in terms of figuring out which metrics work best. He eventually ran up against the same thresholding problem we did - at some point you have to decide what you are going to call 'not related' and what 'related' and there doesn't seem an obvious principled way to d

Actually, if you watch the video, that's not what Stray is talking about. Rather than doing targeted searches, he's talking about processing the whole dataset and using algorithms to establish connections. The narrative that makes sense of those clusters is what would (hopefully) be the reasoned analysis.

The problem you're missing is that, by the nature of broacasting and traditional media, not all journalists (and not all newspapers) have the same weight.
This is a problem and the reason why bias should not be encouraged.

In a world where one source of information has 10 times the weight of others,
the impact of that source's bias is 10 times the impact of the others.
In other words, for every NY Times article which claims we have to invade Iraq, you would have to read 10 local Springfield Shopper article

The Iraq war example doesn't exactly fit here, since the video is about parsing and analyzing large datasets. But since you brought it up (and I referenced it in my post), I'm going to quote Colbert [about.com]:

Over the last five years you people were so good, over tax cuts, WMD intelligence, the effect of global warming. We Americans didn't want to know, and you had the courtesy not to try to find out. Those were good times, as far as we knew.

But, listen, let's review the rules. Here's how it works. The President makes decisions. He's the decider. The press secretary announces those decisions, and you people of the press type those decisions down. Make, announce, type. Just put 'em through a spell check and go home. Get to know your family again. Make love to your wife. Write that novel you got kicking around in your head. You know, the one about the intrepid Washington reporter with the courage to stand up to the administration? You know, fiction!

I think there is far more danger to be had in a news media that passively accepts presented facts (Iraqi WMDs, Saddam's ties to al Qaeda, ect) and narratives (invasion of Iraq is necessary) than one that editorializes a bit but doesn't simply act as a mouthpiece for whoever is currently in power. And I disagree

You are both right. I was just talking about bias (= the narrative or
interpretation that is shrouded over the bare facts). But I wasn't
suggesting that the top news sources should present controversies or
talking points. I was suggesting that they should stay with the facts
and reduce the interpretation the more important (=higher readership)
they are. That's quite different from presenting a bag of
contradictory biases hoping that would give a complete interpretation
somewhere in the middle.

I think you miss the point - that it was used in a journalistic context most certainly *is* newsworthy: the AP guy was going to great lengths to stress evidence-based reporting, and uncovering associations, vice pre-supposing those things and backfitting the data.

Data mining - like stats - allows bias to creep in quite readily, and once a study, a number, a story is out there, it's very difficult to pull it back, even when it's demonstrably wrong, biased or fabricated.

Terrorists and foreign intelligence services will also be doing this to use against the United States and its allies, not just journalists. Wikileaks has provided the raw material for data mining to find things the US doesn't even realize about itself, or its allies. There is no surprise that Bradley Manning has been charged with aiding the enemy [dailymail.co.uk].

Life's a bitch. Perhaps the US shouldn't be doing things that it has to keep so secret. That's just a consequence of empire-building. Preach one value to the masses, do something else in practice.

Is it more important to prop up the current system to keep a few agents of the empire safe from harm or is it more important to try to bring some sanity to the whole entire thing and do some longer-term good by shedding light on things people are afraid of showing to even our own public?

we're doing some rather unsettling things that I don't want to be associated with

And that's why you're not some sort of government agent doing those things. This attitude bothers me for the same reason the "No blood for oil" types bother me. You don't get how important that sort of thing is. No blood for oil? Then what will you shed blood for? Losing oil supplies will so vastly change your way of life that you would argue it impossible if someone accurately showed you. If you think shady goings-on are an endeavor unique to America, you need to wake up. Every country (EVERY countr

So you do actually believe it's fine to kill foreigners so you can keep your higher standard of living? That sounds an absurd question to ask anyone, but I can't see how your post doesn't imply that you do believe it's fine. It's not ethical, though.

Another thing that bothers "foreigners" (i.e. the approximately 6.7 billion, or over 95% of the human race, who are not citizens of the USA) is the inconsistency of a government that loudly insists that all people are equal, while working as hard as it can to ma

I'm not saying it's fine to kill foreigners to keep a higher standard of living, I'm just saying that it's silly and hypocritical for people to make the kind of arguments I mentioned. The response after yours puts it more accurately; it's very Faustian, but it is in fact the way the world works. See my response to that for more.

This is a Faustian bargain though. It seems that you are saying, "If you want to have nice things we have to kill people to get them." You sound a lot like Col. Jessep telling us we can't handle the truth.

I don't deny that you are accurately describing the current dynamic. It is the way it is, and yet people wonder why the world is such a violent and chaotic place and why we can't have "peace". Well this dynamic is partly why the world is the way it is. We can accept the way it is, but not agree that i

The word dynamic here is perfect to describe this; the equilibrium of the world is such that if any entity (country, state, politician, military force, etc) won't take unfair or aggressive advantage of something, there will be another equal entity to fill that void. That's why it's silly to point out any one country for doing this kind of thing, because even if a country isn't, given the chance or risk:reward ratio, it would.

We can accept the way it is, but not agree that it should be that way.

But that attitude means it will never change. Sure, no blood for oil! I'm agains

The word dynamic here is perfect to describe this; the equilibrium of the world is such that if any entity (country, state, politician, military force, etc) won't take unfair or aggressive advantage of something, there will be another equal entity to fill that void. That's why it's silly to point out any one country for doing this kind of thing, because even if a country isn't, given the chance or risk:reward ratio, it would.

That still sounds to me like an excuse to take unfair advantage. If I don't do it someone else will. I would like to see my country's foreign and domestic policy be a bit more equitable than that. A pipe dream, I know.

But that attitude means it will never change. Sure, no blood for oil! I'm against foreign wars! Oh, but I'll also blame the government or big business or whomever if gas prices rise any higher. If you're against something, be against it. I'm in the military, so if someone gives me an order I don't like, I can either deal with it and follow the order, or I can decide I can't follow the order legally or in good conscience and refuse, but like everyone else I'd have to pay the consequences.

Yeah, but some things I just have to accept. I can't fight all the battles, and I can't make everyting the way I think it should be (the rest of you should probably be glad about that). I can accept what I can't (or won't) change, but still not like that it goes on. I admit I am not w

Perhaps the US shouldn't be doing things that it has to keep so secret. That's just a consequence of empire-building. Preach one value to the masses, do something else in practice.

Is it the duty of the United States government to serve the interests of the United States, as opposed to say, Iran? Is it the duty of the United States government to care for and protect its people, as opposed to say, the people of Venezuela? If so, then it must differentiate between different sets of interests, American, and those of others.

If American citizens have been taken prisoner unlawfully by pirates, the United States government could try to negotiate with the pirates. If the pirates want $1,000,000, but the US is willing to pay $20,000,000, should the government go in and up front announce the maximum amount they are willing to pay instead of try to pay the least amount? Wouldn't that be a fundamentally stupid bargaining tactic? But to do that, they would need to keep secrets from the pirates. Well, not just pirates, they would need to keep it secret from the media, since there are many media outlets that would gladly publish it, and force the US to pay $20,000,000 instead of $1,000,000. So, do you think the US should keep the maximum bid a secret and serve American interests, or announce it and server pirate interests by undermining the government's own negotiating position?

Let's say negotiations with the pirates are going badly, they heard in media that the government is willing to pay $20,000,000 but they got greedy and now think they can get $50,000,000. The US Government isn't willing to pay that much, decides to use a commando raid to rescue the hostages while stalling in negotiations. Military actions are generally at least twice as effective over short periods of time when the attacking force attains surprise. Even if the pirates think it is possible, they don't really know if, when, how, who, or where they will come from. Should the US Government announce to the pirates that it has given up negotiations, and that it is going to use military force to free its citizens? If not, that would mean keeping a secret from the pirates - do you oppose that? Of course, it will also have to keep the rescue plan secret from the media as well or it will be published, the pirates will see it, and will be prepared to defeat it. Should the government tell the next of kin that it is going to try a military rescue? They might tell the media, or their kin being held by the pirates, and either the media or the prisoners might tell the pirates. So, it looks like we can't tell the pirates, the media, or the next of kin. What about other people in the United States? Same problem.

As part of the planning for the rescue mission, it appears that it would be really helpful to refuel some aircraft in a country near where the pirates are holding the American captives. This third country has a government that is friendly to the United States, but much of the population is hostile as they are being influenced by religious extremists from outside their country. The government of this third country agrees to the refueling operation at one of their island military bases, but demand that it be kept secret to avoid agitating their citizens. Since it helps the mission of recovering Americans help hostage, shouldn't the US make use of the island for refueling? What about the request to keep it secret? Should the US stir up problems in the country by making it known, despite the request of the government? If the use of the island is revealed, it could hurt diplomatic relations, and perhaps even generate civil unrest, getting people killed. Shouldn't this be kept a secret? From the pirates? From the media?

During the flight to the pirate locations, and on the ground, US forces will be using radios for command and control, and various flight operations. Should the US inform the pirates about the radio frequencies it uses? What about the media, who might listen in? Suppose a

So are you claiming the US should have publicised that they were pretty sure where Osama bin Laden was instead of keeping it secret until they acted on it? Or that they shouldn't have been looking?

Are you claiming the US should call the Chinese Government right now and tell them the names and locations of every Chinese dissident they know about? Or that they shouldn't know on the first place or provide any assistance?

Are you claiming the US should just publicise and, maybe run some training seminars on, ho

The fact that there's a media narrative is hardly news. The purpose is to provide ratings. Anything that will lead to scandal, corruption, or supporting national politics is the name of the game. Fox does this to support Republicans, all the others support the Democrats. I suppose this is news to those that don't already know this however. And this "taking sides" of the national media is nothing new at all. Very old hat in American history.

Ask any budding journalist as to why they want to be in this industry. Sometimes, you will hear a common theme of "To change the world for a better place". Generally that implies a motive with bias. No, their job to REPORT the news in its purest form. I'll tell ya, that can both end wars and create them. But oh no, we can't have that now can we? They should report the good, the bad, and the ugly with impartiality. BBC is the closest as it comes to doing that. Perhaps I'm giving them too much credit however.

The director general of the BBC admitted Thursday that his organisation had been guilty of a "massive bias to the left" but said "a completely different generation" of journalists now works at the broadcaster.Mark Thompson told the right-of-centre Spectator magazine that there was an institutional bias when he joined the organisation, reinforcing the findings of a 2007 int

The BBC certainly is massively biased (in an institutional way). It's less a matter of overt censorship than of a pervasive worldview that makes censorship unnecessary. Presumably they don't even hire people who might be members of the "awkward squad", or who don't appear to share the standard politically correct establishment values.

As a result, I think it's very inaccurate to describe the BBC's bias as "left wing". I would call it "establishment", which of course does beg the question of whether the Briti

The video has a reasonable explanation - they look at every word in each document and give it a relevance score - TF/IDF term-frequence/inverse document frequency - i.e. how often the word comes up in the document compared to how often it comes up in the whole document set. This gives you a rating for every word on how 'document-specific' its use is. Then for every pairwise comparison between documents you can calculate the distance between the pair of documents by looking at the overlap between the term

Mark Twain summed up the central problem of journalism with his epigram, "Get your facts first... then you can distort 'em as much as you please". But, amusing as it is, this completely misses the point! In the very process of "getting your facts" you have the opportunity - indeed, the obligation - of selecting them from among the infinite number of facts that you could choose. Having selected the facts that you think are most important, there is no longer the slightest need to distort them. The work is already done.

Suppose you are the New York Times, and you are reporting on events in Afghanistan. You have a certain amount of space, so do you write up the IED explosion which killed a couple of NATO soldiers and put a few more in hospital - or do you describe the NATO helicopter raid that killed a dozen villagers and wounded another few dozen? Well, your readers are far more interested in the fate of NATO people (especially if they are from the USA); moreover, they don't particularly want to read about how their glorious forces have accidentally (or otherwise) killed a lot of civilians. So it's a no-brainer - you write up the IED event. After a few years of such a policy, consistently followed, readers get the idea that all that happens in Afghanistan is that NATO soldiers occasionally get blown up. Yes the NYT has accurately reported the facts. It hasn't reported all of them, but its editors could argue that such an attempt would be physically impossible. The only practical way of giving a more balanced impression would be to read, as well as the NYT, a newspaper that takes an anti-NATO, pro-Afghan point of view. But no such newspaper can survive commercially in the US market, because it wouldn't sell enough copies (even if it were allowed to go on operating for long).

Indeed, the Wikileaks documents currently under discussion are subject to such a filtering effect too. Remember, all those documents were written by American officials, for US government consumption. You won't find many mentions in there of atrocities by our forces - even if the US authorities in Afghanistan or Washington were aware of such atrocities, they wouldn't put them into messages with such a low level of security. What you can expect to find is a fairly high level of unguarded opinions - either honest or carefully angled to make a particular desired impression.

Is this a link to (presumably) the submitter's blog, rather than the actual presentation available here.

Given that the submitter meckdevil's associated email address is john.mecklin@sbcglobal.net and the link to TFA is on johnmecklin.wordpress.com, I would say yes. The linked page contains no content and readers should just use the link in the parent post. The submitter is nothing more than a link whore and if the editors were doing their jobs this wouldn't happen.