Audrey Watters – O'Reilly RadarInsight, analysis, and research about emerging technologies2017-02-08T12:45:04Zhttp://radar.oreilly.com/feed/atomWordPressAudrey Wattershttp://radar.oreilly.com/audreywhttp://blogs.oreilly.com/radar/2012/06/bittorrent-usage-visualization.html2012-07-12T11:27:34Z2012-06-08T13:00:00ZThis week’s visualization comes from BitTorrent, the San Francisco-based company responsible for the peer-to-peer BitTorrent protocol.

BitTorrent’s visualization is a time-lapsed movie with some 60 million global clients logging in over a 24-hour period. Each frame represents six minutes of real time. “Each time a pixel lights up,” writes BitTorrent’s Kara Murphy, “it’s a client (either BitTorrent or µTorrent) in that square of the world checking in with our servers.”

The video (embedded below) was inspired by NASA’s Earth at Night, where electric lights at nighttime highlight the highly populated and developed regions of the world.

The data for the visualization comes from GeoIP lookups from the company’s access logs.

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We’re always looking for leads, so please drop a line if there’s a visualization you think we should know about.

OSCON 2012 Data Track — Today’s system architectures embrace many flavors of data: relational, NoSQL, big data and streaming. Learn more in the Data track at OSCON 2012, being held July 16-20 in Portland, Oregon.

]]>1Audrey Wattershttp://radar.oreilly.com/audreywhttp://blogs.oreilly.com/radar/2012/06/kaggle-data-hp-hadoop-cloudera.html2012-06-07T16:30:00Z2012-06-07T16:30:00ZHere are a few of the data stories that caught my attention this week:

Prospecting for data

The data science competition site Kaggle is extending its features with a new service called Prospect. Prospect allows companies to submit a data sample to the site without having a pre-ordained plan for a contest. In turn, the data scientists using Kaggle can suggest ways in which machine learning could best uncover new insights and answer less-obvious questions — and what sorts of data competitions could be based on the data.

As GigaOm’s Derrick Harris describes it: “It’s part of a natural evolution of Kaggle from a plucky startup to an IT company with legs, but it’s actually more like a prequel to Kaggle’s flagship predictive modeling competitions than it is a sequel.” It’s certainly a good way for companies to get their feet wet with predictive modeling.

HP’s big data plans

Last year, Hewlett Packard made a move away from the personal computing business and toward enterprise software and information management. It’s a move that was marked in part by the $10 billion it paid to acquire Autonomy. Now we know a bit more about HP’s big data plans for its Information Optimization Portfolio, which has been built around Autonomy’s Intelligent Data Operating Layer (IDOL).

Social data platform DataSift also announced this week that it was powering its Hadoop clusters with CDH to perform the “Big Data heavy lifting to help deliver DataSift’s Historics, a cloud-computing platform that enables entrepreneurs and enterprises to extract business insights from historical public Tweets.”

Have data news to share?

OSCON 2012 Data Track — Today’s system architectures embrace many flavors of data: relational, NoSQL, big data and streaming. Learn more in the Data track at OSCON 2012, being held July 16-20 in Portland, Oregon.

]]>0Audrey Wattershttp://radar.oreilly.com/audreywhttp://blogs.oreilly.com/radar/2012/06/tornado-tracking-visualization.html2012-08-29T17:21:11Z2012-06-01T14:00:00ZThis week’s visualization comes from John Nelson of IDV Solutions, who has taken data from the National Oceanic and Atmospheric Administration (NOAA) to map tornado paths and F-Scale frequencies.

“It tracks 56 years of tornado paths along with a host of attribute information. Here, the tracks are categorized by their F-Scale (which isn’t the latest and greatest means, but good enough for a hack like me), where brighter strokes represent more violent storms.”

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We’re always looking for leads, so please drop a line if there’s a visualization you think we should know about.

OSCON 2012 — Join the world’s open source pioneers, builders, and innovators July 16-20 in Portland, Oregon. Learn about open development, challenge your assumptions, and fire up your brain.

]]>0Audrey Wattershttp://radar.oreilly.com/audreywhttp://blogs.oreilly.com/radar/2012/05/strata-week-mit-announces-a-bi.html2012-05-31T14:00:00Z2012-05-31T14:00:00ZHere are a few of the big data stories that caught my attention this week.

MIT makes a big data push

MIT unveiled its big data research plans this week with a new initiative: bigdata@csail. CSAIL is the university’s Computer Science and Artificial Intelligence Laboratory. According to the initiative’s website, the project will “identify and develop the technologies needed to solve the next generation data challenges which require the ability to scale well beyond what today’s computing platforms, algorithms, and methods can provide.”

The research will be funded in part by Intel, which will contribute $2.5 million per year for up to five years. As part of the announcement, Massachusetts Governor Deval Patrick added that his state was forming a Massachusetts Big Data initiative that would provide matching grants for big data research, something he hopes will make the state “well-known for big data research.”

Cisco’s predictions for the Internet

Cisco released its annual forecast for Internet networking. Not surprisingly, Cisco projects massive growth in networking, with annual global IP traffic reaching 1.3 zettabytes by 2016. “The projected increase of global IP traffic between 2015 and 2016 alone is more than 330 exabytes,” according to the company’s press release, “which is almost equal to the total amount of global IP traffic generated in 2011 (369 exabytes).”

Cisco points to a number of factors contributing to the explosion, including more Internet-connected devices, more users, faster Internet speeds, and more video.

Open data startup Junar raises funding

The Chilean data startup Junar announced this week that it had raised a seed round of funding. The startup is an open data platform with the goal of making it easy for anyone to collect, analyze, and publish.GigaOm’s Barb Darrow writes:

“Junar’s Open Data Platform promises to make it easier for users to find the right data (regardless of its underlying format); enhance it with analytics; publish it; enable interaction with comments and annotation; and generate reports. Throughout the process it also lets user manage the workflow and track who has accessed and downloaded what, determine which data sets are getting the most traction etc.”

Junar joins a number of open data startups and marketplaces that offer similar or related services, including Socrata and DataMarket.

]]>0Audrey Wattershttp://radar.oreilly.com/audreywhttp://blogs.oreilly.com/radar/2012/05/tech-ipo-facebook-visualization.html2012-05-25T15:00:00Z2012-05-25T15:00:00ZThis week’s visualization comes from The New York Times, which tries to shed a little light on Facebook’s initial public offering by showing how it compares to the 2,400 technology IPOs that have occurred since 1980.

The visualization begins with a timeline of tech IPOs that runs up until last week. Until then, Google had the largest market capitalization with a value of $28 billion at its launch. The next image in the visualization series then adds Facebook to the mix — the animation makes the other IPOs literally shrink in comparison. Facebook’s value at launch was $104 billion.

]]>0Audrey Wattershttp://radar.oreilly.com/audreywhttp://blogs.oreilly.com/radar/2012/05/better-life-index-open-data-hadoop.html2012-05-24T15:30:00Z2012-05-24T15:30:00ZHere are a few of the data stories that caught my attention this week:

Visualizing a better life

How do you compare the quality of life in different countries? As The Guardian’s Simon Rogers points out, GDP has commonly been the indicator used to show a country’s economic strength, but it’s insufficient for comparing the quality of life and happiness of people.

To help build a better picture of what quality of life means to people, the Organization for Economic Cooperation and Development OECD built the Your Better Life Index. The index lets people select the things that matter to them: housing, income, jobs, community, education, environment, governance, health, life satisfaction, safety and work-life balance. The OECD launched the tool last year and offered an update this week, adding data on gender and inequality.

“It’s counted as a major success by the OECD,” writes Rogers, “particularly as users consistently rank quality of life indicators such as education, environment, governance, health, life satisfaction, safety and work-life balance above more traditional ones. Designed by Moritz Stefaner and Raureif, it’s also rather beautiful.”

The countries that come out on top most often based on users’ rankings: “Denmark (life satisfaction and work-life balance), Switzerland (health and jobs), Finland (education), Japan (safety), Sweden (environment), and the USA (income).”

Researchers’ access to data

The New York Times’ John Markoff examines social science research and the growing problem of datasets that are not made available to other scholars. Opening data helps make sure that research results can be verified. But Markoff suggests that in many cases, data is being kept private and proprietary.

Much of the data he’s talking about here is:

“… gathered by researchers at companies like Facebook, Google and Microsoft from patterns of cellphone calls, text messages and Internet clicks by millions of users around the world. Companies often refuse to make such information public, sometimes for competitive reasons and sometimes to protect customers’ privacy. But to many scientists, the practice is an invitation to bad science, secrecy and even potential fraud.”

“The debate will only intensify as large companies with deep pockets do more research about their users,” Markoff predicts.

Updates to Hadoop

Apache has released the alpha version of Hadoop 2.0.0. We should stress “alpha” here, and as Hortonworks’ Arun Murthy notes, it’s “not ready to run in production.” However, he adds the update “is still an important step forward, as it represents the very first release that delivers new and important capabilities,” including: HDFS HA (manual failover) and next generation MapReduce.

In other Hadoop news, MapR has unveiled a series of new features and initiatives for its Hadoop distribution, including release of a fully compliant ODBC 3.52 driver, support for the Linux Pluggable Authentication Modules (PAM), and the availability of the source code for several of its components.

]]>0Audrey Wattershttp://radar.oreilly.com/audreywhttp://blogs.oreilly.com/radar/2012/05/visualization-urban-metabolism.html2012-05-18T13:15:00Z2012-05-18T13:15:00ZThis week’s visualization comes from PhD candidates David Quinn and Daniel Wiesmann, who’ve built an interactive web-mapping tool that lets you explore the “urban metabolism” of major U.S. cities. The map includes data about cities’ and neighborhoods’ energy usage (kilowatt per hour per person) and material intensity (kilo per person) patterns. You can also view population density.

Quinn writes that “one of the objectives of this work is to share the results of our analysis. We would like to help provide better urban data to researchers.” The map allows users to analyze information on the screen, draw out an area to analyze, compare multiple areas, and generate a report (downloadable as a PDF) with more details, including information about the specific data sources.

Quinn is a graduate student at MIT; Wiesmann is a PhD candidate at the Instituto Superior Técnico in Lisbon, Portugal.

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We’re always looking for leads, so please drop a line if there’s a visualization you think we should know about.

]]>2Audrey Wattershttp://radar.oreilly.com/audreywhttp://blogs.oreilly.com/radar/2012/05/google-knowledge-graph-yahoo-census.html2012-05-17T13:45:00Z2012-05-17T13:45:00ZHere’s what caught my attention in the data space this week.

Google’s Knowledge Graph

“Google does the semantic Web,” says O’Reilly’s Edd Dumbill, “except they call it the Knowledge Graph.” That Knowledge Graph is part of an update to search that Google unveiled this week.

“We’ve always believed that the perfect search engine should understand exactly what you mean and give you back exactly what you want,” writes Amit Singhal, Senior VP of Engineering, in the company’s official blog post.

“Most of Google users’ queries are ambiguous. In the old Google, when you searched for “kings,” Google didn’t know whether you meant actual monarchs, the hockey team, the basketball team or the TV series, so it did its best to show you web results for all of them.

“In the new Google, with the Knowledge Graph online, a new box will come up. You’ll still get the Google results you’re used to, including the box scores for the team Google thinks you’re looking for, but on the right side, a box called “See results about” will show brief descriptions for the Los Angeles Kings, the Sacramento Kings, and the TV series, Kings. If you need to clarify, click the one you’re looking for, and Google will refine your search query for you.”

Yahoo’s fumbles

The news from Yahoo hasn’t been good for a long time now, with the most recent troubles involving the departure of newly appointed CEO Scott Thompson over the weekend and a scathing blog post this week by Gizmodo’s Mathew Honan titled “How Yahoo Killed Flickr and Lost the Internet.” Ouch.

Over on GigaOm, Derrick Harris wonders if Yahoo “sowed the seeds of its own demise with Hadoop.” While Hadoop has long been pointed to as a shining innovation from Yahoo, Harris argues that:

“The big problem for Yahoo is that, increasingly, users and advertisers want to be everywhere on the web but at Yahoo. Maybe that’s because everyone else that’s benefiting from Hadoop, either directly or indirectly, is able to provide a better experience for consumers and advertisers alike.”

De-funding data gathering

The appropriations bill that recently passed the U.S. House of Representatives axes funding for the Economic Census and the American Community Survey. The former gathers data about 25 million businesses and 1,100 industries in the U.S., while the latter collects data from three million American households every year.

Census Bureau director Robert Groves writes that the bill “devastates the nation’s statistical information about the status of the economy and the larger society.” BusinessWeek chimes in that the end to these surveys “blinds business,” noting that businesses rely “heavily on it to do such things as decide where to build new stores, hire new employees, and get valuable insights on consumer spending habits.”

]]>0Audrey Wattershttp://radar.oreilly.com/audreywhttp://blogs.oreilly.com/radar/2012/05/avengers-visualization.html2012-05-11T15:30:00Z2012-05-11T15:30:00ZMarvel’s “The Avengers” opened in U.S. theaters last weekend, claiming the largest weekend opening so far this year and setting a new three-day domestic box office record. The film features a superhero team comprised of Iron Man, Thor, The Hulk, Captain America, Black Widow, and Hawkeye. But as fans of the comics know, this isn’t the original or the only composition of “The Avengers.” When the team first appeared in 1963, it was made up of Iron Man, Ant-Man, Wasp, Thor, and the Hulk. Captain America was discovered frozen in ice in Issue #4. The only real constant of “The Avengers” over the years is its rotating roster of superheroes.

That changing makeup of “The Avengers” is the theme of this week’s visualization, created by The New York Times’ data artist in residence Jer Thorp. On his blog, Thorp has posted a series of visualizations about these superheroes that uses comicvine.com’s API.

“My first thought was to use images of the characters in my visualizations, but while the Comic Vine API provides images in all kinds of sizes, the styles of drawing are so varied that it ended up not holding together. Instead, then, I built a small tool that let me go through those characters and pick three colours that I thought represented them the best (everybody gets a shield!).”

Below is a depiction of all “The Avengers” characters, ordered by the frequency in which they appear in the series.

And sorting by issue (characters appearing in an issue together form a radial line), here is every appearance of every Avenger team member in every issue.

Read the full post to see many more visualizations based on “The Avengers,” including the gender ratio of the superhero team and the types of villains they often had to assemble to battle.

Found a great visualization? Tell us about it

This post is part of an ongoing series exploring visualizations. We’re always looking for leads, so please drop a line if there’s a visualization you think we should know about.

]]>0Audrey Wattershttp://radar.oreilly.com/audreywhttp://blogs.oreilly.com/radar/2012/05/hadoop-strata-gov-data.html2012-05-10T16:00:00Z2012-05-10T16:00:00ZHere are a few of the data stories that caught my attention this week.

Big data booming

The call for speakers for Strata New York has closed, but as Edd Dumbill notes, the number of proposals are a solid indication of the booming interest in big data. The first Strata conference, held in California in 2011, elicited 255 proposals. The following event in New York elicited 230. The most recent Strata, held in California again, had 415 proposals. And the number received for Strata’s fall event in New York? That came in at 635.

Edd writes:

“That’s some pretty amazing growth. I can thus expect two things from Strata New York. My job in putting the schedule together is going to be hard. And we’re going to have the very best content around.”

The increased popularity of the Strata conference is just one data point from the week that highlights a big data boom. Here’s another: According to a recent report by IDC, the “worldwide ecosystem for Hadoop-MapReduce software is expected to grow at a compound annual rate of 60.2 percent, from $77 million in revenue in 2011 to $812.8 million in 2016.”

A big data gap?

Another report released this week reins in some of the exuberance about big data. This report comes from the government IT network MeriTalk, and it points to a “big data gap” in the government — that is, a gap between the promise and the capabilities of the federal government to make use of big data. That’s interesting, no doubt, in terms of the Obama administration’s recent $200 million commitment to a federal agency big data initiative.

Among the MeriTalk report’s findings: 60% of government IT professionals say their agency is analyzing the data it collects and less than half (40%) are using data to make strategic decisions. Those responding to the survey said they felt as though it would take, on average, three years before their agencies were ready to fully take advantage of big data.

Prismatic and data-mining the news

The largest-ever healthcare fraud scheme was uncovered this past week. Arrests were made in seven cities — some 107 doctors, nurses and social workers were charged, with fraudulent Medicare claims totaling about $452 million. The discoveries about the fraudulent behavior were made thanks in part to data-mining — looking for anomalies in the Medicare filings made by various health care providers.

Prismatic penned a post in which it makes the case for more open data so that there’s “less friction” in accessing the sort of information that led to this sting operation.

“Both the recent sting and the Prime case show that you need real journalists and investigators working with technology and data to achieve good results. The challenge now is to scale this recipe and force transparency on a larger scale.

“We need to get more technically sophisticated and start analysing the data sets up front to discover the right questions to ask, not just the answer the questions we already know to ask based on up-front human investigation. If we have to discover each fraud ring or singleton abuse as a one-off case, we’ll never be able to wipe out fraud on a large enough scale to matter.”

Indeed, despite this being the largest bust ever, it’s really just a fraction of the estimated $20 to $100 billion a year in Medicare fraud.