10 Bits: The Data News Hot List

This week’s list of data news highlights covers February 22-28 and includes articles on a TV network’s plan to leverage data mining to find news stories and an Italian initiative that uses data to track down dog owners who fail to clean up after their pets.

The National Oceanic and Atmospheric Administration (NOAA) collects over 20 terabytes of data per day, but only a small percentage of this is easily accessible. This week, U.S. Secretary of Commerce Penny Pritzker called on private companies to partner with NOAA to unlock and make use of this data, with an eye toward creating new products and driving economic growth. The Department of Commerce has released a Request for Information, asking companies to weigh in on the feasibility of the project.

Officials in Naples, Italy are piloting a data-driven program to identify dog owners who fail to pick up after their pets. The program, which would collect blood samples from every dog in the city to create a database of DNA profiles, would allow the city to identify dogs by their waste. Upon identifying the offending canine, the city could send a fine to the registered owner.

MSNBC is partnering with news technology startup Vocativ to mine the Internet for stories it will use in the new Ronan Farrow Daily program. Vocativ, whose founder also launched a global security firm that helps governments and corporations manage and analyze information, was originally developed to help companies identify business threats. The software that underlies Vocativ searches social media, chat rooms and other online data to predict stories’ growth potential.

The Office of the National Coordinator for Health Information Technology issued a report this week on the present and future of the technology used to match patients’ electronic health records across different organizations. The report recommends that organizations making any changes to patient data attributes coordinate with other groups working on related topics and that organizations should not be required to use a specific type of patient matching algorithm.

This week saw the launch of the Qualitative Data Repository at Syracuse University’s Center for Qualitative and Multi-Method Inquiry. The repository, which was designed in response to a perceived trend in the social sciences for collecting qualitative data once and never reusing it, will allow researchers to reuse qualitative data more efficiently. The repository ensures that the data has persistent, unique identifiers and provides a library of guidance and resources to help scholars manage the data.

Cinemetrics, the statistical study of films, has been around for decades, but it has recently been applied to a new topic: the on-screen gender gap. Looking at screen time, analysts found that this year’s Oscar nominees for best actor occupied an average of 85 minutes on screen, while actresses averaged only 59 minutes. However, in several cases women had greater screen time per shot than men, lending credence to the idea that women in films are sometimes put on display for male audiences.

Investigative journalism site Pro Publica released its “Data Store” platform this week, with data the site has used in its reporting. The site will offer free downloads of information obtained by Freedom of Information Act (FOIA) requests and charge one-time fees for data that the organization has cleaned or devoted significant effort to modifying. So far, the site charges approximately $200 for journalists and $2000 for academics. The datasets are categorized into Health, Business and Transportation sectors.

The UK’s Environment Agency will soon release a large quantity of flood data, including real-time river levels and flood maps. The agency, which has long contended that it needed to keep the data closed in order to collect licensing fees, recently acquiesced in the wake of accusations that it prevented people from finding out about how harsh winter weather would affect their homes. The data could enable businesses to develop local flood warning systems and other products.

The U.S. Department of Defense Advanced Research Projects Agency (DARPA) announced that it was launching the data-driven Distributed Battle Management program this week. It would address the increasing complexity of tech-enabled battlefield conditions with decision aid software and control algorithms. The tools will be integrated into airborne systems used by pilots and battle managers.

Researchers at the University of Chicago published a paper in the journal Bioinformatics this week, detailing how they used the university’s Beagle supercomputer to accelerate genomic data analysis. The researchers were able to analyze 240 whole genomes in 50 hours, a high throughput for the complex operations genomic analysis involves. The researchers hope their approach will save money, inching ever closer to the widely-held milestone of $1000 to sequence a genome.