]]>One of the practical data-related problems that people struggle with today is just finding the right piece of data hidden among various sources. If you’ve ever asked for a specific data set, only to be told “Ask Suzy in accounting,” you’ve experienced the tribal knowledge problem. The thing is, “Ask Suzy in accounting” doesn’t scale with big data.

The tribal knowledge problem affects all sorts of organizations using all sorts of techniques. No matter whether you use Hadoop, Teradata, or relational database management systems, there’s a level of obscurity surrounding data sources that confounds your ability to get the right piece of data.

Traditional master data management (MDM) and data governance techniques propose to solve the problem with a top-down approach and rigid categorization. But the speed at which data types and data sources are changing confounds this approach, says Satyen Sangani, who worked in Oracle’s data warehouse division before co-founding Alation, which today came out of stealth with a new product aimed at closing the tribal knowledge gap.

“While everybody is focused on problem of how to visualize the data and how to make the compute go faster and how do we store the information more efficiently, we’ve seen little about the fact that there’s just so much more data out there,” Sangani tells Datanami. “There’s a fundamental information relevance problem. How do you get the data when you need it, how do you sort through it, how do you filter down the data to get what you’re actually looking for? That’s fundamentally a problem that we see our customers deal with.”

Sorry Suzy, but you just don’t scale

Alation’s solution to the problem is to leverage the power of machine learning to automatically map the access paths that people naturally take to access data. Its software installs an agent in various data sources that access logs and certain metadata components to ascertain what pieces of data are most popular; it also takes a sample of the actual data for indexing purposes. It’s sort of like Google‘s PageRank algorithm, but applied within a customer’s own data environment.

This is a novel approach that yields powerful insights for search and discovery of data, Sangani says. “When you go into database system like Oracle or Teradata, it doesn’t tell you any of the information about who’s using it, when it’s used, what’s touching it, and what that data goes to,” he says. “All that information has to be effectively machine learned by either accessing the query logs or by more fundamentally asking people.” In other words, go see Suzy in accounting.

By understanding the larger context surrounding the data, Elation can help business analysts, IT administrators, or business managers figure out which data sets are the most pertinent for their needs. In that manner, it’s sort of like Yelp for business data, Sangani says.

“Yelp gives you a response and says ‘Here’s not only the result, but the reason the result is right is because 15 other people used this data, it was last refreshed on this date, there are 400 other people who are subscribing to this data, and there are 25 reports hanging off it,'” he says. “We give you a ton of context and it’s really that context that makes the search hum.”

In addition to building the core search and discovery foundation, Alation’s Postgres-based platform includes a collaboration layer that allows people to work together to better understand and annotate data sources as they explore them. In this manner, Alation hopes to engender more strategic data governance and management initiatives organically at customer sites, as an offshoot (or continuation of) the tactical application of search and discovery.

Elation is coming out of stealth today, but it already has a fairly impressive list of early adopters, including EBay, MarketShare iHeart Media, Square, and Inflection. At Ebay, the product is used by business analysts charged with analyzing petabytes of data sitting across multiple systems.

“What they [Ebay] ended up doing was using Elation in order to present not just the data and the context around the data, but also as a knowledge management tool to say, ‘Hey for calculating churn, here are the ways you can do it,'” Sangani says. “So it really helps the analysts accelerate their work process and work product by helping them find the information and getting them to the data sets and the answers they need.”

Currently Elation supports Hadoop (including HDFS and Hive); data warehouse platforms such as Teradata, Oracle, and IBM’s Netezza; and most relational database systems. The has NoSQL data stores on its roadmap, as well as possibly object-based file systems.

Alation builds a sort of graph that maps the use of data in an enterprise

Sangani doesn’t believe that machines can take the place of humans for data curation. MDM initiatives will continue. But machines can augment humans and make them and the MDM initiatives more powerful and successful.

“The problem with MDM is it’s effectively all top down all human curation,” Sangani says. “Our observation has been that’s just not scalable. It never was scalable previously, which is why I founded this company. But it works even less now.

“If you think about the amount of data that’s being generated, that’s one factor. But the data is also becoming more complicated. On top of that, there’s just very little in resources to manage the data. There has to be more people managing the data, more people accessing and touching the data, to talk about it and describe it. But there also have to be automated techniques in order to enable people to do that management work faster.”

Whether Alation ultimately solves the data tribalism problem remains to be seen. What isn’t up for debate is the bona fides of the Redwood City company’s co-founders, including Sangani, who worked at Oracle for many years; Aaron Kalb, who helped build Siri at Apple; machine learning expert Feng Niu; and Venky Ganti, who developed artificial intelligence at Microsoft.

]]>http://www.datanami.com/2015/03/31/battling-big-datas-tribal-knowledge-problem/feed/0What Apple’s Purchase of FoundationDB Would Mean for NoSQLhttp://www.datanami.com/2015/03/30/what-apples-purchase-of-foundationdb-would-mean-for-nosql/?utm_source=rss&utm_medium=rss&utm_campaign=what-apples-purchase-of-foundationdb-would-mean-for-nosql
http://www.datanami.com/2015/03/30/what-apples-purchase-of-foundationdb-would-mean-for-nosql/#commentsMon, 30 Mar 2015 20:59:16 +0000http://www.datanami.com/?p=7328Apple raised some eyebrows in the big data space last week when it was reported that it’s buying NoSQL database developer FoundationDB. For some, the move signaled the emergence of a new group of highly scalable NoSQL database vendors that could challenge the “Big 3″ of NoSQL, while for others it was a warning sign for turbulence ahead. FoundationDB is a key value store-based NoSQL database designed to provide extreme scalability without giving up core ACID properties. Modeled after the Read more…

]]>Apple raised some eyebrows in the big data space last week when it was reported that it’s buying NoSQL database developer FoundationDB. For some, the move signaled the emergence of a new group of highly scalable NoSQL database vendors that could challenge the “Big 3″ of NoSQL, while for others it was a warning sign for turbulence ahead.

FoundationDB is a key value store-based NoSQL database designed to provide extreme scalability without giving up core ACID properties. Modeled after the Google Spanner Paper, the database recently was clocked at 14.4 million writes per second, or nearly 14 times more than other NoSQL databases were able to do on an Amazon cluster, according to the vendor and published benchmark reports.

What exactly Apple plans to do with such a speedy database is unclear. The company has not formally announced the acquisition, which was initially reported by TechCrunch. “Apple buys smaller technology companies from time to time, and we generally do not discuss our purpose or plans,” an Apple spokesperson said. FoundationDB did not respond to Datanami‘s request for comment.

While FoundationDB isn’t talking publicly yet, it did post a notice on its blog that it “made the decision to evolve our company mission” and would thereby cease offering downloads of its software, which was a mix of open and closed source. Other articles have stated that FoundationDB has stopped offering support. If true, this would leave organizations that have adopted FoundationDB no official source of support.

There are unanswered questions about this acquisition. But if true, it makes perfect sense to Adam Wray, CEO of Basho Technologies, the company behind another key-value NoSQL database called Riak. “NoSQL databases are an increasingly critical part of enterprises’ ability to derive real business value from the massive amounts of data that users, devices and online systems generate,” Wray says via email.

Wray says NoSQL database will play a major role in the development of applications for the Internet of Things, which Apple will definitely be involved in. “Apple is acutely aware of the importance of being able to reliably scale to meet the real-time data needs of today’s global applications,” Wray says. “The news of Apple’s intent to acquire FoundationDB greatly amplifies these points to a growing number of IT and engineering leaders.”

While it may be unusual for megavendors like Apple to snap up startups like FoundationDB, which has raised about $23 million to date. But it’s not completely unheard of, especially when big data talent is such a scarce commodity. For Apple, the acquisition may not be so much about obtaining NoSQL technology as a way to hire the FoundationDB team, including CEO David Rosenthal, who was previously vice president of engineering at Omniture.

Other NoSQL startups see the Apple acquisition (if it ends up being true, which seems likely) as writing on the wall for the likes of MongoDB, Couchbase, and Datastax, which have emerged as the “Big 3″ of the NoSQL business.

“Companies like FoundationDB and Aerospike have been purpose-built from the ground up to handle massive workloads in a cost-effective way unlike the initial wave of NoSQL players who are more general purpose database players,” Peter Goldmacher, the vice president of strategy and market development at NoSQL vendor Aerospike, says via email.

“Apple is the most well-known and trusted name in technology today and their acquisition of FoundationDB is a sign that newer players like Cassandra aren’t going to be able to meet increasingly staggering workloads,” he adds. “Before we go ahead and anoint the new kings, remember that success is ephemeral and in my opinion it’s not game-set-match after all.”

]]>http://www.datanami.com/2015/03/30/what-apples-purchase-of-foundationdb-would-mean-for-nosql/feed/0Peek Inside a Hybrid Cloud for Big Genomics Datahttp://www.datanami.com/2015/03/30/peek-inside-a-hybrid-cloud-for-big-genomics-data/?utm_source=rss&utm_medium=rss&utm_campaign=peek-inside-a-hybrid-cloud-for-big-genomics-data
http://www.datanami.com/2015/03/30/peek-inside-a-hybrid-cloud-for-big-genomics-data/#commentsMon, 30 Mar 2015 19:53:10 +0000http://www.datanami.com/?p=7323Genomics, or the study of human genes, holds an enormous promise to improve the health of people. However, at 80GB to 250GB per sequenced genome, the sheer size of genomics data poses a considerable storage and processing challenge. To keep costs in check, Inova Translational Medical Institute (ITMI) adopted a hybrid cloud architecture for its ongoing genomics studies. ITMI is a branch of the non-profit Inova hospital network in the Washington D.C. area that received 150 million in funding by Read more…

]]>Genomics, or the study of human genes, holds an enormous promise to improve the health of people. However, at 80GB to 250GB per sequenced genome, the sheer size of genomics data poses a considerable storage and processing challenge. To keep costs in check, Inova Translational Medical Institute (ITMI) adopted a hybrid cloud architecture for its ongoing genomics studies.

ITMI is a branch of the non-profit Inova hospital network in the Washington D.C. area that received 150 million in funding by Inova to pursue the field of genomics. The Falls Church, Virginia organization has launched six separate studies in the fields of genomics, including a premature birth study, a Type 2 diabetes study, and a congenital disorders study, among others.

When ITMI received its grant in 2011, the institute was not sure where the research would take it, so it opted for practical and simple approach: It chose to store genomics data on the Amazon Simple Storage Service (S3) public cloud rather than keep it on premise. To get gene sequence data into S3, it would ship hundreds of hard disks back and forth across the country to and from its gene sequencing partner, Illumina, based 3,000 miles away in San Diego.

After receiving gene sequence data from Illumina, the ITMI analysts would check the data for errors, and then upload it to S3, which itself was an error-prone process that could take up to a week to complete. This workflow obviously wasn’t ideal, but it was relatively affordable. Besides, when ITMI started, it had the gene sequences of several hundred people to study–representing perhaps 50 terabytes or so—so the size of the data wasn’t a major issue.

The Genomic Growth Curve

But over the years, as ITMI launched more studies, the size and variety of the data has exploded and become a stumbling block. Today, ITMI maintains close to 7,200 human genomes, representing 1.7PB of genomic data stored on Amazon’s S3 cloud, according to Aaron Black, director of informatics ITMI. And with plans to expand the genomic studies to 20,000 people by 2017—or eight to 10 genomes uploaded per day–there’s no end in sight for the big data crunch.

However, Amazon’s cloud-based object storage system is just one leg of a three-legged data stool that ITMI relies on to do its research. In addition to S3, the institute access clinical data and diagnoses stored in an Epic electronic medical record (EMR) system that Inova spend hundreds of millions of dollars to build. The third leg, which went online just a year ago, is a 1,024-core UV2000 supercomputer from SGI that’s used to crunch genomics data.

At about the time that SGI high performance computing (HPC) system went live, ITMI decided to reevaluate its data architecture—in particular whether it was time to bring the Amazon S3 data back in house.

“We did a cost analysis of what it would look like in the next three years,” Black says. “Based on what we were paying at Amazon and what they thought they would have to build or expand in their current data center, it was tens of millions of dollars difference. Amazon was so much cheaper.”

While Amazon S3 is saving ITMI tens of millions, it still lacked that “glue” that could bring these three elements together while maintaining the necessary security and privacy controls. It ended up selecting a NAS storage appliance from Avere Systems that could blend the three data sources together into a seamless whole.

Splicing Hybrid Cloudiness

“What Avere can do is make sure the Amazon buckets…look like they’re locally mounted so researchers can see the data,” Black says. ITMI works with researchers around the world and Avere provides fine-grained control over which data sets the researchers can access. The institute also uses the encryption key management software provided by the Avere edge filers.

Being located just a couple of miles from the Amazon data center is a nice benefit for ITMI, and so is the big 10Gbps pipe linking them. But even with all that bandwidth and 1.4 PB of storage attached to the SGI super, ITMI must be careful about which data sets it chooses to load onto its HPC system. That’s where another Avere feature comes in handy.

ITMI’s hybrid cloud architecture. Courtesy: Avere Systems

“Avere has a special AMI [Amazon Machine Image] where they can synch certain kinds of data that we know are going to be used, so we can save it down on the ground,” Black says. “As we use it, Avere has a custom algorithm that can tell us what’s hot and not and move it into the fastest caching area.”

This helps researchers narrow their search considerably before launching jobs on the SGI cluster. “If there are 7,000 individuals, they might only want to look at kids with asthma and that might only be 70 genomes,” Black says. “They have to find a way to broker, or find those individual objects on Amazon, and then they might want to know just a subset of genes.”

Battling Data Bottlenecks

But bottlenecks still emerge, even with Avere serving as the traffic cop for data moving among the SGI cluster, Amazon S3, and the Epic EMR system. “We work with Amazon almost weekly to understand how can we get the best performance out of this,” Black says. “And what we were seeing was, it was really slow to move up. The bottleneck wasn’t the pipe, it wasn’t Avere–it was writing to disk at Amazon.”

ITMI has done a couple of things to address that issue, including adopting hashing algorithms on Amazon to make the disk IOPS more parallel and get closer to filling up the pipe. “Not that it’s horrible, but we’re spending a lot of money on a high bandwidth connection and we expect the best performance,” Black says.

The wait times are not a big issue today for ITMI’s researchers; they’re used to waiting for answers during long periods of unsupervised learning. But as ITMI ramps up its clinical practice, it will keep much “hotter” data storage and processing capacity in place, such as when it’s called on to determine what genetic condition may be impacting a newborn in the neonatal intensive care unit (NICU).

“In a clinical setting, we’d do all the computing and analysis on the HPC, on the SGI,” Black says. “Once we did the analysis, we’d move [the raw data] up to Amazon, and have a vault on the East Coast and one on West Coast, because we’d want to span zones.”

ITMI’s journey into genomics is just getting started, but it’s already showing promise. The congenital anomalies study, for example, boasts a 60 percent success rate. There’s no question genomics poses great promise to help human health. The question for the system and data architects, then, becomes how best to build a system that can scale while delivering the necessary performance.

“This is the first time we’ve been around a hospital system that’s done this at scale,” Black says. “If look at every single component of it–from storage to compute to how big your pipes have to be to what’s your long term storage strategy–every one of those components has to be analyzed. Cloud scales very well for us, form a storage perceptive. From a compute perspective, we’re not there yet [with cloud].”

]]>http://www.datanami.com/2015/03/30/peek-inside-a-hybrid-cloud-for-big-genomics-data/feed/0Leverage Big Data Cross-Industry Panel: Video Now Availablehttp://www.datanami.com/2015/03/27/leverage-big-data-cross-industry-panel-video-now-available/?utm_source=rss&utm_medium=rss&utm_campaign=leverage-big-data-cross-industry-panel-video-now-available
http://www.datanami.com/2015/03/27/leverage-big-data-cross-industry-panel-video-now-available/#commentsFri, 27 Mar 2015 17:00:50 +0000http://www.datanami.com/?p=7300 Big data means different things to different people. To some, massive data is a challenge to be overcome, while for others it’s an opportunity to seize. At Tabor Communications’ recent Leverage Big Data event, experts from different industries came together to compare their big data notes. As a professor of astrophysics and computational science at George Mason University, Kirk Borne tracks things moving through the universe over space and time. But that broad description could refer to just about Read more…

Big data means different things to different people. To some, massive data is a challenge to be overcome, while for others it’s an opportunity to seize. At Tabor Communications’ recent Leverage Big Data event, experts from different industries came together to compare their big data notes.

As a professor of astrophysics and computational science at George Mason University, Kirk Borne tracks things moving through the universe over space and time. But that broad description could refer to just about anything, he points out, including the activities of cyber criminals attempting to hack a system over the Internet.

Trevor Mason, the vice president of technology research at IRI Worldwide, also faces big data challenges. As a provider of product data to consumer processed goods (CPG) firms, IRI must manages a large number of variables in data describing CPG goods. As you combine those variables, the number of possibilities quickly skyrockets.

Big data represents a challenge to Kerry Hughes, the advanced computing leader at Dow Chemical, who was also on the panel. For Hughes, connecting big data and high performance computing (HPC) technology with the person with the requisite domain expertise is the tough part to crack.

Helping clients to act on fast-moving data is important for panelist Asif Alam, the head of enterprise capabilities at Thomson Reuters. The advent of machine readable financial data generated by more than 400 different exchanges, in combination with outside data such as weather and news, allows Thomson Reuters to help its clients make decisions quickly in our fast-changing world.

]]>http://www.datanami.com/2015/03/27/leverage-big-data-cross-industry-panel-video-now-available/feed/0What Lies Beneath the Data Lakehttp://www.datanami.com/2015/03/27/what-lies-beneath-the-data-lake/?utm_source=rss&utm_medium=rss&utm_campaign=what-lies-beneath-the-data-lake
http://www.datanami.com/2015/03/27/what-lies-beneath-the-data-lake/#commentsFri, 27 Mar 2015 13:00:02 +0000http://www.datanami.com/?p=7294Hadoop and the data lake represents potential business breakthrough for enterprise big data goals, yet beneath the surface is the murky reality of data chaos. In big data circles, the “data lake” is one of the top buzzwords today. The premise: companies can collect and store massive volumes of data from the Web, sensors, devices and traditional systems, and easily ingest it in one place for analysis. The data lake is a strategy from which business-changing big data projects can Read more…

]]>Hadoop and the data lake represents potential business breakthrough for enterprise big data goals, yet beneath the surface is the murky reality of data chaos.

In big data circles, the “data lake” is one of the top buzzwords today. The premise: companies can collect and store massive volumes of data from the Web, sensors, devices and traditional systems, and easily ingest it in one place for analysis.

The data lake is a strategy from which business-changing big data projects can begin, revealing potential for new types of real-time analyses which have long been a mere fantasy. From connecting more meaningfully with customers while they’re on your site to optimizing pricing and inventory mix on-the-fly to designing smart products, executives are tapping their feet waiting for IT to deliver on the promise.

Until recently, though, even large companies couldn’t afford to continue investing in traditional data warehouse technologies to keep pace with the growing surge of data from across the Web. Maintaining a massive repository for cost-effectively holding terabytes of raw data from machines and websites as well as traditional structured data was technologically and economically impossible until Hadoop came along.

Hadoop, in its many iterations, has become a way to at last manage and merge these unlimited data types, unhindered by the rigid confines of relational database technology. The feasibility of an enterprise data lake has swiftly improved, thanks to Hadoop’s massive community of developers and vendor partners that are working valiantly to make it more enterprise friendly and secure.

Yet with the relative affordability and flexibility of this data lake come a host of other problems: an environment where data is not organized or easily manageable, rife with quality problems and unable to quickly deliver business value. The worst-case scenario is that all that comes from the big data movement is data hoarding – companies will have stored petabytes of data, never to be used, eventually forgotten and someday deleted. This outcome is doubtful, given the growing investment in data discovery, visualization, predictive analytics and data scientists.

For now, there are several issues to be resolved to make the data lake clear and beautiful—rather than a polluted place where no one wants to swim.

Poor Data Quality

This one’s been debated for a while, and of course, it’s not a big data problem alone. Yet it’s one glaring reason why many enterprises are still buying and maintaining Oracle and Teradata systems, even alongside their Hadoop deployments. Relational databases are superb for maintaining data in structures that allow for rapid reporting, protection, and auditing. DBAs can ensure data is in good shape before it gets into the system. And, since such systems typically deal only with structured data in the first place, the challenge for data quality is not as vast.

In Hadoop, however, it’s a free-for-all: typically no one’s monitoring anything in a standard way and data is being ingested raw and ad hoc from log files, devices, sensors and social media feeds, among other unconventional sources. Duplicate and conflicting data sets are not uncommon in Hadoop. There’s been some effort by new vendors to develop tools that incorporate machine learning for improved filtering and data preparation. Yet companies also need a foundation of people—skilled Hadoop technicians—and process to attack the data quality challenge

Lack of Governance

Closely related to the quality issue is data governance. Hadoop’s flexible file system is also its downside. You can import endless data types into it, but making sense of the data later on isn’t easy. There’s also been plenty of concerns about securing data (specifically access) within Hadoop. Another challenge is that there are no standard toolsets yet for importing data in Hadoop and extracting it later. This is a Wild West environment, which can lead to compliance problems as well as slow business impact.

To address the problem, industry initiatives have appeared, including the Hortonworks-sponsored Data Governance Initiative. The goal of DGI is to create a centralized approach to data governance by offering “fast, flexible and powerful metadata services, deep audit store and an advanced policy rules engine.” These efforts among others will help bring maturity to big data platforms and enable companies to experiment with new analytics programs.

Skills Gaps

In a recent survey of enterprise IT leaders conducted by TechValidate and SnapLogic, the top barrier to big data ROI indicated by participants was a lack of skills and resources. Still today, there are a relatively small number of specialists skilled in Hadoop. This means that while the data lake can be a treasure chest, it’s one that is still somewhat under lock and key. Companies will need to invest in training and hiring of individuals who can serve as so-called “data lake administrators.” These data management experts have experience managing and working with Hadoop files and possess in-depth knowledge of the business and its various systems and data sources that will interact with Hadoop.

Transforming the data lake into a business strategy that benefits customers, revenue growth and innovation is going to be a long journey. Aside from adding process and management tools, as discussed above, companies will need to determine how to integrate old and new technologies. More than half of the IT leaders surveyed by TechValidate indicated that they weren’t sure how they were going to integrate big data investments with their existing data management infrastructure in the next few years. Participants also noted that the top big data investments they would be making in the near term are analytics and integration tools.

We’re confident that innovation will continue rapidly for new Big Data-friendly integration and management platforms, but there’s also need to apply a different lens to the data lake. It’s time to think about how to apply processes, controls and management tools to this new environment, yet without weakening what makes the data lake such a powerful and flexible tool for exploration and delivering novel business insights.

About the author: Craig Stewart is senior director of product management SnapLogic. His background is in pre-sales and technical management for fast-growing technology companies. Previously he’s been a contributor to the European development of Cognos, Powersoft, Sybase, iMediation, and Sunopsis.

]]>http://www.datanami.com/2015/03/27/what-lies-beneath-the-data-lake/feed/0Three Ways Open Data Could Make California Goldenhttp://www.datanami.com/2015/03/26/three-ways-open-data-could-make-california-golden/?utm_source=rss&utm_medium=rss&utm_campaign=three-ways-open-data-could-make-california-golden
http://www.datanami.com/2015/03/26/three-ways-open-data-could-make-california-golden/#commentsThu, 26 Mar 2015 16:00:48 +0000http://www.datanami.com/?p=7269While California is widely regarded as the high-technology hub of the world, the state lacks a cohesive open data policy. Last week, a non-profit think tank in the Golden State released a report detailing why the state government should adopt an open data policy. According to the Milken Institute‘s “Open Data in California,” the state has the potential to unleash powerful economic forces if it follows the lead set by the Federal Government and 10 other states that have implemented Read more…

]]>While California is widely regarded as the high-technology hub of the world, the state lacks a cohesive open data policy. Last week, a non-profit think tank in the Golden State released a report detailing why the state government should adopt an open data policy.

According to the Milken Institute‘s “Open Data in California,” the state has the potential to unleash powerful economic forces if it follows the lead set by the Federal Government and 10 other states that have implemented open data policies over the past four years, including Texas, Oklahoma, New York, and Maryland.

“California is hardly a pioneer on the open-data frontier,” write the report’s authors, Jason Barrett and Kevin Klowden. “More than any other state, California needs a unified open-data strategy for at least three major reasons.”

Economic Development:

Despite being the home of Web giants like Google and Yahoo, enterprise tech giants like Hewlett-Packard and Cisco, software megavendors like Oracle and Adobe, and dozens upon dozens of big data startups, there is no unifying strategy in Sacramento to effectively harvest today’s digital commodity: data.

Barrett and Klowden point out that California has more companies on the Open Data 500, which is a list of companies that make use of government data, than any other state; it has 132. Open data flag bearers include companies like Esri, the GIS software giant based in Redlands that utilizes data gathered by the United States Geological Survey (among other federal agencies); Crimespotting, a San Francisco organization that develops and app that lets people know about police activity in their neighborhoods; and OpenGov, a Mountain View company that develops data visualization software that transforms budgetary data from government clients into more readable charts and graphs.

Standardization and Transparency:

While Sacramento has not yet set a standard for how all of the state’s agencies should collect, store, and share data, that hasn’t stopped some of them, such as the Controller’s office, from adopting their own open data strategies in an ad hoc manner. The cities of San Francisco, San Jose, and San Diego have all launched their own open data portals that ranked highly in Milken Institute rankings.

Setting open data access standards would make it easier for journalists to track the actions of the government agencies they’re tasked with monitoring. Currently, the Fourth Estate labors under various local Sunshine Acts, which often results in “vague or outdated information about how their tax dollars are spent,” the Milken report says.

Streamlining Regulation:

California has a poor reputation among business backers as being an overly regulated state that’s unfriendly to private enterprise. By providing a single reference point for agency cooperation and applicant communication, state officials can take a big step in battling that image, the authors write.

But the possibilities of open data extend far beyond permitting, the authors write. “Imagine if the Department of Transportation wanted to build a road through a wooded area with dense wildlife populations,” they say. “Architects could incorporate conservation efforts into their design by cross-referencing their plans with migratory patterns collected by the Department of Wildlife without having to submit official requests.”

If California did adopt an open data initiative, what should it look like? According to the authors, it should look a lot like New York’s. In fact, officials in that state created the New York State Open Data Handbook, which is a veritable “one stop shop” for how to set up an open data initiative.

The New York handbook “details best practices for executing an open-data policy—website development recommendations, data standardization, and guidelines for participating agencies, among others—and also serves as a resource to help policymakers ask the right questions when crafting their own policies,” the authors write.

However, California’s open data initiative would look different, the authors state. For starters, there would be high demand for data from the California Environmental Quality Act (CEQA), seismological data, and oil and gas data. A well-crafted open data policy would adequately anticipate public demand for this data, along with more traditional transparency-related data such as revenues and expenditure data, the authors write.

Besides New York, there are other guides that policymakers in California could use to usher the Golden State into the open data promised land. The Milken Institute points to the California Economic Summit Open Data SOAR (Streamline Our Agency Regulations) Team which described what an open data policy might look like in the state:

Quality — For starters, the data should be high quality and vetted for accuracy whenever possible. (The New York data guidebook also provides numerous pointers for how data should be cleaned, the authors point out.)

Security — The data should also be respectful of privacy and security concerns. That would mean there’s no personally identifiable information (PII) contained in the data.

Well-Documented — Metadata should accompany the raw data whenever possible, providing a trail and a lineage of where that data came from.

Up to Date — The data should be refreshed continually and on a regular basis. The authors cite Hawaii as a good model to follow here.

Permanent — Public data would never die, but instead go into an ever-growing archive documenting the historical record.

Searchable — All data should be searchable. That means no PDFs or image files, the authors say. CSV and JSON files would be good, though, they say.

Sounds great, right? But how much would this cost? Not as much as you think. Based on open-data labor statistics from New York, it would cost California just $4 million to $5 million to pay for the staffing required to develop, implement, and manage a fully functioning open-data policy, the authors state. And of course, there should be a chief data officer for the state, too.

]]>http://www.datanami.com/2015/03/26/three-ways-open-data-could-make-california-golden/feed/0Why Graph Databases Are Becoming Part of Everyday Lifehttp://www.datanami.com/2015/03/26/why-graph-databases-are-becoming-part-of-everyday-life/?utm_source=rss&utm_medium=rss&utm_campaign=why-graph-databases-are-becoming-part-of-everyday-life
http://www.datanami.com/2015/03/26/why-graph-databases-are-becoming-part-of-everyday-life/#commentsThu, 26 Mar 2015 13:00:30 +0000http://www.datanami.com/?p=7290Zephyr Health, a San Francisco-based software company that offers a data analytics platform for pharmaceutical, biotech and medical device companies, wanted its customers to unlock more value from their data relationships. Doing so would enable pharmaceutical companies, for example, to find the right doctors for a clinical trial by understanding relationships among a complex mix of public and private data such as specialty, geography, and clinical trial history. Old-school SQL databases were not up to the task. Traditional SQL databases Read more…

]]>Zephyr Health, a San Francisco-based software company that offers a data analytics platform for pharmaceutical, biotech and medical device companies, wanted its customers to unlock more value from their data relationships. Doing so would enable pharmaceutical companies, for example, to find the right doctors for a clinical trial by understanding relationships among a complex mix of public and private data such as specialty, geography, and clinical trial history.

Old-school SQL databases were not up to the task. Traditional SQL databases don’t handle data relationships well, and most NoSQL databases don’t handle data relationships at all. Nor are they well-equipped to handle data that’s always changing – such as streams of new information coming in from doctor’s surveys.

Zephyr found the solution in a graph database, for its capability and scale. Graph databases are key to discovering, capturing, and making sense of complex interdependences and relationships, both for running an IT organization more effectively and for building next-generation functionality for businesses. They are designed to easily model and navigate networks of data, with extremely high performance. To fully appreciate the value of the graph, consider that early adopters of graph databases, such as Facebook and LinkedIn, became household names and unrivaled leaders in their sectors.

While SQL databases have been a mainstay in enterprise IT departments for decades, they have increasingly given way to NoSQL solutions as data volumes and connections boom. It’s important to keep a discerning eye on so-called NoSQL databases; the term can be annoyingly vague and is applied to wildly differing database types. Several NoSQL database categories have emerged, each tackling a distinct business problem: document, column array, key-value, and graphs. While the entire NoSQL sector is attracting increasing attention, graph databases are generating real and lasting excitement, with interest in the sector having grown 500% in the last two years alone! Forrester Research has reported that graph databases — the fastest-growing category in database management systems — will reach more than 25 percent of enterprises by 2017.

Graph databases are effective for every industry — from telecommunications to financial services, logistics, hospitality, and healthcare. Despite their market momentum, however, some people still consider graphs to be mysterious. In actuality, graph databases use natural and intuitive principles that bear much more similarity to tasks we perform on a daily basis, than do relational database management systems, which by comparison have a fairly steep learning curve. If you’ve ever worked out a route via a mass transit map or followed a family tree, you have manually run your own graph-based query.

In fact, you’ve likely come across a product or service powered by a graph database within the last few hours. Many everyday businesses have created new products and services and re-imagined existing ones by bringing data relationships to the fore. That’s because graph databases are the best way to model, store, and query both data and its relationships, which is crucial for next-generation applications that feature use cases such as real-time recommendations, graph-based search, and identity & access management.

For example, Walmart – which deals with almost 250 million customers weekly through its 11,000 stores across 27 countries and through its retail websites in 10 countries – wanted to understand the behavior and preferences of online buyers with enough speed and in enough depth to make real-time, personalized, ‘you may also like’ recommendations. By using a graph database, Walmart is able to connect masses of complex buyer and product data to gain insight into customer needs and product trends, very quickly.

Here’s how it works: The graph database stores and processes any kind of data by bringing relationships to the fore. A “graph” can be thought of like a whiteboard sketch: when you draw on a whiteboard with circles and lines, sketching out data, what you are drawing is a graph. Graph databases store and process data within the structure you’ve drawn, providing significant performance and ease-of-use advantages, plus unparalleled ease in evolving the data model. No other type of database does this. Because they are designed to do so, graph databases are becoming an essential tool in discovering, capturing, and making sense of intricate relationships and interdependencies.

The Seven Bridges Puzzle

Graphs theory, far from being a recent data handling development is actually nearly 300 years old and can be traced to Leonhard Euler, a Swiss mathematician. Euler was looking to solve an old riddle known as the “Seven Bridges of Königsberg.” Set on the Pregel River, the city of Königsberg included two large islands connected to each other and the mainland by seven bridges. The challenge was to map a route through the city that would cross each bridge only once while ending at the starting point. Euler realized that by reducing the problem to its basics, eliminating all features except landmasses and the bridges connecting them, he could develop a mathematical structure that proved the riddle impossible.

Today’s graphs are based entirely from Euler’s design – with land masses now referred to as a “node” (or “vertex”), while the bridges are the “links” (also known as ‘relationships” and “edges”). One thing that’s great about graph databases however is that their end users don’t need to know anything about graph theory in order to experience immediate practical benefits.

Everyday Use

Graphs are a vital part of our online lives, powering everything from social media sites – including Twitter and Facebook – to the retail recommendations on eBay. Online dating also owes much of its success to the way graphs can analyze even the most complex relationships, looking not only at location and personal details but also passions, hobbies, and attitudes, and relationships between all of those things, to identify potential matches.

Interest in the graph will continue to grow. The real-time nature of a graph database makes it an excellent platform for unlocking business value from data relationships, which simply can’t be carried out on traditional SQL or most NoSQL databases. The uses and applications for graph databases seem endless, and it’s exciting to consider what innovations they will continue to power as the world unlocks the value of data relationships.

About the author: Emil Eifrem is CEO of Neo Technology and co-founder of the Neo4j project. Before founding Neo, he was the CTO of Windh AB, where he headed the development of highly complex information architecture for enterprise content management systems. Committed to sustainable open source, he guides Neo along a balanced path between free availability and commercial reliability. Emil is a frequent conference speaker and author on NoSQL databases. His twitter handle is @emileifrem.

]]>http://www.datanami.com/2015/03/26/why-graph-databases-are-becoming-part-of-everyday-life/feed/0Neo Tech Cranks Up Speed on Upgraded Neo4jhttp://www.datanami.com/2015/03/25/neo-tech-cranks-up-speed-on-upgraded-neo4j/?utm_source=rss&utm_medium=rss&utm_campaign=neo-tech-cranks-up-speed-on-upgraded-neo4j
http://www.datanami.com/2015/03/25/neo-tech-cranks-up-speed-on-upgraded-neo4j/#commentsWed, 25 Mar 2015 22:10:23 +0000http://www.datanami.com/?p=7302Graph database specialist Neo Technology has rolled out an updated version of its Neo4j tool that includes faster read and write performance capabilities aimed at critical graph database applications. Neo4j 2.2 released on Wednesday (March 25) comes with souped-up write and read scalability, the company said. Expanded write capacity targets highly concurrent applications while leveraging available hardware via faster buffering of updates. A new unified transaction log is designed to serve both graphs and their indexes. Graph databases are an Read more…

]]>Graph database specialist Neo Technology has rolled out an updated version of its Neo4j tool that includes faster read and write performance capabilities aimed at critical graph database applications.

Neo4j 2.2 released on Wednesday (March 25) comes with souped-up write and read scalability, the company said. Expanded write capacity targets highly concurrent applications while leveraging available hardware via faster buffering of updates. A new unified transaction log is designed to serve both graphs and their indexes.

Graph databases are an advanced type of NoSQL database used for a variety of analytical and transactional tasks.

Neo Technology, San Mateo, Calif., also said the database engine update includes a new bulk import utility that can grab data from external sources at a sustained rate of 1 million documents per second to support graphs with tens of billions of nodes and relationships.

Meanwhile, upgraded read scalability includes a new in-memory graph cache capability to increase read throughput by up to ten times for highly concurrent transactional read applications, the company claimed.

Neo Technology said it has also added a new statistics-gathering capability to Neo4j along with a cost-based query optimizer for the company’s Cypher query language. The optimizer is touted as selecting the best query execution plan using built-in statistics containing data on graph size and shape. The upgrade runs up to 100 times faster “in certain cases” than earlier version of Neo4j, Neo Technology claims.

The graphic database upgrade also includes a batch of visualization and other tooling improvements aimed at boosting developer productivity. These include new graph visualization features, query plan visualizations and integrated tutorials and other training materials.

Neo4j 2.2’s fast write buffering architecture is designed to significantly improve write scaling, “both for initial loading of the graph and for highly concurrent transactional applications,” Neo Technology CTO Johan Svennson said in a statement announcing the latest Neoj4 release.

Graph database adoption continues to grow. Neo Technology cited market forecasts predicting that graph databases could be adopted by more than 25 percent of all enterprises by 2017. Graph analysis is increasingly being used in data-driven operations and for making strategic decisions after a data-capture design is in place.

Given the need for speed, Neo Technology is pitching the Neo4j upgrade as offering read and write performance as much a 100 times faster than previous versions.

Neo Technology announced a $20 million Series C funding round in January, which CEO Emil Eifrem said validates the graph database vendor’s earlier effort. While Eifrem is in a great position to boast about Neo4j’s success, the Swedish-born technology executive is more interested in promoting the values of graph databases as a whole. Graphs are about to break out, in a major way, he asserted.

According to recent rankings by DB-Engines.com, Neo Technology holds a comfortable lead in the graph database market.

]]>http://www.datanami.com/2015/03/25/neo-tech-cranks-up-speed-on-upgraded-neo4j/feed/0Mapping the Shape of Complex Data with Ayasdihttp://www.datanami.com/2015/03/25/mapping-the-shape-of-complex-data-with-ayasdi/?utm_source=rss&utm_medium=rss&utm_campaign=mapping-the-shape-of-complex-data-with-ayasdi
http://www.datanami.com/2015/03/25/mapping-the-shape-of-complex-data-with-ayasdi/#commentsWed, 25 Mar 2015 16:04:56 +0000http://www.datanami.com/?p=7284Machine learning has emerged as the most useful technology for analyzing big and complex data sets. But all too often, it takes a highly skilled data scientists to effectively wield machine learning tools. A company called Ayasdi is positioning a technique it calls Topological Data Analysis as a way to shortcut that machine learning skills gap. And if today’s news is any indication, it’s having tremendous success with this technique. Ayasdi was created in 2008 when a Stanford University mathematics Read more…

]]>Machine learning has emerged as the most useful technology for analyzing big and complex data sets. But all too often, it takes a highly skilled data scientists to effectively wield machine learning tools. A company called Ayasdi is positioning a technique it calls Topological Data Analysis as a way to shortcut that machine learning skills gap. And if today’s news is any indication, it’s having tremendous success with this technique.

Ayasdi was created in 2008 when a Stanford University mathematics PhD. student named Gurjeet Singh joined his adviser Gunnar Carlsson to productize the work they’d done around Topological Data Analysis (TDA). Carlsson had been pursuing TDA since the 1970s, and in 2005 had received a $10 million grant from DARPA and the NSF to accelerate the project, upon which Ayasdi co-founder Harlan Sexton also worked.

With TDA, the researchers devised a method to use topology, or the study of shape, to extract insights from data. As an outgrowth of machine learning, TDA enables researchers to reduce big and complex data with a large number of dimensions and variables into a smaller and less complex data set with a fewer number of dimensions and variables, but without sacrificing the key topological properties.

In this regard, the proprietary TDA technology essentially gives big data practitioners a head start when it comes to extracting insight from unknown data. “You throw data into this machine and it automatically executes a large number of machine learning algorithms against the data and combines them together such that, the first time you look at the picture, you already have something to begin with,” Singh, the Ayasdi CEO, tells Datanami. “It discovers these insights in data automatically without any human intervention.”

If that sounds too good to be true, you’re not alone. The world is full of vendors selling all sorts of technology that purport to solve big data problems with a wave of a magic wand. Skepticism is the order of the day when venturing into unknown waters, which this most certainly is.

But here’s the thing: Ayasdi appears to actually do what it claims. Kleiner Perkins Caufield & Byers, the renowned venture capital firm, isn’t in the habit of wasting money on half-baked big data schemes, but it was impressed enough to lead a Series C round to the tune of $55 million, giving the company a total of $100 million in financing over the course of its lifetime.

Topological Data Analysis

The core tenet behind TDA is that every set of data (except, presumably, those generated by random character generators) has a shape. Once you figure out the shape, it’s much easier to select the appropriate algorithms to pinpoint the pattern behind the shape. “If you understand the shape of the underlying data, then you don’t have to ask all the queries,” Singh says.

For example, if data behaves linearly, a basic a regression algorithm will adequately describe what’s going on. When data appears more scattered, clustering algorithms can help find the best division of the different groups of data. It’s also common to find Cheerio-shape loops appearing in data sets; the U.S. GDP growth rate over time is an example of looping data, Singh says. Finally, some data sets, such as the tracking of lift and drag during an airplane flight, may express themselves two dimensionally as flares.

The insight gleaned from TDA helps Ayasdi to narrow the world of possibilities over what is causing the data to disperse in certain ways. Suddenly, the problem of “double exponentially worsening queries”–which Singh uses to refer to the twin problems of exponentially growing data and the exponentially growing number of possible queries–doesn’t hurt so bad. “You understand everything there is to know about your data, because it’s manifested in this shape,” he says.

Once TDA has given you a peek at the shape of the data, the Ayasdi Core software automatically picks the best machine learning algorithm to explain how the data was made. “The current set of machine learning algorithms that are commercial available are only able to explore a very small subset of these shapes,” Singh says. “What we have developed at Ayasdi is the ability to quickly access a large number of algorithms, and select the most insightful ones for extracting statistically significant sub-groups, values and anomalies in your data.”

What’s more, you don’t have to be a machine learning expert to use Ayasdi, Singh says. By comparison, other machine learning software companies may present a library of algorithms, but it’s up to the customers to select the appropriate one. “What that means is the customer has to be at least as smart as the company producing those algorithms to be able to use them,” Singh says. “What we’ve developed is wholly automated. They’re able to just throw their data into the system, and they don’t need to know any of this stuff. It just works.”

Real World Impact

TDA sounds great theoretically, but does it work in the real world? According to some of Ayasdi’s customers, the answer is an emphatic yes.

In addition to announcing the $55 million funding round and 400 percent bookings growth in 2014, the company also went public today with four new customers: the Mercy health system, Citigroup, Lockheed Martin, and Siemens.

Ayasdi provided this roadmap to help Mercy identify best practices in its hospital setting.

Mercy expects to save $100 million over the next three years as a result of the standard care practices that it developed in part with Ayasdi Care, the version of Ayasdi Core that’s tailored to healthcare organizations. When it comes to knee replacement surgeries, for example, Ayasdi was able to isolate the key variables that determine whether the patient will have a strong recovery or stay in the hospital for months. That will save Mercy $1 million right off the bat, and the savings will add up as it creates more standard care practices. “We’re able to plug Ayasdi Care into Mercy’s EMR [electronic medical records system] and automatically discover these very complex clinical pathways from the data,” Singh says.

Lockheed Martin also expects to save more than $100 million as a result of its Ayasdi implementation, which is helping management identify projects that are threatening to “go off the rails.” While Citigroup didn’t put a number on its anticipated savings with Ayasdi, you can bet that it’s of a similar order of magnitude.

“Ayasdi’s big data technology simplifies and accelerates the analysis of thousands of discrete variables and delivers insights that enable Citi to tailor services to specific client needs, operate more efficiently and mitigate risk,” Deborah Hopkins, Chief Innovation Officer of Citi and CEO of Citi Ventures, said in a press release.

Ted Schlein, a general partner at KPCB, says this type of “machine intelligence” technology will be one of the breakthrough innovations that drive productivity over the next decade. “By combining many machine learning algorithms together with topological mathematics and artificial intelligence, Ayasdi developed an entirely new approach that simplifies complex data analysis for large organizations,” Schlein says.

Holding A Machine Learning Edge

Ayasdi Core includes both the traditional supervised algorithms that are commonly used to train and score predictive models, as well as unsupervised algorithms that are more widely used in data discovery. The company also maintains close ties to the math department at Stanford, “So we keep bringing algorithms into the fold that are hot off the press and not available anywhere else today commercially,” Singh says.

Most of Ayasdi’s customers run the in-memory software on-premise. Ayasdi Core is designed to use HDFS to store data, but it doesn’t run as a Hadoop application, Singh says. Hadoop, apparently, just isn’t fast enough. “The issue is that MapReduce and even Spark end up being just too slow to be able to process this data,” Singh says. “The main issue is the human is the bottleneck. It’s not the processors. Processors are cheap. But if you’re going to employ a data scientist, it’s going to cost you $200,000.”

In that regard, Ayasdi’s competitors are not machine learning software companies, but data scientists who would use machine learning technology. “There’s gap in the market for analytics and enterprise customers are trying to hire their way out of this problem and there just aren’t enough people,” Singh says.

Instead of trying to find a data scientist who’s a machine learning expert, Ayasdi has done the hard work of hammering unstructured data into a rough shape, and then hitting that data with a set of highly targeted machine learning algorithms to pick out the signal. This frees up the data analysts or business analysts to explore their data more quickly and more efficiently. At about $1 million per year, Ayasdi Core is not cheap. But considering the benefits some companies are getting, it’s generating a good return.

“Our software will tell you everything that’s statistically relevant about your data and then it’s up to you to tell if that’s actionable or not,” Singh says. “It sure as hell beats trying to ask a question and hoping to find something useful.”

]]>http://www.datanami.com/2015/03/25/mapping-the-shape-of-complex-data-with-ayasdi/feed/0Watch DDN’s Molly Rector’s Leverage Big Data 2015 Keynotehttp://www.datanami.com/2015/03/24/watch-ddns-molly-rectors-leverage-big-data-2015-keynote/?utm_source=rss&utm_medium=rss&utm_campaign=watch-ddns-molly-rectors-leverage-big-data-2015-keynote
http://www.datanami.com/2015/03/24/watch-ddns-molly-rectors-leverage-big-data-2015-keynote/#commentsTue, 24 Mar 2015 20:47:41 +0000http://www.datanami.com/?p=7272When it comes to finding value in big data, there are almost as many paths as there are data sets. In her introductory keynote at last week’s Leverage Big Data event, Molly Rector, the chief marketing officer at Data Direct Networks, provided guidance on how to find your own path. Like many of the world’s top data scientists, Rector took a roundabout path to working with big data that included two degrees in biology and chemistry and years of work Read more…

]]>When it comes to finding value in big data, there are almost as many paths as there are data sets. In her introductory keynote at last week’s Leverage Big Data event, Molly Rector, the chief marketing officer at Data Direct Networks, provided guidance on how to find your own path.

Like many of the world’s top data scientists, Rector took a roundabout path to working with big data that included two degrees in biology and chemistry and years of work in ophthalmology research. While she may hold a marketing title now, the opening keynote she delivered last week at the Ponte Vedra Resort near Jacksonville, Florida was anything but a marketing presentation.

No matter how you measure value–whether it’s being more competitive, reduce risking, improving margins, or driving efficiencies into a manufacturing process–it all starts with the data. And once you have a bit of data, get a lot more of it.

“When you have a big enough data set and you look at what is statistically relevant, even one percent, or two percent as you see the trend, is enough to make really good decisions–better decisions than any human could,” Rector said.