Monthly Archives: February 2016

When an open-source database written in Java that runs primarily in production on Linux becomes THE solution for the cloud platform from Microsoft (i.e. Azure) in the fully distributed, highly secure and “always on” transactional database space then we should take a special note of that. This is the case of DataStax:

July 15, 2015: Building the intelligent cloudScott Guthrie’s keynote on the Microsoft Worldwide Partner Conference 2015, the DataStax related segment in 7 minutes only

SCOTT GUTHRIE, EVP of Microsoft Cloud and Enterprise: What I’d like to do is invite three different partners now on stage, one an ISV, one an SI, and one a managed service provider to talk about how they’re taking advantage of our cloud offerings to accelerate their businesses and make their customers even more successful.

First, and I think, you know, being able to take advantage of all of these different capabilities that we now offer.

Now, the first partner I want to bring on stage is DataStax. DataStax delivers an enterprise-grade NoSQL offering based on Apache Cassandra. And they enable customers to build solutions that can scale across literally thousands of servers, which is perfect for a hyper-scale cloud environment.

And one of the customers that they’re working with is First American, who are deploying a solution on Microsoft Azure to provide richer insurance and settlement services to their customers.

What I’d like to do is invite Billy Bosworth, the CEO of DataStax, on stage to join me to talk about the partnership that we’ve had and how some of the great solutions that we’re building together. Here’s Billy. (Applause.)

SCOTT GUTHRIE: So tell us a little bit about DataStax and the technology you guys build.

BILLY BOSWORTH: Sure. At DataStax, we deliver Apache Cassandra in a database platform that is really purpose-built for the new performance and availability demands that are being generated by today’s Web, mobile and IOT applications.

Now, that probably sounds like a lot of other database vendors out there as well. But, Scott, we have something that’s really different and really important to us and our customers, and that’s the notion of being always on. And when you talk about “always on” and transactional databases, things can get pretty complicated pretty fast, as you well know.

The reason for that is in an always-on world, the datacenter itself becomes a single point of failure. And that means you have to build an architecture that is going to be comprehensive and include multiple datacenters. That’s tough enough with almost any other piece of the software stack. But for transactional databases, that is really problematic.

Fortunately, we have a masterless architecture in Apache Cassandra that allows us to have DataStax enterprise scale in a single datacenter or across multiple datacenters, and yet at the same time remain operationally simple. So that’s really the core of what we do.

SCOTT GUTHRIE: Is the always-on angle the key differentiator in terms of the customer fit with Azure?

BILLY BOSWORTH: So if you think about deployment to multiple datacenters, especially and including Azure, it creates an immediate benefit. Going back to your hybrid clouds comment, we see a lot of our customers that begin their journey on premises. So they take their local datacenter, they install DataStax Enterprise, it’s an active database up and running. And then they extend that database into Azure.

Now, when I say that, I don’t mean they do so for disaster recovery or failover, it is active everywhere. So it is taking full read-write requests on premises and in Azure at the same time.

So if you lose connectivity to your physical datacenter, then the Azure active nodes simply take over. And that’s great, and that solves the always-on problem.

But that’s not the only thing that Azure helps to solve. Our applications, because of their nature, tend to drive incredibly high throughput. So for us, hundreds of millions or even tens and hundreds of billions of transactions a day is actually quite common.

You guys are pretty good, Scott, but I don’t think you’ve changed the laws of physics yet. And so the way that you get that kind of throughput with unbelievable performance demands, because our customers demand millisecond and microsecond response times, is you push the data closer to the end points. You geographically distribute it.

Now, what our customers are realizing is they can try and build 19 datacenters across the world, which I’m sure was really cheap and easy to do, or they can just look at what you’ve already done and turn to a partnership like ours to say, “Help us understand how we do this with Azure.”

So not only do you get the always-on benefit, which is critical, but there’s also a very important performance element to this type of architecture as well.

SCOTT GUTHRIE: Can you tell us a little bit about the work you did with First American on Azure?

BILLY BOSWORTH: Yeah. First American is a leading name in the title insurance and settlement services businesses. In fact, they manage more titles on more properties than anybody in the world.

Every title comes with an associated set of metadata. And that metadata becomes very important in the new way that they want to do business because each element of that needs to be transacted, searched, and done in real-time analysis to provide better information back to the customer in real time.

And so for that on the database side, because of the type of data and because of the scale, they needed something like DataStax Enterprise, which we’ve delivered. But they didn’t want to fight all those battles of the architecture that we discussed on their own, and that’s where they turned to our partnership to incorporate Microsoft Azure as the infrastructure with DataStax Enterprise running on top.

And this is one of many engagements that you know we have going on in the field that are really, really exciting and indicative of the way customers are thinking about transforming their business.

SCOTT GUTHRIE: So what’s it like working with Microsoft as a partner?

BILLY BOSWORTH: I tell you, it’s unbelievable. Or, maybe put differently, highly improbable that you and I are on stage together. I want you guys to think about this. Here’s the type of company we are. We’re an open-source database written in Java that runs primarily in production on Linux.

Now, Scott, Microsoft has a couple of pretty good databases, of which I’m very familiar from my past, and open source and Java and Linux haven’t always been synonymous with Microsoft, right?

So I would say the odds of us being on stage were almost none. But over the past year or two, the way that you guys have opened up your aperture to include technologies like ours — and I don’t just say “include.” His team has embraced us in a way that is truly incredible. For a company the size of Microsoft to make us feel the way we do is just remarkable given the fact that none of our technologies have been something that Microsoft has traditionally said is part of their family.

So I want to thank you and your team for all the work you’ve done. It’s been a great experience, but we are architecting systems that are going to drive businesses for the coming decades. And that is super exciting to have a partner like you engaged with us.

SCOTT GUTHRIE: Fantastic. Well, thank you so much for joining us on stage.

BILLY BOSWORTH: Thanks, Scott. (Applause.)

The typical data framework capabilities of DataStax in all respects is best understood via the the following webinar which presents Apache Spark as well as the part of the complete data platform solution:
– Apache Cassandra is the leading distributed database in use at thousands of sites with the world’s most demanding scalability and availability requirements.
– Apache Spark is a distributed data analytics computing framework that has gained a lot of traction in processing large amounts of data in an efficient and user-friendly manner.
– The joining of both provides a powerful combination of real-time data collection with analytics.
After a brief overview of Cassandra and Spark, (Cassandra till 16:39, Spark till 19:25) this class will dive into various aspects of the integration (from 19:26).
August 19, 2015: Big Data Analytics with Cassandra and Spark by Brian Hess, Senior Product Manager of Analytics, DataStax

SANTA CLARA, CA – September 23, 2015 – (Cassandra Summit 2015) DataStax, the company that delivers Apache Cassandra™ to the enterprise, today announced a strategic collaboration with Microsoft to deliver Internet of Things (IoT), Web and mobile applications in public, private or hybrid cloud environments. With DataStax Enterprise (DSE), a leading fully-distributed database platform, available on Azure, Microsoft’s cloud computing platform, enterprises can quickly build high-performance applications that can massively scale and remain operationally simple across public and private clouds, with ease and at lightning speed.

PERSPECTIVES ON THE NEWS

“At Microsoft we’re focused on enabling customers to run their businesses more productively and successfully,” said Scott Guthrie, Executive Vice President, Cloud and Enterprise, Microsoft. “As more organizations build their critical business applications in the cloud, DataStax has proved to be a natural Azure partner through their ability to enable enterprises to build solutions that can scale across thousands of servers which is necessary in today’s hyper-scale cloud environment.”

“We are witnessing an increased adoption of DataStax Enterprise deployments in hybrid cloud environments, so closely aligning with Microsoft benefits any organization looking to quickly and easily build high-performance IoT, Web and mobile apps,” said Billy Bosworth, CEO, DataStax. “Working with a world-class organization like Microsoft has been an incredible experience and we look forward to continuing to work together to meet the needs of enterprises looking to successfully transition their business to the cloud.”

“As a leader in providing information and insight in critical areas that shape today’s business landscape, we knew it was critical to transform our back-end business processes to address scale and flexibility” said Graham Lammers, Director, IHS. “With DataStax Enterprise on Azure we are now able to create a next generation big data application to support the decision-making process of our customers across the globe.”

BUILD SIMPLE, SCALABLE AND ALWAY-ON APPS IN RECORD SPEED

To address the ever-increasing demands of modern businesses transitioning from on-premise to hybrid cloud environments, the DataStax Enterprise on Azure on-demand cloud database solution provides enterprises with both development and production ready Bring Your Own License (BYOL) DSE clusters that can be launched in minutes on theMicrosoft Azure Marketplaceusing Azure Resource Management (ARM) Templates. This enables the building of high-performance IoT, Web and mobile applications that can predictably scale across global Azure data centers with ease and at remarkable speed. Additional benefits include:

Hybrid Deployment: Easily move DSE workloads between data centers, service providers and Azure, and build hybrid applications that leverage resources across all three.

Continuous Availability: DSE’s peer-to-peer architecture offers no single point of failure. DSE also provides maximum flexibility to distribute data where it’s needed most by replicating data across multiple data centers, the cloud and mixed cloud/on-premise environments.

MICROSOFT ENTERPRISE CLOUD ALLIANCE & FAST START PROGRAM

DataStax also announced it has joined Microsoft’s Enterprise Cloud Alliance, a collaboration that reinforces DataStax’scommitment to provide the best set of on-premise, hosted and public cloud database solutions in the industry. The goal of Microsoft’s Enterprise Cloud Alliance partner program is to create, nurture and grow a strong partner ecosystem across a broad set of Enterprise Cloud Products delivering the best on-premise, hosted and Public Cloud solutions in the industry. Through this alliance, DataStax and Microsoft are working together to create enhanced enterprise-grade offerings for the Azure Marketplace that reduce the complexities of deployment and provisioning through automated ARM scripting capabilities.

Additionally, as a member of Microsoft Azure’s Fast Start program, created to help users quickly deploy new cloud workloads, DataStax users receive immediate access to the DataStax Enterprise Sandbox on Azure for a hands-on experience testing out DSE on Azure capabilities. DataStax Enterprise Sandbox on Azure can be found here.

Cassandra Summit 2015, the world’s largest gathering of Cassandra users, is taking place this week and Microsoft Cloud and Enterprise Executive Vice President Scott Guthrie, DataStax CEO Billy Bosworth, and Apache Cassandra Project Chair and DataStax Co-founder and CTO Jonathan Ellis, will deliver the conference keynote at 10 a.m. PT on Wednesday, September 23. The keynote can be viewed at DataStax.com.

ABOUT DATASTAX

DataStax delivers Apache Cassandra™ in a database platform purpose-built for the performance and availability demands for IoT, Web and mobile applications. This gives enterprises a secure, always-on database technology that remains operationally simple when scaling in a single datacenter or across multiple datacenters and clouds.

With more than 500 customers in over 50 countries, DataStax is the database technology of choice for the world’s most innovative companies, such as Netflix, Safeway, ING, Adobe, Intuit and eBay. Based in Santa Clara, Calif., DataStax is backed by industry-leading investors including Comcast Ventures, Crosslink Capital, Lightspeed Venture Partners, Kleiner Perkins Caufield & Byers, Meritech Capital, Premji Invest and Scale Venture Partners. For more information, visit DataStax.com or follow us @DataStax.

Datastax is a California-based database management company. It offers an enterprise-grade NoSQL database that seamlessly and securely integrates real-time data with Apache Cassandra. Databases built on Apache Cassandra offer more flexibility than traditional databases. Even in case of calamities and uncertainties, like floods and earthquakes, data is available due to its replication at other data centers. NoSQL and Cassandra are open-source software.

Cassandra database was developed by Facebook (FB) to handle its enormous volumes of data. The technology behind Cassandra was developed by Amazon (AMZN) and Google (GOOGL). Oracle’s MySQL (ORCL), Microsoft’s SQL Server (MSFT), and IBM’s DB2 (IBM) are the traditional databases present in the market .

Datastax raised $106 million in September 2014 to expand its database operations. MongoDB Inc. and Couchbase Inc.—both open-source NoSQL database developers—raised $231 million and $115 million, respectively, in 2014. According to Market Research Media, a consultancy firm, spending on NoSQL technology in 2013 was less than $1 billion. It’s expected to reach $3.4 billion by 2020. This explains why this segment is attracting such huge investments.

Oracle’s dominance in the database market is uncertain

Oracle claims it’s a market leader in the relational database market, with a revenue share of 48.3%. In 2013, it launched Oracle Database 12C. According to Oracle, “Oracle Database 12c introduces a new multitenant architecture that simplifies the process of consolidating databases onto the cloud; enabling customers to manage many databases as one — without changing their applications.” To know in detail about Database 12c, please click here .

In July 2013, DataStax announced that dozens of companies have migrated from Oracle databases to DataStax databases. Customers cited scalability, disaster avoidance, and cost savings as the reasons for shifting databases. Datastax databases’ rising popularity jeopardizes Oracle’s dominant position in the database market.

Cassandra Summit is in high gear this week in Santa Clara, CA, representing the largest NoSQL event of its kind! This is the largest Cassandra Summit to date. With more than 7,000 attendees (both onsite and virtual), this is the first time the Summit is a three-day event with over 135 speaking sessions. This is also the first timeDataStax will debut a formalized Apache Cassandra™ training and certification program in conjunction with O’Reilly Media. All incredibly exciting milestones!

We are excited to share another milestone. Yesterday, we announcedour formal strategic collaboration with Microsoft. Dedicated DataStax and Microsoft teams have been collaborating closely behind the scenes for more than a year on product integration, QA testing, platform optimization, automated provisioning, and characterization of DataStax Enterprise (DSE) on Azure, and more to ensure product validation and a great customer experience for users of DataStax Enterprise on the Azure cloud. There is strong coordination across the two organizations – very close executive, field, and technical alignment – all critical components for a strong partnership.

This partnership is driven and shaped by our joint customers. Our customers oftentimes begin their journey with on-premise deployments of our database technology and then have a requirement to move to the cloud – Microsoft is a fantastic partner to help provide the flexibility of a true hybrid environment along with the ability to migrate to and scale applications in the cloud. Additionally, Microsoft has significant breadth regarding their data centers – customers can deploy in numerous Azure data centers around the globe, in order to be ‘closer’ to their end users. This is highly complementary to DataStax Enterprise software as we are a peer-to-peer distributed database and our customers need to be close to their end users with their always-on, always available enterprise applications.

To highlight a couple of joint customers and use cases we have First American Title and IHS, Inc. First American is a leading provider of title insurance and settlement services with revenue over $5B. They ingest and store the largest number (billions) of real estate property records in the industry. Accessing, searching and analyzing large data-sets to get relevant details quickly is the new way they want to do business – to provide better information back to their customers in real-time and allow end users to easily search through the property records on-line. They chose DSE and Azure because of the large data requirements and because of the need to continue to scale the application.

A second great customer and use case is IHS, Inc., a $2B revenue-company that provides information and analysis to support the decision-making process of businesses and governments. This is a transformational project for IHS as they are building out an ‘internet age’ parts catalog – it’s a next generation big data application, using NoSQL, non-relational technology and they want to deploy in the cloud to bring the application to market faster.

As you can see, we are enabling enterprises to engage their customer like never before with their always on, highly available and distributed applications. Stay tuned for more as we move forward together in the coming months!

When Microsoft says that it is embracing Linux as a peer to Windows, it is not kidding. The company has created its own Linux distribution for switches used to build the Azure cloud, and it has embraced Spark in-memory processing and Cassandra as its data storefor its first major open source big data project – in this case to help improve the quality of its Office365 user experience. And now, Microsoft is embracing Cassandra, the NoSQL data store originally created by Facebook when it could no longer scale the MySQL relational database to suit its needs, on the Azure public cloud.

Billy Bosworth, CEO at DataStax, the entity that took over steering development of and providing commercial support for Cassandra, tells The Next Platform that the deal with Microsoft has a number of facets, all of which should help boost the adoption of the enterprise-grade version of Cassandra. But the key one is that the Global 2000 customers that DataStax wants to sell support and services to are already quite familiar with both Windows Server in their datacenters and they are looking to burst out to the Azure cloud on a global scale.

“We are seeing a rapidly increasing number of our customers who need hybrid cloud, keeping pieces of our DataStax Enterprise on premise in their own datacenters and they also want to take pieces of that same live transactional data – not replication, but live data – and in the Azure cloud as well,” says Bosworth. “They have some unique capabilities, and one of the major requirements of customers is that even if they use cloud infrastructure, it still has to be distributed by the cloud provider. They can’t just run Cassandra in one availability zone in one region. They have to span data across the globe, and Microsoft has done a tremendous job of investing in its datacenters.”

With the Microsoft agreement, DataStax is now running its wares on the three big clouds, with Amazon Web Services and Google Compute Engine already certified able to run the production-grade Cassandra. And interestingly enough, Microsoft is supporting the DataStax implementation of Cassandra on top of Linux, not Windows. Bosworth says that while Cassandra can be run on Windows servers, DataStax does not recommend putting DataStax Enterprise (DSE), the commercial release, on Windows. (It does have a few customers who do, nonetheless, and it supports them.) Bosworth adds that DataStax and the Cassandra community have been “working diligently” for the past year to get a Windows port of DSE completed and that there has been “zero pressure” for the Microsoft Azure team to run DSE on anything other than Linux.

It is important to make the distinction between running Cassandra and other elements of DSE on Windows and having optimized drivers for Cassandra for the .NET programming environment for Windows.

“All we are really talking about is the ability to run the back-end Cassandra on Linux or Windows, and to the developer, it is irrelevant on what that back end is running,” explains Bosworth. This takes away some of that friction, and what we find is that on the back end, we just don’t find religious conviction about whether it should run on Windows or Linux, and this is different from five years ago. We sell mostly to enterprises, and we have not had one customer raise their hand and say they can’t use DSE because it does not run on Windows.”

What is more important is the ability to seamless put Cassandra on public clouds and spread transactional data around for performance and resiliency reasons – the same reasons that Facebook created Cassandra for in the first place.

What Is In The Stack, Who Uses It, And How

The DataStax Enterprise distribution does not just include the Apache Cassandra data store, but has an integrated search engine that is API compatible with the open source Solr search engine and in-memory extensions that can speed up data accesses by anywhere from 30X to 100X compared to server clusters using flash SSDs or disk drives. The Cassandra data store can be used to underpin Hadoop, allowing it to be queried by MapReduce, Hive, Pig, and Mahout, and it can also underpin Spark and Spark Streaming as their data stores if customers decide to not go with the Hadoop Distributed File System that is commonly packaged with a Hadoop distribution.

It is hard to say for sure how many organizations are running Cassandra today, but Bosworth reckons that it is on the order of tens of thousands worldwide, based on a number of factors. DataStax does not do any tracking of its DataStax Community edition because it wants a “frictionless download” like many open source projects have. (Developers don’t want software companies to see what tools they are playing with, even though they might love open source code.) DataStax provides free training for Cassandra, however, where it does keep track, and developers are consuming over 10,000 units of this training per month, so that probably indicates that the Cassandra installed base (including tests, prototypes, and production) is in the five figures.

DataStax itself has over 500 paying customers – now including Microsoft after its partner tried to build its own Spark-Cassandra cluster using open source code and decided that the supported versions were better thanks to the extra goodies that DataStax puts into its distro. DataStax has 30 of the Fortune 100 using its distribution of Cassandra in one form or another, and it is always for transactional, rather than batch analytic, jobs and in most cases also for distributed data stores that make use of the “eventual consistency” features of Cassandra to replicate data across multiple clusters. The company has another 600 firms participating in its startup program, which gives young companies freebie support on the DSE distro until they hit a certain size and can afford to start kicking some cash into the kitty.

The largest installation of Cassandra is running at Apple,which as we previously reportedhas over 75,000 nodes, with clusters ranging in size from hundreds to over 1,000 nodes and with a total capacity in the petabytes range. Netflix, which used to employ the open source Cassandra, switched to DSE last May and had over 80 clusters with more than 2,500 nodes supporting various aspects of its video distribution business. In both cases, Cassandra is very likely housing user session state data as well as feeding product or play lists and recommendations or doing faceted search for their online customers.

We are always intrigued to learn how customers are actually deploying tools such as Cassandra in production and how they scale it. Bosworth says that it is not uncommon to run a prototype project on as few as ten nodes, and when the project goes into production, to see it grow to dozens to hundreds of nodes. The midrange DSE clusters range from maybe 500 to 1,000 nodes and there are some that get well over 1,000 nodes for large-scale workloads like those running at Apple.

In general, Cassandra does not, like Hadoop, run on disk-heavy nodes. Remember, the system was designed to support hot transactional data, not to become a lake with a mix of warm and cold data that would be sifted in batch mode as is still done with MapReduce running atop Hadoop.

The typical node configuration has changed as Cassandra has evolved and improved, says Robin Schumacher, vice president of products at DataStax. But before getting into feeds and speeds, Schumacher offered this advice. “There are two golden rules for Cassandra. First, get your data model right, and second, get your storage system right. If you get those two things right, you can do a lot wrong with your configuration or your hardware and Cassandra will still treat you right. Whenever we have to dive in and help someone out, it is because they have just moved over a relational data model or they have hooked their servers up to a NAS or a SAN or something like that, which is absolutely not recommended.”

Only four years ago, because of the limitations in Cassandra (which like Hadoop and many other analytics tools is coded in Java), the rule of thumb was to put no more than 512 GB of disk capacity onto a single node. (It is hard to imagine such small disk capacities these days, with 8 TB and 10 TB disks.) The typical Cassandra node has two processors, with somewhere between 12 and 24 cores, and has between 64 GB and 128 GB of main memory. Customers who want the best performance tend to go with flash SSDs, although you can do all-disk setups, too.

Fast forward to today, and Cassandra can make use of a server node with maybe 5 TB of capacity for a mix of reads and writes, and if you have a write intensive application, then you can push that up to 20 TB. (DataStax has done this in its labs, says Schumacher, without any performance degradation.) Pushing the capacity up is important because it helps reduce server node count for a given amount of storage, which cuts hardware and software licensing and support costs. Incidentally, only a quarter of DSE customers surveyed said they were using spinning disks, but disk drives are fine for certain kinds of log data. SSDs are used for most transactional data, but the bits that are most latency sensitive should use DSE to store data on PCI-Express flash cards, which have lower latency.

Schumacher says that in most cases, the commercial-grade DSE Cassandra is used for a Web or mobile application, and a DSE cluster is not set up for hosting multiple applications, but rather companies have a different cluster for each use case. (As you can see is the case with Apple and Netflix.) Most of the DSE shops to make use of the eventual consistency replication features of Cassandra to span multiple datacenters with their data stores, and span anywhere from eight to twelve datacenters with their transactional data.

Here’s where it gets interesting, and why Microsoft is relevant to DataStax. Only about 30 percent of the DSE installations are running on premises. The remaining 70 percent are running on public clouds. About half of DSE customers are running on Amazon Web Services, with the remaining 20 percent split more or less evenly between Google Compute Engine and Microsoft Azure. If DataStax wants to grow its business, the easiest way to do that is to grow along with AWS, Compute Engine, and Azure.

So Microsoft and DataStax are sharing their roadmaps and coordinating development of their respective wares, and will be doing product validation, benchmarking, and optimization. The two will be working on demand generation and marketing together, too, and aligning their compensation to sell DSE on top of Azure and, eventually, on top of Windows Server for those who want to run it on premises.

In addition to announcing the Microsoft partnership at the Cassandra Summit this week, DataStax is also releasing its DSE 4.8 stack, which includes certification for Cassandra to be used as the back end for the new Spark 1.4 in-memory analytics tool. DSE Search has a performance boosts for live indexing, and running DSE instances inside of Docker containers has been improved. The stack also includes Titan 1.0, the graph database overlay for Cassandra, HBase, and BerkeleyDB that DataStax got through its acquisition of Aurelius back in February. DataStax is also previewingCassandra 3.0, which will include support for JSON documents, role-based access control, and a lot of little tweaks that will make the storage more efficient, DataStax says. It is expected to ship later this year.