The latest news from Google on open source releases, major projects, events, and student outreach programs.

Google Summer of Code wrap up: Ceph

Friday, November 7, 2014

We continue our Google Summer of Code 2014 wrap up series with Ceph, a distributed object store and file system. Patrick McGarry joins us for a review of two summer projects.

With our first year participating in Google Summer of Code (GSoC) in the rearview mirror, we are already looking forward to GSoC 2015. Over the summer, we learned useful lessons on how to engage with a broader audience and gained valuable code and developer insights. We had two GSoC students that were guided by three mentors, but we certainly hope to grow these numbers in the future. Read on for summaries of our developer projects.

Ceph Wireshark Dissector

Student: Kevin Cox

Mentor: Sage Weil

Wireshark, originally known as Ethereal, is a cross-platform, free and open source packet analyzer used for network troubleshooting, protocol development and general analysis. Although past efforts had been made to integrate Ceph protocols into Wireshark, these efforts were outdated and would no longer compile on a modern version of Wireshark.

This summer Kevin Cox was tasked with creating a new dissector that could be maintained easily and extended as both Ceph and Wireshark changed over time. The main points of the proposal were:

Create a strong framework from which the dissector can be built so that new message types can be added in the future

Develop code that allows the dissector to be accepted into upstream Wireshark

Work with the Wireshark team to get the dissector into Wireshark natively

One of the compelling features behind Ceph is the built-in data reliability. The default scheme to ensure that your data will still be around in the event of hardware failures is replication. Ceph takes data, splits it into chunks, and replicates those chunks (3x by default) across physical servers. There are many things that could affect the reliability of your data (writes, disk failures, network interruptions, etc) and Ceph's reliability model was based around the concept of this 3x replication.

“Erasure Coding” has recently been introduced as an alternative approach to data durability for Ceph. Unlike the default replication scheme, data can be stored in several chunks all on different physical servers with error correction information added. When one of those chunks becomes unavailable due to hardware failure, it can be reconstructed. The error correction information is much more efficient to store than literal copies of data; Ceph can often store data with erasure coding using only 1.4x the size of the data instead of 3x as with plain replication.

Veronica Estrada Galiñanes’ GSoC project was to model the impact that different replication schemes have on data durability in distributed systems. Veronica provided a thorough analysis starting with the existing system reliability. She then modeled several different approaches, including erasure coding and locally repairable codes. Although the analysis was focused on Ceph, it could be applied to any other object-based storage system.

This work provided a great methodology to analyze and confirm the redundancy and overall reliability of data within a Ceph system across a wide array of replication schemes. For the full student report, read Veronica's final report.

As we worked with the students, mentors, and other organization admins within the Google Summer of Code, we realized that students often have a different perspective of the world. This fresh perspective can offer insights and renewed enthusiasm that would otherwise be difficult to achieve. We found our experience with the program to be rewarding both in a tangible sense (code/developers) and, perhaps more importantly, in the intangible of new ideas and experience that will help broaden our ecosystem in the future. We hope to participate again next year!