Just published on the Bright blog is my follow-up post regarding my poster presentation (poster, video) at the 2015 Rice Oil and Gas HPC Workshop. In addition to summarizing the disruptive potential for Apache Spark in energy exploration and other industries, this new post also captures my shift in emphasis from Apache Hadoop to Spark. Because the scientific details of my investigation are more-than-a-little OT for the Bright blog, I thought I’d share them here.

About RTM

RTM has a storied history of being performance-challenged. Although the method was originally conceived by geophysicists in the 1980s, it was almost two decades before it became computationally tractable. Considered table stakes in terms of seismic processing by today’s standards, algorithms research for RTM remains highly topical – not just at Rice, York and other universities, but also at the multinational corporations whose very livelihood depends upon the effective and efficient processing of seismic-reflection data. And of particular note are the consistent gains being made since the introduction of GPU programmability via CUDA, as innovative algorithms for RTM can exploit this platform for double-digit speedups.

Why does RTM remain performance-challenged? Dr. G. Liu and colleagues in the School of Geophysics and Information Technology at China University of Geosciences in Beijing identify the two key challenges:

RTM modelling is inherently compute intensive. In RTM, propagating seismic waves are modeled using the three-dimensional wave equation. This classic equation of mathematical physics needs to be applied twice. First in the forward problem, assumptions are made about the characteristics of the seismic source as well as variations in subsurface velocity, so that seismic waves can be propagated forward in time from their point of origin into the subsurface (i.e., an area of geological interest from a petroleum exploration perspective); this results in the forward or source wavefields in the upper-branch of the diagram below. Using seismic traces recorded at arrays of geophones (receivers sensitive to various types of seismic waves) as well as an assumed subsurface-velocity model, these observations are reversed-in-time (hence the name RTM), and then backwards propagated using the same 3D wave equation; this results in the receivers’ wavefields in the lower-branch of the diagram below. It is standard practice to make use of the Finite Difference Method (FDM) to numerically propagate all wavefields in space and time. In order to ensure meaningful results (stable and non-dispersive from the perspective of numerical analysis) from application of FDM to the 3D wave equation, however, both time and 3D space need to be discretized into small steps and grid intervals, respectively. Because the wave equation is a Partial Differential Equation in time and space, the FDM estimates future values using approximations for all derivatives. And in practice, it has been determined that RTM requires high-order approximations for all spatial derivatives if reliable results are to be optimally obtained. In short, there are valid reasons why the RTM modeling kernel is inherently and unavoidably compute intensive.

RTM data exceeds memory capacity. From the earliest days of computational tractability around the late 1990s, standard practice was to write the forward/source wavefields to disk. Then, in a subsequent step, cross-correlate this stored data of forward wavefields with the receivers’ wavefields. Using cross-correlation as the basis for an imaging condition, coherence (in the time-series analysis sense) between the two wavefields is interpreted as being of geological interest – i.e., the identification in space and time of geological reflectors like (steeply dipping) interfaces between different sedimentary lithologies, folds, faults, salt domes as well as reservoirs of even more complex geometrical structure. Although the method consistently delivered the ‘truest images’ of the subsurface, it was literally being crushed by its own success, as multiple-TB data volumes are typical for the forward wavefields. The need to write the forward wavefields to disk, and then re-read them piecemeal from disk during cross-correlation with the receivers’ wavefields, results in disk I/O emerging as the significant bottleneck.

GPU-Enabled RTM

Not surprisingly then, researchers like Liu et al. have programmed GPUs using CUDA for maximum performance impact when it comes to implementing RTM’s modeling kernel. However they, along with a number of other researchers, have introduced novel algorithms to address the challenge of disk I/O. As you might anticipate, the novel aspect of their algorithms is in how they make use of the memory hierarchy presented by hybrid-architecture systems based on CPUs and accelerators. (Although CUDA 6 introduced a kernel module to allow for shared memory between CPUs and GPUs in the first quarter of 2014, I am unaware of the resulting contiguous memory being exploited in the case of RTM.) Programming GPUs via CUDA is not for the feint of coding heart. However, the double-digit performance gains achieved using this platform have served only to validate an ongoing investment.

Spark’ing Possibilities for RTM

Inspired by the in-memory applications of GPUs, and informed about the meteoric rise of interest in Apache Spark, the inevitable (and refactored) question for the Rice workshop became: “RTM using Spark? Is there a case for migration?” In other words, rather than work with HDFS and YARN in a Hadoop context, might Spark have more to offer to RTM?

With the caveat that my investigation is at its earliest stages, and that details need to be fleshed out by me (and hopefully!) many others, Spark appears to present the following possibilities for RTM:

Replace/reduce disk I/O with RDDs. The key innovation implemented in Spark is RDDs – Resilient Distributed Datasets. This in-memory abstraction (please see the 6th reason here for more) has the potential to replace disks in RTM workflows. More specifically, in making use of RDDs via Spark:

Forward wavefields could reside in memory and be rendered available without the need for disk I/O during the application of the imaging condition – i.e., as forward and receivers’ wavefields are cross-correlated. This is illustrated in a modified version of RTM’s computational workflow above. You should be skeptical about the multiple-TBs of data involved here – as you’re unlikely to have a single system with such memory capacity in isolation. This is where the Distributed aspect of RDDs factors in. In a fashion that mimics Hadoop’s use of distributed, yet distinct disks to provide the abstraction of a contiguous file system, RDDs do the same only with memory. Because RDDs are inherently Resilient, they are intended for clustered environments where various types of failures (e.g., a kernel panic followed by a system crash) are inevitable and can be tolerated. Even more enticing in this use case involving RTM wavefields, the ability to functionally transform datasets using Spark’s built-in capability for partitioning RDDs means that more sophisticated algorithms for imaging RTM’s two wavefields can be crafted – i.e., algorithms that exploit topological awareness of the wavefields’ locality in memory. In confronting the second challenge identified above by Liu et al., an early win for in-memory RTM via RDDs would certainly demonstrate the value of the approach.

Gathers of seismic data could reside in memory, and be optimally partitioned using Spark for wavefield calculations. Once acquired, reflection-seismic data is written to an industry-standard format (SEG Y rev 1) established by the Society of Exploration Geophysicists (SEG). Gathers are collections of data for pairs of sources and receivers that have depth (typically) in common. (This is referred to as a Common Depth Point or CDP gather by the industry.) RTM is systematically applied to each gather. Although this has not been a point of focus from an algorithms-research perspective, even in the innovative cases involving GPUs, the in-memory possibilities afforded by Spark may be cause for reconsideration. In fact Professor Huang and his students, in the Department of Computer Science at Prairie View A&M University (PVAMU) in the Houston area, have already applied Spark to SEG Y rev 1 format seismic data. In a poster presented at the Rice workshop, not only did Prof. Huang demonstrate the feasibility of introducing RDDs via Spark, he indicated how this use is crucial to a cloud-based platform for seismic analytics currently under development at PVAMU.

Apply alternate imaging conditions. For each (CDP) gather, coherence between RTM’s two wavefields comprises the basis for establishing the presence of subsurface reflectors of geological origin. Using cross-correlation, artifacts introduced by complex reflector geometries, for example, are de-emphasized as the gather is migrated as-a-whole. Whereas it represents the canonical imaging condition envisaged by the originators of RTM in the 1980s, cross-correlation is by no means the only mechanism for establishing coherence between wavefields. Because Spark includes support for machine learning (MLlib), graph analytics (GraphX) and even statistics (SparkR), alternate possibilities for rapidly establishing imaging conditions have never been more accessible to the petroleum industry. Spark’s analytics upside for imaging conditions is much more about introducing new capabilities than computational performance. For example, parameter studies based upon varying gathers and/or velocity models might serve to reduce the levels of uncertainty inherently present in inverse problems that seek to image the subsurface in areas of potential interest for the exploitation of petroleum resources. Using Spark’ified Genetic Algorithms (e.g., derivative of Spark-complimentary ones already written in Scala), for example, criteria could be established for evaluating the imaging conditions resulting from parameter studies – i.e., naturally selecting the most-appropriate velocity model.

Alternate implementation of the modeling kernel. Is it possible to Spark’ify the RTM modeling kernel? In other words, make programmatic use of Spark to propagate wavefields via the FDM implementation of the 3D wave equation. And even if this is possible, does it make sense? Clearly, this is the most speculative of the suggestions here. Though most speculative, in asking more questions than it presently answers, also the most intriguing. How so? At its core, speculation of this kind speaks to the generality of RDDs as a paradigm for parallel computing that reaches well beyond just RTM using FDM, and consequently of Spark as a representative implementation. Without speculating further at this time, I’ll take the 5th, and close conservatively here with: Further research is required.

Real-time streaming. Spark includes support for streamed data. Whereas streaming seismic data upon acquisition in real time through an RTM workflow appears problematical even to blue-skying me at this point, the notion might find application in related contexts. For example, perhaps a stream-based implementation involving Spark might aid in ensuring the quality of seismic data in near real time as it is acquired, or be used to assess the resolution adequacy in an area of heightened interest within a broader campaign.

Incorporating Spark into Your IT Environment

Whether you’re a boutique outfit, a multinational corporation, or something in between, you have an incumbent legacy to consider in upstream-processing workflows for petroleum exploration. Therefore, introducing technologies from Big Data Analytics into your existing HPC environments is likely to be deemed unwelcome at the very least. However based on a number of discussions at the Rice workshop, and elsewhere in the Houston oil patch, there are a number of reasons why Spark presents as more appealing than Hadoop in complimenting existing IT infrastructures:

Spark can likely make use of your existing file systems;

Spark will integrate with your HPC workload manager;

Spark can be deployed alongside your HPC cluster;

You can likely use your existing code with Spark;

You could run Spark in a public or private cloud, or even a (Docker) container;

Spark is not a transient phenomena – despite the name; and finally

Spark continues to improve.

Conclusions

Briefly, in conclusion:

RTM has a past, present and future of being inherently performance-challenged. This means that algorithms research remains topical. Noteworthy gains are being made through the use of GPU programmability involving CUDA.

Using some ‘novel exploitation’ of HDFS and YARN, Hadoop might afford some performance-related benefits – especially if diskless HDFS is employed. Performance aside, the analytics upside for Hadoop is arguably comparable to that of Spark, even though there would be a need to make use of a number of separate and distinct applications in the Hadoop case.

Spark is much easier to integrate with an existing HPC IT infrastructure – mostly because Spark is quite flexible when it comes to file systems. Anecdotal evidence suggests that this is a key consideration for organizations involved in petroleum exploration, as they have incumbent storage solutions in which they have made significant and repeated investments. Spark has eclipsed Hadoop in many respects, and the risk of adoption can be mitigated on many fronts.

From in-memory data distributed in a fault-tolerant fashion across a cluster, to analytics-based imaging conditions, to refactored modeling kernels, to possibilities involving data streaming, Spark introduces a number of possibilities that are already demanding the attention of those involved in processing seismic data.

In making use of Spark in the RTM context, there is the potential for significant depth and breadth. Of course, the application of Spark beyond RTM serves only to deepen and broaden the possibilities. Spark is based on sound research in computer science. It has developed into what it is today on the heels of collaboration. That same spirit of collaboration is now required to determine how and when Spark will be applied in the exploration for petroleum, in other areas of the geosciences, as well as in other industries – possibilities for which have been enumerated elsewhere.

Shameless plug: Interested in taking Spark for a test drive? With Bright Cluster Manager for Apache Hadoop all you need is a minimal amount of hardware on the ground or in the cloud. Starting with bare metal, Bright provides you with the entire system stack from Linux through HDFS (or alternative) all the way up to Spark. In other words, you can have your test environment for Spark in minutes, and get cracking on possibilities for Spark-enabling RTM or almost any other application.

Last week, I attended Bio-IT World 2013 in Boston. Bright had an excellent show – lots of great conversations, and even an award!

During numerous conversations, the notion of extending on-site IT infrastructure into the cloud was raised. Bright has an excellent solution for this.

What also emerged during the conversations were two uses for this extension of local IT resources via the cloud. I thought this was worth capturing and sharing. You can read about the use cases I identified over “On the Bright side …“

Monday (October 1, 2012), I intend to use a pencast during my lecture – to introduce aspects of the stability of Earth’s atmosphere. I’ll try to share here how it went. For this intended use of the pencast, I will use a landscape mode for presentation – as I expect that’ll work well in the large lecture hall I teach in. I am, however, a little concerned that the lines I’ll be drawing will be a little too thin/faint for the students at the back of the lecture theatre to see …

I followed through as advertized (above) earlier today.

My preliminary findings are as follows:

The visual aspects of the pencast are quite acceptable – This is true even in large lecture halls such as the 500-seat Price Family Cinema at York University (pictured above) in Toronto, Canada where I am currently teaching. I used landscape mode for today’s pencast, and zoomed it in a little. A slightly thicker pen option would be wonderful for such situations … as would different pen colours (the default is green).

The audio quality of the pencasts is very goodto excellent – Although my Livescribe pen came with a headset/microphone, I don’t use it. I simply use the built-in microphone on the pen, and speak normally when I am developing pencasts. Of course, the audio capabilities of the lecture hall I teach in are most excellent for playback!

One-to-many live streaming of pencasts works well – I streamed live directly from myLivescibe today. I believe the application infrastructure is based largely on Adobe Flash and various Web services delivered by Web Objects. Regardless of the technical underpinnings, live streaming worked well. Of course, I could’ve developed a completely self-contained PDF file, downloaded this, and run the pencast locally using Adobe Reader.

Personal pencasting works well – I noticed that a number of students were streaming the pencast live for themselves during the lecture. In so doing, they could control interaction with the pencast.

Anecdotally, a few students mentioned that they appreciated the pencast during the break period – my class meets once per for a three-hour session.

Although I’ve yet to hear this feedback directly from the students, I believe I need to:

Previous approach – Project an illustration taken directly from the course’s text. This is a professionally produced, visually appealing, detailed, end-result, static diagram that I embedded in my presentation software (I use Google Docs for a number of reasons.) Using a laser pointer, my pedagogy called for a systematic deconstruction this diagram – hoping that the students would be engaged enough to actually follow me. Of course, in the captured versions of my lectures, the students don’t actually see where I’m directing the laser pointer. The students have access to the course text and my lecture slides. I have no idea if/how they attempt to ingest and learn from this approach.

Pencasting – As discussed elsewhere, the starting point is a blank slate. Using the pencasting technology, I sketch my own rendition of the illustration from the text. As I build up the details, I explain the concept of stability analyses. Because the sketch appears as I speak, the students have the potential to follow me quite closely – and if they miss anything, they can review the pencast after class at their own pace. The end result of a pencast is a sketch that doesn’t hold a candle to the professionally produced illustration provided in the text and my lecture notes. However, to evaluate the pencast as merely a final product, I believe, misses the point completely. Why? I believe the pencast is a far superior way to teach and to learn in situations such as this one. Why? I believe the pencast allows the teacher to focus on communication – communication that the learner can also choose to be highly receptive to, and engaged by.

I still regard myself as very much a neophyte in this arena. However, as the above final paragraphs indicate, pencasting is a disruptive innovation whose value in teaching/learning merits further investigation.

I first heard about it a few years ago, and thought it sounded interesting … and then, this past Summer, I did a little more research and decided to purchase a Livescribe 8 GB Echo(TM) Pro Pack. Over the Summer, I took notes with the pen from time-to-time and found it to be somewhat useful/interesting.

Decent-quality pencasts can be produced with minimal effort – I figured out the basics (e.g., how to record my voice) in a few minutes, and started on my first pencast. Transferring the pencast from the pen to the desktop software to the Web (where it can be shared with my students) also requires minimal effort. “Decent quality” here refers to both the visual and audio elements. The fact that this is both a very natural (writing with a pen while speaking!) and speedy (efficient/effective) undertaking means that I am predisposed towards actually using the technology whenever it makes sense – more on that below. Net-net: This solution is teacher-friendly.

Pencasts compliment other instructional media – This is my current perspective … Pencasts compliment the textbook readings I assign, the lecture slides plus video/audio captures I provide, the Web sites we all share, the Moodle discussion forums we engage in, the Tweets I issue, etc. In the spirit of blended learning it is my hope that pencasts, in concert with these other instructional media, will allow my TAs and I to `reach’ most of the students in the course.

Pencasts allow the teacher to address both content and skills-oriented objectives – Up to this point, my pencasts have started from a blank page. This forces me to be focused, and systematically develop towards some desired content (e.g., conceptually introducing the phase diagram for H2O) and/or skills (e.g., how to calculate the slope of a line on a graph) oriented outcome. Because students can follow along, they have the opportunity to be fully engaged as the pencast progresses. Of course, what this also means is that this technology can be as effective in the first-year university level course I’m currently teaching, but also at the academic levels that precede (e.g., grade school, high school, etc.) and follow (senior undergraduate and graduate) this level.

Pencasts are learner-centric – In addition to be teacher-friendly, pencasts are learner-centric. Although a student could passively watch and listen to a pencast as it plays out in a linear, sequential fashion, the technology almost begs you to interact with it. As noted previously, this means a student can easily replay some aspect of the pencast that they missed. Even more interestingly, however, students can interact with pencasts in a random-access mode – a mode that would almost certainly be useful when they are attempting to apply the content/skills conveyed through the pencast to a tutorial or assignment they are working on, or a quiz or exam they are actively studying for. It is important to note that both the visual and audio elements of the pencast can be manipulated with impressive responsiveness to random-access input from the student.

I’m striving for authentic, not perfect pencasts – With a little more practice and some planning/scripting, I’d be willing to bet that I could produce an extremely polished pencast. Based on past experience teaching today’s first-year university students, I’m fairly convinced that this is something they couldn’t care less about. Let’s face it, my in-person lectures aren’t perfectly polished, and neither are my pencasts. Because I can easily go back to existing pencasts and add to them, I don’t need to fret too much about being perfect the first time. Too much time spent fussing here would diminish the natural and speedy aspects of the technology.

Findings aside, on to samples:

Calculating the lapse rate for Earth’s troposphere – This is a largely a skills-oriented example. It was my first pencast. I returned twice to the original pencast to make changes – once to correct a spelling mistake, and the second time to add in a bracket (“Run”) that I forgot. I communicated these changes to the students in the course via an updated link shared through a Moodle forum dedicated to pencasts. If you were to experience the updates, you’d almost be unaware of the lapse of time between the original pencast and the updates, as all of this is presented seamlessly as a single pencast to the students.

Introducing the pressure-temperature phase diagram for H2O – This is largely a content-oriented example. I got a little carried away in this one, and ended up packing in a little too much – the pencast is fairly long, and by the time I’m finished, the visual element is … a tad on the busy side. Experience gained.

Anecdotally, initial reaction from the students has been positive. Time will tell.

Next steps:

Monday (October 1, 2012), I intend to use a pencast during my lecture – to introduce aspects of the stability of Earth’s atmosphere. I’ll try to share here how it went. For this intended use of the pencast, I will use a landscape mode for presentation – as I expect that’ll work well in the large lecture hall I teach in. I am, however, a little concerned that the lines I’ll be drawing will be a little too thin/faint for the students at the back of the lecture theatre to see …

I have two sections of the NATS 1780 Weather and Climate course to teach this year. One section is taught the traditional way – almost 350 students in a large lecture theatre, 25-student tutorial groups, supported by Moodle, etc. In striking contrast to the approach taken in the meatspace section, is the second section where almost everything takes place online via Moodle. Although I have yet to support this hypothesis with any data, it is my belief that these pencasts are an excellent way to reach out to the students in the Internet-only section of the course. More on this over the fullness of time (i.e., the current academic session.)

Feel free to comment on this post or share your own experiences with pencasts.

[With apologies for the situational monsoonal imagery …] As I awash myself in Aakash, I am particularly taken by:

The order of magnitude reduction in price point. With a stated cost of about $50, marked-up prices are still close to an order of magnitude more affordable than the incumbent offerings (e.g., the iPad, Android-based tablets, etc.). Even Amazon’s Kindle Fire is 2-3 times more expensive.

The adoption of Android as the innovation platform. I take this as yet another data point (YADP) in firmly establishing Android as the leading future proofed platform for innovation in the mobile-computing space. As Aakash solidly demonstrates, it’s about the all-inclusive collaboration that can occur when organizational boundaries are made redundant through use of an open platform for innovation. These dynamics just aren’t the same as those that would be achieved by embracing proprietary platforms (e.g., Apple’s iOS, RIM QNX-based O/S, etc.).

The Indian origin. It took MIT Being Digital, in the meatspace personage of Nicholas Negroponte, to hatch the One Laptop Per Child initiative. In the case of Aakash, this is grass-roots innovation that has Grameen Bank like possibilities.

“An innovation that is disruptive allows a whole new population of consumers access to a product or service that was historically only accessible to consumers with a lot of money or a lot of skill. Characteristics of disruptive businesses, at least in their initial stages, can include: lower gross margins, smaller target markets, and simpler products and services that may not appear as attractive as existing solutions when compared against traditional performance metrics.”

I am certainly looking forward to seeing this evolve!

Disclaimers:

Like Aakash, I am of Indian origin. My Indian origin, however, is somewhat diluted by some English origin – making me an Anglo-Indian. Regardless, my own origin may play some role in my gushing exuberance for Aakash – and hence the need for this disclaimer.

I am the owner of a Motorola Xoom, but not an iPad. This may mean I am somewhat predisposed towards the Android platform.

Feel free to chime in with your thoughts on Aakash by commenting on this post.