A few years ago, O’Reilly became interested in health topics, running the Strata RX conference, writing a report on How Data Science is Transforming Health Care: Solving the Wanamaker Dilemma, and publishing Hacking Healthcare. Our social network grew to include people in the health care space, informing our nascent thoughts about data in the age of the Affordable Care Act and the problems and opportunities facing the health care industry. We had the notion that aggregating data from traditional and new device-based sources could change much of what we understand about medicine — thoughts now captured by the concept of “precision medicine.”

From that early thinking, we developed the framework for a grant with the Robert Wood Johnson Foundation (RWJF) to explore the technical, organizational, legal, privacy, and other issues around aggregating health-related data for research — to provide empirical lessons for organizations also interested in pushing for data in health care initiatives. Our new free report, Navigating the Health Data Ecosystem, begins the process of sharing what we’ve learned.

After decades of maturing in more aggressive industries, data-driven technologies are being adopted, developed, funded, and deployed throughout the health care market at an unprecedented scale. February 2015 marked the inaugural working group meeting of the newly announced NIH Precision Medicine Initiative designed to aggregate a million-person cohort of genotype/phenotype dense longitudinal health data, where donors provide researchers with the raw epidemiological evidence to develop better decision-making, treatment, and potential cures for diseases like cancer. In the past several years, many established companies and new startups have also started to apply collective intelligence and “big data” platforms to health and health care problems. All these efforts encounter a set of unique challenges that experts coming from other disciplines do not always fully appreciate.

In 2014, the Robert Wood Johnson Foundation funded the subject of this report, a research effort called “Operationalizing Health Data,” as a deep dive into the health care ecosystem, focused on understanding and advancing the integration of personalized health data in both clinical and research organizations. RWJF encouraged the small group of data scientists, innovators, and health researchers working on the grant to find and prototype concrete solutions for several partner organizations trying to leverage the value of health data. As a result, the focus of our research intends to empirically inform innovations teams, often coming from non-health-related industries, about the messy details of using and making sense of data in the heavily regulated hospital IT environment.

Our first in a series of four reports covering our findings from the grant describes key learnings identified by the project across six major facets of the health data ecosystem: Complexity, Computing, Context, Culture, Contracts, and Commerce. In future reports, we will focus on specific tactical challenges the project team addressed.

]]>http://radar.oreilly.com/2015/05/navigating-the-health-data-ecosystem.html/feed0What the data world can learn from the fashion industryhttp://radar.oreilly.com/2014/10/what-the-data-world-can-learn-from-the-fashion-industry.html
http://radar.oreilly.com/2014/10/what-the-data-world-can-learn-from-the-fashion-industry.html#commentsWed, 08 Oct 2014 14:16:26 +0000http://radar.oreilly.com/?p=71192At O’Reilly Research, we focus our attention on trends in technology adoption — which tools are adopted and in which industries. In doing so, we uncover interesting cross-disciplinary opportunities and discover what we can learn from innovations in other fields.

We’ve recently learned about the increasing role of data in the fashion industry, so we set out to uncover some of the players who are making disruptive changes using technology and analytics.

We are beguiled by a key concept from this report: fashion innovators spend time listening to their customers and turning what they hear into data. The result is data that is critical to understanding what fashion consumers are interested in, how they are engaged, and what motivates them.

For example, some fashion startups are offering consumers value in exchange for providing fine-grained and personal data (such as sizing and preference information) — and in turn, consumers might receive free products or personalized style recommendations.

“We don’t guess — we ask.”“Most companies — Google, Yahoo!, Netflix — use what they call inferred attributes: they guess. We don’t guess — we ask,” says Eric Colson, who spent six years working at Netflix before becoming the chief algorithms and analytics officer at Stitch Fix, a personalized online shopping and styling service for women. In order for customers to willingly answer personal questions, there needs to be both a value proposition and a certain amount of trust.

This study into the business of fashion has revealed a major takeaway relevant across industries: conversations generate data that unlock insights into human behavior — and those insights might not be available through any other source.

Join us at Strata + Hadoop World in New York, October 15-17, 2014, and save 20% on registration with the code FASHIONDATA. Learn how data can help you tell better brand stories, discover the importance of data in wearable technology, and discover how start-ups are making use of both online and offline data collection. You can check out all the fashion-related sessions on the Strata + Hadoop World website. If you’re coming to the event, grab a drink at the Fashioning Data Mixer on Thursday, October 16, from 6-7 p.m., at the O’Reilly booth.

]]>http://radar.oreilly.com/2014/10/what-the-data-world-can-learn-from-the-fashion-industry.html/feed1Four data themes to watch from Strata + Hadoop World 2012http://radar.oreilly.com/2012/11/four-data-themes-to-watch-from-strata-hadoop-world-2012.html
http://radar.oreilly.com/2012/11/four-data-themes-to-watch-from-strata-hadoop-world-2012.html#commentsThu, 08 Nov 2012 14:00:23 +0000http://strata.oreilly.com/?p=52859At our successful Strata + Hadoop World conference (including successfully avoiding Sandy), a few themes emerged that resonated with my interests and experience as a hands-on data analyst and as a researcher who tracks technology adoption trends. Keep in mind that these themes reflect my personal biases. Others will have a different take on their own key takeaways from the conference.

1. In-memory data storage for faster queries and visualization

Interactive or real-time query for large datasets is seen as a key to analyst productivity (real-time as in query times fast enough to keep the user in the flow of analysis, from sub-second to less than a few minutes). The existing large-scale data management schemes aren’t fast enough and reduce analytical effectiveness when users can’t explore the data by quickly iterating through various query schemes. We see companies with large data stores building out their own in-memory tools, e.g., Dremel at Google, Druid at Metamarkets, and Sting at Netflix, and new tools, like Cloudera’s Impala announcement at the conference, UC Berkeley’s AMPLab’s Spark, SAP Hana, and Platfora.

We saw this coming a few years ago when analysts we pay attention to started building their own in-memory data store sandboxes, often in key/value data management tools like Redis, when trying to make sense of new, large-scale data stores. I know from my own work that there’s no better way to explore a new or unstructured data set than to be able to quickly run off a series of iterative queries, each informed by the last.

2. SQL and SQL-like tools matter

We see Strata attendees maturing their large-scale analysis infrastructures toward democratizing access to data via high-level SQL and SQL-like tools. As with in-memory data storage, we see both high-functioning data companies and tool vendors working to build more SQL and SQL-like access to large-scale data stores. A common architecture mentioned at the conference included Hadoop for data ingestion and prep coupled with a relational or SQL-like interface (e.g., Hive) to provide widespread access to the data. From the vendor side, we see Cloudera’s Impala project consists of a distributed parallel SQL query engine, Hadapt integrating Hadoop and SQL, and the AMPLab Shark (Hive) interface to Spark.

There’s still a need for constructs like MapReduce and Scala to support parallel programming algorithms. And, HDFS seems the likely foundation for all manner of distributed data processing and tools for the next few years. However, we see too much existing investment in staff resources who know SQL and in SQL-oriented tools to blunt the trend of increased SQL-like access to large-scale, distributed data stores.

3. The 80% rule for data preparation

Echoing a theme DJ Patil highlights in “Data Jujitsu,” many of the data analysts at Strata emphasized that 80% of analysis is data preparation — a ratio we see in our own data work. By data preparation, we mean the acquisition, cleaning, transforming — including standardizing and normalizing values — organizing, and training data for analysis and use in machine learning and other algorithms. It’s hard work, and it isn’t the sexy part of the data science ecosystem, but these efforts are necessary to get reliable and effective results. Joe Hellerstein used the analogy of washing machines and their contribution to productivity compared to rocket ships in his “Of Rocket Ships and Washing Machines” conference keynote to nicely illustrate the importance of data prep to the data space.

A more complete understanding of the role and requirements for data prep as a key component of the analysis workflow, the more realistic organizations can be about what to expect from a data group and why they should invest in data prep productivity. We expect to hear more about better tools and techniques to support improved data prep productivity over the next few years.

4. Asking the right question

Effective analysis depends more on asking the right question or designing a good experiment than on tools and techniques. Large datasets provide the opportunity to take advantage of the “Unreasonable Effectiveness of Data” (Halevy, Norvig, Perera), i.e., effective results from coupling large datasets with relatively simply algorithms. We think organizations that want to improve their ability to deploy data as an asset are best served by emphasizing the “art” of asking good questions and experiment design. Effective analysis is difficult to “buy,” typically requiring adapting the culture toward learning, experimenting and quantitative understanding.

A conference sub-theme of asking the right question raised another issue: how do we train and enable data resources? While no easy answers were offered, there seemed some agreement that:

Curious folks can make great use of, and build on, simple tools and techniques to become more effective data analysts.

Storytelling is a key capability for making good analysis useful to an organization.

Weaving the themes together

Looking at all four themes and other topics from Strata, we see maturing and coalescing around analytic productivity as a prime driver for how the data ecosystem is changing — with more focus on better, faster access to data; more effective analysis; and improving how results are communicated and shared.

Let us know via the comments or email what themes you noticed and what you found most intriguing about the conference.

Strata + Hadoop World sessions of note

While there were many outstanding keynotes and sessions at Strata, here’s a list of a few that best informed the themes described above (full keynote presentations are posted below; session videos are available in the Strata + Hadoop World complete video compilation) :

Using the analogy of washing machines as having a bigger productivity and cultural impact than rocket ships, Hellerstein explained the importance of increasing data prep productivity for data scientists.

By showing how Netflix pragmatically builds, learns and adapts its analytic infrastructure, Brown encapsulated how the data space has matured over the last few years — toward faster queries and more widespread access to data.

A mother/daughter presentation focused on how Nokia built its internal data team — with curiosity as primary. I was inspired by how Danielle Dean (the daughter) earnestly described her self-taught immersion into tools and techniques to pragmatically improve her ability to make sense of data.

Roger Magoulas further explores these data themes in the following video:

Strata Conference Santa Clara — Strata Conference Santa Clara, being held Feb. 26-28, 2013 in California, gives you the skills, tools, and technologies you need to make data work today.

]]>http://radar.oreilly.com/2012/11/four-data-themes-to-watch-from-strata-hadoop-world-2012.html/feed0Open source wonhttp://radar.oreilly.com/2012/07/open-source-won.html
http://radar.oreilly.com/2012/07/open-source-won.html#commentsMon, 30 Jul 2012 17:00:39 +0000http://radar.oreilly.com/?p=49880I heard the comments a few times at the 14th OSCON: The conference has lost its edge. The comments resonated with my own experience — a shift in demeanor, a more purposeful, optimistic attitude, less itching for a fight. Yes, the conference has lost its edge, it doesn’t need one anymore.

Open source won. It’s not that an enemy has been vanquished or that proprietary software is dead, there’s not much regarding adopting open source to argue about anymore. After more than a decade of the low-cost, lean startup culture successfully developing on open source tools, it’s clearly a legitimate, mainstream option for technology tools and innovation.

And open source is not just for hackers and startups. A new class of innovative, widely adopted technologies has emerged from the open source culture of collaboration and sharing — turning the old model of replicating proprietary software as open source projects on its head. Think Git, D3, Storm, Node.js, Rails, Mongo, Mesos or Spark.

We see more enterprise and government folks intermingling with the stalwart open source crowd who have been attending OSCON for years. And, these large organizations are actively adopting many of the open source technologies we track, e.g., web development frameworks, programming languages, content management, data management and analysis tools.

MySQL appears as popular as ever and remains open source after three years of Oracle control and Microsoft is pushing open source JavaScript as a key part of its web development environment and more explicit support for other open source languages. Oracle and Microsoft are not likely to radically change their business models, but their recent efforts show that open source can work in many business contexts.

Even more telling:

With so much of the consumer web undergirded with open source infrastructure, open source permeates most interactions on the web.

What does winning look like? Open source is mainstream and a new norm — for startups, small business, the enterprise and government. Innovative open source technologies creating new business sectors and ecosystems (e.g., the distribution options, tools and services companies building around Hadoop). And what’s most exciting is the notion that the collaborative, sharing culture that permeates the open source community spreads to the enterprise and government with the same impact on innovation and productivity.

So, thanks to all of you who made the open source community a sustainable movement, the ones who were there when … and all the new folks embracing the culture. I can’t wait to see the new technologies, business sectors and opportunities you create.

]]>http://radar.oreilly.com/2012/07/open-source-won.html/feed307 emergent themes from Webstock reveal a frameworkhttp://radar.oreilly.com/2011/03/7-webstock-themes.html
http://radar.oreilly.com/2011/03/7-webstock-themes.html#commentsThu, 17 Mar 2011 13:00:00 +0000http://blogs.oreilly.com/radar/2011/03/7-webstock-themes.htmlAt this year’s Webstock in Wellington, New Zealand, there were no reports of bad acid or rainstorms to muck things up, just a few sore tummies from newcomers (like me) eating too many Pineapple Lumps.

Webstock wasn’t a rock concert, but a gathering of the geek tribes that lived up to its reputation with accomplished speakers, an interesting mix of topics, and a scenic venue on the harbor.

The conference organizers sat the speakers together and they developed a nice camaraderie — unusual for a tech conference — that had them referencing and building on each others’ presentations. While the presenters came from a variety of backgrounds — engineering, design and the arts — the talks were anchored in how how their topics related to web and mobile business processes.

Deliver value early and build fast with a purpose

Best results come from agile-like processes and cultures that start small and iterate fast. Not surprisingly, this common theme was echoed most often by those with startup backgrounds. By limiting features, releasing fast, and learning, products and services can quickly become aligned to user needs. Resisting feature bloat is important to staying fast, especially after initial success and getting exposed to what may be a large number of users.

The agile approach also lets organization stay tuned into serendipity — uses and features that users exploit that were not part of planned use scenarios (Twitter becoming a communications tool for activism is one example). Corollaries to building fast include failing fast, i.e., quickly acknowledging what doesn’t work, and using the scientific method to test and learn.

Scientific method

Elements of the scientific method — hypothesis setting, testing, learning and adjusting — are just as valid for web performance tuning as for user interfaces and understanding comprehension. Embracing the scientific method can help an organization establish a learning culture. It’s not trivial. To gain the most from testing you need familiarity with using data and quantitative analysis, including visualization (charts), statistics, and machine learning.

Web 2.0 Expo San Francisco 2011, being held March 28-31, will examine key pieces of the digital economy and the ways you can use important ideas for your own success.

Keep everything human-scaled

Acknowledge that the builders and users of products and services are people whose needs should be understood and accommodated. For user interfaces, human scale means keeping operations and processes simple, functionally consistent, cognitively easy, and intuitive. For engineers and managers that means developing strategies that help align efforts, listening to users and making fast build cycles possible. Communicating on a human scale is best done through narrative processes and storytelling, and establishing a communications strategy. Apple products were often offered as examples of simple-to-use, human-scaled designs that delight users.

Communicating with stories

To effectively communicate with users — to share, to guide and convince, to engage — use storytelling processes like narration and cause-and-effect processes. Stories are a way to cut through and make sense of the increasing volumes and ubiquity of data and noise we experience via the web and mobile devices. Humans are wired to understand and remember stories — they make content and interfaces visceral, memorable and viral (i.e., worth repeating). Different speakers evoked storytelling as a way to develop content strategies, to stay connected with others, as an interface metaphor, e.g., using scrolling to represent the passage of time, and, as a way to put data to use.

My background in data and analysis kept me keenly interested in David McCandless’ “Information is Beautiful” session, with its mix of analytics, visually rendered data and storytelling. David does beautiful, playful and insightful infographics that focus on expressing the relative magnitude/scale of large numbers — especially numbers so large they are abstractions to most people — and on telling the stories in the data. His Debtris visualizations show the size of the US and UK deficits using a visual metaphor from the game Tetris to explains the size of the US and UK debt. David’s work is always designed to tell a story, keeping users compelled to drill further and stay engaged. It’s worth noting that David shared some failures and what he learned from them — the scientific method in action. David, who has a journalism background, calls his mix of charts and narration a “charticle” — a concept worth remembering when teasing a story out of data.

Users expect multi-touch and gestural interfaces

Multi-touch and gestural functions should be treated as the primary user interface, not functionality tacked onto conventional interfaces. If touch and gestures can do the job, there’s no reason to keep a keyboard or mouse. Users see multi-touch and gestures as intuitive and the “right” interface for many tasks, expecting them on many devices, not just mobile phones and tablets. Vendors who fail to fully embrace multi-touch and gestures as the primary interface to devices and services do so at their own peril.

Simple is hard

It’s hard work building the right amount of functionality, providing the optimal amount and frequency of content, creating devices that feel right, and designing user interfaces that are natural and “disappear.” Making a simple experience for users, one that takes the cognitive load of how to use a device so they can focus on why they are using the device, takes effort, sweating the details, thinking, testing, and creativity. To succeed at simple, build fast and use scientific learning processes to refine and improve interfaces. Spend as much time taking out what does not work as building on what does.

Find inspiration everywhere

The presenters commonly described creative flashes inspired by their own needs and from their curiosity about topics outside their primary expertise. Creativity often gets sparked by making non-obvious connections.

Build fast, learn, simplify, tell stories, stay curious — taken together, these themes provide a useful framework for quickly building the next generation of services, applications and devices that become the warp and woof of our increasingly digital and mobile lives.

]]>http://radar.oreilly.com/2011/03/7-webstock-themes.html/feed1Need faster machine learning? Take a set-oriented approachhttp://radar.oreilly.com/2011/01/faster-machine-learning.html
http://radar.oreilly.com/2011/01/faster-machine-learning.html#commentsFri, 28 Jan 2011 21:00:00 +0000http://blogs.oreilly.com/radar/2011/01/faster-machine-learning.htmlWe recently faced the type of big data challenge we expect to become increasingly common: scaling up the performance of a machine learning classifier for a large set of unstructured data.

Machine learning algorithms can help make sense of data by classifying, clustering and summarizing items in a data set. In general, performance has limited the opportunities to apply machine learning to understanding big or messy data sets. Analysts need to factor in time for speeding up off-the-shelf algorithms or even whether a machine learning pass would complete in a timely manner. While using smaller random samples can help mitigate performance issues, some data sets lend themselves to improved results when applied to more data.

Here we share our experience implementing a set-oriented approach to machine learning that led to huge performance increases (more detail is available in a related post at O’Reilly Answers). Applying a set-oriented approach can help you expand the opportunities to gain the benefits of machine learning on larger, unstructured and complex data sets.

We are working with the US Department of Health and Human Services (HHS) on a project to look for trends in demand for jobs related to Electronic Medical Records (EMR) and Health Information Technology (HIT). The twist, and the reason we decided to build a classifier, is that we wanted to separate jobs for those using EMR systems from those building, implementing, running and selling EMR systems. While many jobs easily fit in one of the two buckets, plenty of job descriptions had duties and company descriptions that made classifying the jobs difficult even for humans with domain expertise.

Identifying the approximately 400,000 jobs with EMR and related references was achieved with high accuracy using a regular expression rule-base. All the job description data is stored on a multi-node Greenplum Massively Parallel Processing (MPP) database cluster, running a Postgres engine. Having an MPP database has been critical for analyzing the large, 1.4-billion-record data set we work with — we can generally run investigative queries against the full data set in minutes.

After some discussion, we decided a Naive Bayes classifier seemed appropriate for the task. While there are some Python open source naive bayes classifiers available, such as NLTK and Orange, I decided to use the algorithm in Toby Segaran’s “Programming Collective Intelligence” so I could tweak the code and play with different feature arrangements. Toby does a great job of tying the code to the principles behind the naive bayes algorithm, and I thought that would help with modding and tuning the classifier for our purposes.

We had a tricky data set with categories that could be only subtly different. We wanted the classifier to be fast enough to iterate through the data many times so we could spend enough time training and tuning the algorithm to optimize classifier accuracy. Starting with a training set of 1,800 categorized jobs (phew…) and a random sample of 1,850 jobs, we set to work trying and reviewing different sets of feature combinations.

We ran into a Python related problem early on that I think worth sharing. Due to the large numbers of words in a job description, the probabilities used by the Naive Bayes algorithm get exceedingly small, so small that Python turned them into zero, making for suspiciously strange results. Luckily, I complained about this problem to a friend with a doctorate in math who suggested taking the log of the probabilities, since the logs of very small numbers are not so small. Worked like a charm.

That’s when it hit me, job descriptions have lots of words that are often not carefully entered. That creates a large set of words and probabilities to work with, slowing down the algorithm. And, the algorithm was written to explain how Naive Bayes works, not for maximum efficiency. With training and classifying the sample data running for more than six hours, we needed to do something to speed up the process to handle all 400,000 records we wanted to classify.

I contacted Daisy Zhe Wang, an EECS doctoral student at UC Berkeley and a consultant at Bayes Informatics, because of her focus on scaling in-database natural language and machine learning algorithms.

Daisy, together with Bayes Informatics founder Milenko Petrovic, developed a set-oriented approach to implementing the Naive Bayes algorithm that treats the data derived from the training set (features (words) and counts) as a single entity, and converting the Naive Bayes algorithm to Python User Defined Functions (UDFs) that, since Greenplum is a distributed database platform, let us parallelize the classifier process.

The result: The training set was processed and the sample data set classified in six seconds. We were able to classify the entire 400,000-record data set in under six minutes — more than a four-orders-of-magnitude records processed per minute (26,000-fold) improvement. A process that would have run for days, in its initial implementation, now ran in minutes! The performance boost let us try out different feature options and thresholds to optimize the classifier. On the latest run, a random sample showed the classifier working with 92% accuracy.

My simple understanding of their algorithm is that training set results are treated like a model and stored as a single row/column in the database. They’re parsed into a permanent Python data structure once, while each job description is parsed into another temporary data structure. The Python UDFs compare the words in the temporary data structure to the words in the model. The result is one database read for each job description and a single write once the probabilities are compared and the classification assignment made. That’s quite a contrast from reading and writing each word in the training set and the unassigned job.

Why does the set-oriented approach to machine learning matter? Performance and scale issues have long been a problem when trying to fully apply machine learning to large or unruly unstructured data sets. Set-oriented machine learning provides a straightforward way to bypass performance roadblocks, making machine learning a viable option for categorizing, clustering or summarizing large data sets or data sets with big chunks of data (e.g., descriptions or items with large numbers of features).

With any data set, speeding up machine learning processes allows quicker iterations through the data. That creates room to run experiments that improve accuracy and more time to focus on and interpret results to gain insights about the data. Quicker processing reduces the risk of applying machine learning to new topics by reducing the time investment to determine if results are worthwhile.

Strata: Making Data Work, being held Feb. 1-3, 2011 in Santa Clara, Calif., will cover many similar topics related to big data, machine learning, analytics and visualization.

]]>http://radar.oreilly.com/2011/01/faster-machine-learning.html/feed7Ebooks and the threat from "internal constituencies"http://radar.oreilly.com/2010/11/ebooks-and-the-threat-from-int.html
http://radar.oreilly.com/2010/11/ebooks-and-the-threat-from-int.html#commentsTue, 02 Nov 2010 13:00:00 +0000http://blogs.oreilly.com/radar/2010/11/ebooks-and-the-threat-from-int.htmlA recent New Yorker article by James Surowiecki on the problem of “internal constituencies” and how organizations respond to technology and market changes seems relevant to the ebook conundrum publishers are facing.

Surowiecki highlights how Blockbuster was unable to correctly value the assets that created the company’s initial success, especially when faced with insurgents like Netflix and Redbox. He ends with a warning for Netflix, and a look at the uncertain world of digital distribution to come.

The summary of how Blockbuster overvalued their “clicks and mortar” strategy may provide a catalyst for publishers to look at the opportunities and threats from ebooks. Will internal constituencies bias how publishers value and contrast print book and ebook assets and business models?

Subjective rambling on print and ebooks

Lessons from music and movies need to be tempered by the nature of the media. Vinyl, VCR tapes, CDs, and DVDs do nothing but carry information and require a player. While some may wax nostalgic for the media of their youth (yes, the pun is intentional; note the resurgence in vinyl records sales), the media of yore are awkward and limiting, particular
when compared to playing devices with enough built-in storage and connectivity to access everything users may ever want to hear or watch.

Print books and magazines –black print on white paper — create a uniquely effective reading platform that integrates both storage and a player in a convenient package. Reading material is consumed differently than music and movies. Books generally take longer to read and aren’t continually or frequently re-read, making the low storage density of books less of an issue than with music and movies. Ebooks and ebook readers are still maturing and may not be “good enough” yet to effectively replace the print book experience.

In books we may see a complementary relationship between print and electronic forms based on context, content, distribution and consumer usage. For example, students needing portable access to multiple textbooks may find the storage density of print a significant issue that pushes the adoption of ebooks (there’s a funny New Yorker cover showing a young girl with a backpack leading a mule, laden with books and school supplies).

Likewise, technical folks who want random access to a broad range of reference material may find the storage density, search capability and instant distribution of ebooks an unassailable benefit. For many others, the comfort and familiarity with print, the ereader experience, slow consumption rates, etc., may all continue to create demand for print books.

The audio book market may provide guidance as an alternate media channel that complementarily coexists with print books. Audio books have the same distribution characteristics as ebooks, as they are almost entirely distributed online these days. There’s also a market maker in Audible, where prices likely provide reliable information about demand and price elasticity (Audible has alternate flat subscription pricing). Audible charges more for new books, best sellers and evergreen sellers. Audio books are generally priced higher than Amazon’s print and ebook prices, anecdotally 10-40 percent higher for bestsellers. (A digression: researching Amazon pricing for best sellers and evergreen books shows about a 10 percent (+/- 7 percent) discount for ebooks compared to print books on Amazon.)

Smart publishers can work to keep the price differential between print books and ebooks close, and learn how to segregate price insensitive consumers via temporal distribution strategies (e.g., early access), distribution channels (e.g., print on demand), content (e.g., colors and diagrams that don’t render as clearly electronically) and form
factors that can provide extra margins from the commodity, low-margin mass market.