Global High-Tech Innovation

July 08, 2013

In previous posts I explained how EMC leverages an analytic process to manage university research globally.
Truth be told, the analytic process actually measures global innovation;
university research is a subset of corporate innovation activities. The context of university research is worth
exploring on its own, however, and I took the opportunity to share our approach
with partner faculty and students at EMC’s 2nd Annual University
Day.

Analyzing university research side-by-side with other
corporate innovation activities has its advantages. In my last post, I shared
specific data about a list of Chinese researchers that are actively involved in
local university research. The pie chart
below highlights the set of researchers that collaborated with Chinese
universities for a specific time period.

In addition to participating in local university research,
the engineers at EMC Labs China are also actively involved in the yearly EMC
idea contest known as the Innovation Roadmap. Historical idea submissions from
global employees (8,000+ ideas and growing) are also stored in this innovation
database.

Employees are encouraged to submit their ideas as “teams”.
In fact, diversity of team members, whether it be geographic or by function, is
highly encouraged. This diversity is a leading cause of
increased idea quality, as I’ve discussed in a recent post. As a result of storing many years’ worth of
historical idea submissions, we can begin to visualize team submissions using
social network analysis. For example,
the social graph below focuses on Chinese employees that continually surface as
“strong collaborators” in the area of idea submission.

In the graph above it is apparent that Jidong Chen, for example, has
a network of collaborators with which he submits ideas. Chances are good that
Jidong, during discussions with his peers, is sharing the work of his
university counterparts either directly or indirectly.

A more important question, in my mind, is whether Jidong is
sharing knowledge across boundaries. These boundaries could be
geographic, they could be technology related (e.g. security researchers, compression researchers,
etc), or they could be by function (collaborating with marketing, HR, finance, etc).

In order to validate that any given EMC employee is indeed
crossing boundaries in the transfer of university research knowledge, the
analysis was run again with color-coded values representing country of origin.
The graph below is a zoomed-out picture of the same chart, with color coding of
each individual by geography.

The yellow dots represent Chinese employees. The yellow
circles with red boundaries correspond to the red circles in the previous
chart. One can readily see that Chinese idea submitters not only cluster
together, but they bridge to other geographies as well. The colors of
these circles represent employees in France, Israel, Australia, and the United
States. As a result, certain Chinese
employees have extremely high betweenness ranks, which means that they are
strong candidates to transfer their knowledge to other countries.

Whether or not they actually DO transfer that knowledge is a
different thing altogether. Is it enough to know that the potential is
there? The answer is no. However, the knowledge of betweenness allows a
centralized innovation program to guide good behaviors when it comes to global
knowledge transfer. The analytic results give us all the pieces to put a plan
into action.

This post has highlighted the ability to explore
connectedness and collaboration emanating from an EMC employee
conducting local university research. Knowledge expands locally, and then it has the potential to be transferred globally.

What about the reverse of this process? Can any global
researcher correlate their research efforts to a remote employee like Jidong?

In a future post I will examine analytic techniques to
enable this behavior.

July 01, 2013

I’ve been writing several blog posts recently that describe
EMC’s use of analytics to accelerate the innovation that is coming
from global university research partners world-wide.

The geographic locations of many (but not all) of our
university research partners are depicted below:

In addition, in my last post I described how EMC tries to
keep track of “what” research activities our global partners are working on
through the use of Stanford’s Topic Modeling Toolbox:

In this post I’d like to shed a little more light on how to take these research themes and map them to specific geographic regions. In
order to do this, it helps to describe the inner workings of EMC's innovation analytics framework in general.

The framework is founded upon activities that are closely tied to innovation, including:

University Research

Publications

Conferences

Customers/Partner engagements

Knowledge Transfer/Brown Bag sessions

Employee ideas

Intellectual property

Etc.

Visually these activities can be depicted as follows:

The analytic framework is essentially a data gathering
process that can record these activities via a variety of methods, including (a) manually, (b) via email, (c) via crawling of a file
system, or (d) as part of Outlook calendar invites. No matter what the source of
ingest, all of the innovation activities (including university research), are funneled into an analytic sandbox, which stores both structured and
unstructured content (the graphic below describes this approach, and was previously described in a series of posts on
the data analytic lifecycle).

The beauty of this approach is that the geographic location
of the structured and unstructured data is preserved during the ingest phase,
and thus available for analytic queries. For example, the diagram below
highlights an answer to the question: “What types of research has EMC funded
recently in Russia”?

The resulting map and word cloud depicts that
compression research is occurring in Saint Petersburg, Russia. This is due in large part to the strong
mathematical skills of the EMC employees and universities in that region.

I recently contributed an article that described my own
personal empirical data about innovation at our global R&D locations. The
approach described above is an alternative, data-driven approach to classifying a company's global innovation activities.

Who are the EMC employees conducting the research? Can
the analytic framework drill down to the employee level and discover which EMC mathematicians
are involved with this compression collaboration, and/or which Russian
employees participate in university research in general?

The answer is yes, and more detail will be described in an
upcoming post.

June 27, 2013

During a lecture at EMC’s 2nd Annual
University Day on Monday, I held a dialogue with faculty and students gathered at
our Santa Clara campus. I described how EMC uses an analytics framework (Pivotal/Greenplum) to accelerate the innovation that emerges from our global academic research partners. In
particular, I highlighted the following capabilities of our innovation
analytics platform:

A visualization of the “types” of research
currently active in our portfolio (e.g. solid state storage, analytics, etc).

A visualization of the “types” of research by
region (e.g. where in the world do we research compression technology?)

Who are EMC’s key researchers in any given
region?

Which researchers are the best at transferring
knowledge out of their region?

For any given EMC researcher, what type(s) of
research do they conduct?

What is the complete list of EMC employees, per
region, that are involved in any form of university research?

How can global EMC employees advance their own
ideas by locating relevant university research?

How do we augment university research with other
external employee connections (e.g. programmatically leverage their Twitter
connections)

In this post I’d like to focus on the first bullet. EMC has
dozens of university research partnerships worldwide. How do we dynamically
visualize the current areas of exploration that are occurring across the globe
at any given point in time? How can we determine which strategic research areas
have strong coverage and which areas may have no coverage at all?

These questions are currently answered through our use of
the functionality provided within the Stanford Topic Modeling Toolbox. The
diagram below helps explain our use of this tool:

The Topic Modeling Toolbox analyzes the analytic repository
containing university research activity. Data
Scientists within EMC (working as part of our EMC Labs China team) categorize
these research activities by providing the toolbox with a number (e.g. N = 25).
The toolbox runs algorithms that classify each research activity into one of
a number of different buckets using the toolbox algorithm.

Once the toolkit has taken a pass at every document, the bar
graph above shows the level of activity for each “class” of research
initiative. I asked our Data Science team to provide a simple word cloud
algorithm across each category, and it is fairly easy to see at a high level
that Topic 01 has a cloud focus, while Topic 12 has a Big Data focus.

Furthermore, if the data above was analyzed in a given time
frame (e.g. the first half of 2013), Topic 22 would be viewed as “most active”,
while Topics 02 and 23 would qualify as “least active”. This may or may not be cause for concern
depending on the nature of the work. In
fact, given topics can be broken down into the “nature of the engagement”,
which is highlighted below:

While this type of data gives EMC a great idea about “what”
we are researching, it doesn’t provide any data at all about “where”.

June 25, 2013

Today I attended EMC’s 2nd Annual University Day
in Santa Clara, California. A large
number of schools were represented from all over the United States, including:

UC San Diego

UC Irvine

UC Santa Cruz

Northeastern University

Minnesota

Carnegie Mellon University

University of Wisconsin

Case Western

Florida International University

University of Utah

Harvard

University of Rochester

Stony Brook University

Princeton University

The
agenda for the day included discussions on challenging high-tech issues in
next generation data centers, including new developments in solid state storage. EMC Distinguished Engineer Jeroen
VanRotterdam led an interesting dialogue examining the current state of relationships
between Industry and Academia.

Greg
Ganger, CMU Professor and Director of the Parallel Data Lab, gave the Academic
Keynote during the afternoon session. His keynote was followed by the annual
poster session, in which nine students competed for first prize.

For
this post, however, I’d like to summarize a discussion I led just before lunch, in
which I asked the students the following question:

“How
would you manage EMC’s global university research portfolio?”

Their answer was loud and clear: "We don't know!". I responded that the answer was a fair one; it's a hard problem to solve. I then shared our company's approach of using EMC’s
own analytic products (e.g. Pivotal/Greenplum) to perform global analytics
across all academic research partners.
In order to highlight the global span and scope of our research
initiatives, I shared the following map:

This
map is dynamically generated. While it doesn’t represent every university
research partnership EMC has across the globe, it’s pretty close. The map is the result of nearly two years of
collaboration across all of the countries that register their research engagements. The larger the circle, the more activity is being reported from
the region.

What
types of analysis can be run against a database containing research activities? During my talk I described the current reports enabled by our analytics framework:

A visualization of the “types” of research
currently active in our portfolio (e.g. solid state storage, analytics, etc).

A visualization of the “types” of research by
region (e.g. where in the world do we research compression technology?)

Who are EMC’s key researchers in any given
region?

Which researchers are the best at transferring
knowledge out of their region?

For any given EMC researcher, what type(s) of
research do they conduct?

What is the complete list of EMC employees, per
region, that are involved in any form of university research?

How can global EMC employees advance their own
ideas by locating relevant university research?

How do we augment university research with other
external employee connections (e.g. programmatically leverage their Twitter
connections)

The talk was well-received. The faculty and students that
attended got a good feel for the framework that EMC uses to impact our own
business by expanding our knowledge with local university partners.

In future posts I will dive in many of the items above in
more detail to specifically describe how analytics are leveraged to improve
EMC’s university research results.

June 17, 2013

Several weeks ago I published a blog post (Different Minds
Think Greatly) that explored the topic of cognitive diversity and innovation.
At the time, I had read a Techonomy article by John Hagel and John Seely Brown,
who basically asserted that “too much like-mindedness hurts companies”, and I
quoted the following:

Organizations that
host a diverse and broad range of members have a resilience that results from
cross pollination.

As part of the article I echoed my agreement with this
assertion and referred to some Social Network Analysis from my own company
(EMC). The data that we modeled highlighted that a good degree of
geographic diversity can result in higher quality ideas. “Higher quality ideas” are typically defined as
ideas that receive a high score from judges in our yearly Innovation
Roadmap program (especially ideas that reach finalist or funded status).

Our data shows that when diverse minds from different
cultures collaborate on new approaches, good ideas result. The conclusions we drew from this analysis
have resulted in behavioral change at EMC. Most notably we’ve formed a global
“Innovation Best Practices Community” in order to intentionally stimulate this
behavior.

After publishing some of our findings, we were approached by
two universities on an interesting joint research project. They asked if we wouldn’t mind sharing a filtered view of our employee idea activity in order to correlate it against the public Twitter
profiles of these same people. Professors Eoin Whelan (NUI Galway) and
Salvatore Parise (Babson) invited us to focus on an offshoot of cognitive
diversity known as “structural holes”.
Eoin explains structural holes in the following manner.

Ron Burt’s theory
of structural holes has proven to be influential in explaining how innovation
transpires. Burt proposes that gaps in a
social network, structural holes, create brokerage opportunities. A structural hole indicates that the people
on either side of the hole circulate in different flows of information and
advantages accrue to those individuals whose relationships span the structural
hole. In his best selling book The
Tipping Point, Malcolm Gladwell argues that the success of Paul Revere’s
midnight ride was due to his quite diverse social networks – ranging from
hunting and card playing to theatre and business. Therefore, he knew which
doors to knock on when arriving in a town.
Network brokers like Revere not only disseminate information more
broadly, they also benefit by receiving a greater novelty of information from
their diverse social contacts. Indeed,
studies within organizations have shown that employees, teams, and even entire
companies with more diverse network connections tend to be more innovative.

As a result of our conversation, we polled our global
Innovation Best Practices community and asked idea submitters to voluntarily
share their Twitter handles with Eoin and Sal. We packaged up the Twitter
handles with the employees’ corresponding level of innovation activity over a
period of several years.

This research has been ongoing for several months, and in an
upcoming post I plan to share some of the results and what it might mean for
our organization. Before I do, however,
I’d like to discuss this Data Science project in the context of Phase 1 of the
Data Analytics Life Cycle: Hypothesis Generation.

By “diverse” we mean “disconnected” or “fragmented”. In
other words, employees that follow people that are not “like-minded” tend to
submit better ideas due to the diversity of their network.

If this hypothesis proves to be true, we can brainstorm ways
of stimulating additional behavioral change via encouraging our employees to
fragment their Twitter networks. Proving the hypothesis involves iterating through the additional phases of the Life Cycle (e.g. Data Prep, Data Modeling, etc).

In future posts I will share the results of the modeling exercise conducted by Eoin and Sal.

I advise taking the train to ETH Zurich if you want to meet with the ZISC team, or perhaps take a taxi. I came to this conclusion after listening to ZISC researcher Srdjan Marinovic, who told me that his team has found a way to open a keyless car system by standing in the same room with the person who has the wireless key:

The details of their approach (Relay attacks on Passive Keyless Entry and Start Systems in Modern Cars) can be found in their paper on the subject. This project falls into the category of Access Control for Next Generation Systems, which is a project focus at the ZISC center.

While this type of research may freak out owners of new cars, the main focus of research at the ZISC center will actually freak out owners of digital information. I shared with the ZISC team some use cases regarding "digital passports", as well as multi-stage big data workflows. We had a discussion on several possible research directions, such as:

How can a file travel to one country (and only that country)?

How can globally distributed data scientists collaborate on remote data without the data leaving the country?

How can multiple stages within a big data workflow provide overlapping access control for various data scientists at various stages?

In addition to ZISC center meetings, we had a long discussion about FutureICT. For those interested in Big Data research, I highly recommend learning about this proposal (which is currently nearing the grant stage).

This was my first trip to ETH Zurich and I'm looking forward to seeing what comes of it.

July 20, 2012

Last month EMC held a "University Day" just before the USENIX ATC 2012 event in Boston on June 12-15. This event was the first in a new series of meetings where key university faculty and students gather together to collaborate for a day with EMC technologists.

The event included many prominent university research partners, great keynote speakers, and panels.

Dr. Ian Foster of the University of Chicago delivered a keynote on the “Big Process for Big Data” that surrounded process automation and data-driven science. He was able to tie in the famous Kasparov computer-human as an example, explaining that there needs to be intellectual strength in both the machine and the human operating the machine.

Dr. Kai Li of Princeton University (also the co-founder of Data Domain) talked about the “Exploration of Feature-Rich Data.” I enjoyed Kai's presentation and thoughts on searching video and photo images.

The two panel discussions were led by EMC’s own Rhonda Baldwin and Erik Riedel. These discussions covered the consolidation of both structured and unstructured data in order to run analytics, technical roadblocks in terms of the consumption of information, the challenge of the lifecycle of data, and many other topics.

Jeff Nick, EMC’s CTO, offered his insight into the “Golden Triangle of Innovation,” being the combination of organic innovation, acquisitions and investments, and university collaborations. This was a nice transition to the poster session that displayed the work of 10 Ph.D. students from partnered universities.

The day concluded with a poster sessions by the students in attendance, with the following students taking home the prizes:

In fact, Multi-Tenant Big Data Workloads (MTBDWs) are gaining quite a bit of attention in academia. How can shared, massive data sets be most effectively (and securely) analyzed by multiple tenants in a cloud environment?

Consider the Sloan Digital Sky Survey (billed on its website as "the largest map in the history of the world"). Images from the heavens are input into a data processing pipeline, resulting in a massive amount of raw data, processed images, and meta-data. A variety of scientists can collaborate on the processing of this data, given the publicly accessible interfaces to a directory tree and forms for querying coordinates.

Providing this type of functionality as a service causes me to ask a few questions:

How would this type of collaboration work in a commercialized service provider environment?

How can tenant isolation be enforced, or in the case of collaborating tenants, be disabled?

Are the tenants, and/or the data services, all located within VMs?

Can virtualized tenants and/or data services adhere to specific SLAs?

The research community is beginning to explore these questions. Early results were presented at last year's SIGMOD conference in Athens. MIT researchers studied physical consolidation of database workloads onto fewer machines. In their paper "Workload Aware Database Monitoring and Consolidation", the authors stated the following about virtualizing databases (section 9, Conclusion):

Additionally, we show that existing virtual machine technologies are not nearly as effective as our techniques at consolidating database workloads.

The paper goes on to list some of the problems (section 7.4) encountered when virtualizing databases (as opposed to physical consolidation):

Redundant operations (log writes, log reclamations)

Over-allocation of RAM

Excessive context switching

Less code-sharing between workloads

How then can these issues be addressed to help realize performance gains while allowing for either tenant isolation and/or tenant collaboration for analytic workloads (otherwise known as multi-tenant workload management)?

With a new architecture, of course!

As part of its University R&D program, EMC is not only collaborating with CSAIL but also with the University of Washington (who has data sets of their own). In partnership with University of Washington Professors Magda Balazinska and Bill Howe, the research will attempt to define an architecture that can address the following questions:

Should tenants always be placed in their own virtual machines?

Are their benefits to tenant sharing of parallel data processing engines?

For overlapping data sets between tenants, how should resources be allocated between different tenants and engines?

What service level agreements make sense for these new scenarios?

How is elasticity implemented in these new scenarios?

It's exciting to be at the forefront of these discussions, and I'm looking forward to sharing results moving forward.

June 26, 2012

I spoke at a panel on "Russian Startups in the Global Market" at the Saint Petersburg Economic Forum 2012 last week. My contributions to the discussion during the two-hour dialogue represent a great summary of the week I just spent in Russia (I visited both Moscow and St. Petersburg). Before the economic forum I participated in the 5-year anniversary of EMC's R&D presence in Saint Petersburg (SPb). In the picture below I am sitting in the middle of three individuals representing Russian startups (on the right) and four individuals representing venture capitalists (on the left).

The main question being address by the panel concerned the ability (or inability) of Russian startups to engage globally. My main contribution to this panel was shaped largely by the celebration of EMC's 5-year R&D contributions to EMC. In 2007, several dozen new EMC employees began working in SPb, and within 5 years they had already delivered significant (and globally-delivered) product contributions. During the panel discussion I attempted to draw some analogies between (a) the innovative behavior that I see in my Russian co-workers and (b) the potential behavior of the Russian startup market.

Firstly, any product offering developed in Russia needs to be customizable for global markets. The VNXe and Unisphere offerings developed in SPb were designed by the Russian team in such a fashion. In fact, several Russian employees display (in their office) the "President's Award" given directly to them by Joe Tucci after the VNXe was voted by the industry as most innovative product of the year.

The VNXe is a great offering in Russia. It fills an information storage need in the small-to-medium business market that enterprise products cannot deliver. The Unisphere user interface infrastructure developed in Russia was built in such a way that the product is customizable for other markets (Europe, Asia, U.S., etc).

More important that customization, however, is product quality. As I sat on the stage of the SPb Economic Forum I recalled a statement made by EMC SVP Joel Schwartz (shown here on the right) that goes something like this:

"I'd like to say we had a master plan in mind when we launched EMC R&D in SPb. The truth is we had a hunch that the talent level in Russia could deliver world-class products".

I have always spent a great deal of attention to the quality designed into EMC's products (see previous posts). I spent four years developing the VNXe and Unisphere alongside my Russian co-workers, and during that time I observed that their approach to product quality was certainly world-class.

Observation #1 that I made during the panel: think about quality and international customization from the very beginning.

During the week I also visited with local universities in both Moscow and SPb. I made a visit to SUAI (the State University of Aerospace Instrumentation). I visited SUAI with several of my co-workers (shown here on the left) working on the Viper project. It is well-known in the industry that the Viper team creates re-usable compression componentry that is subsequently embedded across EMC's product lines.

Russians are certainly world-renowned for their mathematical prowess. I was impressed by the mathematical skills of the university researchers and even more impressed to find that some of the new techniques for improving compression rates were already integrated into our product line. Joel's statement about the talent level is true not only for our EMC employees but for the universities that we partner with. Innovation accelerates when our employees reach out to local universities to find out what is new and relevant.

This lead me to observation #2 during the panel: Find out what is relevant and new in global markets by studying the research of universities in those markets. Knowledge expansion and transfer translates into deliverable ideas (a concept that we are attempting to formally quantify as part of EMC's innovation analytics research).

After my visit I had my picture taken with my co-workers, Professor Eugeneii Krouk and his researchers.

My statement stressing the importance of university engagements in global markets was challenged by the moderator: Shouldn't the focus be more about money (as opposed to the knowledge available at universities)? One of the issues identified with Russian startups entering global markets was the lack of access to the capital that was required to do so. This challenge to my perspective caused me to think about the first two days of my trip in Moscow. EMC's innovation strategy in the region is actually two-pronged. In addition to a strong university research presence, we also have growing visibility to the startup, partner, and customer ecosystem (after joining the recently-announced Skolkovo initiative).

For EMC, Skolkovo represents a window into the most pressing needs of customers in Russia. During my two days in Moscow I met with startup companies working on smart-grid metering technologies, large customers that run Russia power utilities, and potential partner companies that already are performing Russia-related research (e.g. in areas such as smart grid).

Observation #3: Every geography has government initiatives that enable networking of peers and potential partners. Find ways to engage with these initiatives.

This was my fourth visit to Russia. I finally got smart and visited in June (my previous visits were in October, November, and January, with snow on every trip). I highly recommend SPb in June! Below is a picture I took at midnight on the longest day of the year: June 21st. The White Nights of this northern city are a sight worth seeing.

It turns out that two of the authors of the original RAID paper (Patterson and Katz) sit in the AMP Lab surrounded by students and other faculty. The AMP Lab is a five year initiative that logically follows from the successful conclusion of the five-year RADLab effort.

I met with Michael Franklin (the Director of AMP Lab) and he said that the RADLab experience was squarely aligned to the advent of cloud computing. In the same way, AMP Lab has squarely positioned the faculty and the students for the advent of Big Data.

Algorithms, Machines, People

Michael shared the main problem that AMP Lab is trying to solve:

The normal application of current technology doesn’t enable users to obtain timely and cost-effective answers of sufficient quality to data driven questions.

This problem statement has caused the team to research new ways of combining the vectors of timeliness, cost, and quality. The focus is on three areas:

2. Machines: use cloud computing to get value from Big Data and enhance data center infrastructure to cut costs of Big Data Management

3. People: Leverage human activity and intelligence.

It's the last item (people) that I found most interesting. The team at the AMP Lab is beginning to explore software stacks that call out to people. I've embedded a picture of one of these stacks below (taken from the CrowdDB project page).

For Big Data, some queries may best be satisfied by the crowd. This includes large data sets with incomplete data, queries over visual pictures that may not necessarily be tagged, or queries using "synonyms" (e.g. "IBM" and "Big Blue"). One of the more popular platforms for assigning work tasks to the crowd is Amazon's Mechanical Turk, and the AMP Lab has already started research with that platform.

I'd be interested in experimenting with this type of layered (the PowerPath stack comes to mind).

The portfolio of Big Data research going on at AMP Lab was cool. I walked out of there feeling pretty stoked.

Employer

Volunteer

Disclaimer

The opinions expressed here are my personal opinions. Content published here is not read or approved in advance by DELL Technologies and does not necessarily reflect the views and opinions of DELL Technologies nor does it constitute any official communication of DELL Technologies.