WEBVTT
00:00:00.650 --> 00:00:02.700
Welcome back to
Tech Days Online,
00:00:02.700 --> 00:00:05.140
Turning Data into Intelligence.
00:00:05.140 --> 00:00:07.080
I'm joined by Richard Conway.
00:00:07.080 --> 00:00:10.590
And Richard you're an R.D.
What is that?
00:00:10.590 --> 00:00:11.470
>> I am indeed.
So
00:00:11.470 --> 00:00:14.160
I'm Microsoft's
Regional Director.
00:00:14.160 --> 00:00:17.930
It's a very special title, and
its a very sort of Community,
00:00:17.930 --> 00:00:22.530
and technically, the business
focused role and all of these
00:00:22.530 --> 00:00:27.510
sorts of things where I help
set up community events.
00:00:27.510 --> 00:00:33.809
And I really keep my finger
on the pulse of technology.
00:00:33.809 --> 00:00:37.650
It's a great thing, because I
get to talk to some amazing
00:00:37.650 --> 00:00:41.400
people in Microsoft about all
the technologies that I love. Marvelous.
00:00:41.400 --> 00:00:42.070
Well, thanks for
00:00:42.070 --> 00:00:43.780
taking time out of
your busy schedule.
00:00:43.780 --> 00:00:46.190
We're pretty lucky to get
this guy on the show, and
00:00:46.190 --> 00:00:49.740
we've given you the tough topic
of the day, big data, which is-
00:00:49.740 --> 00:00:51.120
>> Big data.
00:00:51.120 --> 00:00:52.610
So much to say about this.
00:00:52.610 --> 00:00:55.560
We're probably not gonna
fit all of this material
00:00:55.560 --> 00:00:56.680
into 45 minutes.
00:00:56.680 --> 00:00:58.150
But let's start.
00:00:58.150 --> 00:00:59.570
Before we actually start,
00:00:59.570 --> 00:01:03.660
Andrew, I did want to
make a few announcements,
00:01:03.660 --> 00:01:06.870
and just let people know
about some of the great stuff
00:01:06.870 --> 00:01:07.650
that we're doing-
>> Yes.
00:01:07.650 --> 00:01:09.850
>> And how they can learn
more about this topic.
00:01:09.850 --> 00:01:14.049
So there's a little bit of
information about me, but
00:01:14.049 --> 00:01:17.688
in really taking this
appreciation of Azure,
00:01:17.688 --> 00:01:20.317
we found the UKAzure user group.
00:01:20.317 --> 00:01:22.853
So, obviously it's
great to be here and
00:01:22.853 --> 00:01:26.517
tell people about some of the
great meetups that we put on.
00:01:26.517 --> 00:01:28.433
>> You've got Scott Guthrie?
00:01:28.433 --> 00:01:29.545
>> We have indeed.
00:01:29.545 --> 00:01:35.330
We got Scott Guthrie here
who's the EDP in Microsoft,
00:01:35.330 --> 00:01:39.670
and he owns Several clouds, so
he's entirely responsible for
00:01:39.670 --> 00:01:43.890
Azure and made it a great
platform that it is currently.
00:01:45.150 --> 00:01:50.370
So we have Scott for
a day in London, so if you're
00:01:50.370 --> 00:01:54.680
in London on June third, and you
want to have a whistle-stop tour
00:01:54.680 --> 00:01:57.982
of Azure, and you want to hear
Scott do a morning keynote, you
00:01:57.982 --> 00:02:01.000
want to hear some great talks
in the afternoon about some of
00:02:01.000 --> 00:02:05.510
the amazing new technologies
that are surfacing in Azure like
00:02:05.510 --> 00:02:08.840
service fabric, and internet
of things, and big data.
00:02:08.840 --> 00:02:11.930
Then absolutely come
to this conference,
00:02:11.930 --> 00:02:12.980
we'd love to see you.
00:02:12.980 --> 00:02:14.410
It's free for the day.
00:02:14.410 --> 00:02:19.150
We'll play you with coffee all
day, [LAUGH] we'll feed you, and
00:02:19.150 --> 00:02:21.520
we're gonna have
a hack-a-thon as well, so
00:02:21.520 --> 00:02:25.250
you can actually get hands
on all of these things.
00:02:25.250 --> 00:02:30.500
But if that wasn't enough, we
are also gonna do a second day.
00:02:30.500 --> 00:02:33.287
And, yes,
this is Thames Valley Park.
00:02:33.287 --> 00:02:34.168
>> I thought it looked familiar.
00:02:34.168 --> 00:02:35.648
[LAUGH]
>> This is Microsoft HQ,
00:02:35.648 --> 00:02:36.990
absolutely.
00:02:36.990 --> 00:02:41.730
So the community's been working
really hard with Microsoft
00:02:41.730 --> 00:02:44.790
to do an event unlike
we've never done before.
00:02:44.790 --> 00:02:48.360
So normally we do we
run hack-a-thons,
00:02:48.360 --> 00:02:52.334
because of our love for data,
we tend to do a lot of events,
00:02:52.334 --> 00:02:54.240
which are data-centric.
00:02:54.240 --> 00:02:55.950
But we thought, you know what?
00:02:55.950 --> 00:02:59.480
And this is sort-of given
away by the title Azurecraft.
00:02:59.480 --> 00:03:03.390
Why don't we couple two
things that developers love?
00:03:03.390 --> 00:03:07.140
Azure and Minecraft,
that kids love.
00:03:07.140 --> 00:03:09.602
We decided to put
together Azurecraft and
00:03:09.602 --> 00:03:12.848
on the second day we're
running some amazing things.
00:03:12.848 --> 00:03:18.215
We're teaching Minecraft through
Microweb where we've got this
00:03:18.215 --> 00:03:22.813
big focus on nano satellites,
an incredible project and
00:03:22.813 --> 00:03:26.932
initiative that's being
run out of Microsoft UK,
00:03:26.932 --> 00:03:30.092
and really taking
shape at the moment,
00:03:30.092 --> 00:03:33.857
and so we're gonna have
some great time here.
00:03:33.857 --> 00:03:36.421
But we really want to
empower families and
00:03:36.421 --> 00:03:39.357
children to really use and
relate to the cloud.
00:03:39.357 --> 00:03:40.880
>> Yeah you've got
to try this stuff.
00:03:40.880 --> 00:03:43.150
There's only so
much you can by watching videos.
00:03:43.150 --> 00:03:45.850
It's a bit like learning
to drive I guess.
00:03:45.850 --> 00:03:46.840
>> Yeah absolutely.
00:03:46.840 --> 00:03:48.200
>> Right so thanks for
00:03:48.200 --> 00:03:49.560
the tutorial let's
get on with the show.
00:03:49.560 --> 00:03:50.869
>> Absolutely, let's do it.
00:03:52.410 --> 00:03:57.760
I want to start, I know we had
an impromptu definition before,
00:03:58.800 --> 00:04:02.720
and we'll just recap with
this idea of the three V's.
00:04:02.720 --> 00:04:07.190
This talk is about big data,
and this is really how people
00:04:07.190 --> 00:04:12.100
are defining this through
volume because we've amassed so
00:04:12.100 --> 00:04:14.760
much data over the long term.
00:04:14.760 --> 00:04:19.070
And as Matthew points out,
he brought this really home in
00:04:19.070 --> 00:04:22.140
a very sort of conceptual level
about what these terms mean.
00:04:22.140 --> 00:04:24.720
But what this means for
a business,
00:04:24.720 --> 00:04:28.848
this volume is ten years worth
of web logs, for example.
00:04:28.848 --> 00:04:29.848
What is this?
00:04:29.848 --> 00:04:31.563
This is 50 terabytes,
00:04:31.563 --> 00:04:35.237
100 terabytes of data to
literally be able to troll
00:04:35.237 --> 00:04:39.007
through this in a relational
database is impossible.
00:04:39.007 --> 00:04:39.707
>> Yeah.
00:04:39.707 --> 00:04:44.340
>> So, big data is all about
appreciation of these things.
00:04:44.340 --> 00:04:46.750
Then you've got this idea
of variety, you know?
00:04:46.750 --> 00:04:49.820
As enterprises emerge,
what tends to happen is,
00:04:49.820 --> 00:04:53.860
we get different projects
at different times.
00:04:53.860 --> 00:04:56.720
Different areas relating
to different technologies
00:04:56.720 --> 00:04:58.200
with different file formats,
00:04:58.200 --> 00:05:02.210
and we end up in a position
where we've got all of these
00:05:02.210 --> 00:05:05.750
different schemers, and we don't
really understand what they are.
00:05:05.750 --> 00:05:07.760
>> Yeah.
Running SQL over a collection of
00:05:07.760 --> 00:05:09.460
photographs isn't
gonna work is it?
00:05:09.460 --> 00:05:10.230
>> Right, exactly.
00:05:10.230 --> 00:05:15.110
>> Or sound files or JSON
objects, even JSON if we loaded
00:05:15.110 --> 00:05:17.960
it in, I mean I know we've
>> It can be super helpful
00:05:17.960 --> 00:05:18.730
understanding that stuff.
00:05:18.730 --> 00:05:20.350
And of course,
it's coming at us fast and
00:05:20.350 --> 00:05:21.720
no network can cope
with sometimes.
00:05:21.720 --> 00:05:23.900
>> Yeah, yeah, absolutely.
00:05:23.900 --> 00:05:25.450
And the speed of this.
00:05:25.450 --> 00:05:29.440
If we used a lot of the
technologies that we use today,
00:05:29.440 --> 00:05:32.190
we can't do this fast enough.
00:05:32.190 --> 00:05:35.780
So supplying these three things
is a really simple way of
00:05:35.780 --> 00:05:36.850
breaking down this problem.
00:05:37.870 --> 00:05:41.520
A set of desired outcomes
that we want from a solution.
00:05:41.520 --> 00:05:45.550
Which is where this big this
idea that big data takes over.
00:05:45.550 --> 00:05:48.910
But if we think about
what has changed in data
00:05:50.650 --> 00:05:52.545
the Apollo mission, right?
00:05:52.545 --> 00:05:53.200
>> Mm-hm.
00:05:53.200 --> 00:05:57.800
>> We were talking about 64
Kb payloads and 64k buffers.
00:05:57.800 --> 00:05:59.400
So computing was
very different then.
00:05:59.400 --> 00:06:00.920
We were talking about,
00:06:00.920 --> 00:06:03.480
we weren't talking about
very large storage devices.
00:06:03.480 --> 00:06:05.960
We weren't talking about
large computing devices.
00:06:05.960 --> 00:06:08.310
And if we fast-forward, right?
00:06:08.310 --> 00:06:10.420
Just the massive halo,
00:06:10.420 --> 00:06:12.880
health messages with
With the Xbox game,
00:06:12.880 --> 00:06:15.810
you know, we're talking about
gigabytes of data per second.
00:06:15.810 --> 00:06:21.290
So, that change in time,
that whole change you know.
00:06:21.290 --> 00:06:25.537
In this information super
highway has necessitated
00:06:25.537 --> 00:06:29.976
a different set of technologies
over 50 or so years,
00:06:29.976 --> 00:06:31.818
less than 50 years.
00:06:31.818 --> 00:06:35.686
So we do live in
crazy times I think.
00:06:35.686 --> 00:06:41.553
So I think where we start this
is when you speak to companies
00:06:41.553 --> 00:06:47.301
you really get that idea of
boundaries, complexities,
00:06:47.301 --> 00:06:52.333
incumbent systems with
their relational data and
00:06:52.333 --> 00:06:57.266
where you start
the conversation with big data.
00:06:57.266 --> 00:07:02.397
It's about unlocking some of
the potential and moving away
00:07:02.397 --> 00:07:07.835
from very schematized systems
which are heavily normalized,
00:07:07.835 --> 00:07:13.068
and so this are the source of
conversations that you'll find
00:07:13.068 --> 00:07:18.712
in the office with DBA's solving
problems around the storage and
00:07:18.712 --> 00:07:21.197
the access times with data.
00:07:21.197 --> 00:07:24.754
So, you know there are some
simple things that we can define
00:07:24.754 --> 00:07:26.959
and we want to
achieve through this,
00:07:26.959 --> 00:07:30.318
we want to discover what we
don't know, from the data.
00:07:30.318 --> 00:07:33.275
Sometimes with
the systems that we have,
00:07:33.275 --> 00:07:37.077
it's very easy to ask questions
that we do know, with
00:07:37.077 --> 00:07:40.720
expectations and answers that-
>> And we should be getting
00:07:40.720 --> 00:07:43.120
an answer back, in a two or at
least a three dimensional way,
00:07:43.120 --> 00:07:44.990
if visualizations
agree with numbers,
00:07:44.990 --> 00:07:47.910
if you've got
>> We've just been talking
00:07:47.910 --> 00:07:48.720
about meningitis.
00:07:48.720 --> 00:07:51.770
If you've got 20 characteristics
that relate to meningitis it's
00:07:51.770 --> 00:07:54.010
very difficult to conceptualize
that in the human mind.
00:07:54.010 --> 00:07:56.660
We can't think in 20 dimensions.
00:07:56.660 --> 00:08:00.440
So you want to move away from
eyeballing the numbers and
00:08:00.440 --> 00:08:03.600
get as Matt was saying
the algorithms to kind of
00:08:03.600 --> 00:08:04.760
play with this I guess.
00:08:04.760 --> 00:08:07.730
>> Yeah, absolutely, So
>> I mean,
00:08:07.730 --> 00:08:11.850
this is where I see the Cloud,
Microsoft's biggest value, and
00:08:11.850 --> 00:08:13.620
we'll talk about
that as we go along.
00:08:13.620 --> 00:08:16.200
But that enablement cycle, so
00:08:16.200 --> 00:08:20.300
that you can just go to the root
of the problem, and this idea
00:08:20.300 --> 00:08:23.730
about actionable insights that
we hear about all the time.
00:08:23.730 --> 00:08:29.150
You can skip the management of
the infrastructure and being
00:08:29.150 --> 00:08:31.880
able to actually understand
what these systems do.
00:08:31.880 --> 00:08:34.810
Because Microsoft has done so
much work to just give you
00:08:34.810 --> 00:08:38.050
these, and give you a set of
very simple interfaces that
00:08:38.050 --> 00:08:41.190
will drive your uses and
your use cases.
00:08:41.190 --> 00:08:43.640
And as we move
through the cycle,
00:08:43.640 --> 00:08:47.090
we begin to understand patterns
and practices around big data.
00:08:47.090 --> 00:08:50.110
And a lot of
companies think that
00:08:50.110 --> 00:08:52.300
their approach to
data is unique.
00:08:52.300 --> 00:08:54.570
They have a unique user base.
00:08:54.570 --> 00:08:57.930
And many do, but you'll find
is that there are some well
00:08:57.930 --> 00:09:01.000
pronounced patents, that
Microsoft has discovered over
00:09:01.000 --> 00:09:02.980
the years, and they can
tell you how to do this.
00:09:02.980 --> 00:09:04.960
>> And I think in one
case you're working on,
00:09:04.960 --> 00:09:07.080
you were taking something that
we would normally associate
00:09:07.080 --> 00:09:09.250
customer return resale.
00:09:09.250 --> 00:09:12.220
But what if we use the same
algorithms against education,
00:09:12.220 --> 00:09:14.150
and looked at student drop
out rates for example.
00:09:14.150 --> 00:09:14.930
>> Absolutely.
00:09:14.930 --> 00:09:18.890
>> And I think that everybody
thinks they've got, yes,
00:09:18.890 --> 00:09:19.800
a unique perspective.
00:09:19.800 --> 00:09:21.890
And I guess one of the other
things that's unlocked
00:09:21.890 --> 00:09:25.460
this problem now, is that people
are much better at sharing
00:09:25.460 --> 00:09:27.520
the ideas and
the techniques they've used.
00:09:27.520 --> 00:09:28.400
It's really code on,
00:09:28.400 --> 00:09:29.720
get high ball-
>> Yeah.
00:09:29.720 --> 00:09:30.750
>> Rhythms on cran,
00:09:30.750 --> 00:09:33.550
on R, and so the
>> Hopefully then
00:09:33.550 --> 00:09:36.390
the business it needs to
poke it up to the business.
00:09:36.390 --> 00:09:38.750
That someone may have already
done some work in this space.
00:09:38.750 --> 00:09:41.390
>> Yeah, yeah, exactly,
and I think that you're
00:09:41.390 --> 00:09:44.450
getting a lot of communities
forming around this because
00:09:44.450 --> 00:09:46.750
techies are really attracted
to these technologies,
00:09:46.750 --> 00:09:49.230
because they've got
amazing potential.
00:09:49.230 --> 00:09:50.490
>> Super fun stuff to play with,
00:09:50.490 --> 00:09:54.470
yeah, and
>> So I just wanted to bring
00:09:54.470 --> 00:09:58.882
down who the users are of these
technologies, so this is.
00:09:58.882 --> 00:10:01.863
>> That's a great definition
Josh Wills from Cloudera,
00:10:01.863 --> 00:10:04.276
who gives a definition
of a Data Scientist.
00:10:04.276 --> 00:10:06.853
A person who is better at
statistics than any software
00:10:06.853 --> 00:10:07.483
engineer and
00:10:07.483 --> 00:10:10.680
better at software engineering
than any statistician.
00:10:10.680 --> 00:10:15.310
So I think as we sort of move
through this big data landscape,
00:10:15.310 --> 00:10:18.620
we're getting the development
of careers and people and
00:10:18.620 --> 00:10:22.000
understandings, which involve
small bits of engineering,
00:10:22.000 --> 00:10:25.110
small bits of mathematics,
small bits of design.
00:10:25.110 --> 00:10:27.270
So they're not experts
in any one thing, but
00:10:27.270 --> 00:10:28.350
you need these things for
00:10:28.350 --> 00:10:32.649
the big base technologies to
work properly, so we like that.
00:10:32.649 --> 00:10:35.300
[LAUGH] And
00:10:35.300 --> 00:10:38.830
just to show you the breadth of
what Big Data actually means,
00:10:38.830 --> 00:10:42.455
this is a quote from
Jeff Hammerbacher from Facebook.
00:10:42.455 --> 00:10:44.130
>> Mm-hm.
>> On any given day,
00:10:44.130 --> 00:10:47.210
a team member could author a
multi-stage processing pipeline
00:10:47.210 --> 00:10:51.050
in Python, design a hypothesis
solution, perform a regression
00:10:51.050 --> 00:10:53.830
analysis over data samples
with R, design and
00:10:53.830 --> 00:10:57.170
implement an algorithm for
some data intensive product or service-
00:10:57.170 --> 00:10:57.290
>> Yeah.
00:10:57.290 --> 00:11:00.450
>> in Hadoop, or communicate
the results of our analysis
00:11:00.450 --> 00:11:01.890
to other members of
the organization.
00:11:01.890 --> 00:11:03.200
>> Yeah.
>> Possibly
00:11:03.200 --> 00:11:05.120
through a visualization.
00:11:05.120 --> 00:11:06.240
>> And
the story I like to tell in
00:11:06.240 --> 00:11:07.760
here is the one
inside Microsoft.
00:11:07.760 --> 00:11:11.475
With, what we had was this
engine called Cosmos.
00:11:11.475 --> 00:11:12.820
>> Mm-hm.
>> And that's
00:11:12.820 --> 00:11:16.090
our internal engine that
sits behind Bing and MSN.
00:11:16.090 --> 00:11:19.030
And it wasn't working as well
as it was, brutally honest.
00:11:19.030 --> 00:11:19.580
So what do we do?
00:11:19.580 --> 00:11:20.970
Well, let's go and
hire some people in.
00:11:20.970 --> 00:11:24.160
Well, that didn't really work
because those data scientist
00:11:24.160 --> 00:11:27.010
people were spending too much
time fiddling around with
00:11:27.010 --> 00:11:30.140
the data, not actually coming
up with the tests and so on and
00:11:30.140 --> 00:11:31.920
the insights that
we were expecting.
00:11:31.920 --> 00:11:33.470
So we had to redesign
the whole system.
00:11:33.470 --> 00:11:34.800
And actually that's
what bore out,
00:11:34.800 --> 00:11:36.430
that's where the whole
system came from.
00:11:36.430 --> 00:11:38.500
And then we thought
it's such a good idea,
00:11:38.500 --> 00:11:41.318
well we then need to make that
in to production, so how do we
00:11:41.318 --> 00:11:44.309
put it in to production, we do
what any modern business should
00:11:44.309 --> 00:11:46.380
do is put it in the Cloud.
00:11:46.380 --> 00:11:47.180
Put that in Azure.
00:11:47.180 --> 00:11:49.920
If we can put Cosmos on Azure,
fantastic.
00:11:49.920 --> 00:11:53.160
And then somebody in
marketing has a bright idea.
00:11:53.160 --> 00:11:55.385
Well if it's on Azure then
maybe we could market it and
00:11:55.385 --> 00:11:56.430
sell it to other people.
00:11:57.930 --> 00:12:00.560
So we have this data now, and
that's pretty much what it is.
00:12:00.560 --> 00:12:01.990
And the scope was the language,
and
00:12:01.990 --> 00:12:04.420
now we have U-SQL
as the language.
00:12:04.420 --> 00:12:05.560
But the point being is,
00:12:05.560 --> 00:12:08.775
that we want to basically make
the tuning a little easier.
00:12:08.775 --> 00:12:09.320
>> Mm-hm.
00:12:09.320 --> 00:12:11.830
>> As well as having that scale
and performance that we need.
00:12:11.830 --> 00:12:14.150
>> Yeah, absolutely, and I
00:12:14.150 --> 00:12:16.690
think this is one of the appeal
of Microsoft technologies.
00:12:18.050 --> 00:12:21.470
Several years ago,
we started doing hackathons for
00:12:21.470 --> 00:12:22.850
data scientists
with our community.
00:12:22.850 --> 00:12:23.870
>> Yeah.
>> And
00:12:23.870 --> 00:12:27.320
we didn't have all these
amazing tools like Azure ML and
00:12:27.320 --> 00:12:29.690
Microsoft research were
providing tool kits.
00:12:29.690 --> 00:12:32.190
It was a very,
very academic working place.
00:12:32.190 --> 00:12:35.090
And then all of that
technology found its way
00:12:35.090 --> 00:12:38.300
into the mainstream and
people began to use it.
00:12:38.300 --> 00:12:40.190
>> Yeah.
>> So it makes life much easier
00:12:40.190 --> 00:12:41.900
because you can focus
on the problem domain.
00:12:43.860 --> 00:12:46.590
>> Right, so it's important
stuff and I think, yeah,
00:12:46.590 --> 00:12:48.440
this is super important as well.
00:12:48.440 --> 00:12:52.030
This idea between batch and
and speed.
00:12:52.030 --> 00:12:52.920
>> Yeah, exactly.
00:12:52.920 --> 00:12:56.790
So one of things that you'll
find is that Azure has an answer
00:12:56.790 --> 00:13:00.330
for everything and there's many
different ways to skin a cat.
00:13:00.330 --> 00:13:03.100
But I wanted to show you
this because there's
00:13:03.100 --> 00:13:04.040
a pattern of reuse.
00:13:04.040 --> 00:13:05.700
We've used this all
over the place.
00:13:05.700 --> 00:13:07.520
We use this for
community projects.
00:13:07.520 --> 00:13:12.008
We use this for some of the
projects we do for customers,
00:13:12.008 --> 00:13:17.645
and it's essentially being able
to take telemetry or data and
00:13:17.645 --> 00:13:22.612
from some system, republish
this out so there's a messaging
00:13:22.612 --> 00:13:27.508
mechanism in Azure called the
Service Bus and the event hub.
00:13:27.508 --> 00:13:29.199
We can either persist these so
00:13:29.199 --> 00:13:31.720
we can have a long
term view of storage.
00:13:31.720 --> 00:13:34.480
And we can push these
into Azure storage.
00:13:34.480 --> 00:13:37.070
We can schedule Hadoop jobs.
00:13:37.070 --> 00:13:40.420
And we'll talk about
Hadoop in a bit.
00:13:40.420 --> 00:13:42.810
Or we can process singular
messages in real time.
00:13:42.810 --> 00:13:45.500
So, we've got a couple of
different options as to how we
00:13:45.500 --> 00:13:46.110
view this data.
00:13:46.110 --> 00:13:48.090
We can view it as a time series.
00:13:48.090 --> 00:13:49.700
So that we can see
this data coming in,
00:13:49.700 --> 00:13:54.080
we can spot trends in real time,
we can react to events, the so
00:13:54.080 --> 00:13:56.380
called complex event processing.
00:13:56.380 --> 00:13:59.640
And then we can use our familiar
SQL to deliver this to people.
00:13:59.640 --> 00:14:02.410
>> Yes the SQL now, for me,
really becomes a query language
00:14:02.410 --> 00:14:05.480
rather than being locked into
this world of relational
00:14:05.480 --> 00:14:07.860
databases and
transactions, and so on.
00:14:07.860 --> 00:14:08.490
>> Absolutely.
00:14:10.000 --> 00:14:13.410
And so
lets talk a little about Hadoop.
00:14:13.410 --> 00:14:16.230
And specifically about HDinsight
00:14:16.230 --> 00:14:20.280
which is Microsoft Hadoop
implementation.
00:14:20.280 --> 00:14:24.940
So you can see,
this is a very, very large
00:14:24.940 --> 00:14:29.090
diagram which sort of defines
how Microsoft view Hadoop.
00:14:29.090 --> 00:14:30.850
Sorry for
the incredibly garish colors.
00:14:30.850 --> 00:14:31.480
>> That's okay.
00:14:31.480 --> 00:14:33.255
It's very readable.
00:14:33.255 --> 00:14:34.918
>> [LAUGH] Okay.
00:14:34.918 --> 00:14:37.690
So you can see there's a lot
going on in this system.
00:14:37.690 --> 00:14:38.880
>> Hm.
>> You've got
00:14:38.880 --> 00:14:40.030
multiple languages.
00:14:40.030 --> 00:14:41.630
You've got multiple frameworks.
00:14:41.630 --> 00:14:43.900
You've got this
idea of Map-Reduce,
00:14:43.900 --> 00:14:44.780
which we're gonna look at.
00:14:44.780 --> 00:14:47.120
You've got HTFS,
which we'll discuss.
00:14:47.120 --> 00:14:49.510
And all of these different
things going on.
00:14:49.510 --> 00:14:53.440
And they sort of marry
themselves, and very well,
00:14:53.440 --> 00:14:56.800
to the converse picture which
is the Apache ecosystem.
00:14:56.800 --> 00:15:01.270
And so
one of the things that I noticed
00:15:01.270 --> 00:15:04.450
in the front here
is Tux the penguin.
00:15:04.450 --> 00:15:05.870
>> Yeah, the penguin guy here.
00:15:05.870 --> 00:15:06.810
>> Yeah, let's hold Tux up.
00:15:06.810 --> 00:15:07.550
Fantastic.
00:15:07.550 --> 00:15:08.620
>> Get our penguin guy out.
00:15:08.620 --> 00:15:10.120
It's our penguin guy.
00:15:10.120 --> 00:15:12.050
I think we've got it.
00:15:12.050 --> 00:15:15.710
>> So, for me, this is the
embodiment of the new Microsoft.
00:15:15.710 --> 00:15:16.368
>> Yes.
00:15:16.368 --> 00:15:20.860
>> Because, with Hadoop, what
Microsoft has done is they've
00:15:20.860 --> 00:15:25.080
taken a collection of what
started out to be a community or
00:15:25.080 --> 00:15:26.450
Apache project.
00:15:26.450 --> 00:15:29.666
Have now turned into things
that many companies are using,
00:15:29.666 --> 00:15:32.432
and they've turned them
into something that can be
00:15:32.432 --> 00:15:34.570
reused with the minimum effort.
00:15:34.570 --> 00:15:36.350
So you don't need to worry
about the deployment.
00:15:36.350 --> 00:15:39.100
You don't need to worry about
the underlining infrastructure.
00:15:39.100 --> 00:15:41.120
Microsoft will do it for you.
00:15:41.120 --> 00:15:43.770
>> Yeah, it is bewildering
to see this list.
00:15:43.770 --> 00:15:45.530
How do you fit all of
this stuff together, and
00:15:45.530 --> 00:15:48.650
I think if you come from this
traditional BI analytics world
00:15:48.650 --> 00:15:50.150
and I see a lot of you
going on that journey.
00:15:50.150 --> 00:15:52.710
Then over here you've got these
people who have been playing
00:15:52.710 --> 00:15:53.230
in this world.
00:15:53.230 --> 00:15:55.540
And there's this sort of
like buffer here now.
00:15:55.540 --> 00:15:58.223
And these guys are having to
spend quite a lot of cycles
00:15:58.223 --> 00:16:01.269
spinning up this infrastructure
before they actually start
00:16:01.269 --> 00:16:02.536
working on the problem.
00:16:02.536 --> 00:16:05.121
That's why we have people
like Cloud here and
00:16:05.121 --> 00:16:08.670
Hortonworks layering on top of
this to make that easier to do.
00:16:08.670 --> 00:16:10.886
But we've done
that too on Azure.
00:16:10.886 --> 00:16:12.350
And that's essentially HDinside.
00:16:12.350 --> 00:16:15.390
So you can have
flavors of Hadoop,
00:16:15.390 --> 00:16:17.980
which is the earlier diagram we
saw, if I've got that right.
00:16:17.980 --> 00:16:18.830
>> Yeah, absolutely.
00:16:18.830 --> 00:16:22.940
And it's really interesting
because in the early days when
00:16:22.940 --> 00:16:26.460
we just had virtual machines
we tried to do this ourselves.
00:16:26.460 --> 00:16:28.020
It takes a very long time.
00:16:28.020 --> 00:16:31.569
And part of the problem is
that this software changes so
00:16:31.569 --> 00:16:35.273
regularly that to keep updated
with the latest version of
00:16:35.273 --> 00:16:38.668
software, especially in
the case of Apache Spark,
00:16:38.668 --> 00:16:40.616
which is one of my favorites.
00:16:40.616 --> 00:16:43.796
It's from the diagram,
it's the, with the orange star.
00:16:43.796 --> 00:16:45.156
>> Mm-hm.
00:16:45.156 --> 00:16:48.296
>> So, to keep up with
the updates which can sometimes
00:16:48.296 --> 00:16:51.143
happen like every week or
every two weeks, and
00:16:51.143 --> 00:16:54.282
there are pretty intense
changes in the software,
00:16:54.282 --> 00:16:56.550
you don't wanna
do this yourself.
00:16:56.550 --> 00:16:59.655
>> And it can break, obviously,
what you already have in place,
00:16:59.655 --> 00:17:02.291
because this is gonna be
orchestration maybe going on
00:17:02.291 --> 00:17:02.828
in Mahut.
00:17:02.828 --> 00:17:06.427
There's gonna be the tubing that
you're using, the scripts you
00:17:06.427 --> 00:17:09.437
have, could break because
their functionality could
00:17:09.437 --> 00:17:11.870
be deprecated or
there's a new name for it.
00:17:11.870 --> 00:17:14.780
And then there's something
that was really hard just gets
00:17:14.780 --> 00:17:16.720
super easy and
you didn't know about that.
00:17:16.720 --> 00:17:18.091
>> Exactly.
>> So you're not able to
00:17:18.091 --> 00:17:20.210
leverage as much as you want to.
00:17:20.210 --> 00:17:22.750
>> And it's a real
boon to just have this
00:17:22.750 --> 00:17:24.500
capability as
a platform service,
00:17:24.500 --> 00:17:29.020
where you can just leave it to
Microsoft to spin up a cluster,
00:17:29.020 --> 00:17:32.170
a Hadoop cluster, and
just execute some software and
00:17:32.170 --> 00:17:34.370
get a result which is really
what you want to do anyway.
00:17:34.370 --> 00:17:37.200
>> Yeah, and that was one of
the challenges we faced actually
00:17:37.200 --> 00:17:39.840
when we first started trying to
keep this up of course was that
00:17:39.840 --> 00:17:41.690
Hadoop was running on Windows.
00:17:41.690 --> 00:17:43.790
And so it was always going
to be slightly lagging what
00:17:43.790 --> 00:17:45.110
was happening on Linux and
00:17:45.110 --> 00:17:48.320
now we've got HDinsight
actually runs on Linux as well.
00:17:48.320 --> 00:17:51.100
If you want to, you can choose
when you spin it up and
00:17:51.100 --> 00:17:54.070
it's far more, it's far easier,
to port what you may have
00:17:54.070 --> 00:17:57.170
already originally done
with that platform.
00:17:57.170 --> 00:17:58.030
>> Yeah, absolutely.
00:17:58.030 --> 00:18:00.990
And I think that it also
resonates with a lot
00:18:00.990 --> 00:18:03.380
of existing users
of Hadoop because
00:18:03.380 --> 00:18:07.380
they started out life with
using Hadoop on Linux.
00:18:07.380 --> 00:18:10.050
It's a set of open
source tooling so
00:18:10.050 --> 00:18:14.750
Java developers, Java's a very,
very good skill for this because
00:18:14.750 --> 00:18:18.380
natively most of the Apache
frameworks are written in Java.
00:18:18.380 --> 00:18:19.170
>> Yes.
00:18:19.170 --> 00:18:22.010
>> You know I think the way
that Microsoft has embraced
00:18:22.010 --> 00:18:23.880
these languages has
been fantastic.
00:18:23.880 --> 00:18:27.190
All the way down
to being able to
00:18:27.190 --> 00:18:29.910
deploy things directly
onto Hadoop clusters from
00:18:29.910 --> 00:18:31.640
environments that
are not Visual Studio.
00:18:31.640 --> 00:18:32.930
So from IntelliJ,
00:18:32.930 --> 00:18:35.570
from Eclipse, you have all
the appropriate plugins.
00:18:35.570 --> 00:18:39.220
So just say I'm gonna write a
piece of software for Hadoop or
00:18:39.220 --> 00:18:40.900
Apache Spark or Storm.
00:18:40.900 --> 00:18:44.150
I'm gonna right click on
this piece of software and
00:18:44.150 --> 00:18:46.060
then I'm just gonna deploy
this straight to Azure.
00:18:46.060 --> 00:18:48.753
>> And for people who
are the Microsoft fans and
00:18:48.753 --> 00:18:50.760
[INAUDIBLE] channel nine so
they might be.
00:18:50.760 --> 00:18:51.814
And we've got good
tuning as well for
00:18:51.814 --> 00:18:53.044
the traditional
Microsoft audience too.
00:18:53.044 --> 00:18:53.591
>> Yeah,
00:18:53.591 --> 00:18:58.160
exactly, and you can do this
straight from Visual Studio.
00:18:58.160 --> 00:19:01.160
Not to mention the fact that
Microsoft put a lot of energy
00:19:01.160 --> 00:19:01.860
and effort and
00:19:01.860 --> 00:19:07.910
guidance around how to use C#
as a language with Hadoop.
00:19:07.910 --> 00:19:11.415
And in Storm there's a, we'll
touch on Storm later on, but
00:19:11.415 --> 00:19:14.992
there's a framework that you can
use with Apache Storm called
00:19:14.992 --> 00:19:17.146
SCP which allows
you to write in C#.
00:19:17.146 --> 00:19:19.410
>> Wow, okay.
00:19:19.410 --> 00:19:21.466
>> Lot's of amazing innovations.
00:19:21.466 --> 00:19:23.380
>> So on a problem, you don't
have to learn a new language,
00:19:23.380 --> 00:19:26.040
but you do need to
know some languages.
00:19:26.040 --> 00:19:28.010
To your point earlier
about the data scientists.
00:19:28.010 --> 00:19:29.660
It's just you don't need
to learn a new one.
00:19:29.660 --> 00:19:30.483
>> Yeah, that's right, exactly.
00:19:30.483 --> 00:19:32.364
>> Fantastic.
00:19:32.364 --> 00:19:36.280
>> So this is now HDInsight.
00:19:36.280 --> 00:19:39.091
>> Yeah, my favorite thing on
there as well, that R-Server.
00:19:39.091 --> 00:19:42.420
>> R-Server, I put that on there
specially for you, Andrew.
00:19:42.420 --> 00:19:44.460
>> Thank you. [LAUGH] >> [LAUGH]
>> I mean,
00:19:44.460 --> 00:19:46.880
it's worth talking about
the story of this.
00:19:46.880 --> 00:19:50.440
So Hadoop we've touch on,
we'll touch on how that works.
00:19:51.530 --> 00:19:57.077
There's a piece of software that
runs on HDInsight called HBase.
00:19:57.077 --> 00:20:00.057
Which is a petabyte scale,
column store.
00:20:00.057 --> 00:20:05.860
So we can use this almost
like a no store SQL database.
00:20:05.860 --> 00:20:08.990
Directly from Hadoop
to store our state.
00:20:08.990 --> 00:20:13.390
We've got Storm which processes
messages in real time.
00:20:13.390 --> 00:20:16.230
And HDInsight Storm syncs
up really well with
00:20:16.230 --> 00:20:18.200
other services
like the EventHub.
00:20:18.200 --> 00:20:20.870
So you can feed in thousands
of messages a second.
00:20:20.870 --> 00:20:23.320
>> So to characterize this,
if I've got this right,
00:20:23.320 --> 00:20:26.520
HBase makes Hadoop look a little
bit like a relational database.
00:20:26.520 --> 00:20:27.990
Would that be fair?
00:20:27.990 --> 00:20:32.510
Storm is where we wanna use
the real time kind of thing.
00:20:32.510 --> 00:20:36.230
And we'll come back and do each
of these in a bit more detail.
00:20:37.480 --> 00:20:41.450
You are familiar with Apache
Spark, what is that all about?
00:20:41.450 --> 00:20:44.210
>> One of the things that people
call Apache Spark is the swiss
00:20:44.210 --> 00:20:46.302
army knife of big data.
00:20:46.302 --> 00:20:48.670
It will give you a flavor
of what that is.
00:20:48.670 --> 00:20:52.396
With Spark what you can do is
you can do many of the things
00:20:52.396 --> 00:20:55.300
that you can do with Hadoop.
00:20:55.300 --> 00:20:58.120
You can also run
interactive queries for it.
00:20:58.120 --> 00:20:59.426
>> Right.
>> So it's very difficult
00:20:59.426 --> 00:21:00.130
with Hadoop.
00:21:00.130 --> 00:21:02.150
>> Yeah, so
normally running a big job.
00:21:02.150 --> 00:21:04.020
So, that looks a bit like SQL.
00:21:04.020 --> 00:21:05.743
HQL looks a bit like SQL,
where you might have to go and
00:21:05.743 --> 00:21:07.097
have coffee while
the answer comes back-
00:21:07.097 --> 00:21:07.635
>> [LAUGH] That's right.
00:21:07.635 --> 00:21:10.420
>> Cuz you've orchestrated
this 500 note,
00:21:10.420 --> 00:21:13.872
I know you've worked
on a 500 note cluster.
00:21:13.872 --> 00:21:14.555
>> That's right.
00:21:14.555 --> 00:21:16.860
>> And, That is going
to take a while, right.
00:21:16.860 --> 00:21:18.130
>> Yeah.
>> And I also, for me,
00:21:18.130 --> 00:21:21.780
as a rank amateur, when I do a
demo, I'm going to calculate Pi,
00:21:21.780 --> 00:21:23.120
or some simple calculation.
00:21:23.120 --> 00:21:24.930
I think I've spent all this
infrastructure and it's taken
00:21:24.930 --> 00:21:27.660
me way longer then it would
have done on my calculator.
00:21:27.660 --> 00:21:29.050
But, of course,
it's designed to scale and
00:21:29.050 --> 00:21:31.090
it really takes off on it,
when it scales.
00:21:31.090 --> 00:21:33.130
So, Spark enables us to
do this real time thing.
00:21:33.130 --> 00:21:34.130
That's going to
be super popular.
00:21:34.130 --> 00:21:34.690
>> Yes, it does.
00:21:34.690 --> 00:21:37.160
It has a, it has an interface
called, Spark Streaming.
00:21:37.160 --> 00:21:39.981
And it also has a machine
learning library so you can use
00:21:39.981 --> 00:21:42.998
Spark now to actually distribute
machine learning code.
00:21:42.998 --> 00:21:44.598
>> Is there a special name for
that Spark?
00:21:44.598 --> 00:21:46.450
Is it just called Spark.
00:21:46.450 --> 00:21:49.894
>> So Spark came out and
spark is available so
00:21:49.894 --> 00:21:51.868
you can write this code in R.
00:21:51.868 --> 00:21:52.827
>> [CROSSTALK] Yeah, but
00:21:52.827 --> 00:21:54.856
we've also got this
thing on the right here.
00:21:54.856 --> 00:21:56.016
So I'm throughly confused now.
00:21:56.016 --> 00:21:57.916
>> This is slightly becoming
my second favorite.
00:21:57.916 --> 00:22:02.688
[LAUGH] So one of the things
that Microsoft did was to buy
00:22:02.688 --> 00:22:07.180
out a company called
Revolution Analytics.
00:22:07.180 --> 00:22:09.590
And what Revolution Analytics
00:22:09.590 --> 00:22:12.240
did was effectively create
a speedy version of R.
00:22:12.240 --> 00:22:12.813
>> A scalable version of R.
00:22:12.813 --> 00:22:16.770
>> A scalable version
of R exactly.
00:22:16.770 --> 00:22:21.060
R has a lot of problems
in its threading models.
00:22:21.060 --> 00:22:23.968
So you can't run this in
a In a multi-threaded,
00:22:23.968 --> 00:22:25.758
multi-process environment.
00:22:25.758 --> 00:22:26.630
>> Right.
>> And
00:22:26.630 --> 00:22:28.270
you can't run it in
a distributed environment.
00:22:28.270 --> 00:22:34.350
So Microsoft has been solving
these problems with Revolution,
00:22:34.350 --> 00:22:37.400
and also making us
a first-class citizen of Azure.
00:22:38.870 --> 00:22:40.930
>> Yeah, and there are
investments all over the Azure
00:22:40.930 --> 00:22:43.870
ecosystem, we'll be touching
on that as a continuing theme.
00:22:43.870 --> 00:22:46.220
I think just for
a brief diversion now.
00:22:46.220 --> 00:22:48.060
See one of the other questions
that I get asked a lot is,
00:22:48.060 --> 00:22:48.850
Python versus R.
00:22:48.850 --> 00:22:51.320
And of course, that's not really
a Microsoft fight because
00:22:51.320 --> 00:22:53.210
neither of those
are Microsoft languages.
00:22:53.210 --> 00:22:56.780
But I guess Python already had
some of that scalability and
00:22:57.840 --> 00:22:59.290
parallelism built into it.
00:22:59.290 --> 00:23:02.560
And so the algorithms that were
traditionally associated with R
00:23:02.560 --> 00:23:03.430
being put into Python.
00:23:03.430 --> 00:23:06.976
But R's moved up and already had
all this really good ML stuff
00:23:06.976 --> 00:23:10.108
built into it, and Revolution
makes it behave like Python.
00:23:10.108 --> 00:23:14.269
So I think it, ultimately for me
and I hope you agree is use what
00:23:14.269 --> 00:23:18.039
you know, and
we've got really good for both.
00:23:18.039 --> 00:23:21.813
>> Exactly, so
I think one of the things that's
00:23:21.813 --> 00:23:25.388
emerged with Python in
the big data space is
00:23:25.388 --> 00:23:30.375
there's a tool that you can
use called Jupiter Notebooks.
00:23:30.375 --> 00:23:30.875
>> Yes.
00:23:30.875 --> 00:23:34.385
>> And Jupiter Notebooks
promotes this idea of reuse.
00:23:34.385 --> 00:23:36.210
So you can create a notebook,
and
00:23:36.210 --> 00:23:37.690
then you can distribute
this to other people.
00:23:37.690 --> 00:23:39.440
So it's a very good
learning aid for Python.
00:23:39.440 --> 00:23:43.048
So I think a lot of people
have been learning from this.
00:23:43.048 --> 00:23:47.973
And the great thing about
it is that you can use
00:23:47.973 --> 00:23:52.646
libraries like Pandas
to use data frames.
00:23:52.646 --> 00:23:55.728
To basically take in memory
data sets, to pivot them, to
00:23:55.728 --> 00:23:59.132
manipulate them, to add columns
to them, to remove columns,
00:23:59.132 --> 00:24:01.788
to transform them between
different data sets.
00:24:01.788 --> 00:24:06.980
And with a little bit of effort,
you can distribute this as well.
00:24:06.980 --> 00:24:10.890
Now, Tthe Jupiter Notebooks is
supported through HD Insight.
00:24:10.890 --> 00:24:14.140
So you can run this
on a Spark Cluster.
00:24:14.140 --> 00:24:16.370
But it's also supported
through Azure ML.
00:24:16.370 --> 00:24:19.885
So, there's lots different ways
to use Python in the Azure
00:24:19.885 --> 00:24:20.668
ecosystem.
00:24:20.668 --> 00:24:23.818
I think that as
you rightly said,
00:24:23.818 --> 00:24:26.638
R has really had a head start.
00:24:26.638 --> 00:24:29.520
The entire academic community
has been behind this.
00:24:29.520 --> 00:24:32.100
It's got some of the most
phenomenal machine learning
00:24:32.100 --> 00:24:32.970
libraries.
00:24:32.970 --> 00:24:35.120
Libraries for
every line of business as well.
00:24:35.120 --> 00:24:39.230
For economics, statistics,
for energy statistics, for
00:24:39.230 --> 00:24:40.990
time series analysis.
00:24:40.990 --> 00:24:44.430
And so you're not starting
from scratch with R.
00:24:44.430 --> 00:24:45.340
>> Yeah, okay.
00:24:45.340 --> 00:24:46.970
So we've got all
of this goodness
00:24:46.970 --> 00:24:48.400
just when you turn
this thing on.
00:24:48.400 --> 00:24:50.220
>> It's there.
And this is click a button, and
00:24:50.220 --> 00:24:50.830
it's there-
>> Yeah.
00:24:50.830 --> 00:24:51.860
>> Automate it, and it's there.
00:24:51.860 --> 00:24:55.058
So you've got full command and
control over this framework.
00:24:55.058 --> 00:24:59.436
And so, what is this?
00:24:59.436 --> 00:25:04.176
You've got a set of different
tools that you can use.
00:25:04.176 --> 00:25:07.516
You can use how you mentioned
that before or Hadoop.
00:25:07.516 --> 00:25:08.276
>> Yeah.
00:25:08.276 --> 00:25:10.416
>> There's two ways
of saying this.
00:25:10.416 --> 00:25:13.931
>> I'm always, probably
going to upset somebody now.
00:25:13.931 --> 00:25:17.657
>> Probably, which will allow
us to do machine learning over
00:25:17.657 --> 00:25:19.115
a Hadoop Cluster.
00:25:19.115 --> 00:25:23.681
Giraph which is a way of a graph
engine so it will allow us to
00:25:23.681 --> 00:25:28.456
take data in Hadoop, and turn
this into a graph basically.
00:25:28.456 --> 00:25:29.296
>> All right, okay.
00:25:29.296 --> 00:25:31.420
>> So
we can understand the edges and-
00:25:31.420 --> 00:25:31.940
>> Yeah, yeah.
00:25:31.940 --> 00:25:33.900
>> Graph databases, yeah.
00:25:33.900 --> 00:25:38.410
We've got several languages that
we can use, pig, hive, and .net.
00:25:38.410 --> 00:25:42.270
And one of the innovations
that Microsoft has brought to
00:25:42.270 --> 00:25:44.040
Hadoop through HD Insight-
>> Mm-hm.
00:25:44.040 --> 00:25:47.580
Is to externalize the storage,
because
00:25:47.580 --> 00:25:51.530
Hadoop uses a distributed
filing system called HDFS,
00:25:51.530 --> 00:25:55.380
and HDFS is reliant
on the disks.
00:25:55.380 --> 00:25:57.115
>> Yep.
>> So it's one of the things
00:25:57.115 --> 00:26:00.511
that you find with the Cloud
is that you don't need things
00:26:00.511 --> 00:26:01.723
running 24/7.
00:26:01.723 --> 00:26:03.230
>> No
>> So
00:26:03.230 --> 00:26:05.680
Hadoop is effectively
a managed service.
00:26:05.680 --> 00:26:08.230
HD Insights is a managed
service of Hadoop. Right?
00:26:08.230 --> 00:26:09.960
We only want this for
00:26:09.960 --> 00:26:12.130
a few hours a day or-
>> Right.
00:26:12.130 --> 00:26:13.350
>> When we wanna do work.
00:26:13.350 --> 00:26:16.160
>> But if you turn it off
we're persisting the data when
00:26:16.160 --> 00:26:16.970
the Cluster's turned off.
00:26:16.970 --> 00:26:17.580
>> Exactly.
00:26:17.580 --> 00:26:19.840
And that's really where
the innovations come in.
00:26:19.840 --> 00:26:21.568
>> Yeah.
>> To externalize the storage
00:26:21.568 --> 00:26:22.830
into Azure storage.
00:26:22.830 --> 00:26:24.260
>> I think one of the things
you've got on here is
00:26:24.260 --> 00:26:25.020
Azure Blob Storage.
00:26:25.020 --> 00:26:27.070
And of course what we have
now is actually this data.
00:26:27.070 --> 00:26:28.250
I've talked about
Cosmos earlier.
00:26:28.250 --> 00:26:30.590
We have this data-like
storage now.
00:26:30.590 --> 00:26:34.630
And that should be the way if
you were just starting out today
00:26:34.630 --> 00:26:36.240
to play with this.
00:26:36.240 --> 00:26:40.990
That should be the landing place
for our data because it doesn't
00:26:40.990 --> 00:26:43.730
suffer some of the limitations
that exist for the Azure Blob.
00:26:43.730 --> 00:26:47.000
So just scale, how many
objects we can put in a store?
00:26:47.000 --> 00:26:49.610
And then I don't know, you've
hit this barrier with some of
00:26:49.610 --> 00:26:50.940
the projects you've
been working on.
00:26:50.940 --> 00:26:51.650
>> Yeah, and I mean,
00:26:51.650 --> 00:26:54.150
it's really interesting that
you should say that because
00:26:54.150 --> 00:26:56.540
I think with traditional
Azure Blog Storage,
00:26:56.540 --> 00:26:59.930
there are limitations on
how you can scale this.
00:26:59.930 --> 00:27:02.372
It's represented in
a metric called IOPS,
00:27:02.372 --> 00:27:04.618
Input/Output Operations
Per Second.
00:27:04.618 --> 00:27:07.814
So one of the things that you
have to do is you might have to
00:27:07.814 --> 00:27:09.456
go beyond the volume limit.
00:27:09.456 --> 00:27:13.416
I mean, 500 terabytes seems like
a huge amount of data to store
00:27:13.416 --> 00:27:15.366
but not in the world
of big data.
00:27:15.366 --> 00:27:19.628
[LAUGH] So you tend to have to
have patents to allow you to
00:27:19.628 --> 00:27:23.050
have more throughput and
store more data.
00:27:23.050 --> 00:27:24.950
And that means you have to
scale the number of storage
00:27:24.950 --> 00:27:26.240
accounts out.
00:27:26.240 --> 00:27:29.500
So one of the things that Azure
data storage has brought us is
00:27:29.500 --> 00:27:32.690
the ability to have
a single point of access.
00:27:32.690 --> 00:27:33.310
>> Yeah.
00:27:33.310 --> 00:27:37.246
>> And the great thing about it
is that it's got an interface
00:27:37.246 --> 00:27:38.710
called Web HTFS.
00:27:38.710 --> 00:27:42.049
So as you can probably imagine
by the name that interface
00:27:42.049 --> 00:27:44.677
was actually written
with the idea of Hadoop
00:27:44.677 --> 00:27:47.035
technologies in mind
accessing this.
00:27:47.035 --> 00:27:48.010
>> Yeah.
00:27:48.010 --> 00:27:51.030
So we can just throw
our stuff in there,
00:27:51.030 --> 00:27:53.390
we can apply a job to it, turn
the Cluster off when we want to.
00:27:53.390 --> 00:27:55.860
We can still access
the refined data,
00:27:55.860 --> 00:27:58.900
the output of our work from
other tools, for example
00:27:58.900 --> 00:28:01.990
Power BI or whatever it happened
to be that we haven't used.
00:28:01.990 --> 00:28:02.690
>> Yeah, exactly.
00:28:03.720 --> 00:28:06.340
And the great thing about this
as well is that you can fully
00:28:06.340 --> 00:28:07.100
automate it.
00:28:07.100 --> 00:28:11.690
So, we end up in a situation
where we can use PowerShell.
00:28:11.690 --> 00:28:14.914
The new Azure data
factory which is-
00:28:14.914 --> 00:28:15.460
>> Yes.
00:28:15.460 --> 00:28:18.360
>> A complete
orchestration software.
00:28:18.360 --> 00:28:20.435
>> Yeah and that's really,
really super cool for me.
00:28:20.435 --> 00:28:25.482
And I often get, again, people
get a little bit confused about
00:28:25.482 --> 00:28:29.010
this [INAUDIBLE] well, isn't
it like integration services?
00:28:29.010 --> 00:28:30.495
And the answer is, well, no.
00:28:30.495 --> 00:28:34.260
[LAUGH] I've come from
that background, and
00:28:34.260 --> 00:28:35.810
we've got Alan Mitchell
on later,
00:28:35.810 --> 00:28:39.570
and he used to be a legend
in integration services.
00:28:39.570 --> 00:28:41.610
>> He literally wrote
the book on it, didn't he?
00:28:41.610 --> 00:28:43.530
>> He did literally write
the book on it, yeah.
00:28:43.530 --> 00:28:46.480
But now he's gone another way
now, he's stream analytics and
00:28:46.480 --> 00:28:48.570
that's what you're going
to be talking about later.
00:28:48.570 --> 00:28:50.120
Now if we come back
to this a minute,
00:28:50.120 --> 00:28:54.140
this is an orchestration tool,
that is controlling our cCuster.
00:28:54.140 --> 00:28:56.437
For HD Insight,
indeed he can actually spin
00:28:56.437 --> 00:28:56.963
one up
>> Sure
00:28:56.963 --> 00:28:57.836
>> Problematically and
00:28:57.836 --> 00:28:59.730
then turn it off when
we're finished with it.
00:28:59.730 --> 00:29:02.690
Then it can run a job
inside that Cluster, but
00:29:02.690 --> 00:29:04.020
it will be written.
00:29:04.020 --> 00:29:05.770
If we're using
an HD Insight class in
00:29:05.770 --> 00:29:06.680
the appropriate language.
00:29:06.680 --> 00:29:09.193
So it might be an HTML script or
if we're talking to a data
00:29:09.193 --> 00:29:11.759
warehouse we might just be
calling it a traditional store
00:29:11.759 --> 00:29:12.356
procedure.
00:29:12.356 --> 00:29:12.956
>> Correct.
00:29:12.956 --> 00:29:14.106
>> And if we're talking to data
00:29:14.106 --> 00:29:15.228
we'll be using
a you SQL language.
00:29:15.228 --> 00:29:18.528
So it's a little bit different
but it gives you this one pane
00:29:18.528 --> 00:29:21.630
of glass where you can see
what's happening if you wanna
00:29:21.630 --> 00:29:24.668
run that job every night or
every hour or what have you.
00:29:24.668 --> 00:29:28.446
It's gonna give that thing,
super useful and of course.
00:29:28.446 --> 00:29:30.384
PowerShell as well, and
00:29:30.384 --> 00:29:33.808
I think two data
scientists point earlier.
00:29:33.808 --> 00:29:35.491
We need to learn
some JavaScript,
00:29:35.491 --> 00:29:37.174
we need to learn
some PowerShell,
00:29:37.174 --> 00:29:38.340
we need to learn some R.
00:29:38.340 --> 00:29:41.090
We might already know these
shots so we've now got this kind
00:29:41.090 --> 00:29:43.750
of multi-purpose kind
of scripting guide.
00:29:43.750 --> 00:29:45.740
And with some knowledge
in some of these tools.
00:29:45.740 --> 00:29:46.267
>> That's right.
>> To be able to go and
00:29:46.267 --> 00:29:49.350
do this stuff.
>> Yeah, exactly, and so,
00:29:49.350 --> 00:29:53.770
I mean,
When you break HDInsight down,
00:29:53.770 --> 00:29:58.345
there's different
components of this.
00:29:58.345 --> 00:29:59.422
>> Yeah.
>> You have log in
00:29:59.422 --> 00:30:01.260
stored in table storage.
00:30:01.260 --> 00:30:03.150
You have your data
stored in Blob storage.
00:30:03.150 --> 00:30:06.170
You have the state of the system
that you can attach a SQL
00:30:06.170 --> 00:30:09.900
database to effectively,
if you're gonna use Hive, which
00:30:09.900 --> 00:30:13.870
you mentioned in HQL, you're
gonna be creating entities.
00:30:13.870 --> 00:30:17.381
>> Right.
>> Things that look like tables
00:30:17.381 --> 00:30:20.921
in SQL S in relational database.
00:30:20.921 --> 00:30:25.012
So when you do this, every time
you create a cluster you need to
00:30:25.012 --> 00:30:27.330
have that definition again.
00:30:27.330 --> 00:30:31.920
And that definition is stored
in an external database,
00:30:31.920 --> 00:30:34.170
so you can just attach that-
>> Back into it.
00:30:34.170 --> 00:30:35.970
>> And
your state just reappears.
00:30:35.970 --> 00:30:36.700
>> So you can and.
00:30:36.700 --> 00:30:39.980
I guess this a play for
maybe some organizations that
00:30:39.980 --> 00:30:42.590
aren't banks, or oil companies
or the big pharmaceutical boys,
00:30:42.590 --> 00:30:44.690
who have been
traditionally using this.
00:30:44.690 --> 00:30:47.370
It now means that say a smaller
retailer consortium who's got
00:30:47.370 --> 00:30:50.570
a lot of data can just flip this
on for whenever I need to and
00:30:50.570 --> 00:30:52.315
then turn if off again.
00:30:52.315 --> 00:30:53.862
>> Exactly.
And that really brings
00:30:53.862 --> 00:30:57.590
home the economics and
the power of the cloud.
00:30:57.590 --> 00:31:00.510
And there's a set way
of working with Hadoop.
00:31:00.510 --> 00:31:04.990
You can create a job, each job
has a task associated with it.
00:31:04.990 --> 00:31:08.260
So it's has that familiar
distribution model.
00:31:08.260 --> 00:31:10.440
You apply a schema at
execution time, so
00:31:10.440 --> 00:31:14.990
if you were writing codes to
define this in C# or Java.
00:31:14.990 --> 00:31:18.340
You could take a line
from a CSV and
00:31:18.340 --> 00:31:19.790
you could turn that
into an object,
00:31:19.790 --> 00:31:22.380
you can interrogate that,
you could type it, you could
00:31:22.380 --> 00:31:25.450
transform it, and then you could
output something different.
00:31:27.930 --> 00:31:31.170
So then you could
00:31:31.170 --> 00:31:34.140
focus on being able to put
all of these jobs together.
00:31:34.140 --> 00:31:35.100
So that you pipeline them and
00:31:35.100 --> 00:31:37.250
have what's called
intermediate outputs.
00:31:37.250 --> 00:31:40.170
You have a processing step.
00:31:40.170 --> 00:31:41.270
Intermediate output.
00:31:41.270 --> 00:31:44.720
You process the next step with
the data that you've enriched.
00:31:44.720 --> 00:31:49.019
And that way you can create
a veritable pipeline to lead you
00:31:49.019 --> 00:31:51.663
to that visualization
in PowerBI.
00:31:54.588 --> 00:31:57.634
So, we talked about this
idea about splitting up and
00:31:57.634 --> 00:32:01.100
having many storage
accounts as a pattern. And-
00:32:01.100 --> 00:32:02.464
>> What's a WasB?
00:32:02.464 --> 00:32:04.698
>> I would say a WasB [LAUGH].
00:32:04.698 --> 00:32:06.567
>> I know what it is,
I'm just [INAUDIBLE].
00:32:06.567 --> 00:32:08.679
>> Yes.
>> [INAUDIBLE]
00:32:08.679 --> 00:32:10.615
>> Windows is your storage Blob,
00:32:10.615 --> 00:32:14.561
which actually conflicts with
the early version of the service
00:32:14.561 --> 00:32:17.328
pass, which Windows
is your service pass.
00:32:17.328 --> 00:32:21.808
So both team acronyms trot
on each others' toes, but
00:32:21.808 --> 00:32:24.320
this one's stark longer.
00:32:24.320 --> 00:32:25.000
>> Yep.
00:32:25.000 --> 00:32:30.048
>> And it's still a protocol
that's used with storage.
00:32:30.048 --> 00:32:35.130
Now the and the protocol
that is used is ADL.
00:32:35.130 --> 00:32:35.640
>> Yeah.
00:32:35.640 --> 00:32:37.610
>> So we can replace
the WasB with ADL.
00:32:37.610 --> 00:32:38.250
>> With ADL, yep.
00:32:38.250 --> 00:32:42.350
>> But it's great because you've
got two storages that you could
00:32:42.350 --> 00:32:45.320
potentially use, both of them
cover different economic
00:32:45.320 --> 00:32:47.940
complications and
different ways of using them.
00:32:47.940 --> 00:32:49.466
So you've got variety there.
00:32:49.466 --> 00:32:50.749
Yep, okay.
>> Secondly.
00:32:52.755 --> 00:32:56.475
And it's worth just
expressing that one of the,
00:32:56.475 --> 00:33:00.649
Microsoft has really invested
a huge amount into the,
00:33:00.649 --> 00:33:05.277
not only the adoption of Hadoop
through HDInsight, but also
00:33:05.277 --> 00:33:10.340
contributing to the core Hadoop
code base with Hortonworks.
00:33:10.340 --> 00:33:11.170
>> Right, so
00:33:11.170 --> 00:33:14.810
we're actually changing Hadoop
to make it more cloud aware.
00:33:14.810 --> 00:33:15.870
>> Yes.
>> Or putting
00:33:15.870 --> 00:33:16.890
our own hooks in it.
00:33:16.890 --> 00:33:18.580
And Hortonworks actually
wrote some of that,
00:33:18.580 --> 00:33:19.470
where we worked
with Hortonworks,
00:33:19.470 --> 00:33:21.000
to write some of
the code in there.
00:33:21.000 --> 00:33:21.610
>> Yeah.
>> Yeah.
00:33:21.610 --> 00:33:25.310
>> So the original Hadoop was
centered around this idea of
00:33:25.310 --> 00:33:26.980
MapReduce.
00:33:26.980 --> 00:33:29.840
And we'll talk
about that shortly,
00:33:29.840 --> 00:33:34.300
but what Microsoft did was they
help create an abstraction.
00:33:34.300 --> 00:33:40.330
So they created a resource
manager, and then
00:33:40.330 --> 00:33:43.982
a very pluggable model so that
many of the things, like machine
00:33:43.982 --> 00:33:47.230
learning, Map-Reduce some of
the languages, ping and HIVE.
00:33:47.230 --> 00:33:49.010
These will just
become components,
00:33:49.010 --> 00:33:49.920
that plug into Hadoop.
00:33:49.920 --> 00:33:50.930
>> Yeah,
>> So,
00:33:50.930 --> 00:33:52.870
it's a much more
flexible model now.
00:33:54.490 --> 00:33:56.240
This is sort of what
Map-Reduce looks like,
00:33:56.240 --> 00:34:00.940
it's a very easy way of
taking a set of data and
00:34:00.940 --> 00:34:04.180
then mapping that data,
into key value pairs.
00:34:04.180 --> 00:34:06.660
>> And it turns out that the key
value pairs are actually
00:34:06.660 --> 00:34:07.815
quite easy to distribute.
00:34:07.815 --> 00:34:08.370
>> Mm-hm, yeah.
00:34:08.370 --> 00:34:10.970
>> So you can create
clusters of machines.
00:34:10.970 --> 00:34:14.291
You can create processes
which can then take sets-
00:34:14.291 --> 00:34:15.685
>> Some sort of mechanism to
00:34:15.685 --> 00:34:18.272
work out against they key where
the particular set of data is on
00:34:18.272 --> 00:34:20.213
a particular node,
on a particular cluster,
00:34:20.213 --> 00:34:22.039
and then bring it back
together at the end.
00:34:22.039 --> 00:34:25.400
Have I got that, right?
00:34:25.400 --> 00:34:27.080
>> 100%.
It's really a case of, for
00:34:27.080 --> 00:34:28.620
example, if you were counting.
00:34:28.620 --> 00:34:32.350
If you are counting a set
of ages of people, and
00:34:32.350 --> 00:34:34.640
you would map those ages, right?
00:34:34.640 --> 00:34:37.660
You might have a set of data
that describes each person, but
00:34:37.660 --> 00:34:39.820
you might wanna just
pull out the age.
00:34:39.820 --> 00:34:42.030
And it maybe even
associate a group.
00:34:42.030 --> 00:34:46.272
So the key in this case
could be by county or
00:34:46.272 --> 00:34:49.677
something, by county in the UK.
00:34:49.677 --> 00:34:54.129
And what you might wanna do is
just count all of the people
00:34:54.129 --> 00:34:56.890
within certain sets of ages.
00:34:56.890 --> 00:35:01.250
So, you could either take
a key which was a composite
00:35:01.250 --> 00:35:06.150
key of the county and
the age band, or you could
00:35:06.150 --> 00:35:10.860
take a set of ages against
a key with the county and
00:35:10.860 --> 00:35:12.900
then you could reduce that
down to something else,
00:35:12.900 --> 00:35:15.488
which would be a count or
some kind of aggregator.
00:35:15.488 --> 00:35:19.612
[CROSSTALK] Yeah, so these
two things work in tandem, so
00:35:19.612 --> 00:35:22.174
it's a very, very good pattern.
00:35:22.174 --> 00:35:26.207
And you can see here, from
a top-down view, that each of
00:35:26.207 --> 00:35:30.241
these is individual processes,
so you have a map process
00:35:30.241 --> 00:35:35.010
sitting on Hadoop guide and
then you have a reduced process.
00:35:35.010 --> 00:35:38.890
And so that's essentially
how you can distribute this.
00:35:38.890 --> 00:35:40.740
>> Yeah, and then we're going
out of these big clusters.
00:35:40.740 --> 00:35:41.710
And then we have,
00:35:41.710 --> 00:35:44.570
is it head nodes that sort
of overall manage that?
00:35:44.570 --> 00:35:46.720
And we have a couple of those
just in case one dies, so
00:35:46.720 --> 00:35:48.320
that we can persist state.
00:35:48.320 --> 00:35:48.940
>> Exactly.
>> So
00:35:48.940 --> 00:35:50.810
it's a highly valuable solution.
00:35:50.810 --> 00:35:53.510
>> Yeah, so
you've got this other
00:35:53.510 --> 00:35:57.010
business continuity actually
baked into the cluster.
00:35:58.590 --> 00:36:02.900
And so if we just briefly touch
on programming on Hadoop.
00:36:02.900 --> 00:36:06.200
Some Java code there.
00:36:06.200 --> 00:36:08.380
But generally, this is what
a Map-Reduce will look like.
00:36:08.380 --> 00:36:10.850
>> Yeah, okay, so if you haven't
seen Java before, you don't
00:36:10.850 --> 00:36:13.220
still need to be a rocket
scientist to read this stuff.
00:36:13.220 --> 00:36:17.390
I can see what is happening out
there, we can run some sort of
00:36:17.390 --> 00:36:21.420
loop, be getting some words out
and put some comments as well.
00:36:22.440 --> 00:36:24.850
>> And so we're, and so we're
pushing a set of words out.
00:36:24.850 --> 00:36:27.660
So, you know, if we, if we take
this from the top with the map.
00:36:27.660 --> 00:36:29.160
We're mapping each
individual word.
00:36:29.160 --> 00:36:31.770
We're tokenizing
them in the loop.
00:36:31.770 --> 00:36:35.790
And when we reduce these we're
just summing these up so
00:36:35.790 --> 00:36:38.710
that we have-
>> Yeah, so just word count.
00:36:38.710 --> 00:36:39.670
>> Exactly.
00:36:39.670 --> 00:36:42.540
>> So we reduced the complete
works of Shakespeare to a list
00:36:42.540 --> 00:36:44.810
of all the words in
Shakespeare with a count, and
00:36:44.810 --> 00:36:46.223
then we can plot that.
00:36:46.223 --> 00:36:48.042
Much more [INAUDIBLE]
at the end of that.
00:36:48.042 --> 00:36:50.054
>> And that's just a small
piece of code that we can then
00:36:50.054 --> 00:36:51.582
distribute across
the Hadoop system.
00:36:51.582 --> 00:36:53.199
>> And
I guess the point of this,
00:36:53.199 --> 00:36:56.240
just pulling back a minute,
is that if we take a book like
00:36:56.240 --> 00:36:59.630
the one we're giving away here,
we might wanna do that on it.
00:36:59.630 --> 00:37:00.370
But then tomorrow,
00:37:00.370 --> 00:37:02.360
we might wanna do something
completely different.
00:37:02.360 --> 00:37:04.160
We might have a look
at sentence length or
00:37:04.160 --> 00:37:06.400
something like that, and
plot that out over the book.
00:37:06.400 --> 00:37:09.620
We could store the source data.
00:37:09.620 --> 00:37:11.680
We don't it throw away,
but we'll run a job for
00:37:11.680 --> 00:37:12.460
a different purpose.
00:37:12.460 --> 00:37:15.550
And rather than trying to guess
full time what we wanna do with
00:37:15.550 --> 00:37:17.580
this data and put it into
some relational structure.
00:37:17.580 --> 00:37:19.150
Each time we're running
one of these jobs,
00:37:19.150 --> 00:37:21.610
we're thinking about it from
a different perspective.
00:37:21.610 --> 00:37:22.690
So how many time,
00:37:22.690 --> 00:37:24.900
how many equations are there in
there whatever it happens to be.
00:37:24.900 --> 00:37:25.603
>> Yeah, absolutely.
00:37:25.603 --> 00:37:28.016
>> So I think that's quite
an interesting point here that
00:37:28.016 --> 00:37:30.063
we're doing this sort of
scheme on demand when
00:37:30.063 --> 00:37:31.070
we're running this.
00:37:31.070 --> 00:37:32.490
>> Yeah, and this idea of reuse.
00:37:32.490 --> 00:37:34.880
All a sudden, we can count any
number of words in any book.
00:37:34.880 --> 00:37:36.420
>> Any book, yeah,
we've written that recipe.
00:37:36.420 --> 00:37:38.030
We just throw whatever
we want at it.
00:37:38.030 --> 00:37:38.540
>> Yeah.
00:37:38.540 --> 00:37:39.350
Absolutely.
00:37:39.350 --> 00:37:45.800
>> So, Java is not the only
way to interface with Hadoop.
00:37:45.800 --> 00:37:49.970
We have HIVE, which you can see
is a SQL-esque type language.
00:37:49.970 --> 00:37:52.130
We can physically
partition this so
00:37:52.130 --> 00:37:53.680
that we can increase
performance.
00:37:53.680 --> 00:37:54.910
And we can delimit this.
00:37:54.910 --> 00:37:57.660
In this case we've
done this by country
00:37:57.660 --> 00:37:59.670
with our page_views tables.
00:37:59.670 --> 00:38:03.620
So we're capturing view times,
last referrer tags, the URL for
00:38:03.620 --> 00:38:04.550
the page.
00:38:04.550 --> 00:38:07.180
And then we're storing
these as a sequence file.
00:38:07.180 --> 00:38:08.014
So Hadoop has this concept of
00:38:08.014 --> 00:38:08.553
a sequence file-
>> Right.
00:38:08.553 --> 00:38:11.766
>> Where it goes from
zero to 10,000 and
00:38:11.766 --> 00:38:14.790
we'll just number
these in order.
00:38:14.790 --> 00:38:15.310
>> Right.
Okay.
00:38:15.310 --> 00:38:16.530
>> So
it's a different file type.
00:38:16.530 --> 00:38:19.830
And we can also extend
this with Java if we want.
00:38:19.830 --> 00:38:22.378
So you can see here that I've
created what's called a user
00:38:22.378 --> 00:38:23.210
to find function.
00:38:23.210 --> 00:38:24.280
>> Mm-hm.
>> And then I can just apply
00:38:24.280 --> 00:38:25.370
this function in SQL.
00:38:25.370 --> 00:38:25.960
>> Yeah.
00:38:25.960 --> 00:38:29.010
>> And then so, these are a
very, very familiar concept for
00:38:29.010 --> 00:38:29.920
SQL programers.
00:38:29.920 --> 00:38:31.180
>> Mm-hm.
>> In doing these
00:38:31.180 --> 00:38:32.020
sorts of thing.
00:38:32.020 --> 00:38:32.940
>> Yeah.
>> They can just
00:38:32.940 --> 00:38:35.060
apply themselves
straight away to HIVE.
00:38:35.060 --> 00:38:35.560
>> Yeah.
>> So it's-
00:38:35.560 --> 00:38:37.057
>> So we're describing sin
00:38:37.057 --> 00:38:40.391
text here in JavaScript term.
And we're using that language
00:38:40.391 --> 00:38:43.569
because it's more appropriate to
deal with strings and character
00:38:43.569 --> 00:38:46.464
manipulation and looking for
words and splitting out spaces
00:38:46.464 --> 00:38:49.385
or whatever it happens to be.
And then we're able to have
00:38:49.385 --> 00:38:51.236
a relational structure.
And once we've
00:38:51.236 --> 00:38:53.850
declared that structure,
then we can code over it.
00:38:53.850 --> 00:38:54.380
>> Yeah, exactly.
00:38:54.380 --> 00:38:55.490
>> Got the idea.
00:38:55.490 --> 00:38:59.990
>> And there's another way
of approaching Hadoop.
00:38:59.990 --> 00:39:06.720
So there's different
ways of looking at data.
00:39:08.010 --> 00:39:10.960
They'll be SQL programmers
that come to Hadoop.
00:39:10.960 --> 00:39:14.220
But they'll also be people that
think about things in steps and
00:39:14.220 --> 00:39:16.470
sequences, that are more
used to workflows.
00:39:16.470 --> 00:39:20.300
And PIG is one such way
of describing this.
00:39:20.300 --> 00:39:25.270
>> Yeah, so we seen PIG, we've
seen HIVE, and some JavaScript.
00:39:25.270 --> 00:39:27.350
Why have we got three things
that are doing essentially
00:39:27.350 --> 00:39:27.870
the same thing?
00:39:27.870 --> 00:39:30.170
Is this just historical,
different communities?
00:39:30.170 --> 00:39:34.020
>> Yes.
>> Is this a best use case?
00:39:34.020 --> 00:39:38.100
>> I think it's evolved this
way because many companies, for
00:39:38.100 --> 00:39:42.630
example Facebook have input put
in to HIVE, they took that core
00:39:42.630 --> 00:39:46.380
skill set and then they
created extensions of Hadoop.
00:39:46.380 --> 00:39:47.180
>> To work their way.
00:39:47.180 --> 00:39:50.171
So they changed Hadoop to work
the way they want it to work,
00:39:50.171 --> 00:39:52.995
rather than learning how to
work the way Hadoop works.
00:39:52.995 --> 00:39:55.330
>> Exactly, because not
everybody's a Java programmer,
00:39:55.330 --> 00:39:57.050
it's not always convenient.
00:39:57.050 --> 00:39:59.717
You've got a good friend called
Andy who absolutely hates Java.
00:39:59.717 --> 00:40:01.824
>> Well, some on different days,
00:40:01.824 --> 00:40:05.359
he's got different things
about different things.
00:40:05.359 --> 00:40:06.500
>> [LAUGH]
>> So
00:40:06.500 --> 00:40:09.570
yeah, you probably
caught him off Java day.
00:40:09.570 --> 00:40:10.460
>> Okay. [LAUGH] >> But
you can see,
00:40:10.460 --> 00:40:12.360
you've got a very
familiar syntax here.
00:40:12.360 --> 00:40:13.110
>> Yeah.
00:40:13.110 --> 00:40:16.020
>> We're dealing with
collections in [INAUDIBLE].
00:40:16.020 --> 00:40:18.750
We can manipulate things but
we can see this as a work flow.
00:40:18.750 --> 00:40:19.820
>> Yes.
>> So it's not set
00:40:19.820 --> 00:40:20.680
base like Hive.
00:40:20.680 --> 00:40:22.620
>> Yes, yes.
00:40:22.620 --> 00:40:24.770
>> And we can also do
things like unions and
00:40:24.770 --> 00:40:28.530
we can iterate these collections
and we can move from variable to
00:40:28.530 --> 00:40:35.060
variable and just enact on each
of these different variables.
00:40:35.060 --> 00:40:40.020
So, it's a way of seeing
much of this visually.
00:40:40.020 --> 00:40:44.810
So, we've got a short time left
and I wanted to just draw.
00:40:44.810 --> 00:40:45.500
>> Yeah.
00:40:45.500 --> 00:40:49.530
>> This to not saying too
much about the real time
00:40:49.530 --> 00:40:50.530
stuff but it's a.
00:40:50.530 --> 00:40:52.470
>> We have a [INAUDIBLE]
>> Yes, we do, but
00:40:52.470 --> 00:40:55.430
in terms of the open
source side of things.
00:40:55.430 --> 00:40:56.900
>> Yeah, sure.
00:40:56.900 --> 00:40:58.460
>> There's a couple
of frameworks.
00:40:58.460 --> 00:41:00.300
There's the Storm,
which we talked about,
00:41:00.300 --> 00:41:01.710
which is part of HD Insight.
00:41:01.710 --> 00:41:02.270
>> Yep.
>> And
00:41:02.270 --> 00:41:07.260
this processes messages using a
concept called spouts and bolts.
00:41:07.260 --> 00:41:10.450
Each message that's processed
is processed in an atomic way.
00:41:10.450 --> 00:41:13.740
So if it gets to the end of its
pipeline of spouts and bolts,
00:41:13.740 --> 00:41:16.690
and you can think of this
as like a message router
00:41:16.690 --> 00:41:17.700
within the cluster.
00:41:17.700 --> 00:41:21.440
So we can define that route,
we can change, or
00:41:21.440 --> 00:41:23.770
transform the data
as we go along.
00:41:23.770 --> 00:41:26.730
The spouts themselves,
Microsoft has written one for
00:41:26.730 --> 00:41:31.160
the event hub, which is a hugely
high performance spout which can
00:41:31.160 --> 00:41:33.650
ingest say 4,000 messages
a second on each node.
00:41:33.650 --> 00:41:35.193
>> Wow, yeah?
00:41:35.193 --> 00:41:36.034
And on each node?
00:41:36.034 --> 00:41:36.954
So you can have multiple nodes?
00:41:36.954 --> 00:41:37.605
>> Yeah, on each node.
>> Well how many
00:41:37.605 --> 00:41:38.127
nodes can we have?
00:41:38.127 --> 00:41:39.810
>> Well, you can have as
many nodes as you want.
00:41:39.810 --> 00:41:41.632
So what's the biggest one
you've worked on recently?
00:41:41.632 --> 00:41:44.810
>> Well,
when my credit card ran out.
00:41:44.810 --> 00:41:45.467
[LAUGH].
00:41:45.467 --> 00:41:46.550
>> [LAUGH]
>> Yeah,
00:41:46.550 --> 00:41:48.390
we tend to take it to the limit.
00:41:48.390 --> 00:41:51.400
>> So jolly good, so you find
your wife before you fire up
00:41:51.400 --> 00:41:53.371
Storm, if you're watching this.
00:41:53.371 --> 00:41:55.216
[LAUGH]
>> Well, not in this case.
00:41:55.216 --> 00:41:56.422
I was in the doghouse.
00:41:56.422 --> 00:41:57.510
>> Okay.
>> [LAUGH]
00:41:57.510 --> 00:41:58.080
>> Okay.
00:41:58.080 --> 00:41:59.560
So it scales massively,
00:41:59.560 --> 00:42:02.290
obviously you've gotta have
your HD Insight cluster up all
00:42:02.290 --> 00:42:04.790
the time, if you're gonna kind
of do this free assign scenario,
00:42:04.790 --> 00:42:06.960
and we'll come back
to that later on.
00:42:06.960 --> 00:42:10.170
But again, you've kind of got
spouts going into spouts here.
00:42:10.170 --> 00:42:12.340
So we've got this
initial processing,
00:42:12.340 --> 00:42:14.640
refinement of processing,
working out way through.
00:42:14.640 --> 00:42:17.820
>> Exactly, because I mean,
this whole process is about
00:42:17.820 --> 00:42:21.190
taking data from multiple
sources, enriching data sets,
00:42:21.190 --> 00:42:22.170
joining them.
00:42:22.170 --> 00:42:25.130
And really getting something
better than you started with at
00:42:25.130 --> 00:42:27.000
the end of the day,
to give you business value.
00:42:27.000 --> 00:42:27.990
>> Okay.
>> And
00:42:27.990 --> 00:42:30.990
there's a couple of simple
approaches you can see there.
00:42:30.990 --> 00:42:32.980
And what we just looked at
on the screen was called
00:42:32.980 --> 00:42:36.240
a topology, which can define
things like parallelism,
00:42:36.240 --> 00:42:38.360
we can set the number
of workers.
00:42:38.360 --> 00:42:39.940
We can have word
count bolt here,
00:42:39.940 --> 00:42:41.660
we can have a split center bolt.
00:42:41.660 --> 00:42:44.440
So we can use this idea of
a single responsibility
00:42:44.440 --> 00:42:45.180
pattern in this.
00:42:45.180 --> 00:42:45.768
>> Yeah, yeah.
00:42:45.768 --> 00:42:50.370
[INAUDIBLE] So this loosely
covers I just got these off
00:42:50.370 --> 00:42:53.630
the shelf things and
I'm gonna wire them together.
00:42:53.630 --> 00:42:54.160
>> Exactly,
00:42:54.160 --> 00:42:57.570
and I think there's a couple
of different approaches.
00:42:57.570 --> 00:43:00.260
You can see that this is some
code that actually takes in
00:43:00.260 --> 00:43:02.670
a bolt, and
everything in Storm is a tuple.
00:43:02.670 --> 00:43:05.370
So we take in a tuple,
we emit a tuple.
00:43:05.370 --> 00:43:08.408
Which is just a named value
as far as we're concerned.
00:43:08.408 --> 00:43:12.020
>> Yeah.
00:43:12.020 --> 00:43:13.260
>> And just to, sort of again,
00:43:13.260 --> 00:43:15.750
bring this home,
we've got Apache Spark.
00:43:15.750 --> 00:43:20.030
And Spark is a part
of HDInsight now.
00:43:20.030 --> 00:43:23.150
>> It runs on this
clustered mechanism.
00:43:23.150 --> 00:43:24.530
It's got a whole lot
of things to it.
00:43:24.530 --> 00:43:25.970
It's got a machine
learning part.
00:43:25.970 --> 00:43:27.580
It's got a graph database.
00:43:27.580 --> 00:43:29.800
It can stream messages
in real time, so
00:43:29.800 --> 00:43:32.790
we can have microbatch, we can
look at, a window of data and
00:43:32.790 --> 00:43:34.088
we can do analysis over there.
00:43:34.088 --> 00:43:38.040
We can cache things and
this idea of RDD,
00:43:38.040 --> 00:43:40.405
so Resilient
Distributed Datasets.
00:43:40.405 --> 00:43:41.020
>> All right.
00:43:41.020 --> 00:43:43.520
>> So
one of the benefits here is that
00:43:43.520 --> 00:43:45.450
we can pull all of
our data in memory.
00:43:45.450 --> 00:43:48.380
So in this case you can see
that we read in a text file and
00:43:48.380 --> 00:43:49.140
then we filter.
00:43:49.140 --> 00:43:50.780
So this is a log, and
00:43:50.780 --> 00:43:53.640
we're filtering the word,
anything that starts with error.
00:43:53.640 --> 00:43:55.810
And then we can split
each sentence and
00:43:55.810 --> 00:43:58.540
take the second part of this.
00:43:58.540 --> 00:44:01.170
And second part of the array
in this case may be
00:44:01.170 --> 00:44:01.750
the actual error.
00:44:01.750 --> 00:44:02.960
>> The error message,
yeah, absolutely.
00:44:02.960 --> 00:44:03.580
>> Exactly.
So
00:44:03.580 --> 00:44:05.520
all of the sudden we've
got error messages, but
00:44:05.520 --> 00:44:07.290
they're in memory
across the cluster.
00:44:07.290 --> 00:44:10.540
So potentially we could get
much faster performance than
00:44:10.540 --> 00:44:14.570
we can with Hadoop, because we
don't have that IO bottleneck.
00:44:14.570 --> 00:44:18.590
And then as we go down
the line with Spark,
00:44:18.590 --> 00:44:22.590
we see that actually what
the Spark community has done
00:44:22.590 --> 00:44:25.350
is introduce the idea
of DataFrames.
00:44:25.350 --> 00:44:26.820
Which we know from R,
we know from Python.
00:44:26.820 --> 00:44:29.120
>> Yeah, yeah, yeah,
I was gonna say. Yeah.
00:44:29.120 --> 00:44:29.430
>> And so
00:44:29.430 --> 00:44:32.410
you can apply a lot of your
familiar programming paradigms
00:44:32.410 --> 00:44:34.040
if you come from these worlds,
00:44:34.040 --> 00:44:37.385
but you can turn them into
a more distributed framework.
00:44:37.385 --> 00:44:38.040
>> Yeah.
00:44:38.040 --> 00:44:42.010
>> So then if you understand
DataFrames, you can do joins
00:44:42.010 --> 00:44:46.250
between DataFrames, in memory,
across this cluster, at scale.
00:44:46.250 --> 00:44:48.430
Right, and this is something
that Microsoft has provided.
00:44:48.430 --> 00:44:50.990
Literally it looked like
Robotium through HDM swipe.
00:44:50.990 --> 00:44:53.120
>> Fantastic, and are you
able to just as we close out,
00:44:53.120 --> 00:44:56.020
give us a couple of examples of
the kind of thing, I don't need
00:44:56.020 --> 00:44:59.440
to know names of who it's for,
but obviously I know you guys
00:44:59.440 --> 00:45:01.430
are out there in industry all
the time doing cool things.
00:45:01.430 --> 00:45:03.660
What are you using Spark for
at the moment?
00:45:03.660 --> 00:45:04.960
>> Yeah, of course.
00:45:04.960 --> 00:45:11.080
So, we've written an entire
recommendations engine on Spark.
00:45:11.080 --> 00:45:14.460
>> Right.
>> We can take in data for
00:45:14.460 --> 00:45:19.390
20 million users,
20 million customers,
00:45:19.390 --> 00:45:23.950
normally creating heavy cross
joins and these sorts of things.
00:45:23.950 --> 00:45:25.360
There's a lot of
computational work,
00:45:25.360 --> 00:45:28.170
but Spark does this in
a high proficient way,
00:45:28.170 --> 00:45:32.110
so we'll end up with a set of
numbers where we can define
00:45:32.110 --> 00:45:35.122
the exact products that people
should be buying from this.
00:45:35.122 --> 00:45:38.290
>> [CROSSTALK] consumed inside
that web application, yeah.
00:45:38.290 --> 00:45:39.848
>> Exactly.
00:45:39.848 --> 00:45:42.160
>> Horace,
that's fried my brain, I think.
00:45:42.160 --> 00:45:43.977
And it's only 11
o'clock in the morning.
00:45:43.977 --> 00:45:45.990
>> [LAUGH] More to come on that.
00:45:45.990 --> 00:45:48.620
>> More to come on that cuz we
got some new colleagues that's
00:45:48.620 --> 00:45:50.140
sending out latest
documents about this.
00:45:50.140 --> 00:45:52.286
So do you have a question for
us,
00:45:52.286 --> 00:45:54.599
I think there's
supposed to be a.
00:45:54.599 --> 00:45:59.803
>> Yes, so my question is,
what are the five
00:45:59.803 --> 00:46:04.144
services provided by HD Insight?
00:46:04.144 --> 00:46:07.190
>> Okay, what are the five
services provided by HD Insight?
00:46:07.190 --> 00:46:11.320
Get that on Twitter to our team,
and the prize is in the post.
00:46:11.320 --> 00:46:13.970
Richard, thanks for
taking up your time and
00:46:13.970 --> 00:46:15.230
your valuable day to come and
00:46:15.230 --> 00:46:18.540
share your experience with our
wonderful Channel 9 audience.
00:46:18.540 --> 00:46:19.780
>> It's been absolutely great.
00:46:19.780 --> 00:46:20.920
>> Let's do it again soon, sir.
00:46:20.920 --> 00:46:21.770
Thank you for your time.
00:46:21.770 --> 00:46:25.723
>> Take care.
>> Cheers now, bye.