1
00:00:00,000 --> 00:00:00,030
2
00:00:00,030 --> 00:00:02,420
The following content is
provided under a Creative
3
00:00:02,420 --> 00:00:03,860
Commons license.
4
00:00:03,860 --> 00:00:06,870
Your support will help MIT
OpenCourseWare continue to
5
00:00:06,870 --> 00:00:10,540
offer high quality educational
resources for free.
6
00:00:10,540 --> 00:00:13,410
To make a donation or view
additional materials from
7
00:00:13,410 --> 00:00:16,610
hundreds of MIT courses, visit
MIT OpenCourseWare at
8
00:00:16,610 --> 00:00:17,860
ocw.mit.edu.
9
00:00:17,860 --> 00:00:21,030
10
00:00:21,030 --> 00:00:26,076
PROFESSOR: I guess [OBSCURED]
11
00:00:26,076 --> 00:00:28,030
Let's get going.
12
00:00:28,030 --> 00:00:30,300
OK, should I introduce you?
13
00:00:30,300 --> 00:00:30,610
BRADLEY KUSZMAUL: If you want.
14
00:00:30,610 --> 00:00:31,810
I can introduce myself.
15
00:00:31,810 --> 00:00:34,735
PROFESSOR: We have Bradley
Kuszmaul who's been doing
16
00:00:34,735 --> 00:00:36,921
articles on Cilk?
17
00:00:36,921 --> 00:00:42,935
He's a very interesting
paralleling and also what you
18
00:00:42,935 --> 00:00:47,820
can say about the program It's
a very interesting project
19
00:00:47,820 --> 00:00:52,873
that coming for a while, and
there's a lot of interesting
20
00:00:52,873 --> 00:00:55,816
things he's developed,
and multi core
21
00:00:55,816 --> 00:00:59,740
becoming very important.
22
00:00:59,740 --> 00:01:00,860
BRADLEY KUSZMAUL: So how
many of you people have
23
00:01:00,860 --> 00:01:04,070
ever heard of Cilk?
24
00:01:04,070 --> 00:01:05,870
Have used it?
25
00:01:05,870 --> 00:01:09,370
So those of you who have
used it may find
26
00:01:09,370 --> 00:01:13,400
this talk old or whatever.
27
00:01:13,400 --> 00:01:16,840
So Cilk is a system that
runs on a shared-memory
28
00:01:16,840 --> 00:01:17,750
multiprocessor.
29
00:01:17,750 --> 00:01:21,690
So this is not like the system
you've been programming for
30
00:01:21,690 --> 00:01:22,720
this class.
31
00:01:22,720 --> 00:01:25,440
This kind of machine you have
processors, which each have
32
00:01:25,440 --> 00:01:28,840
cache and some sort of a network
and a bunch of memory
33
00:01:28,840 --> 00:01:33,720
and when the processors do
memory operations they are all
34
00:01:33,720 --> 00:01:36,780
on the same address space
and it's typically--
35
00:01:36,780 --> 00:01:39,000
the memory system provides some
sort of coherence like
36
00:01:39,000 --> 00:01:41,540
strong consistency or maybe
released consistency.
37
00:01:41,540 --> 00:01:44,690
38
00:01:44,690 --> 00:01:49,180
We're interested in the case
where the distance from
39
00:01:49,180 --> 00:01:52,420
processors to other processors
into a processors to memory
40
00:01:52,420 --> 00:01:55,980
may be nonuniform and so it's
important to use the cache
41
00:01:55,980 --> 00:02:02,500
well in this kind of machine
because you can't
42
00:02:02,500 --> 00:02:03,600
just ignore the cache.
43
00:02:03,600 --> 00:02:06,810
So sort of the technology that
I'm going to talk about for
44
00:02:06,810 --> 00:02:09,310
this kind of system
is called Cilk.
45
00:02:09,310 --> 00:02:13,240
Cilk is a C language and it does
dynamic multithreading
46
00:02:13,240 --> 00:02:15,390
and it has a provably
good runtime system.
47
00:02:15,390 --> 00:02:18,630
So I'll talk about what
those all mean.
48
00:02:18,630 --> 00:02:22,420
Cilk runs on shared-memory
machines like Suns and SGIs
49
00:02:22,420 --> 00:02:25,920
and well, you probably can't
find Alphaservers anymore.
50
00:02:25,920 --> 00:02:30,260
It runs on SMPs like that are
in everybody's laptops now.
51
00:02:30,260 --> 00:02:33,290
There's been several interesting
applications
52
00:02:33,290 --> 00:02:37,930
written in Cilk including virus
shell assembly, graphics
53
00:02:37,930 --> 00:02:40,050
rendering, n-body simulation.
54
00:02:40,050 --> 00:02:43,180
We did a bunch of chess programs
because they were
55
00:02:43,180 --> 00:02:48,740
sort of the raison
d'etre for Cilk.
56
00:02:48,740 --> 00:02:51,270
One of the features about Cilk
is that it automatically
57
00:02:51,270 --> 00:02:53,300
manages a lot of the
low-level issues.
58
00:02:53,300 --> 00:02:56,510
You don't have to do load
balancing, you don't have to
59
00:02:56,510 --> 00:02:59,210
write in protocols.
60
00:02:59,210 --> 00:03:02,310
You basically write programs
that look a lot more like the
61
00:03:02,310 --> 00:03:06,250
ordinary Cilk programs instead
of saying first I'm going to
62
00:03:06,250 --> 00:03:09,040
do this and then I'm going to
set this variable and then
63
00:03:09,040 --> 00:03:11,870
somebody else is going to read
that variable and that's a
64
00:03:11,870 --> 00:03:15,240
protocol and those are very
difficult to get right.
65
00:03:15,240 --> 00:03:17,720
AUDIENCE: [OBSCURED]
66
00:03:17,720 --> 00:03:19,460
BRADLEY KUSZMAUL: Yeah,
I'll mention that
67
00:03:19,460 --> 00:03:20,900
a little bit later.
68
00:03:20,900 --> 00:03:24,390
We had award-winning
chess player.
69
00:03:24,390 --> 00:03:26,770
So to explain what Cilk's about
70
00:03:26,770 --> 00:03:28,390
I'll talk about Fibonacci.
71
00:03:28,390 --> 00:03:32,510
Now Fibonacci, this is just to
review in case you don't know
72
00:03:32,510 --> 00:03:34,400
C. You all know C right?
73
00:03:34,400 --> 00:03:39,080
So Fibonacci is the function
that each number is the sum of
74
00:03:39,080 --> 00:03:41,330
the previous two Fibonacci
numbers.
75
00:03:41,330 --> 00:03:46,340
And this is an implementation
that basically does that
76
00:03:46,340 --> 00:03:47,400
computation directly.
77
00:03:47,400 --> 00:03:50,410
The Fibonacci of n if n is
less than 2, it's just n.
78
00:03:50,410 --> 00:03:53,840
So Fibonacci of zero
is zero, 1 is 1.
79
00:03:53,840 --> 00:03:55,120
2, the Fibonacci's--
80
00:03:55,120 --> 00:03:58,380
well, then you have to do the
recursion, so you compute
81
00:03:58,380 --> 00:04:01,585
Fibonacci of n minus 1 and
Fibonacci of n minus 2 and sum
82
00:04:01,585 --> 00:04:04,500
them together and that's
Fibonacci of n.
83
00:04:04,500 --> 00:04:09,330
One observation about this
function is it's a really slow
84
00:04:09,330 --> 00:04:12,430
implementation of Fibonacci.
85
00:04:12,430 --> 00:04:13,610
You all know how to
do this faster?
86
00:04:13,610 --> 00:04:16,570
How fast can you do Fibonacci?
87
00:04:16,570 --> 00:04:20,703
You all know this, How
fast is this one?
88
00:04:20,703 --> 00:04:22,960
AUDIENCE: [OBSCURED].
89
00:04:22,960 --> 00:04:25,450
BRADLEY KUSZMAUL: So for those
of you who don't know--
90
00:04:25,450 --> 00:04:28,200
certainly know how to compute
Fibonacci in linear time just
91
00:04:28,200 --> 00:04:31,340
by keeping track of the
most recent two.
92
00:04:31,340 --> 00:04:34,230
1, 1, 2, 3, 5, you just do it.
93
00:04:34,230 --> 00:04:37,270
This is exponential time and
there's an algorithm that does
94
00:04:37,270 --> 00:04:38,350
it in logarithmic time.
95
00:04:38,350 --> 00:04:44,380
So this implementation is
doubly, exponentially bad.
96
00:04:44,380 --> 00:04:46,760
But it's good as a didactic
example because it's easy to
97
00:04:46,760 --> 00:04:48,020
understand.
98
00:04:48,020 --> 00:04:50,540
So to turn this into Cilk we
just add some key words and
99
00:04:50,540 --> 00:04:53,730
I'll talk about what the key
words are in a minute, but the
100
00:04:53,730 --> 00:04:56,490
key thing to understand about
this is if you delete the key
101
00:04:56,490 --> 00:05:00,860
words you have a C program and
Cilk programs have the
102
00:05:00,860 --> 00:05:06,070
property that one of the legal
semantics for the Cilk program
103
00:05:06,070 --> 00:05:10,210
is the C program that you get
by deleting the key words.
104
00:05:10,210 --> 00:05:11,490
Now there's other possible
semantics
105
00:05:11,490 --> 00:05:13,020
you could get because--
106
00:05:13,020 --> 00:05:15,390
not for this function, this
function always produces the
107
00:05:15,390 --> 00:05:18,450
same answer because there's
no race conditions in it.
108
00:05:18,450 --> 00:05:20,660
But for programs that have
races you may have other
109
00:05:20,660 --> 00:05:24,150
semantics that the system
could provide.
110
00:05:24,150 --> 00:05:27,480
And so this kind of a language
extension where you can sort
111
00:05:27,480 --> 00:05:32,590
of delete the extensions and get
a correct implementation
112
00:05:32,590 --> 00:05:36,150
of the parallel program is
called a faithful extension.
113
00:05:36,150 --> 00:05:40,380
A lot of languages like OpenMP
have properties that if you
114
00:05:40,380 --> 00:05:42,360
had these directives and if
you delete them, it will
115
00:05:42,360 --> 00:05:44,990
change the semantics of your
program and so you have to be
116
00:05:44,990 --> 00:05:46,160
very careful.
117
00:05:46,160 --> 00:05:48,820
Now if you're careful about
programming OpenMP you can
118
00:05:48,820 --> 00:05:51,990
make it so that it's faithful,
that has this property.
119
00:05:51,990 --> 00:05:57,600
But that's not always
the case that it is.
120
00:05:57,600 --> 00:05:58,896
Sure.
121
00:05:58,896 --> 00:06:01,166
AUDIENCE: Is it built
on the different..
122
00:06:01,166 --> 00:06:04,060
123
00:06:04,060 --> 00:06:06,310
BRADLEY KUSZMAUL: C 77.
124
00:06:06,310 --> 00:06:06,940
No, C 89.
125
00:06:06,940 --> 00:06:08,934
AUDIENCE: OK, so there's
no presumption
126
00:06:08,934 --> 00:06:12,346
or any alias involved?
127
00:06:12,346 --> 00:06:15,570
It's assumed that
the [OBSCURED].
128
00:06:15,570 --> 00:06:17,060
BRADLEY KUSZMAUL: So the
issue of restricted
129
00:06:17,060 --> 00:06:18,095
pointers, for example?
130
00:06:18,095 --> 00:06:20,820
AUDIENCE: Restricted pointers.
131
00:06:20,820 --> 00:06:21,860
BRADLEY KUSZMAUL: So Cilk
turns out to work
132
00:06:21,860 --> 00:06:23,535
with C 99 as well.
133
00:06:23,535 --> 00:06:26,862
AUDIENCE: But is the presumption
though for a
134
00:06:26,862 --> 00:06:29,450
pointer that it could alias?
135
00:06:29,450 --> 00:06:31,020
BRADLEY KUSZMAUL: The Cilk
compiler makes no assumptions
136
00:06:31,020 --> 00:06:32,680
about that.
137
00:06:32,680 --> 00:06:35,310
If you write a program
and the back end--
138
00:06:35,310 --> 00:06:40,040
Cilk works and I'll talk about
this in a couple minutes.
139
00:06:40,040 --> 00:06:42,080
Cilk works by transforming
this into a C
140
00:06:42,080 --> 00:06:44,940
program that has--
141
00:06:44,940 --> 00:06:48,060
when you run it on one processor
it's just the
142
00:06:48,060 --> 00:06:50,370
original C program in effect.
143
00:06:50,370 --> 00:06:53,420
And so if you have a dialect
of C that has restricted
144
00:06:53,420 --> 00:06:55,950
pointers and a compiler that--
145
00:06:55,950 --> 00:06:57,460
PROFESSOR: You're taking
the assumptions that
146
00:06:57,460 --> 00:06:59,110
if you make a mistake--
147
00:06:59,110 --> 00:07:02,610
BRADLEY KUSZMAUL: If you make a
mistake the language doens't
148
00:07:02,610 --> 00:07:04,940
stop you from making
the mistake.
149
00:07:04,940 --> 00:07:08,230
AUDIENCE: Well, but in C 89
there's not a mistake.
150
00:07:08,230 --> 00:07:10,068
There's no assumption about
aliasing, right?
151
00:07:10,068 --> 00:07:11,060
It could alias.
152
00:07:11,060 --> 00:07:14,250
So if I said --
153
00:07:14,250 --> 00:07:16,232
BRADLEY KUSZMAUL: Because of
the aliasing you write a
154
00:07:16,232 --> 00:07:19,276
program that has a race
condition in it, which is
155
00:07:19,276 --> 00:07:19,450
erroneous--
156
00:07:19,450 --> 00:07:22,930
AUDIENCE: It would be valid?
157
00:07:22,930 --> 00:07:23,360
BRADLEY KUSZMAUL: No,
it'd still be valid.
158
00:07:23,360 --> 00:07:25,120
It would just have a race in
it and you would have a
159
00:07:25,120 --> 00:07:26,910
non-determinate result.
160
00:07:26,910 --> 00:07:27,900
PROFESSOR: It may not
do what you want.
161
00:07:27,900 --> 00:07:29,830
BRADLEY KUSZMAUL: It may not do
what you want, but one of
162
00:07:29,830 --> 00:07:32,310
the legal executions of that
parallel program is the
163
00:07:32,310 --> 00:07:34,060
original C program.
164
00:07:34,060 --> 00:07:37,300
AUDIENCE: So there's no extra.
165
00:07:37,300 --> 00:07:39,270
BRADLEY KUSZMAUL: At the sort
of level of doing analysis,
166
00:07:39,270 --> 00:07:40,840
Cilk doesn't do analysis.
167
00:07:40,840 --> 00:07:46,530
Cilk is a compiler that compiles
this language and the
168
00:07:46,530 --> 00:07:49,610
semantics are what they are,
which is you the spawn is
169
00:07:49,610 --> 00:07:50,450
its-- and I'll talk about
the semantics.
170
00:07:50,450 --> 00:07:52,910
The spawn means you can run the
function in parallel and
171
00:07:52,910 --> 00:07:56,310
if that doesn't give you the
same answer every time it's
172
00:07:56,310 --> 00:07:58,350
not the compilers fault.
173
00:07:58,350 --> 00:08:00,890
AUDIENCE: [OBSCURED]
174
00:08:00,890 --> 00:08:01,752
BRADLEY KUSZMAUL: Pardon?
175
00:08:01,752 --> 00:08:04,585
AUDIENCE: There has to be some
guarantee [OBSCURED].
176
00:08:04,585 --> 00:08:07,990
[OBSCURED]
177
00:08:07,990 --> 00:08:09,890
PROFESSOR: How in a race
condition you get some
178
00:08:09,890 --> 00:08:12,650
[OBSCURED].
179
00:08:12,650 --> 00:08:14,240
BRADLEY KUSZMAUL: One of the
legal things the Cilk system
180
00:08:14,240 --> 00:08:17,480
could do is just run this,
run that program.
181
00:08:17,480 --> 00:08:20,410
Now if you're running it on
multiple processors that's not
182
00:08:20,410 --> 00:08:22,430
what happens because the other
thing is there's some
183
00:08:22,430 --> 00:08:23,730
performance guarantees we get.
184
00:08:23,730 --> 00:08:24,970
So there's actually
parallelism.
185
00:08:24,970 --> 00:08:28,130
But on one processor in fact,
that's exactly what the
186
00:08:28,130 --> 00:08:30,860
execution does.
187
00:08:30,860 --> 00:08:34,120
So Cilk does dynamic
multithreading and this is
188
00:08:34,120 --> 00:08:37,660
different from p threads for
example where you have this
189
00:08:37,660 --> 00:08:41,330
very heavyweight thread that
costs tens of thousands of
190
00:08:41,330 --> 00:08:43,010
instructions to create.
191
00:08:43,010 --> 00:08:46,720
Cilk threads are really small,
so in this program there's a
192
00:08:46,720 --> 00:08:49,350
Cilk thread that runs basically
from when the fib
193
00:08:49,350 --> 00:08:51,990
starts to here and then--
194
00:08:51,990 --> 00:08:57,480
195
00:08:57,480 --> 00:08:59,920
I feel like there's a missing
slide in here.
196
00:08:59,920 --> 00:09:01,170
I didn't tell you about spawn.
197
00:09:01,170 --> 00:09:04,400
198
00:09:04,400 --> 00:09:07,820
OK, well let me tell you about
spawn because what the spawn
199
00:09:07,820 --> 00:09:12,940
means is that this function
can run in parallel.
200
00:09:12,940 --> 00:09:13,920
That's very simple.
201
00:09:13,920 --> 00:09:18,430
What the sync means is that all
the functions that were
202
00:09:18,430 --> 00:09:23,540
spawned off in this function all
have to finish before this
203
00:09:23,540 --> 00:09:25,040
function can proceed.
204
00:09:25,040 --> 00:09:29,170
So in a normal execution of C,
when you call a function the
205
00:09:29,170 --> 00:09:31,060
parent stops.
206
00:09:31,060 --> 00:09:33,910
In Cilk the parent can keep
running, so while that's
207
00:09:33,910 --> 00:09:36,580
running the parent-- this can
spawn off this and then the
208
00:09:36,580 --> 00:09:39,410
sync happens and now the
parent has to stop.
209
00:09:39,410 --> 00:09:42,140
And this key word basically just
says that this function
210
00:09:42,140 --> 00:09:44,590
can be spawned.
211
00:09:44,590 --> 00:09:46,510
AUDIENCE: Is the sync in that
scope or the children scope?
212
00:09:46,510 --> 00:09:49,960
213
00:09:49,960 --> 00:09:52,710
BRADLEY KUSZMAUL: The sync is
scoped within the function.
214
00:09:52,710 --> 00:09:54,883
So you could have a 4 loop
that spawned off a
215
00:09:54,883 --> 00:09:55,640
whole bunch of stuff.
216
00:09:55,640 --> 00:09:58,220
AUDIENCE: You could call the
function instead of moving
217
00:09:58,220 --> 00:10:00,180
some spawns, but
then [OBSCURED]
218
00:10:00,180 --> 00:10:00,620
in the sync.
219
00:10:00,620 --> 00:10:01,830
BRADLEY KUSZMAUL: There's an
explicit sync at the end of
220
00:10:01,830 --> 00:10:03,640
every function.
221
00:10:03,640 --> 00:10:06,660
So Cilk functions are strict.
222
00:10:06,660 --> 00:10:12,680
PROFESSOR: [NOISE]
223
00:10:12,680 --> 00:10:14,940
BRADLEY KUSZMAUL: You know,
there's children down inside
224
00:10:14,940 --> 00:10:17,420
here, but this function can't
return-- well, if I had
225
00:10:17,420 --> 00:10:22,000
omitted the sync and down in
some leaf the compiler puts
226
00:10:22,000 --> 00:10:25,970
one in before the function
returns.
227
00:10:25,970 --> 00:10:28,620
There's some languages that are
like this where somehow
228
00:10:28,620 --> 00:10:32,030
the intermediate function can go
away and then you can sync
229
00:10:32,030 --> 00:10:35,490
directly with your
grandparent.
230
00:10:35,490 --> 00:10:36,740
AUDIENCE: Otherwise
it would stop.
231
00:10:36,740 --> 00:10:38,710
232
00:10:38,710 --> 00:10:40,303
BRADLEY KUSZMAUL: So this gives
you this dag, so you
233
00:10:40,303 --> 00:10:42,100
have this part of the program
that runs up to the first
234
00:10:42,100 --> 00:10:44,900
spawn and then part of the
program that runs between the
235
00:10:44,900 --> 00:10:48,460
spawns and the part of the
program that runs after--
236
00:10:48,460 --> 00:10:51,760
well, after the last spawn to
the sync and then from there
237
00:10:51,760 --> 00:10:53,080
to the return.
238
00:10:53,080 --> 00:10:56,690
So I've got this drawing
that shows this
239
00:10:56,690 --> 00:10:58,130
function sort of running.
240
00:10:58,130 --> 00:11:00,650
So first the purple code runs
at it gets to the spawn, it
241
00:11:00,650 --> 00:11:03,850
spawns of this guy, but now the
second piece of code can
242
00:11:03,850 --> 00:11:05,140
start running.
243
00:11:05,140 --> 00:11:08,510
He does a spawn, so these two
are running in parallel.
244
00:11:08,510 --> 00:11:08,880
Meanwhile.
245
00:11:08,880 --> 00:11:10,200
This guy started that pff.
246
00:11:10,200 --> 00:11:13,460
247
00:11:13,460 --> 00:11:18,480
This is a base case, so he's
going to not do anything.
248
00:11:18,480 --> 00:11:19,920
Just feels like there's
something
249
00:11:19,920 --> 00:11:21,530
missing in this slide.
250
00:11:21,530 --> 00:11:23,670
Oh well.
251
00:11:23,670 --> 00:11:26,780
Essentially now this
guy couldn't run
252
00:11:26,780 --> 00:11:28,260
going back to here.
253
00:11:28,260 --> 00:11:30,500
This part of the code couldn't
run until after sync so this
254
00:11:30,500 --> 00:11:32,410
thing's sitting here waiting.
255
00:11:32,410 --> 00:11:37,780
So when these guys finally
return then this can run.
256
00:11:37,780 --> 00:11:39,270
This guy's getting stuck here.
257
00:11:39,270 --> 00:11:41,320
He runs and he runs.
258
00:11:41,320 --> 00:11:43,970
These two return and the
value comes up here.
259
00:11:43,970 --> 00:11:47,050
And now basically the
function is done.
260
00:11:47,050 --> 00:11:49,820
261
00:11:49,820 --> 00:11:52,500
One observation here is that
there's no mention of the
262
00:11:52,500 --> 00:11:55,470
number of processors
in this code.
263
00:11:55,470 --> 00:11:58,760
You haven't specified how
to schedule or how many
264
00:11:58,760 --> 00:11:59,800
processors.
265
00:11:59,800 --> 00:12:02,580
All you've specified is this
directed acyclic graph that
266
00:12:02,580 --> 00:12:06,550
unfolds dynamically and it's
up to us to schedule those
267
00:12:06,550 --> 00:12:07,830
onto the processors.
268
00:12:07,830 --> 00:12:09,850
So this code is processor
oblivious.
269
00:12:09,850 --> 00:12:12,481
It's oblivious to the number
of processors.
270
00:12:12,481 --> 00:12:14,940
PROFESSOR: But because we're
using the language we're
271
00:12:14,940 --> 00:12:17,890
probably have to create, write
as many spawns depending on--
272
00:12:17,890 --> 00:12:19,570
BRADLEY KUSZMAUL: No, what you
do is you write as many spawns
273
00:12:19,570 --> 00:12:20,866
as you can.
274
00:12:20,866 --> 00:12:23,800
You expose all the parallelism
in your code.
275
00:12:23,800 --> 00:12:27,980
So you want this dag to have
millions of threads in it
276
00:12:27,980 --> 00:12:29,520
concurrently.
277
00:12:29,520 --> 00:12:32,120
And then it's up to us to
schedule that efficiently.
278
00:12:32,120 --> 00:12:36,120
So it's a different mindset
then, I have 4 processors, let
279
00:12:36,120 --> 00:12:37,440
me create 4 things to do.
280
00:12:37,440 --> 00:12:40,250
I have 4 processors, let me
create a million things to do.
281
00:12:40,250 --> 00:12:43,350
And then the Cilk scheduler
guarantees to give you-- you
282
00:12:43,350 --> 00:12:45,480
have 4 processors, I'll give
you 4 fold speed up.
283
00:12:45,480 --> 00:12:48,555
PROFESSOR: I guess what you'd
like to avoid is the mindset
284
00:12:48,555 --> 00:12:51,430
of the programmer has to change
or find the changing
285
00:12:51,430 --> 00:12:52,680
tuning the parameters
for the performance.
286
00:12:52,680 --> 00:12:55,260
287
00:12:55,260 --> 00:12:58,215
BRADLEY KUSZMAUL: There's some
tuning that you do in order to
288
00:12:58,215 --> 00:12:59,890
make the leaf code.
289
00:12:59,890 --> 00:13:02,120
There's some overhead for
doing function calls.
290
00:13:02,120 --> 00:13:06,140
So it's small overhead.
291
00:13:06,140 --> 00:13:07,210
It turns out the cost
of the spawn is like
292
00:13:07,210 --> 00:13:10,670
three function calls.
293
00:13:10,670 --> 00:13:13,760
If you were actually trying to
make this code run faster you
294
00:13:13,760 --> 00:13:18,815
make the base case bigger and do
something, trying to speed
295
00:13:18,815 --> 00:13:21,970
things up a little bit with
the leaves of this call.
296
00:13:21,970 --> 00:13:24,140
So there's this call tree
and inside the call
297
00:13:24,140 --> 00:13:28,140
tree is this dag.
298
00:13:28,140 --> 00:13:31,740
So it supports C's rule
for pointers.
299
00:13:31,740 --> 00:13:37,550
For whatever dialect you have.
If you have a pointer to a
300
00:13:37,550 --> 00:13:41,250
stack and then you have a
pointer to the stack and then
301
00:13:41,250 --> 00:13:45,120
you call, you're allowed to use
that pointer in C. So in
302
00:13:45,120 --> 00:13:46,760
Cilk you are as well.
303
00:13:46,760 --> 00:13:49,820
If you have a parallel thing
going on where normally in C
304
00:13:49,820 --> 00:13:53,780
you would call A, then B
returns, then C and D. So C
305
00:13:53,780 --> 00:13:58,030
and D can refer to anything on
A, but C can't legally refer
306
00:13:58,030 --> 00:14:02,000
to something on B and the same
rule applies to Cilk.
307
00:14:02,000 --> 00:14:04,550
So we have a data structure that
implements this cactus
308
00:14:04,550 --> 00:14:10,062
stack is what it's called, after
the sugauro cactus--
309
00:14:10,062 --> 00:14:17,360
the view of the imagery
there and it lets you
310
00:14:17,360 --> 00:14:19,690
support that rule.
311
00:14:19,690 --> 00:14:23,630
There's some advanced features
in Cilk that have to do with
312
00:14:23,630 --> 00:14:29,150
speculative execution and I'm
going to skip over those today
313
00:14:29,150 --> 00:14:32,000
because it turns out that sort
of 99% of the time you don't
314
00:14:32,000 --> 00:14:33,250
need this stuff.
315
00:14:33,250 --> 00:14:35,720
316
00:14:35,720 --> 00:14:39,740
We have some debugger support,
so if you've written code that
317
00:14:39,740 --> 00:14:45,300
relied on some semantics that
maybe you didn't like when you
318
00:14:45,300 --> 00:14:48,460
went to the parallel world,
you'd like to find out.
319
00:14:48,460 --> 00:14:51,560
This is a tool that basically
takes a Cilk program and an
320
00:14:51,560 --> 00:14:56,640
input data set and it runs and
it tells you is there any
321
00:14:56,640 --> 00:14:59,780
schedule that I could have
chosen-- so it's that directed
322
00:14:59,780 --> 00:15:00,470
acyclic graph.
323
00:15:00,470 --> 00:15:02,230
So there's a whole bunch of
possible schedules I could
324
00:15:02,230 --> 00:15:03,270
have chosen.
325
00:15:03,270 --> 00:15:07,030
Is there any schedule that
changes the order of two
326
00:15:07,030 --> 00:15:12,060
concurrent memory operations
where one of them is right?
327
00:15:12,060 --> 00:15:14,000
So we call this the
non-determinator because it
328
00:15:14,000 --> 00:15:17,750
finds all the determinacy
races in your program.
329
00:15:17,750 --> 00:15:22,080
And Cilk guarantees-- the Cilk
race detector is guaranteed to
330
00:15:22,080 --> 00:15:23,150
find those.
331
00:15:23,150 --> 00:15:25,470
There's a lot of race detectors
where if the race
332
00:15:25,470 --> 00:15:27,300
doesn't actually occur you
have two things that are
333
00:15:27,300 --> 00:15:30,820
logically in parallel, but if
they don't actually run on
334
00:15:30,820 --> 00:15:34,910
different processors a lot of
race detectors out there in
335
00:15:34,910 --> 00:15:36,190
the world won't report
the race.
336
00:15:36,190 --> 00:15:38,560
So you get false negatives and
there's a bunch of false
337
00:15:38,560 --> 00:15:39,530
positives that show up.
338
00:15:39,530 --> 00:15:41,753
This basically only gives
you the real ones.
339
00:15:41,753 --> 00:15:46,800
AUDIENCE: That might be
indicatiors there might be
340
00:15:46,800 --> 00:15:49,220
still a data to arrays.
341
00:15:49,220 --> 00:15:51,110
BRADLEY KUSZMAUL: So this
doesn't analyze the program.
342
00:15:51,110 --> 00:15:52,890
It analyzes the execution.
343
00:15:52,890 --> 00:15:56,040
So it's not trying to solve some
MP complete problem or
344
00:15:56,040 --> 00:15:58,560
Turing complete problem.
345
00:15:58,560 --> 00:16:01,810
And so this reduces the problem
of finding data races
346
00:16:01,810 --> 00:16:05,770
to the situation that's just
like when you're trying to do
347
00:16:05,770 --> 00:16:08,540
code release and quality control
for serial programs.
348
00:16:08,540 --> 00:16:10,030
You write tests.
349
00:16:10,030 --> 00:16:12,720
If you don't test your program
you don't know what it does
350
00:16:12,720 --> 00:16:15,280
and that's the same
property here.
351
00:16:15,280 --> 00:16:18,430
If you do find some race someday
later then you can
352
00:16:18,430 --> 00:16:21,660
write a test for it and know
that you're testing to make
353
00:16:21,660 --> 00:16:23,860
sure that race didn't creep
back into your code.
354
00:16:23,860 --> 00:16:27,025
That's what you want out of a
software release strategy.
355
00:16:27,025 --> 00:16:36,900
AUDIENCE: [NOISE]
356
00:16:36,900 --> 00:16:40,170
BRADLEY KUSZMAUL: If you start
putting in sync than maybe the
357
00:16:40,170 --> 00:16:41,480
race goes away because
of that.
358
00:16:41,480 --> 00:16:44,950
But if just put in
instrumentation to try to
359
00:16:44,950 --> 00:16:47,380
figure out what's going,
it's still there.
360
00:16:47,380 --> 00:16:50,650
And the race detector sort of
says, this variable in this
361
00:16:50,650 --> 00:16:53,312
function, this variable in this
function, you look at it
362
00:16:53,312 --> 00:16:54,330
and say, how could
that happen?
363
00:16:54,330 --> 00:16:56,460
And finally you figured out and
you fix it and then you
364
00:16:56,460 --> 00:16:59,290
put it-- if you're trying to do
software release you build
365
00:16:59,290 --> 00:17:03,610
a regression test that will
verify that has that input.
366
00:17:03,610 --> 00:17:08,021
AUDIENCE: What if you have a
situation where the spawn
367
00:17:08,021 --> 00:17:09,964
graph falls into a terminal.
368
00:17:09,964 --> 00:17:14,670
So it's not a radius, but
monitoring spawn is there but
369
00:17:14,670 --> 00:17:15,920
it spawns a graph a
little bit deeper.
370
00:17:15,920 --> 00:17:21,060
371
00:17:21,060 --> 00:17:23,580
BRADLEY KUSZMAUL: Yes.
372
00:17:23,580 --> 00:17:25,420
For example, our race detector
understands locks.
373
00:17:25,420 --> 00:17:29,390
So part of the rule is it
doesn't report a race if the
374
00:17:29,390 --> 00:17:30,910
two memory accesses--
375
00:17:30,910 --> 00:17:34,610
if there was a lock that they
both held in common.
376
00:17:34,610 --> 00:17:37,010
Now you now you can write buggy
programs because you can
377
00:17:37,010 --> 00:17:39,890
essentially do the memory at
lock, you know, read the
378
00:17:39,890 --> 00:17:42,390
memory, unlock, lock,
write the memory.
379
00:17:42,390 --> 00:17:44,330
Now the interleave happens
and there's a race.
380
00:17:44,330 --> 00:17:47,330
So the assumption of this race
detector is that if you put
381
00:17:47,330 --> 00:17:49,600
locks in there that you've
sort of thought about.
382
00:17:49,600 --> 00:17:53,560
This is finding races that you
forgot about rather than races
383
00:17:53,560 --> 00:17:55,150
that you ostensibly
thought about.
384
00:17:55,150 --> 00:17:57,530
There are some races that
are actually correct.
385
00:17:57,530 --> 00:17:59,880
For example, in the chess
programs there's this big
386
00:17:59,880 --> 00:18:01,910
table that remembers all
the chess positions
387
00:18:01,910 --> 00:18:03,290
that have been seen.
388
00:18:03,290 --> 00:18:05,360
And if you don't get the right
answer out of the table it
389
00:18:05,360 --> 00:18:07,740
doesn't matter because you
search it again anyway.
390
00:18:07,740 --> 00:18:08,880
Not getting the right
answer means you
391
00:18:08,880 --> 00:18:09,700
don't get any answer.
392
00:18:09,700 --> 00:18:12,420
You look something up and it's
not there so you search again.
393
00:18:12,420 --> 00:18:14,790
If you just waited a little
longer maybe somebody else
394
00:18:14,790 --> 00:18:15,940
would have put the value
in, you could have
395
00:18:15,940 --> 00:18:17,470
saved a little work.
396
00:18:17,470 --> 00:18:20,310
And so in that case, well it
turns out to be there's no
397
00:18:20,310 --> 00:18:21,570
parallel way to do that.
398
00:18:21,570 --> 00:18:25,290
So I'm willing to tolerate that
race because that gives
399
00:18:25,290 --> 00:18:28,940
me performance and so you have
what we call fake locks, which
400
00:18:28,940 --> 00:18:33,280
are basically things that look
like lock calls, but they
401
00:18:33,280 --> 00:18:36,200
don't do anything except tell
the race detector, pretend
402
00:18:36,200 --> 00:18:37,940
there was a lock
held in common.
403
00:18:37,940 --> 00:18:38,707
Yeah?
404
00:18:38,707 --> 00:18:39,957
AUDIENCE:
[UNINTELLIGIBLE PHRASE]
405
00:18:39,957 --> 00:18:52,080
406
00:18:52,080 --> 00:18:53,780
BRADLEY KUSZMAUL: If it says
there's no race it means that
407
00:18:53,780 --> 00:18:57,745
for every possible
scheduling that--
408
00:18:57,745 --> 00:18:59,290
AUDIENCE:
[UNINTELLIGIBLE PHRASE].
409
00:18:59,290 --> 00:19:00,880
BRADLEY KUSZMAUL: Well,
you have that dag.
410
00:19:00,880 --> 00:19:02,850
And imagine running it
on one processor.
411
00:19:02,850 --> 00:19:04,300
There's a lot of possible
orders in
412
00:19:04,300 --> 00:19:05,910
which to run the dag.
413
00:19:05,910 --> 00:19:10,350
And the rule is well, was there
a load in a store or a
414
00:19:10,350 --> 00:19:14,100
store in a store that switched
orders in some possible
415
00:19:14,100 --> 00:19:16,511
schedule and that's
the definition.
416
00:19:16,511 --> 00:19:30,090
AUDIENCE: So, in practice,
sorry, one of the [INAUDIBLE]
417
00:19:30,090 --> 00:19:31,573
techniques is loss.
418
00:19:31,573 --> 00:19:33,798
Assuming, dependent on the
processor, that you have
419
00:19:33,798 --> 00:19:37,012
atomic rights, we want to
deal with that data
420
00:19:37,012 --> 00:19:38,990
[UNINTELLIGIBLE] in
the background --
421
00:19:38,990 --> 00:19:40,140
BRADLEY KUSZMAUL: Those
protocols are really hard to
422
00:19:40,140 --> 00:19:42,755
get right, but yes, it's
an important trick.
423
00:19:42,755 --> 00:19:44,430
AUDIENCE: Certainly
[INAUDIBLE].
424
00:19:44,430 --> 00:19:45,440
BRADLEY KUSZMAUL: So to convince
the race detector not
425
00:19:45,440 --> 00:19:47,160
to complain you put fake
locks around it.
426
00:19:47,160 --> 00:19:51,410
427
00:19:51,410 --> 00:19:53,840
You've programmed a
sophisticated algorithm it's
428
00:19:53,840 --> 00:19:56,090
up to you to get the
details right.
429
00:19:56,090 --> 00:19:59,150
430
00:19:59,150 --> 00:20:01,190
The other property about this
race detector is that it's
431
00:20:01,190 --> 00:20:03,220
fast. It runs almost
in liear time.
432
00:20:03,220 --> 00:20:05,700
A lot of the race detectors that
you find out there run in
433
00:20:05,700 --> 00:20:07,210
quadratic time.
434
00:20:07,210 --> 00:20:09,550
So if you want to run a million
instructions it has to
435
00:20:09,550 --> 00:20:12,380
compare every instruction to
every other instruction.
436
00:20:12,380 --> 00:20:13,690
Turns out we don't
have to do that.
437
00:20:13,690 --> 00:20:17,220
We run in time, which is n times
alpha of n where alpha's
438
00:20:17,220 --> 00:20:18,440
the inverse Ackermann
function.
439
00:20:18,440 --> 00:20:23,580
Anybody remember that from
the union-find algorithm.
440
00:20:23,580 --> 00:20:27,210
It's got that graded So it's
like the almost linear time.
441
00:20:27,210 --> 00:20:32,980
We actually now have a linear
timed one that has performance
442
00:20:32,980 --> 00:20:34,990
advantages.
443
00:20:34,990 --> 00:20:40,130
So let me do a little
theory in practice.
444
00:20:40,130 --> 00:20:42,850
In Cilk we have some fundamental
complexity
445
00:20:42,850 --> 00:20:44,160
measures that we worry about.
446
00:20:44,160 --> 00:20:47,070
So we're interested in knowing
and being able to predict the
447
00:20:47,070 --> 00:20:50,440
runtime of a Cilk program
on P processors.
448
00:20:50,440 --> 00:20:53,580
So we want to know T sub p,
which is the execution time on
449
00:20:53,580 --> 00:20:54,350
P processors.
450
00:20:54,350 --> 00:20:55,660
That's the goal.
451
00:20:55,660 --> 00:20:59,430
What we've got to work with is
some directed acyclic graph
452
00:20:59,430 --> 00:21:02,260
that is for a particular input
set and if the program
453
00:21:02,260 --> 00:21:05,400
determines it and everything
else it's a well defined graph
454
00:21:05,400 --> 00:21:08,470
and we can come up with some
basic measures of this graph.
455
00:21:08,470 --> 00:21:11,950
So T sub 1 is the work of the
graph, which is the total time
456
00:21:11,950 --> 00:21:15,080
it would take to run that
graph on one processor.
457
00:21:15,080 --> 00:21:17,570
Or if you assume that these
things are all cost unit
458
00:21:17,570 --> 00:21:19,510
times, just the number
of nodes.
459
00:21:19,510 --> 00:21:21,940
So for this graph
what's the work?
460
00:21:21,940 --> 00:21:31,780
461
00:21:31,780 --> 00:21:34,070
I heard teen, but something--
462
00:21:34,070 --> 00:21:36,460
18?
463
00:21:36,460 --> 00:21:43,590
And the critical path
is the longest path.
464
00:21:43,590 --> 00:21:45,920
And if these nodes weren't unit
time you'd have to weight
465
00:21:45,920 --> 00:21:48,200
the things according
to actually how
466
00:21:48,200 --> 00:21:49,200
much time they run.
467
00:21:49,200 --> 00:21:53,280
So the critical path
here is what?
468
00:21:53,280 --> 00:21:54,770
9.
469
00:21:54,770 --> 00:21:57,700
So I think those are right.
470
00:21:57,700 --> 00:22:02,500
The lower bounds then that you
know is that you don't expect
471
00:22:02,500 --> 00:22:05,480
the runtime on P processes to be
faster than linear speedup.
472
00:22:05,480 --> 00:22:09,980
473
00:22:09,980 --> 00:22:13,460
In this model that
doesn't happen.
474
00:22:13,460 --> 00:22:17,430
475
00:22:17,430 --> 00:22:18,820
It turns out cache
does things.
476
00:22:18,820 --> 00:22:21,190
It's adding more than
just processors.
477
00:22:21,190 --> 00:22:23,060
You're adding more cache too.
478
00:22:23,060 --> 00:22:26,380
So all sorts of things or maybe
it means that there's a
479
00:22:26,380 --> 00:22:28,180
better algorithm you
should have used.
480
00:22:28,180 --> 00:22:30,120
So there's some funny things
that happen if you have bad
481
00:22:30,120 --> 00:22:31,180
algorithms and so forth.
482
00:22:31,180 --> 00:22:34,020
But in this model you can't have
more than linear speedup.
483
00:22:34,020 --> 00:22:36,880
You also can't get things done
faster than in linear time.
484
00:22:36,880 --> 00:22:39,660
This model assumes basically
that these costs of running
485
00:22:39,660 --> 00:22:45,420
these things are fixed and the
cache has the property that
486
00:22:45,420 --> 00:22:47,410
changing the order of execution
means that the
487
00:22:47,410 --> 00:22:52,600
actual costs of the nodes in
the graph change costs.
488
00:22:52,600 --> 00:22:56,530
So those are lower bounds and
the things that we want to
489
00:22:56,530 --> 00:23:00,410
know are speedups, so that's
T sub 1 over T sub p.
490
00:23:00,410 --> 00:23:02,600
And the parallelism of
the graph is T sub
491
00:23:02,600 --> 00:23:04,310
1 over T sub infinity.
492
00:23:04,310 --> 00:23:07,090
So the work over the critical
path and we've been calling
493
00:23:07,090 --> 00:23:09,740
this span sometimes lately.
494
00:23:09,740 --> 00:23:12,260
Some people call that depth.
495
00:23:12,260 --> 00:23:14,750
Span is easier to say than
critical path, depth has too
496
00:23:14,750 --> 00:23:17,390
many other meanings so
I kind of like span.
497
00:23:17,390 --> 00:23:19,570
So what's the parallelism
for this program?
498
00:23:19,570 --> 00:23:24,760
499
00:23:24,760 --> 00:23:29,880
18/9.
500
00:23:29,880 --> 00:23:33,920
We said that T sub 1 was what?
501
00:23:33,920 --> 00:23:35,120
18.
502
00:23:35,120 --> 00:23:37,770
The infinity is 9.
503
00:23:37,770 --> 00:23:40,290
So on average and if you had
an infinite number of
504
00:23:40,290 --> 00:23:44,730
processors and you scheduled
this as greedy as you good, it
505
00:23:44,730 --> 00:23:47,560
would take you 9 steps to run
and you would you be doing 18
506
00:23:47,560 --> 00:23:48,350
things worth of work.
507
00:23:48,350 --> 00:23:51,140
So on average there's
two things to do.
508
00:23:51,140 --> 00:23:55,950
You know, 1 plus 1 plus 1 plus
3 plus 4 plus 4 plus 1 plus 1
509
00:23:55,950 --> 00:23:59,970
plus 1 divided by 9
turns out to be 2.
510
00:23:59,970 --> 00:24:02,950
So the average parallelism or
just the parallelism of the
511
00:24:02,950 --> 00:24:06,120
program is T sub 1 over
T sub infinity.
512
00:24:06,120 --> 00:24:08,580
And this property is something
that's not dependent on the
513
00:24:08,580 --> 00:24:12,390
scheduler, it's a property
of the program.
514
00:24:12,390 --> 00:24:15,490
Doesn't depend on how many
processors you have.
515
00:24:15,490 --> 00:24:16,438
AUDIENCE: [OBSCURED]
516
00:24:16,438 --> 00:24:21,740
You're saying, you're calling
that the span now?
517
00:24:21,740 --> 00:24:32,070
Is that the one for
us [OBSCURED]
518
00:24:32,070 --> 00:24:34,460
BRADLEY KUSZMAUL: That's
too long to say.
519
00:24:34,460 --> 00:24:37,240
I might as well say critical
path length.
520
00:24:37,240 --> 00:24:40,440
Critical path length, longest
trace span is a mathematical
521
00:24:40,440 --> 00:24:41,690
sounding name.
522
00:24:41,690 --> 00:24:45,956
523
00:24:45,956 --> 00:24:48,020
AUDIENCE: We just like
to steal terminology.
524
00:24:48,020 --> 00:24:50,510
BRADLEY KUSZMAUL: Well, yeah.
525
00:24:50,510 --> 00:24:51,660
So there's a theorem due to--
526
00:24:51,660 --> 00:24:54,500
Graham and Brent said that
there's some schedule that can
527
00:24:54,500 --> 00:24:58,040
actually achieve the sum of
those two lower bounds.
528
00:24:58,040 --> 00:25:01,740
This linear speedup is one lower
bound of the runtime and
529
00:25:01,740 --> 00:25:04,470
the critical path
is the other.
530
00:25:04,470 --> 00:25:06,400
So there's some schedule that
basically achieves the sum of
531
00:25:06,400 --> 00:25:09,360
those and how does that
theorem work?
532
00:25:09,360 --> 00:25:12,350
Well, at each time
step either--
533
00:25:12,350 --> 00:25:14,400
suppose we had 3 processors.
534
00:25:14,400 --> 00:25:20,090
Either there's at least 3 things
ready to run and so
535
00:25:20,090 --> 00:25:22,460
what you do is you do
a greedy schedule.
536
00:25:22,460 --> 00:25:23,730
You grab any 3 of them.
537
00:25:23,730 --> 00:25:27,790
538
00:25:27,790 --> 00:25:30,120
If there's fewer than p things
to run, like here we have a
539
00:25:30,120 --> 00:25:34,680
situation where these
have all run.
540
00:25:34,680 --> 00:25:36,270
The green ones are
ready to go.
541
00:25:36,270 --> 00:25:38,600
Those are the only 2 that
are ready to go.
542
00:25:38,600 --> 00:25:40,640
So what do you do then
in a greedy schedule?
543
00:25:40,640 --> 00:25:42,170
You run them all.
544
00:25:42,170 --> 00:25:44,880
545
00:25:44,880 --> 00:25:50,730
And the argument goes, well, how
many times steps could you
546
00:25:50,730 --> 00:25:51,980
execute 3 things?
547
00:25:51,980 --> 00:25:55,800
548
00:25:55,800 --> 00:25:58,090
At most you could do it the work
divided by the number of
549
00:25:58,090 --> 00:26:00,625
processors times because
then after that you've
550
00:26:00,625 --> 00:26:02,860
used up all the work.
551
00:26:02,860 --> 00:26:07,910
Well how many times could you
execute less than p things?
552
00:26:07,910 --> 00:26:09,940
Well, every time you execute
less than p things you're
553
00:26:09,940 --> 00:26:13,090
reducing the length of the
remaining critical path.
554
00:26:13,090 --> 00:26:17,130
You can't do that more
than the span times.
555
00:26:17,130 --> 00:26:21,040
And so a greedy scheduler will
achieve some runtime which is
556
00:26:21,040 --> 00:26:22,520
within the sum of these 2.
557
00:26:22,520 --> 00:26:25,990
It's actually the sum
of these 2 minus 1.
558
00:26:25,990 --> 00:26:28,940
It turns out that there has to
be at least one node that's on
559
00:26:28,940 --> 00:26:32,430
both work and critical path.
560
00:26:32,430 --> 00:26:34,400
And so that means that you're
guaranteed to be within a
561
00:26:34,400 --> 00:26:39,720
factor of 2 of optimal with
a greedy schedule.
562
00:26:39,720 --> 00:26:43,030
And it turns out that if you
have a lot of parallelism
563
00:26:43,030 --> 00:26:45,810
compared to the number
processors, so if you have a
564
00:26:45,810 --> 00:26:49,330
graph that has a million fold
parallelism and a thousand
565
00:26:49,330 --> 00:26:54,890
fold processors Well, if this
is really small compared to
566
00:26:54,890 --> 00:26:57,440
the work, if you have a graph
with a million fold
567
00:26:57,440 --> 00:27:00,410
parallelism that means the
critical path is small.
568
00:27:00,410 --> 00:27:02,190
If you only had 1000 processors
that means this
569
00:27:02,190 --> 00:27:05,220
term's big.
570
00:27:05,220 --> 00:27:08,420
And that means that this term is
very close to this term, so
571
00:27:08,420 --> 00:27:11,730
essentially the corollary to
this is that you get linear
572
00:27:11,730 --> 00:27:17,770
speedup, perfect linear speed
asymptotically if you have
573
00:27:17,770 --> 00:27:22,060
fewer processors then you have
parallelism in your program.
574
00:27:22,060 --> 00:27:26,280
So the game here at this level
of understanding, I haven't
575
00:27:26,280 --> 00:27:28,500
told you how the scheduler
actually works-- is to write a
576
00:27:28,500 --> 00:27:30,720
program that's got a lot of
parallelism that you can get
577
00:27:30,720 --> 00:27:31,970
linear speedup.
578
00:27:31,970 --> 00:27:38,070
579
00:27:38,070 --> 00:27:40,390
Well, the work-stealing
scheduler we actually use.
580
00:27:40,390 --> 00:27:42,380
The problem is the greedy
schedulers can be hard to
581
00:27:42,380 --> 00:27:44,670
compute-- especially if you
imagine having a million
582
00:27:44,670 --> 00:27:48,090
processors in a program with
a billion fold parallelism.
583
00:27:48,090 --> 00:27:50,940
Finding on every clock cycle,
finding something for each of
584
00:27:50,940 --> 00:27:54,580
the million guys to do is
conceptually difficult, so
585
00:27:54,580 --> 00:27:57,520
instead we have a work-stealing
scheduler.
586
00:27:57,520 --> 00:27:59,140
I'll talk about that
in a second.
587
00:27:59,140 --> 00:28:02,970
It achieves bounds which are
not quite as good as those.
588
00:28:02,970 --> 00:28:04,140
This bound is the same.
589
00:28:04,140 --> 00:28:07,130
It's the sum of two terms. One
is the linear speedup term,
590
00:28:07,130 --> 00:28:09,740
but instead of it being T sub
infinity it's big O of T sub
591
00:28:09,740 --> 00:28:14,010
infinity because you actually
have to do communication
592
00:28:14,010 --> 00:28:17,530
sometimes if the critical
path length is long.
593
00:28:17,530 --> 00:28:18,860
Basically, you can
sort of imagine.
594
00:28:18,860 --> 00:28:21,930
If you have a lot of things to
do, a lot of tasks and people
595
00:28:21,930 --> 00:28:25,840
to do it, it's easy to do that
in parallel if there's no
596
00:28:25,840 --> 00:28:27,810
interdependencies
among the tasks.
597
00:28:27,810 --> 00:28:29,930
But as soon as there's
dependencies you end up having
598
00:28:29,930 --> 00:28:33,530
to coordinate a lot and that
communication costs--
599
00:28:33,530 --> 00:28:37,620
there's lots of lore about
adding programmers to a task
600
00:28:37,620 --> 00:28:41,590
and it slowing you down.
601
00:28:41,590 --> 00:28:45,380
Because basically communication
gets you.
602
00:28:45,380 --> 00:28:47,170
What we found empirically--
603
00:28:47,170 --> 00:28:48,680
there's a theorem for this--
604
00:28:48,680 --> 00:28:53,130
empirically the runtime is
actually still very close to
605
00:28:53,130 --> 00:28:56,960
the sum of those terms. Or maybe
it's those terms plus 2
606
00:28:56,960 --> 00:29:00,610
times T sub infinity or
something like that.
607
00:29:00,610 --> 00:29:03,190
And again, we basically get
near-perfect speedup as long
608
00:29:03,190 --> 00:29:05,450
as the number of processors
is a lot less than the
609
00:29:05,450 --> 00:29:06,250
parallelism.
610
00:29:06,250 --> 00:29:08,815
Should be sort of a less
than less than.
611
00:29:08,815 --> 00:29:12,320
612
00:29:12,320 --> 00:29:14,950
The compiler has the mode where
you basically can insert
613
00:29:14,950 --> 00:29:15,310
instrumentations.
614
00:29:15,310 --> 00:29:17,700
So you can run your program,
it'll tell you the critical
615
00:29:17,700 --> 00:29:18,200
path length.
616
00:29:18,200 --> 00:29:20,530
You can compute these numbers.
617
00:29:20,530 --> 00:29:23,220
Clear how to compute work, you
just sum up the runtime of all
618
00:29:23,220 --> 00:29:24,590
the threads.
619
00:29:24,590 --> 00:29:27,360
To compute the critical path
length, well you have to do
620
00:29:27,360 --> 00:29:31,540
some max's and stuff as you
go through the graph.
621
00:29:31,540 --> 00:29:36,000
And the average cost of a spawn
these days is about 3 on
622
00:29:36,000 --> 00:29:39,580
like a dual core pentium.
623
00:29:39,580 --> 00:29:42,270
Three times the cost
of a function call.
624
00:29:42,270 --> 00:29:45,080
And most of that cost actually
has to do with the memory
625
00:29:45,080 --> 00:29:48,700
barrier that we do at the spawn
because that machine
626
00:29:48,700 --> 00:29:50,080
doesn't have strong
consistencies.
627
00:29:50,080 --> 00:29:52,360
So you have to put this memory
barrier in and that just
628
00:29:52,360 --> 00:29:53,610
empties all the pipelines.
629
00:29:53,610 --> 00:29:56,480
630
00:29:56,480 --> 00:30:00,100
It does better on like an SGI
machine, which has strong--
631
00:30:00,100 --> 00:30:01,480
well, traditional.
632
00:30:01,480 --> 00:30:04,520
A MIPS machine that has strong
consistency actually does
633
00:30:04,520 --> 00:30:08,440
better for the cost
of that overhead.
634
00:30:08,440 --> 00:30:10,490
Let me talk a little
bit about chess.
635
00:30:10,490 --> 00:30:16,410
And we had a bunch of chess
programs. I wrote one in 1994,
636
00:30:16,410 --> 00:30:19,650
which placed third at the
International Computer Chess
637
00:30:19,650 --> 00:30:21,940
Championship and that was
running on a big connection
638
00:30:21,940 --> 00:30:23,960
machine CM5.
639
00:30:23,960 --> 00:30:25,663
I was one of the architects
of that machine, so
640
00:30:25,663 --> 00:30:27,770
it was double fun.
641
00:30:27,770 --> 00:30:32,110
We wrote another program that
placed second in '95 and that
642
00:30:32,110 --> 00:30:34,860
was running on an 1800 node
Paragon and that was a big
643
00:30:34,860 --> 00:30:37,020
computer back then.
644
00:30:37,020 --> 00:30:40,520
We built another program called
Cilk chess, which
645
00:30:40,520 --> 00:30:46,440
placed first in '96 running on
a relatively smaller machine.
646
00:30:46,440 --> 00:30:52,810
And then on a larger SGI origin
we ran some more and
647
00:30:52,810 --> 00:30:56,520
then at the World Computer Chess
Championship in 1999 we
648
00:30:56,520 --> 00:31:01,270
beat Deep Blue and
lost to a PC.
649
00:31:01,270 --> 00:31:04,420
650
00:31:04,420 --> 00:31:07,570
And people don't realize this,
but at the time that Deep Blue
651
00:31:07,570 --> 00:31:09,740
beat Kasparov it was not the
World Computer Chess
652
00:31:09,740 --> 00:31:13,290
Champion, a PC was.
653
00:31:13,290 --> 00:31:16,150
So what?
654
00:31:16,150 --> 00:31:17,400
It's running a program.
655
00:31:17,400 --> 00:31:20,930
656
00:31:20,930 --> 00:31:22,883
You know, there's this
head and a tape.
657
00:31:22,883 --> 00:31:26,830
658
00:31:26,830 --> 00:31:29,360
I don't know what it did.
659
00:31:29,360 --> 00:31:31,980
So this was a program called
Fritz, which is a commercially
660
00:31:31,980 --> 00:31:32,860
available program.
661
00:31:32,860 --> 00:31:38,450
And those guys were very good,
the PC guys playing were very
662
00:31:38,450 --> 00:31:42,290
good at getting on sort
of the algorithm side.
663
00:31:42,290 --> 00:31:44,680
We got advantage
by brute force.
664
00:31:44,680 --> 00:31:47,630
And we also had some real chess
expertise on our team,
665
00:31:47,630 --> 00:31:51,060
but those guys were spending
full time on things like
666
00:31:51,060 --> 00:31:55,290
pruning away sub-searches that
they were convinced weren't
667
00:31:55,290 --> 00:31:55,930
going to pan out.
668
00:31:55,930 --> 00:31:58,340
Computer chess programs spend
most of their time looking at
669
00:31:58,340 --> 00:32:00,720
situations that any person would
look at and say, ah,
670
00:32:00,720 --> 00:32:01,420
blacks won.
671
00:32:01,420 --> 00:32:02,560
Why are you even looking
at this?
672
00:32:02,560 --> 00:32:03,820
And it keeps searching.
673
00:32:03,820 --> 00:32:05,786
It's like, well maybe there's
a way to get the queen.
674
00:32:05,786 --> 00:32:10,480
675
00:32:10,480 --> 00:32:12,310
So computers are pretty
dumb at that.
676
00:32:12,310 --> 00:32:15,380
So basically these guys put a
lot more chess intelligence in
677
00:32:15,380 --> 00:32:19,270
and we also lost due to what--
in this particular game, we
678
00:32:19,270 --> 00:32:24,570
were tied for first place and we
decided to do a runoff game
679
00:32:24,570 --> 00:32:27,770
to find out who would win and
we lost due to a classic
680
00:32:27,770 --> 00:32:29,960
horizon effect.
681
00:32:29,960 --> 00:32:32,910
So it turns out that we were
searching to depth 12 in the
682
00:32:32,910 --> 00:32:35,280
tree and Fritz was searching
to depth 11.
683
00:32:35,280 --> 00:32:38,730
Even with all these heuristics
and stuff they had in it, they
684
00:32:38,730 --> 00:32:41,280
were still not searching
as deeply as we were.
685
00:32:41,280 --> 00:32:45,050
But there was a move that was a
good move that looked OK at
686
00:32:45,050 --> 00:32:49,290
depth 11 and looked bad at depth
11 and at depth 13 it
687
00:32:49,290 --> 00:32:50,540
looked really good again.
688
00:32:50,540 --> 00:32:53,000
689
00:32:53,000 --> 00:32:56,460
So they saw the move and made
it for the wrong reason, we
690
00:32:56,460 --> 00:32:59,130
saw the move and didn't make it
for the right reason, but
691
00:32:59,130 --> 00:33:02,120
it was wrong and the right
move-- if we'd been able to
692
00:33:02,120 --> 00:33:05,320
search a little deeper, we would
have seen that it was
693
00:33:05,320 --> 00:33:08,070
really the wrong thing to do.
694
00:33:08,070 --> 00:33:09,550
This happens all the
time in chess.
695
00:33:09,550 --> 00:33:10,760
There's a little randomness
in there.
696
00:33:10,760 --> 00:33:13,820
This horizon effect shows up
and again, it boils down to
697
00:33:13,820 --> 00:33:15,820
the programs are not
intelligent.
698
00:33:15,820 --> 00:33:18,550
A human would look at it and
say, eventually that knight's
699
00:33:18,550 --> 00:33:19,730
going to fall.
700
00:33:19,730 --> 00:33:24,070
But if the computer can't see
it with a search, you know?
701
00:33:24,070 --> 00:33:27,190
702
00:33:27,190 --> 00:33:31,280
We plotted the speedup of star
Socrates, which was the first
703
00:33:31,280 --> 00:33:33,330
one on this funny graph.
704
00:33:33,330 --> 00:33:35,980
So this looks sort of like a
typical linear speedup graph.
705
00:33:35,980 --> 00:33:38,200
Sort of when you're down here
with few numbers processors
706
00:33:38,200 --> 00:33:40,660
you get good linear speedup
and eventually you stop
707
00:33:40,660 --> 00:33:41,730
getting linear speedup.
708
00:33:41,730 --> 00:33:43,410
That's sort of in broad
strokes what
709
00:33:43,410 --> 00:33:44,960
this graph looks like.
710
00:33:44,960 --> 00:33:46,510
But the axes are
kind of funny.
711
00:33:46,510 --> 00:33:50,210
The axes aren't the number of
processors and the speedup--
712
00:33:50,210 --> 00:33:54,250
it's the number processors
divided by the parallelism of
713
00:33:54,250 --> 00:33:55,440
the program.
714
00:33:55,440 --> 00:33:58,360
And here is the speedup divided
by the parallelism of
715
00:33:58,360 --> 00:33:59,400
the program.
716
00:33:59,400 --> 00:34:01,750
And the reason we did that is
the each of these data points
717
00:34:01,750 --> 00:34:05,540
is a different program with
different work in span.
718
00:34:05,540 --> 00:34:08,640
719
00:34:08,640 --> 00:34:10,720
If I'm trying to run a
particular problem on a bunch
720
00:34:10,720 --> 00:34:13,350
of different processors I can
just draw that curve and see
721
00:34:13,350 --> 00:34:15,950
what happens as get
more processors.
722
00:34:15,950 --> 00:34:19,250
723
00:34:19,250 --> 00:34:20,960
I'm not getting any advantage
because I've got too many
724
00:34:20,960 --> 00:34:21,600
processors.
725
00:34:21,600 --> 00:34:23,320
I've exceeded the parallelism
of the program.
726
00:34:23,320 --> 00:34:25,240
But if I'm running, trying
to compare two different
727
00:34:25,240 --> 00:34:27,170
programs, how do I do that?
728
00:34:27,170 --> 00:34:29,560
Well, you can do that by
normalizing by the
729
00:34:29,560 --> 00:34:30,760
parallelism.
730
00:34:30,760 --> 00:34:35,890
So down in this domain the
number of processors is small
731
00:34:35,890 --> 00:34:38,800
compared to the average
parallelism and we get good
732
00:34:38,800 --> 00:34:39,610
linear speedups.
733
00:34:39,610 --> 00:34:43,210
And up in this the domain the
number of processors is large
734
00:34:43,210 --> 00:34:46,310
and it starts asymptoting to
the point where the speedup
735
00:34:46,310 --> 00:34:51,650
approaches the parallelism and
that's sort of what happened.
736
00:34:51,650 --> 00:34:53,830
You get some noise out here so
one of the things down here,
737
00:34:53,830 --> 00:34:56,520
it's nice and tight.
738
00:34:56,520 --> 00:34:58,660
And that's because we're in
that domain where the
739
00:34:58,660 --> 00:35:01,790
communication costs are
infrequently paid because
740
00:35:01,790 --> 00:35:03,200
there's lots of work to do.
741
00:35:03,200 --> 00:35:05,120
You don't have to communicate
very much.
742
00:35:05,120 --> 00:35:07,510
Up here there's a lot of
communication that happens and
743
00:35:07,510 --> 00:35:12,170
so the noise is showing
up more in the data.
744
00:35:12,170 --> 00:35:14,530
This curve here is the
T sub 1 over P plus T
745
00:35:14,530 --> 00:35:15,780
sub infinity curve.
746
00:35:15,780 --> 00:35:19,200
747
00:35:19,200 --> 00:35:22,700
The T sub P equals T sub
infinity curve and that's the
748
00:35:22,700 --> 00:35:25,430
linear speedup curve
on this graph.
749
00:35:25,430 --> 00:35:28,380
So I think there's an important
lesson in this graph
750
00:35:28,380 --> 00:35:32,120
besides the data itself, which
is if you're careful about
751
00:35:32,120 --> 00:35:37,920
choosing the axes, you can take
a whole bunch of data
752
00:35:37,920 --> 00:35:40,580
that you couldn't see how to
plot it together and you can
753
00:35:40,580 --> 00:35:42,690
plot it together and get
something meaningful.
754
00:35:42,690 --> 00:35:46,130
So in my Ph.D. thesis I had
hundreds of little plots for
755
00:35:46,130 --> 00:35:48,980
each chess position and I didn't
figure out how-- it's
756
00:35:48,980 --> 00:35:50,030
like they all look
the same, right?
757
00:35:50,030 --> 00:35:53,050
But I didn't sort of figure out
that if I was careful I
758
00:35:53,050 --> 00:35:55,110
could actually make
them be the same.
759
00:35:55,110 --> 00:35:57,290
That happened after I
published my thesis.
760
00:35:57,290 --> 00:35:59,030
Oh, we could just
overlay them.
761
00:35:59,030 --> 00:36:03,340
Well, what's the normalization
that makes that work?
762
00:36:03,340 --> 00:36:05,540
So there's a speedup paradox
that happened.
763
00:36:05,540 --> 00:36:09,220
764
00:36:09,220 --> 00:36:10,025
Pardon?
765
00:36:10,025 --> 00:36:11,480
AUDIENCE: [OBSCURED]
766
00:36:11,480 --> 00:36:12,940
BRADLEY KUSZMAUL: Yeah, OK.
767
00:36:12,940 --> 00:36:15,460
There was a speedup paradox that
happened while we were
768
00:36:15,460 --> 00:36:16,650
developing star Socrates.
769
00:36:16,650 --> 00:36:20,040
We were developing this for
512 processor connection
770
00:36:20,040 --> 00:36:24,520
machine that was at University
of Illinois, but we only had a
771
00:36:24,520 --> 00:36:26,790
smaller machine on which
to do our development.
772
00:36:26,790 --> 00:36:30,910
We had a 128 processor machine
at MIT and most days I could
773
00:36:30,910 --> 00:36:34,040
only get 32 processors
because the machine
774
00:36:34,040 --> 00:36:35,340
was in heavy demand.
775
00:36:35,340 --> 00:36:38,240
So we had this program
and it ran on 32
776
00:36:38,240 --> 00:36:41,000
processors in 65 seconds.
777
00:36:41,000 --> 00:36:44,260
And one of the developers said,
here's a variation on
778
00:36:44,260 --> 00:36:46,390
the algorithm, it
changes the dag.
779
00:36:46,390 --> 00:36:47,720
It's a heuristic.
780
00:36:47,720 --> 00:36:50,510
It makes the program
run more efficient.
781
00:36:50,510 --> 00:36:53,910
Look, it runs in only 40 seconds
on 32 processors.
782
00:36:53,910 --> 00:36:55,850
And so is that a good idea?
783
00:36:55,850 --> 00:36:59,770
It sure seemed like a good idea,
but we were worried that
784
00:36:59,770 --> 00:37:01,890
we knew that the transformation
increased the
785
00:37:01,890 --> 00:37:04,260
critical path length of the
program, so we weren't sure it
786
00:37:04,260 --> 00:37:05,280
was a good idea.
787
00:37:05,280 --> 00:37:07,660
So we did some calculation.
788
00:37:07,660 --> 00:37:11,550
We measured the work
and the speedup.
789
00:37:11,550 --> 00:37:12,510
And so the work here--
790
00:37:12,510 --> 00:37:14,340
these numbers have been cooked
a little bit to make the math
791
00:37:14,340 --> 00:37:19,950
easy, but the numbers--
792
00:37:19,950 --> 00:37:24,980
this really did happen, but not
with these exact numbers.
793
00:37:24,980 --> 00:37:29,100
So we had a work which was 2048
seconds and only 1 second
794
00:37:29,100 --> 00:37:29,910
of critical path.
795
00:37:29,910 --> 00:37:33,430
And over this new program had
only 1/2 as much work to do,
796
00:37:33,430 --> 00:37:35,120
but the critical path
length was longer.
797
00:37:35,120 --> 00:37:36,390
It was 8 seconds long.
798
00:37:36,390 --> 00:37:40,190
799
00:37:40,190 --> 00:37:43,140
If you predict on 32 processors
what the runtime's
800
00:37:43,140 --> 00:37:46,740
going to be that formula
says well, 65 seconds.
801
00:37:46,740 --> 00:37:50,050
If you predict it on 32
processors this-- well, it's
802
00:37:50,050 --> 00:37:53,540
40 seconds and that looks good,
but we were going to be
803
00:37:53,540 --> 00:37:57,990
running the tournament on 512
processors where this term
804
00:37:57,990 --> 00:38:02,030
would start being less important
than this term.
805
00:38:02,030 --> 00:38:04,570
So this really did happen and
we actually went back and
806
00:38:04,570 --> 00:38:07,200
validated that these numbers
were right after we did the
807
00:38:07,200 --> 00:38:11,120
calculation and it allowed us to
do the engineering to make
808
00:38:11,120 --> 00:38:14,310
the right decision and not be
misled by something that
809
00:38:14,310 --> 00:38:19,120
looked good in the
test environment.
810
00:38:19,120 --> 00:38:21,090
We were able to predict what was
going to happen on the big
811
00:38:21,090 --> 00:38:23,310
machine without actually having
access to the big
812
00:38:23,310 --> 00:38:24,730
machine and that was
very important.
813
00:38:24,730 --> 00:38:27,660
814
00:38:27,660 --> 00:38:31,230
Let me do some algorithms. You
guys probably have done some
815
00:38:31,230 --> 00:38:34,000
matrix multipliers over the
past 3 weeks, right?
816
00:38:34,000 --> 00:38:36,223
That's probably the only thing
you've been able to do would
817
00:38:36,223 --> 00:38:38,210
be my guess.
818
00:38:38,210 --> 00:38:40,680
So matrix multiplication
is this operation.
819
00:38:40,680 --> 00:38:42,780
I won't talk about it, but
you know what it is.
820
00:38:42,780 --> 00:38:47,240
In Cilk instead of doing the
standard triply nested loops
821
00:38:47,240 --> 00:38:49,740
you do divide and conquer.
822
00:38:49,740 --> 00:38:52,825
We don't parallelize loops we
parallelize function calls, so
823
00:38:52,825 --> 00:38:56,830
you want to express a
loops as recursion.
824
00:38:56,830 --> 00:39:00,460
So to multipliy two big matrices
you do a whole bunch
825
00:39:00,460 --> 00:39:01,990
of little matrix multiplications
of the
826
00:39:01,990 --> 00:39:04,450
sub-blocks and then you express
those little matrix
827
00:39:04,450 --> 00:39:07,870
multiplications themselves and
go off and recursively do
828
00:39:07,870 --> 00:39:10,490
smaller matrix multiplications.
829
00:39:10,490 --> 00:39:12,980
So this requires 8
multiplications of matrices
830
00:39:12,980 --> 00:39:15,780
these of 1/2 the number of
rows and 1/2 the number
831
00:39:15,780 --> 00:39:19,280
columns an one edition at the
end where you add these two
832
00:39:19,280 --> 00:39:20,430
matrices together.
833
00:39:20,430 --> 00:39:25,600
That's the algorithm that we do,
it's the same total work
834
00:39:25,600 --> 00:39:28,700
as the standard one, but it's
just expressed recursively.
835
00:39:28,700 --> 00:39:32,850
So a matrix multiply is you
do these 8 multiplies.
836
00:39:32,850 --> 00:39:35,660
I had to create a temporary
variable, so the first four
837
00:39:35,660 --> 00:39:40,540
multiplies the A's and B's
into C. The second four
838
00:39:40,540 --> 00:39:45,030
multiply the A's and B's into
T and then I have to add T
839
00:39:45,030 --> 00:39:48,880
into C. So I do all those
spawns, do all the multiplies.
840
00:39:48,880 --> 00:39:51,516
I do a sync because I better not
start using the results on
841
00:39:51,516 --> 00:39:55,030
the multiplies and adding them
until the multiplies are done.
842
00:39:55,030 --> 00:39:56,360
AUDIENCE: Which four
do you add?
843
00:39:56,360 --> 00:39:57,960
BRADLEY KUSZMAUL: What?
844
00:39:57,960 --> 00:39:59,210
There's parallelism in add.
845
00:39:59,210 --> 00:40:01,620
846
00:40:01,620 --> 00:40:02,930
Matrix addition.
847
00:40:02,930 --> 00:40:05,750
AUDIENCE: Yeah, but it doesn't
add spawn extent
848
00:40:05,750 --> 00:40:08,330
BRADLEY KUSZMAUL: Well,
we spawn off add.
849
00:40:08,330 --> 00:40:10,610
I don't understand--
850
00:40:10,610 --> 00:40:12,770
[INTERPOSING VOICES]
851
00:40:12,770 --> 00:40:15,550
BRADLEY KUSZMAUL: So you have
to spawn Cilk functions even
852
00:40:15,550 --> 00:40:17,890
if you're only executing
one of them at a time.
853
00:40:17,890 --> 00:40:20,760
854
00:40:20,760 --> 00:40:24,290
Cilk functions are spawned,
C functions are called.
855
00:40:24,290 --> 00:40:26,810
It's a decision that's built
into the language.
856
00:40:26,810 --> 00:40:28,760
It's not really a fundamental
decision.
857
00:40:28,760 --> 00:40:30,748
It's just that's the
way we did it.
858
00:40:30,748 --> 00:40:32,703
AUDIENCE: Why'd you choose to
have the key word then?
859
00:40:32,703 --> 00:40:34,170
That's just documentation
from the caller side?
860
00:40:34,170 --> 00:40:37,600
BRADLEY KUSZMAUL: Yeah, we found
we were less likely to
861
00:40:37,600 --> 00:40:41,420
make a mistake if we sort of
built it into the type system
862
00:40:41,420 --> 00:40:42,370
in this way.
863
00:40:42,370 --> 00:40:45,110
But I'm not convinced that this
is the best way to do
864
00:40:45,110 --> 00:40:47,990
this type system.
865
00:40:47,990 --> 00:40:48,920
AUDIENCE: Can the C functions
spawn a Cilk function.
866
00:40:48,920 --> 00:40:49,240
BRADLEY KUSZMAUL: No.
867
00:40:49,240 --> 00:40:52,330
You can only call spawn, spawn,
spawn, spawn then you
868
00:40:52,330 --> 00:40:55,390
can call C functions
at the leaves.
869
00:40:55,390 --> 00:40:58,160
It turns out you can actually
spawn Cilk functions if you're
870
00:40:58,160 --> 00:41:01,310
a little clever about-- there's
a mechanism for a Cilk
871
00:41:01,310 --> 00:41:02,850
system running in the background
and if you're
872
00:41:02,850 --> 00:41:05,800
running C you can
say OK, do this
873
00:41:05,800 --> 00:41:07,570
Cilk function in parallel.
874
00:41:07,570 --> 00:41:10,290
So we have that, but that's
not didactic.
875
00:41:10,290 --> 00:41:13,805
876
00:41:13,805 --> 00:41:15,745
AUDIENCE: Sorry, I
have a question
877
00:41:15,745 --> 00:41:18,655
about the sync spawning.
878
00:41:18,655 --> 00:41:22,487
Is the sync actually doing a
whole wave or -- like, in the
879
00:41:22,487 --> 00:41:27,720
case of-- maybe not in the case
of the add here, but in
880
00:41:27,720 --> 00:41:32,155
plenty of other practical
functions you get inside the
881
00:41:32,155 --> 00:41:35,187
spawn function looking
at the tendencies of
882
00:41:35,187 --> 00:41:36,415
the parameters, right?
883
00:41:36,415 --> 00:41:39,136
Based on how those
were built from
884
00:41:39,136 --> 00:41:40,960
previous spawned funcitons.
885
00:41:40,960 --> 00:41:44,822
You can actually just start
processing so long as it's
886
00:41:44,822 --> 00:41:47,006
guaranteed that the results
are available before you
887
00:41:47,006 --> 00:41:47,080
actually read them.
888
00:41:47,080 --> 00:41:49,080
BRADLEY KUSZMAUL: So there's
this other style of expressing
889
00:41:49,080 --> 00:41:50,890
parallelism which you see
in some of the data flow
890
00:41:50,890 --> 00:41:54,480
languages where you say well,
I've computed this first
891
00:41:54,480 --> 00:41:57,160
multiply, why can't I get
started on the corresponding
892
00:41:57,160 --> 00:42:00,030
part of the addition.
893
00:42:00,030 --> 00:42:03,440
And it turns out that in those
models there's no performance
894
00:42:03,440 --> 00:42:04,690
guarantees.
895
00:42:04,690 --> 00:42:07,110
896
00:42:07,110 --> 00:42:08,480
The real issue is you
run out of memory.
897
00:42:08,480 --> 00:42:12,220
898
00:42:12,220 --> 00:42:15,060
It's a long topic, let's not
go into it, but there's a
899
00:42:15,060 --> 00:42:17,560
serious technical issue with
those programming models.
900
00:42:17,560 --> 00:42:20,610
901
00:42:20,610 --> 00:42:22,560
We have very tight memory
bounds as well, so we
902
00:42:22,560 --> 00:42:25,310
simultaneously get these good
scheduling bounds and good
903
00:42:25,310 --> 00:42:28,060
memory bounds and if you are
doing that you could have sort
904
00:42:28,060 --> 00:42:30,910
of a really large number of
temporaries required and run
905
00:42:30,910 --> 00:42:31,300
out of memory.
906
00:42:31,300 --> 00:42:35,030
The data flow machine used to
have this number-- there was a
907
00:42:35,030 --> 00:42:38,295
student, Ken Traub, who was
working on Monsoon when Greg
908
00:42:38,295 --> 00:42:42,780
Papadapolous was here and he
came up with this term which
909
00:42:42,780 --> 00:42:45,390
we called Traub's constant,
which was how long the machine
910
00:42:45,390 --> 00:42:47,740
could be guaranteed to run
before it crashed from being
911
00:42:47,740 --> 00:42:48,870
out of memory.
912
00:42:48,870 --> 00:42:51,910
And that was-- well, he took
the rate at which it Kahn's
913
00:42:51,910 --> 00:42:56,960
divided by the amount of
memory and that was it.
914
00:42:56,960 --> 00:43:00,390
And many data flow programs had
that property that Monsoon
915
00:43:00,390 --> 00:43:02,460
could run for 40 seconds
and then after
916
00:43:02,460 --> 00:43:04,150
that you never knew.
917
00:43:04,150 --> 00:43:06,420
It might start crashing at any
moment, so everybody wrote
918
00:43:06,420 --> 00:43:11,770
short data flow programs.
919
00:43:11,770 --> 00:43:13,280
So one of the things you
actually do when you're
920
00:43:13,280 --> 00:43:14,900
implementing, when you're trying
to engineer this to go
921
00:43:14,900 --> 00:43:18,540
fast, is you course in the
base case, which I didn't
922
00:43:18,540 --> 00:43:19,350
describe up there.
923
00:43:19,350 --> 00:43:22,310
You don't just do a 1 by 1
matrix multiplied down there
924
00:43:22,310 --> 00:43:24,970
at the leaves of
this recursion.
925
00:43:24,970 --> 00:43:26,900
Because then you're not using
the processor pipeline
926
00:43:26,900 --> 00:43:28,580
efficiently.
927
00:43:28,580 --> 00:43:31,980
You call the Intel Math Kernel
Library or something on an 8
928
00:43:31,980 --> 00:43:34,290
by 8 matrix so that it really
gets the pipeline a
929
00:43:34,290 --> 00:43:35,750
chance to chug away.
930
00:43:35,750 --> 00:43:38,740
931
00:43:38,740 --> 00:43:39,250
So analysis.
932
00:43:39,250 --> 00:43:42,850
This matrix addition operation--
well, what's the
933
00:43:42,850 --> 00:43:45,390
work for matrix addition?
934
00:43:45,390 --> 00:43:50,820
Well the work to do a matrix
operation on n rows is well,
935
00:43:50,820 --> 00:43:55,120
you have to do 4 additions
of size n over 2.
936
00:43:55,120 --> 00:43:58,370
Plus there's order 1 work
here for the sync.
937
00:43:58,370 --> 00:44:04,080
And that recurrence has solution
order n squared.
938
00:44:04,080 --> 00:44:05,200
Well, that's not surprising.
939
00:44:05,200 --> 00:44:07,980
You have to add up 2 matrices
which are n by n.
940
00:44:07,980 --> 00:44:10,900
That's going to be n squared
so that's a good result.
941
00:44:10,900 --> 00:44:15,450
The critical path for this is
well, you have to do all of
942
00:44:15,450 --> 00:44:16,350
these in parallel.
943
00:44:16,350 --> 00:44:20,840
So whatever the critical path of
the longest one is, they're
944
00:44:20,840 --> 00:44:23,350
all the same so it's just the
critical path of the size n
945
00:44:23,350 --> 00:44:29,350
over 2 plus quarter 1, so the
critical path is order log n.
946
00:44:29,350 --> 00:44:34,270
For matrix multiplication,
sort of the reason I
947
00:44:34,270 --> 00:44:36,840
do this is I can.
948
00:44:36,840 --> 00:44:38,820
This is a model which I can
do this analysis, so
949
00:44:38,820 --> 00:44:40,010
I have to do it.
950
00:44:40,010 --> 00:44:43,740
But really, being able to do
this analysis is important
951
00:44:43,740 --> 00:44:46,550
when you're trying to make
things run faster.
952
00:44:46,550 --> 00:44:48,800
Matrix multiplication, well,
the work is I have to do 8
953
00:44:48,800 --> 00:44:52,840
little matrix multiplies plus
I have to do the matrix add.
954
00:44:52,840 --> 00:44:56,350
955
00:44:56,350 --> 00:44:59,110
The work has solution order n
cubed and everybody knows that
956
00:44:59,110 --> 00:45:01,990
there's order n cubed multiply
adds in a matrix multiplier,
957
00:45:01,990 --> 00:45:04,420
so that's not very surprising.
958
00:45:04,420 --> 00:45:09,700
The critical path is-- well, I
have to do a add so that takes
959
00:45:09,700 --> 00:45:12,910
log n, plus I have to do a
multiply on a matrix that's
960
00:45:12,910 --> 00:45:14,210
1/2 the size.
961
00:45:14,210 --> 00:45:16,990
So the critical path length of
the whole thing has solution
962
00:45:16,990 --> 00:45:19,190
order log squared n.
963
00:45:19,190 --> 00:45:25,670
So the total parallelism of
matrix multiplication is the
964
00:45:25,670 --> 00:45:30,900
work over the span, which is
n cubed over log squared n.
965
00:45:30,900 --> 00:45:34,030
So if you have a 1000 by 1000
matrix that means your
966
00:45:34,030 --> 00:45:37,860
parallelism is close
to 10 million.
967
00:45:37,860 --> 00:45:40,930
There's a lot of parallelism
and in fact, we see perfect
968
00:45:40,930 --> 00:45:43,760
linear speedup on matrix
multiply because there's so
969
00:45:43,760 --> 00:45:45,010
much parallelism in it.
970
00:45:45,010 --> 00:45:47,710
971
00:45:47,710 --> 00:45:51,270
It turns out that this stack
temporary that I created so
972
00:45:51,270 --> 00:45:53,550
that I could do these multiplies
all in parallel is
973
00:45:53,550 --> 00:45:57,870
actually costing me work because
I'm on a machine that
974
00:45:57,870 --> 00:46:00,110
has cache and I want to use
the cache effectively.
975
00:46:00,110 --> 00:46:02,630
So I really don't want to create
a whole big temporary
976
00:46:02,630 --> 00:46:06,780
matrix and blow my cache
out if I can avoid it.
977
00:46:06,780 --> 00:46:10,860
So I proposed the following
matrix multiply, which is I
978
00:46:10,860 --> 00:46:14,950
first do 4 of the matrix
multiplies into C1 then I do a
979
00:46:14,950 --> 00:46:20,830
sync and then I do the other
4 into C1 and another sync.
980
00:46:20,830 --> 00:46:24,560
And I forgot to do the add-- oh,
no those are multiply adds
981
00:46:24,560 --> 00:46:26,850
so they're multiplying
and adding in.
982
00:46:26,850 --> 00:46:30,300
And this saves space because it
doesn't need a temporary,
983
00:46:30,300 --> 00:46:32,960
but it increases the
critical path.
984
00:46:32,960 --> 00:46:35,790
So is that a good idea
about or a bad idea?
985
00:46:35,790 --> 00:46:39,410
Well, we can answer part of that
question with analysis.
986
00:46:39,410 --> 00:46:42,290
Saving space we know is going
to save something.
987
00:46:42,290 --> 00:46:44,080
What does it do to the work
in critical path?
988
00:46:44,080 --> 00:46:47,220
Well, the work is still the
same, it's n cubed because we
989
00:46:47,220 --> 00:46:50,370
didn't change the number of
flops that we're doing.
990
00:46:50,370 --> 00:46:51,900
But the critical
path has grown.
991
00:46:51,900 --> 00:46:56,530
Instead of doing 1 times a
matrix multiply, we have to do
992
00:46:56,530 --> 00:46:58,690
one and then sync and
then do another one.
993
00:46:58,690 --> 00:47:02,140
So it's 2 matrix multiplies of
1/2 the size plus the order 1
994
00:47:02,140 --> 00:47:06,700
and that recurrence has solution
order n instead of
995
00:47:06,700 --> 00:47:09,300
order log squared n.
996
00:47:09,300 --> 00:47:12,590
So that sounds bad, we've made
the critical path longer.
997
00:47:12,590 --> 00:47:13,610
AUDIENCE: [OBSCURED]
998
00:47:13,610 --> 00:47:13,870
BRADLEY KUSZMAUL: What?
999
00:47:13,870 --> 00:47:15,010
Yeah.
1000
00:47:15,010 --> 00:47:18,790
So parallelism is now order n
squared instead of n cubed
1001
00:47:18,790 --> 00:47:22,240
over log squared n and for a
1000 by 1000 matrix that means
1002
00:47:22,240 --> 00:47:24,740
you still have a million
fold parallelism.
1003
00:47:24,740 --> 00:47:27,900
So for relatively modest sized
matrices you still have plenty
1004
00:47:27,900 --> 00:47:29,260
of work to do this
optimization.
1005
00:47:29,260 --> 00:47:31,830
So this is a good transformation
to do it.
1006
00:47:31,830 --> 00:47:34,680
One of the advantages of Cilk
is that you can do this kind
1007
00:47:34,680 --> 00:47:37,770
of You could say, let me
do an optimization.
1008
00:47:37,770 --> 00:47:40,340
I can do an optimization in my
C code and I get to take
1009
00:47:40,340 --> 00:47:42,460
advantage of it in
the Cilk code.
1010
00:47:42,460 --> 00:47:45,580
I could do this kind of
optimization of trading work
1011
00:47:45,580 --> 00:47:46,290
for parallelism.
1012
00:47:46,290 --> 00:47:49,730
If I have a lot of work that
sometimes is a good idea.
1013
00:47:49,730 --> 00:47:53,810
Ordinary matrix multiplication
just is really bad.
1014
00:47:53,810 --> 00:47:57,170
Basically you can imagine
spawning off the n squared
1015
00:47:57,170 --> 00:48:00,180
inner dot products here and
1016
00:48:00,180 --> 00:48:01,790
computing them all in parallel.
1017
00:48:01,790 --> 00:48:06,560
It has work n cubed
parallelism log n.
1018
00:48:06,560 --> 00:48:10,210
I mean, critical path log n so
the parallelism's even better.
1019
00:48:10,210 --> 00:48:13,480
It's n cubed over log n
instead of n squared.
1020
00:48:13,480 --> 00:48:16,000
That looks better theoretically,
but it's really
1021
00:48:16,000 --> 00:48:19,430
bad in practice because it has
such poor cache behavior.
1022
00:48:19,430 --> 00:48:23,390
So we don't do that.
1023
00:48:23,390 --> 00:48:25,360
I'll just briefly talk
about how it works.
1024
00:48:25,360 --> 00:48:27,000
So Cilk does work-stealing.
1025
00:48:27,000 --> 00:48:29,740
We had did double ended
queue-like decque.
1026
00:48:29,740 --> 00:48:31,995
So at the bottom of the queue
is the stack where you push
1027
00:48:31,995 --> 00:48:34,680
and pop things and the top is
something where you can pop
1028
00:48:34,680 --> 00:48:36,500
things off if you want to.
1029
00:48:36,500 --> 00:48:38,790
And so what's running is all
these processors are running
1030
00:48:38,790 --> 00:48:40,170
each on their own stack.
1031
00:48:40,170 --> 00:48:42,690
They're all running the
ordinary serial code.
1032
00:48:42,690 --> 00:48:44,700
That's sort of the
basic situation.
1033
00:48:44,700 --> 00:48:46,190
They're pretty much
running the serial
1034
00:48:46,190 --> 00:48:48,470
code most of the time.
1035
00:48:48,470 --> 00:48:50,170
So some processor runs.
1036
00:48:50,170 --> 00:48:51,490
It pushes.
1037
00:48:51,490 --> 00:48:53,130
Well, it doesn't spawn,
so what does it do?
1038
00:48:53,130 --> 00:48:55,080
It pushes something onto its
stack because it's just a
1039
00:48:55,080 --> 00:48:57,080
function call.
1040
00:48:57,080 --> 00:49:01,760
And it does another couple more
spawns so things pop off.
1041
00:49:01,760 --> 00:49:04,180
Somebody returns so
he pops his stack.
1042
00:49:04,180 --> 00:49:06,890
So far everything's going on,
they're not communicating,
1043
00:49:06,890 --> 00:49:10,640
they're completely independent
computations.
1044
00:49:10,640 --> 00:49:12,840
This guy spawns and now
he's out of work.
1045
00:49:12,840 --> 00:49:14,220
Now he has to do something.
1046
00:49:14,220 --> 00:49:17,090
What he does is he goes and
picks another processor at
1047
00:49:17,090 --> 00:49:22,270
random and he steals
the thing from the
1048
00:49:22,270 --> 00:49:23,920
other end of the stack.
1049
00:49:23,920 --> 00:49:26,260
So he's unlikely to conflict
because this guy's pushing and
1050
00:49:26,260 --> 00:49:28,900
popping down here, but there's
a lock in there, thers's a
1051
00:49:28,900 --> 00:49:30,290
little algorithm.
1052
00:49:30,290 --> 00:49:34,680
A non-blocking algorithm
actually, it's not lock.
1053
00:49:34,680 --> 00:49:38,870
And so he goes and he steals
something and come on, slide
1054
00:49:38,870 --> 00:49:39,690
over there.
1055
00:49:39,690 --> 00:49:40,460
Whoa.
1056
00:49:40,460 --> 00:49:43,900
Yes, that's animation, right?
1057
00:49:43,900 --> 00:49:47,340
That's the extent
of my animation.
1058
00:49:47,340 --> 00:49:49,600
And then he starts
working away.
1059
00:49:49,600 --> 00:49:52,800
And the theorem is that a
work-stealing scheduler like
1060
00:49:52,800 --> 00:49:56,330
this gives expected running
time with high probability
1061
00:49:56,330 --> 00:49:59,280
actually of T sub 1 over P
plus T sub infinity on P
1062
00:49:59,280 --> 00:50:00,690
processors.
1063
00:50:00,690 --> 00:50:04,190
And the pseudoproof is a little
bit like the proof for
1064
00:50:04,190 --> 00:50:05,270
Brent's Theorem,
which is either
1065
00:50:05,270 --> 00:50:07,020
you're working or stealing.
1066
00:50:07,020 --> 00:50:11,050
If you're working well, that
goes against T sub 1 over P.
1067
00:50:11,050 --> 00:50:14,410
You can't do that very much
or you run out of work.
1068
00:50:14,410 --> 00:50:19,210
If you're stealing well, each
steal has a chance that it
1069
00:50:19,210 --> 00:50:22,040
steals the thing that's
on the critical path.
1070
00:50:22,040 --> 00:50:23,960
You may actually steal the wrong
thing, but you actually
1071
00:50:23,960 --> 00:50:26,910
have a 1 in P chance that you're
the one who steals the
1072
00:50:26,910 --> 00:50:30,030
thing that it's on the critical
path and then in
1073
00:50:30,030 --> 00:50:31,910
which case the expected
number--
1074
00:50:31,910 --> 00:50:34,060
so you had this chance of
1 over P of reducing the
1075
00:50:34,060 --> 00:50:38,210
critical path length by 1, so
after this many steals the
1076
00:50:38,210 --> 00:50:39,750
critical path is all gone.
1077
00:50:39,750 --> 00:50:44,260
So you can only do P times
T infinity steals.
1078
00:50:44,260 --> 00:50:46,750
This high probability
it comes out.
1079
00:50:46,750 --> 00:50:50,440
And that gives you
these bounds.
1080
00:50:50,440 --> 00:50:54,110
OK, I'm not going to give
you all this stuff.
1081
00:50:54,110 --> 00:50:58,040
Message passing sucks,
you know.
1082
00:50:58,040 --> 00:50:59,170
You guys know.
1083
00:50:59,170 --> 00:51:02,270
There's probably nothing
else in here.
1084
00:51:02,270 --> 00:51:05,270
1085
00:51:05,270 --> 00:51:09,790
So basically the pitch here is
that you get some high level
1086
00:51:09,790 --> 00:51:13,620
linguistics support for these
very fine-grained parallelism.
1087
00:51:13,620 --> 00:51:16,620
It's an algorithmic programming
model so that
1088
00:51:16,620 --> 00:51:19,640
means that you can do
engineering for performance.
1089
00:51:19,640 --> 00:51:23,310
There's fairly easy conversion
of existing code, especially
1090
00:51:23,310 --> 00:51:24,770
when you combine it with
the race detector.
1091
00:51:24,770 --> 00:51:27,335
You've got this factorization
of the debugging problem and
1092
00:51:27,335 --> 00:51:30,930
to debugging your serial code
is you run it with all the
1093
00:51:30,930 --> 00:51:32,060
Cilk stuff turned off.
1094
00:51:32,060 --> 00:51:35,010
You allied the program and make
sure your program works.
1095
00:51:35,010 --> 00:51:36,700
Then you run it with the rate
detector to make sure you get
1096
00:51:36,700 --> 00:51:41,290
the same answer in parallel
and then you're done.
1097
00:51:41,290 --> 00:51:44,310
Applications in Cilk don't just
scale to large number of
1098
00:51:44,310 --> 00:51:47,240
processors, they scale down
to small numbers, which is
1099
00:51:47,240 --> 00:51:49,890
important if you only have
two processors or one.
1100
00:51:49,890 --> 00:51:53,750
You don't suddenly want to pay
a factor of 10 to get off the
1101
00:51:53,750 --> 00:51:55,820
ground, which happens
sometimes on
1102
00:51:55,820 --> 00:51:57,110
clusters running MPI.
1103
00:51:57,110 --> 00:51:58,760
You have to pay a big
overhead before
1104
00:51:58,760 --> 00:52:01,700
you've made any progress.
1105
00:52:01,700 --> 00:52:04,320
And one of the advantages for
example is that the number of
1106
00:52:04,320 --> 00:52:06,420
processors might change
dynamically.
1107
00:52:06,420 --> 00:52:09,520
In this model that's
OK because it's
1108
00:52:09,520 --> 00:52:10,650
not part of the program.
1109
00:52:10,650 --> 00:52:14,050
So you may have the operating
system reduce the number of
1110
00:52:14,050 --> 00:52:18,230
actual worker threads that you
have doing that work-stealing
1111
00:52:18,230 --> 00:52:19,560
and that can work.
1112
00:52:19,560 --> 00:52:22,420
One of the bad things about
Cilk is that it doesn't
1113
00:52:22,420 --> 00:52:29,450
support sort of data parallel
or program model kind of
1114
00:52:29,450 --> 00:52:30,420
parallelism.
1115
00:52:30,420 --> 00:52:34,010
You really have to think of
things as this divide and
1116
00:52:34,010 --> 00:52:35,930
conquer kind of the world.
1117
00:52:35,930 --> 00:52:38,000
And if you have trouble
expressing that--
1118
00:52:38,000 --> 00:52:40,730
1119
00:52:40,730 --> 00:52:43,900
situations where you're doing
Jacobi update and you very
1120
00:52:43,900 --> 00:52:48,010
carefully put things on, had
each processor work on its
1121
00:52:48,010 --> 00:52:49,930
local memory and then they only
have to communicate at
1122
00:52:49,930 --> 00:52:51,250
the boundaries.
1123
00:52:51,250 --> 00:52:54,660
That's difficult to do right
in Cilk because essentially
1124
00:52:54,660 --> 00:52:56,960
every time you go around the
loop of I have all these
1125
00:52:56,960 --> 00:52:57,420
things to do.
1126
00:52:57,420 --> 00:52:59,700
All the work-stealing happens
randomly and it happens on a
1127
00:52:59,700 --> 00:53:00,520
different processor.
1128
00:53:00,520 --> 00:53:03,350
So it's not very good at that
sort of thing, although it
1129
00:53:03,350 --> 00:53:05,920
turns out Jacobi update's not
a very good example for that
1130
00:53:05,920 --> 00:53:08,790
because there are more
sophisticated algorithms that
1131
00:53:08,790 --> 00:53:12,230
use cache effectively that you
can express in Cilk and I
1132
00:53:12,230 --> 00:53:15,020
would have no idea how to no
say those in some of these
1133
00:53:15,020 --> 00:53:16,870
sort of data parallel
languages.
1134
00:53:16,870 --> 00:53:21,010
Using the cache efficiently is
really important on modern
1135
00:53:21,010 --> 00:53:23,481
processors.
1136
00:53:23,481 --> 00:53:24,731
PROFESSOR: Thank you.
1137
00:53:24,731 --> 00:53:27,543
1138
00:53:27,543 --> 00:53:28,793
Questions?
1139
00:53:28,793 --> 00:53:33,130
1140
00:53:33,130 --> 00:53:34,700
BRADLEY KUSZMAUL: You can
download Cilk, there's a bunch
1141
00:53:34,700 --> 00:53:35,360
of contributors.
1142
00:53:35,360 --> 00:53:38,490
Those are the Cilk worms
and you can download
1143
00:53:38,490 --> 00:53:39,620
Cilk off our webpage.
1144
00:53:39,620 --> 00:53:41,730
Just Google for Cilk
and you'll find it.
1145
00:53:41,730 --> 00:53:44,540
It's a great language,
you'll love it.
1146
00:53:44,540 --> 00:53:47,115
You'll love it much more than
what you've been doing.
1147
00:53:47,115 --> 00:53:48,534
AUDIENCE: How does the Cilk
play with processor
1148
00:53:48,534 --> 00:53:54,350
[OBSCURED]?
1149
00:53:54,350 --> 00:53:57,420
BRADLEY KUSZMAUL: Well, you
have to have a language, a
1150
00:53:57,420 --> 00:53:58,820
compiler that can
generate those.
1151
00:53:58,820 --> 00:54:02,482
If you have an assembly command
or you have some other
1152
00:54:02,482 --> 00:54:04,090
complier that can
generate those.
1153
00:54:04,090 --> 00:54:10,130
So I just won the HPC challenge,
which is this
1154
00:54:10,130 --> 00:54:16,070
challenge where everybody tries
to run parallel programs
1155
00:54:16,070 --> 00:54:18,620
and argue that they
get productivity.
1156
00:54:18,620 --> 00:54:21,870
For that there were some codes
like matrix multiply and LUD
1157
00:54:21,870 --> 00:54:24,020
composition with pivoting.
1158
00:54:24,020 --> 00:54:26,910
Basically at the leads of the
computation I call the Intel
1159
00:54:26,910 --> 00:54:27,960
Math Kernel Library.
1160
00:54:27,960 --> 00:54:32,190
Which in turn uses the
SSE instructions.
1161
00:54:32,190 --> 00:54:35,860
You could do anything you can do
in C in the C parts of the
1162
00:54:35,860 --> 00:54:39,940
code because Cilk compiler just
passes those through.
1163
00:54:39,940 --> 00:54:43,140
So if you have some really
efficient pipeline code for
1164
00:54:43,140 --> 00:54:47,530
doing something, up to
some point it made
1165
00:54:47,530 --> 00:54:48,680
sense to use that.
1166
00:54:48,680 --> 00:54:52,620
AUDIENCE: [OBSCURED]
1167
00:54:52,620 --> 00:54:58,460
BRADLEY KUSZMAUL: So I ran
it on NASIS Columbia.
1168
00:54:58,460 --> 00:55:02,420
So the benchmark consists of--
well, there's 7 applications
1169
00:55:02,420 --> 00:55:04,560
they have. 6 of which are
actually well-defined.
1170
00:55:04,560 --> 00:55:07,065
One of them is this thing that
just measures network
1171
00:55:07,065 --> 00:55:07,540
performance or something,
so it doesn't
1172
00:55:07,540 --> 00:55:09,190
have any real semantics.
1173
00:55:09,190 --> 00:55:10,020
There's 6 benchmarks.
1174
00:55:10,020 --> 00:55:13,220
One of them is LUD composition,
one of them is
1175
00:55:13,220 --> 00:55:18,110
DJEM matrix multiplication and
this FFT and 3 others.
1176
00:55:18,110 --> 00:55:21,030
So I implemented all 6, nobody
else implemented all 6.
1177
00:55:21,030 --> 00:55:24,310
It turns out that you had to
implement 3 in order to enter.
1178
00:55:24,310 --> 00:55:27,990
Almost everybody implemented 3
or 4, but I did all 6 which is
1179
00:55:27,990 --> 00:55:29,540
part of why I won.
1180
00:55:29,540 --> 00:55:33,390
So I could argue that in
a weeks work I just
1181
00:55:33,390 --> 00:55:33,820
implemented--
1182
00:55:33,820 --> 00:55:37,510
AUDIENCE: What is [OBSCURED]?
1183
00:55:37,510 --> 00:55:40,280
BRADLEY KUSZMAUL: So the prize
has two components.
1184
00:55:40,280 --> 00:55:43,860
Performance and productivity or
elegance or something and
1185
00:55:43,860 --> 00:55:47,370
it's completely whatever the
judges want that to be.
1186
00:55:47,370 --> 00:55:50,800
So it was up to me as a
presenter to make the case
1187
00:55:50,800 --> 00:55:51,780
that I was elegant.
1188
00:55:51,780 --> 00:55:54,550
Because I had my performance
numbers, which were pretty
1189
00:55:54,550 --> 00:55:58,245
good and it turned out that the
IBM entry for x10 did me
1190
00:55:58,245 --> 00:55:59,360
more good than I did, I think.
1191
00:55:59,360 --> 00:56:01,960
Because they got up there and
they compared the performance
1192
00:56:01,960 --> 00:56:05,330
of x10 to their Cilk
implementation and their x10
1193
00:56:05,330 --> 00:56:07,650
thing was almost as
good as Cilk.
1194
00:56:07,650 --> 00:56:10,380
So after that I think the judges
said they had to give
1195
00:56:10,380 --> 00:56:12,310
me the prize.
1196
00:56:12,310 --> 00:56:15,230
So basically, it went down to
supercomputing and each of us
1197
00:56:15,230 --> 00:56:20,680
got 5 minutes to present and
there were 5 finalists.
1198
00:56:20,680 --> 00:56:22,840
We did our presentation and
then they gave out the --
1199
00:56:22,840 --> 00:56:28,970
So they divided the prize three
ways: the people who got
1200
00:56:28,970 --> 00:56:31,650
the absolute best performance,
which were some people running
1201
00:56:31,650 --> 00:56:36,950
UPC and the people who had the
most elegance based on minimal
1202
00:56:36,950 --> 00:56:40,630
number of lines of codes and
that was Cleve at --
1203
00:56:40,630 --> 00:56:41,140
what's his name?
1204
00:56:41,140 --> 00:56:43,100
The Mathworks guy, MATLAB guy.
1205
00:56:43,100 --> 00:56:45,880
Who said, look, matrix,
LUD composition.
1206
00:56:45,880 --> 00:56:50,110
LU of P. It's very elegant,
but I don't think that it
1207
00:56:50,110 --> 00:56:53,720
really sort of explains what
you have to do to solve the
1208
00:56:53,720 --> 00:56:58,250
problems. So he won the prize
for most elegant and I got the
1209
00:56:58,250 --> 00:57:02,040
prize for best combination,
which they then changed--
1210
00:57:02,040 --> 00:57:06,560
in the final citation for the
prize they said, most
1211
00:57:06,560 --> 00:57:07,410
productivity.
1212
00:57:07,410 --> 00:57:08,390
That was the prize.
1213
00:57:08,390 --> 00:57:10,450
So I actually won the contest
because that was what the
1214
00:57:10,450 --> 00:57:13,040
contest was supposed to be
was most productivity.
1215
00:57:13,040 --> 00:57:14,880
But I only won 1/3 of the
prize money because they
1216
00:57:14,880 --> 00:57:16,130
divided it three ways.
1217
00:57:16,130 --> 00:57:19,236
1218
00:57:19,236 --> 00:57:22,682
PROFESSOR: Any other question?
1219
00:57:22,682 --> 00:57:24,651
Thank you.
1220
00:57:24,651 --> 00:57:26,620
BRADLEY KUSZMAUL: Thank you.
1221
00:57:26,620 --> 00:57:30,589
PROFESSOR: We'll take a 5 minute
break and since you had
1222
00:57:30,589 --> 00:57:34,867
guest lecturer I do
have [OBSCURED]
1223
00:57:34,867 --> 00:57:40,126