1
00:00:00,000 --> 00:00:00,120
2
00:00:00,120 --> 00:00:02,500
The following content is
provided under a Creative
3
00:00:02,500 --> 00:00:03,910
Commons license.
4
00:00:03,910 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to
5
00:00:06,950 --> 00:00:10,600
offer high quality educational
resources for free.
6
00:00:10,600 --> 00:00:13,500
To make a donation or view
additional materials from
7
00:00:13,500 --> 00:00:17,430
hundreds of MIT courses, visit
MIT OpenCourseWare at
8
00:00:17,430 --> 00:00:18,680
ocw.mit.edu.
9
00:00:18,680 --> 00:00:23,660
10
00:00:23,660 --> 00:00:25,450
PROFESSOR: Let's get started.
11
00:00:25,450 --> 00:00:29,750
So today, I'm going to talk
a little bit more about
12
00:00:29,750 --> 00:00:33,200
performance issues in
parallelization.
13
00:00:33,200 --> 00:00:36,860
A little bit more out
of the [INAUDIBLE]
14
00:00:36,860 --> 00:00:39,170
to what people are
doing otherwise.
15
00:00:39,170 --> 00:00:44,030
So normally what we have done so
far, is we looked at Cilk.
16
00:00:44,030 --> 00:00:46,100
It provides a very robust
and environment for
17
00:00:46,100 --> 00:00:46,930
parallelization.
18
00:00:46,930 --> 00:00:50,550
It hides many issues and
eliminates many of the
19
00:00:50,550 --> 00:00:56,120
problems out there if you find
other areas of parallelization
20
00:00:56,120 --> 00:00:57,710
that you deal with.
21
00:00:57,710 --> 00:01:00,290
And in last lectures, we looked
at things like cache
22
00:01:00,290 --> 00:01:01,140
[UNINTELLIGIBLE]
23
00:01:01,140 --> 00:01:04,269
algorithims, algorithmic
issues, like
24
00:01:04,269 --> 00:01:06,710
looking at work and spend.
25
00:01:06,710 --> 00:01:08,790
And in fact, in your projects,
you're going to use all these
26
00:01:08,790 --> 00:01:13,660
nice concepts to get you a nice
parallel learning CorD.
27
00:01:13,660 --> 00:01:20,640
But if you look at a lot of
these CorD [UNINTELLIGIBLE]
28
00:01:20,640 --> 00:01:24,150
the very people normally for
parallelized CorD outside
29
00:01:24,150 --> 00:01:25,800
probably Cilk.
30
00:01:25,800 --> 00:01:30,270
And there are a lot of other
issues that arise, things like
31
00:01:30,270 --> 00:01:33,160
synchronization issues
and memory issues.
32
00:01:33,160 --> 00:01:36,380
Today, I think we are going to
focus mostly on memory issues.
33
00:01:36,380 --> 00:01:39,450
And we are going to use open
OpenMP instead of [INAUDIBLE].
34
00:01:39,450 --> 00:01:42,400
And most of these issues
will be affected on
35
00:01:42,400 --> 00:01:43,710
Cilk sometimes, too.
36
00:01:43,710 --> 00:01:45,990
But Cilk tries to hide
them from you.
37
00:01:45,990 --> 00:01:46,920
There's a layer of abstract.
38
00:01:46,920 --> 00:01:49,940
And it's hard to kind of get
to those issues in there.
39
00:01:49,940 --> 00:01:53,200
So we are going to look at
this thing called OpenMP.
40
00:01:53,200 --> 00:01:55,430
So today, we are going to
address things like
41
00:01:55,430 --> 00:01:58,820
granularity of parallelism.
42
00:01:58,820 --> 00:02:00,140
There are so many things
that just went out
43
00:02:00,140 --> 00:02:01,820
on the page, I guess.
44
00:02:01,820 --> 00:02:05,060
True sharing, false sharing,
load balancing issues, the
45
00:02:05,060 --> 00:02:06,170
[UNINTELLIGIBLE].
46
00:02:06,170 --> 00:02:10,570
So from the license and keep
talking about that we want to
47
00:02:10,570 --> 00:02:13,550
be out of all this not dealing
with Voodoo parameter.
48
00:02:13,550 --> 00:02:16,350
Today, we actually are dealing
mainly with Voodoo.
49
00:02:16,350 --> 00:02:20,460
So I guess this should
be basically
50
00:02:20,460 --> 00:02:21,670
the Halloween lecture.
51
00:02:21,670 --> 00:02:24,703
So we are all about Voodoo today
and see how we can deal
52
00:02:24,703 --> 00:02:27,760
with Voodoo issues.
53
00:02:27,760 --> 00:02:32,690
So if you look at a Cilk
program, here is a nice simple
54
00:02:32,690 --> 00:02:34,740
matrix multiply, seem
to be [INAUDIBLE]
55
00:02:34,740 --> 00:02:36,320
example these days.
56
00:02:36,320 --> 00:02:39,060
What you can do is you can put
a Cilk formula in these two
57
00:02:39,060 --> 00:02:42,070
loops and get a nice parallel
performance.
58
00:02:42,070 --> 00:02:44,870
However, [UNINTELLIGIBLE]
59
00:02:44,870 --> 00:02:47,270
from where how the memory
is arranged is
60
00:02:47,270 --> 00:02:48,820
up to the Cilk scheduler.
61
00:02:48,820 --> 00:02:50,980
Cilk scheduler is doing
some work stealing.
62
00:02:50,980 --> 00:02:54,420
Depending on how the work gets
distributed, the process will
63
00:02:54,420 --> 00:02:56,090
get worked, it will happen.
64
00:02:56,090 --> 00:02:58,370
Hopefully, everything
will go nicely.
65
00:02:58,370 --> 00:03:02,270
And so what that means is it
ties the distribution and load
66
00:03:02,270 --> 00:03:04,060
balancing issues.
67
00:03:04,060 --> 00:03:07,640
So it's nice if you have access
to Cilk, but many other
68
00:03:07,640 --> 00:03:09,050
[UNINTELLIGIBLE] you
might not be.
69
00:03:09,050 --> 00:03:12,320
And even within Cilk, some of
these issues might show up.
70
00:03:12,320 --> 00:03:15,390
So what we are going
to do is step one
71
00:03:15,390 --> 00:03:17,820
below the Cilk scheduler.
72
00:03:17,820 --> 00:03:22,470
So there's this system
called OpenMP.
73
00:03:22,470 --> 00:03:25,640
It's a more simplified
model of parallelism.
74
00:03:25,640 --> 00:03:29,665
So what it tries to do is
instead of giving this very
75
00:03:29,665 --> 00:03:30,460
[UNINTELLIGIBLE]
76
00:03:30,460 --> 00:03:33,930
system, it lets you basically
direct access to the
77
00:03:33,930 --> 00:03:35,110
processors.
78
00:03:35,110 --> 00:03:38,280
So what that means is there's
normally what we call a
79
00:03:38,280 --> 00:03:39,210
fork-join model.
80
00:03:39,210 --> 00:03:39,520
[UNINTELLIGIBLE]
81
00:03:39,520 --> 00:03:43,580
we have with Cilk, basically.
82
00:03:43,580 --> 00:03:46,490
We can do fork into different
workers and join.
83
00:03:46,490 --> 00:03:49,750
And more or less, you can
actually bind these workers to
84
00:03:49,750 --> 00:03:50,420
[UNINTELLIGIBLE]
85
00:03:50,420 --> 00:03:53,510
sometimes or make sure
that the number--
86
00:03:53,510 --> 00:03:56,800
I'll give you some techniques
how to do that as I go on.
87
00:03:56,800 --> 00:04:01,540
So for parallel loops, you can
do data parallelism, different
88
00:04:01,540 --> 00:04:02,340
[UNINTELLIGIBLE]
89
00:04:02,340 --> 00:04:04,580
parallelism you can do something
like fork-join.
90
00:04:04,580 --> 00:04:07,620
And you can see a bunch
of static or
91
00:04:07,620 --> 00:04:11,220
dynamic scheduling policies.
92
00:04:11,220 --> 00:04:16,790
So for example in OpenMP, you
can see for this loop that add
93
00:04:16,790 --> 00:04:17,959
a pragma to [UNINTELLIGIBLE]
94
00:04:17,959 --> 00:04:21,029
in front of this loop and
say this is OpenMP.
95
00:04:21,029 --> 00:04:22,450
Parallel loop in here.
96
00:04:22,450 --> 00:04:24,310
Parallel full loop
in this one.
97
00:04:24,310 --> 00:04:26,990
And schedule it using
static chunk.
98
00:04:26,990 --> 00:04:29,610
I will tell you what
exactly that means.
99
00:04:29,610 --> 00:04:34,680
And that gives you direct access
to how each of these
100
00:04:34,680 --> 00:04:36,730
parts will be run.
101
00:04:36,730 --> 00:04:39,320
So let me get a little
bit in detail.
102
00:04:39,320 --> 00:04:41,730
So assume you have
[UNINTELLIGIBLE] courses in
103
00:04:41,730 --> 00:04:42,950
there, [UNINTELLIGIBLE]
processors.
104
00:04:42,950 --> 00:04:46,270
So now in OpenMP, you are
basically opening the entire
105
00:04:46,270 --> 00:04:46,990
world underneath.
106
00:04:46,990 --> 00:04:49,840
And you have to kind of
see what's going on.
107
00:04:49,840 --> 00:04:55,330
And if you say, schedule a
static chunk of four, assume
108
00:04:55,330 --> 00:04:58,050
you have 16 iterations.
109
00:04:58,050 --> 00:05:00,200
Here are my 16 iterations.
110
00:05:00,200 --> 00:05:02,990
So each of these dots represent
a value for i.
111
00:05:02,990 --> 00:05:07,890
So what it says is you take
chunks of four and basically
112
00:05:07,890 --> 00:05:08,950
send it across it.
113
00:05:08,950 --> 00:05:11,190
So what happens is the first
four iterations will go to
114
00:05:11,190 --> 00:05:13,240
[UNINTELLIGIBLE] or core zero.
115
00:05:13,240 --> 00:05:15,650
Next four will go to core one,
core two and core three.
116
00:05:15,650 --> 00:05:18,870
So you know exactly which
iterations run where.
117
00:05:18,870 --> 00:05:20,660
It's a very static thing.
118
00:05:20,660 --> 00:05:23,100
You have full control
of what's going on.
119
00:05:23,100 --> 00:05:26,980
Whereas in Cilk, it's
up to the scheduler.
120
00:05:26,980 --> 00:05:29,090
So the nice thing here is you
can have full control.
121
00:05:29,090 --> 00:05:30,960
But you get enough room
to harm yourself if
122
00:05:30,960 --> 00:05:32,260
you do things wrong.
123
00:05:32,260 --> 00:05:36,510
So this is a double-edged
sword in that sense.
124
00:05:36,510 --> 00:05:39,990
So instead of doing static
five you do static two.
125
00:05:39,990 --> 00:05:43,010
You're assigning chunks
of size two.
126
00:05:43,010 --> 00:05:46,125
What it will do is it will
assign chunks of size two to
127
00:05:46,125 --> 00:05:48,450
the four cores.
128
00:05:48,450 --> 00:05:49,770
And then you're not done yet.
129
00:05:49,770 --> 00:05:52,620
And then you start with again
core zero and assign
130
00:05:52,620 --> 00:05:53,980
chunks of size two.
131
00:05:53,980 --> 00:05:57,880
This is called block
cyclic schedule.
132
00:05:57,880 --> 00:05:59,880
And if you do a chunk
of size one, it's
133
00:05:59,880 --> 00:06:00,890
called a cyclic schedule.
134
00:06:00,890 --> 00:06:08,190
[UNINTELLIGIBLE] cycles just
assigning iterations to cores.
135
00:06:08,190 --> 00:06:08,830
OK.
136
00:06:08,830 --> 00:06:11,700
So far so good?
137
00:06:11,700 --> 00:06:13,780
OK.
138
00:06:13,780 --> 00:06:15,780
So I want to do something.
139
00:06:15,780 --> 00:06:19,740
So I have this program.
140
00:06:19,740 --> 00:06:21,790
This is again your
141
00:06:21,790 --> 00:06:24,350
run-of-the-mill matrix multiply.
142
00:06:24,350 --> 00:06:30,070
And I ran a sequential single
machine, and I got this
143
00:06:30,070 --> 00:06:31,860
performance.
144
00:06:31,860 --> 00:06:34,750
Then, I said look, I want to
parallelize the outer loop.
145
00:06:34,750 --> 00:06:37,640
So I parallelize this loop.
146
00:06:37,640 --> 00:06:40,710
What should I get?
147
00:06:40,710 --> 00:06:41,790
[UNINTELLIGIBLE] fast or slow.
148
00:06:41,790 --> 00:06:46,970
I want to just check whether
you are awake or sleep.
149
00:06:46,970 --> 00:06:47,890
How do [UNINTELLIGIBLE PHRASE]
150
00:06:47,890 --> 00:06:49,140
to run slower.
151
00:06:49,140 --> 00:06:53,030
152
00:06:53,030 --> 00:06:54,370
It's not a trick question.
153
00:06:54,370 --> 00:06:57,460
This is just to make sure that
actually participate.
154
00:06:57,460 --> 00:06:59,230
How do people think
it's run faster?
155
00:06:59,230 --> 00:07:00,950
AUDIENCE: [INAUDIBLE]
156
00:07:00,950 --> 00:07:01,100
PROFESSOR: OK.
157
00:07:01,100 --> 00:07:02,390
Good.
158
00:07:02,390 --> 00:07:05,234
What do others think?
159
00:07:05,234 --> 00:07:06,070
OK.
160
00:07:06,070 --> 00:07:08,980
They're probably checking their
email or something.
161
00:07:08,980 --> 00:07:10,550
OK.
162
00:07:10,550 --> 00:07:14,340
So actually it ran faster.
163
00:07:14,340 --> 00:07:18,910
The source not run on the
common cloud machines.
164
00:07:18,910 --> 00:07:20,990
This was a previous generation
that I ran.
165
00:07:20,990 --> 00:07:23,390
So [UNINTELLIGIBLE] was
seven times faster.
166
00:07:23,390 --> 00:07:24,290
So this was great.
167
00:07:24,290 --> 00:07:25,670
I parallized outer loop.
168
00:07:25,670 --> 00:07:28,390
What happens if I parallize
inner loop?
169
00:07:28,390 --> 00:07:31,570
So this test, this i loop,
runs parallel.
170
00:07:31,570 --> 00:07:34,060
Here, I launch the
[UNINTELLIGIBLE] parallel.
171
00:07:34,060 --> 00:07:38,138
How much people thinks
this runs faster?
172
00:07:38,138 --> 00:07:39,970
AUDIENCE: [INAUDIBLE]
173
00:07:39,970 --> 00:07:42,800
PROFESSOR: Compared
to this one.
174
00:07:42,800 --> 00:07:45,290
How many people thinks
this runs slower?
175
00:07:45,290 --> 00:07:46,530
OK.
176
00:07:46,530 --> 00:07:50,530
There's some consistent
answers here.
177
00:07:50,530 --> 00:07:52,956
Why do you think it
would run slower?
178
00:07:52,956 --> 00:07:56,130
179
00:07:56,130 --> 00:07:57,720
So OK.
180
00:07:57,720 --> 00:08:00,180
It ran slower, so it
can improve that.
181
00:08:00,180 --> 00:08:02,270
And that's a little bit slow.
182
00:08:02,270 --> 00:08:05,774
Why is it slow?
183
00:08:05,774 --> 00:08:08,250
AUDIENCE: [INAUDIBLE]
184
00:08:08,250 --> 00:08:09,050
PROFESSOR: Exactly.
185
00:08:09,050 --> 00:08:12,520
So what it's doing here, it's
basically spawning many, many
186
00:08:12,520 --> 00:08:13,200
times in here.
187
00:08:13,200 --> 00:08:17,530
Because every time you have
parallelism, you chunkify into
188
00:08:17,530 --> 00:08:18,500
the processor.
189
00:08:18,500 --> 00:08:21,450
Here you are getting a lot more
smaller chunks inside.
190
00:08:21,450 --> 00:08:24,700
So let's look at how this
is basically run.
191
00:08:24,700 --> 00:08:31,075
So normally, you can think about
an OpenMP program as you
192
00:08:31,075 --> 00:08:32,549
have one sequential thread.
193
00:08:32,549 --> 00:08:34,000
You run the main program.
194
00:08:34,000 --> 00:08:37,760
And then assume you have, in
cores, you might have n minus
195
00:08:37,760 --> 00:08:42,100
1 other thread just
waiting for work.
196
00:08:42,100 --> 00:08:45,270
And then, when you finally
come to the parallel
197
00:08:45,270 --> 00:08:48,180
loop, it says, OK.
198
00:08:48,180 --> 00:08:48,730
Set up.
199
00:08:48,730 --> 00:08:52,080
What do you want to run
on other basic cores.
200
00:08:52,080 --> 00:08:53,510
And release it.
201
00:08:53,510 --> 00:08:55,360
Release these waiting people.
202
00:08:55,360 --> 00:08:57,020
And let them start working
on the parallel work.
203
00:08:57,020 --> 00:08:59,650
And also, I will start doing
on my own chunk.
204
00:08:59,650 --> 00:09:03,720
So suddenly, when you say
parallel four, it releases all
205
00:09:03,720 --> 00:09:07,310
other cores to go run that
part of the core.
206
00:09:07,310 --> 00:09:11,740
And once it's done, it's
will say, OK, I'm done.
207
00:09:11,740 --> 00:09:13,140
I have to wait until
everybody is done.
208
00:09:13,140 --> 00:09:16,570
So even if the main guy is done,
it has to wait until
209
00:09:16,570 --> 00:09:18,200
everybody is finished.
210
00:09:18,200 --> 00:09:22,610
And then, start executing the
sequence [UNINTELLIGIBLE].
211
00:09:22,610 --> 00:09:27,670
So this is the gist of how
OpenMP program is run.
212
00:09:27,670 --> 00:09:30,840
And if you realize that it all
heads here because you have
213
00:09:30,840 --> 00:09:33,150
basically make sure all these
cases are broken up.
214
00:09:33,150 --> 00:09:35,740
So there's some things that
has to be issued.
215
00:09:35,740 --> 00:09:38,590
And there's a delay between
these guys can start if
216
00:09:38,590 --> 00:09:39,990
everybody has equal work.
217
00:09:39,990 --> 00:09:42,200
Despite not finishing on time
because it may take some time
218
00:09:42,200 --> 00:09:43,430
for this to start.
219
00:09:43,430 --> 00:09:46,490
And then, it has to also tell
this back OK, I am done.
220
00:09:46,490 --> 00:09:48,320
So there's a lot of
synchronization going on.
221
00:09:48,320 --> 00:09:49,390
Locks and unlocks.
222
00:09:49,390 --> 00:09:51,020
Here it's called various
synchronization here.
223
00:09:51,020 --> 00:09:54,085
224
00:09:54,085 --> 00:09:58,400
And so if this work is small,
this synchronization starts
225
00:09:58,400 --> 00:10:00,070
dominating.
226
00:10:00,070 --> 00:10:03,530
So what happens is
[UNINTELLIGIBLE]
227
00:10:03,530 --> 00:10:05,490
fine grain parallelism.
228
00:10:05,490 --> 00:10:08,090
Do a little work in the
parallel region, and
229
00:10:08,090 --> 00:10:10,880
synchronization will basically
start dominating your time.
230
00:10:10,880 --> 00:10:13,310
So how do you take this?
231
00:10:13,310 --> 00:10:15,880
And sometimes when you run
something parallel, it might
232
00:10:15,880 --> 00:10:19,150
even run slow because the
amount of stuff in the
233
00:10:19,150 --> 00:10:22,420
parallel region is so small,
[UNINTELLIGIBLE] will start
234
00:10:22,420 --> 00:10:22,760
dominating.
235
00:10:22,760 --> 00:10:24,010
And that's not a good way.
236
00:10:24,010 --> 00:10:26,700
And also, sometimes
you assume.
237
00:10:26,700 --> 00:10:29,240
And you keep increasing
the number of cores.
238
00:10:29,240 --> 00:10:32,750
Hopefully, you want to see a
nice parallelism increase, but
239
00:10:32,750 --> 00:10:35,090
it doesn't, even though you
have enough information.
240
00:10:35,090 --> 00:10:37,880
But that means you're running
a lot of small chunks, even
241
00:10:37,880 --> 00:10:39,390
though you seem to have a lot
of parallelism available.
242
00:10:39,390 --> 00:10:42,730
243
00:10:42,730 --> 00:10:46,160
And also, you can make sure
the synchronization in the
244
00:10:46,160 --> 00:10:47,210
time in the parallel region.
245
00:10:47,210 --> 00:10:48,720
If the parallel regions
are on a very
246
00:10:48,720 --> 00:10:50,670
short time, this happens.
247
00:10:50,670 --> 00:10:54,780
We saw this effect when
we were doing Cilk.
248
00:10:54,780 --> 00:10:56,030
Remember?
249
00:10:56,030 --> 00:10:59,450
When did we see this granularity
affecting Cilk?
250
00:10:59,450 --> 00:11:00,750
And what did he do?
251
00:11:00,750 --> 00:11:03,900
252
00:11:03,900 --> 00:11:04,910
When you write Cilk programs.
253
00:11:04,910 --> 00:11:05,660
You write [UNINTELLIGIBLE]
254
00:11:05,660 --> 00:11:06,950
programs.
255
00:11:06,950 --> 00:11:12,500
Where did the granularity
start showing up on us?
256
00:11:12,500 --> 00:11:15,160
It may not be exactly this
because the scheduling is
257
00:11:15,160 --> 00:11:15,430
complicated.
258
00:11:15,430 --> 00:11:15,890
OK.
259
00:11:15,890 --> 00:11:16,380
Yes?
260
00:11:16,380 --> 00:11:19,280
AUDIENCE: The two by two
matrix [INAUDIBLE]
261
00:11:19,280 --> 00:11:19,720
PROFESSOR: Yeah.
262
00:11:19,720 --> 00:11:20,420
Something like two by--
263
00:11:20,420 --> 00:11:22,800
for example, that's the reason
we wanted to have
264
00:11:22,800 --> 00:11:24,580
a large base case.
265
00:11:24,580 --> 00:11:26,900
Because if you didn't put a
large base case, it keeps
266
00:11:26,900 --> 00:11:30,110
dividing into smaller and
smaller problems.
267
00:11:30,110 --> 00:11:32,580
And if the schedule is
smart, it won't be
268
00:11:32,580 --> 00:11:33,700
doing exactly this.
269
00:11:33,700 --> 00:11:37,350
But it's always good to have
these large granulated chunks
270
00:11:37,350 --> 00:11:38,620
at the bottom.
271
00:11:38,620 --> 00:11:41,870
272
00:11:41,870 --> 00:11:43,890
So how to get [UNINTELLIGIBLE]
273
00:11:43,890 --> 00:11:45,560
granulated parallelism.
274
00:11:45,560 --> 00:11:48,080
What we need to do is reduce the
number of [UNINTELLIGIBLE]
275
00:11:48,080 --> 00:11:48,770
equations.
276
00:11:48,770 --> 00:11:51,760
So you want to always try to
look for the outer most loop
277
00:11:51,760 --> 00:11:55,080
you can get at all the really
large independent regions.
278
00:11:55,080 --> 00:11:57,040
So you go look, and not
[UNINTELLIGIBLE]
279
00:11:57,040 --> 00:11:57,930
thing you want to parallelize.
280
00:11:57,930 --> 00:12:01,165
You go up, up, up, up until the
point you can parallelize.
281
00:12:01,165 --> 00:12:05,490
And that's the best way to
get good performance.
282
00:12:05,490 --> 00:12:06,740
OK?
283
00:12:06,740 --> 00:12:09,580
284
00:12:09,580 --> 00:12:13,850
So if you really compare these
three programs here, again,
285
00:12:13,850 --> 00:12:15,210
what you see--
286
00:12:15,210 --> 00:12:16,720
of course, this has no
synchronization.
287
00:12:16,720 --> 00:12:19,400
This has n amount of
synchronizations.
288
00:12:19,400 --> 00:12:20,870
Here in [UNINTELLIGIBLE]
synchronization, that's
289
00:12:20,870 --> 00:12:23,210
obviously a lot more
synchronization going on.
290
00:12:23,210 --> 00:12:27,190
And that is where this
[UNINTELLIGIBLE] comes from.
291
00:12:27,190 --> 00:12:27,620
OK.
292
00:12:27,620 --> 00:12:30,260
So now, I am switching
a little bit in here.
293
00:12:30,260 --> 00:12:33,450
I want you guys to look at this
program a little bit.
294
00:12:33,450 --> 00:12:35,000
So what am I doing here?
295
00:12:35,000 --> 00:12:37,580
I have two [UNINTELLIGIBLE].
296
00:12:37,580 --> 00:12:45,110
And I am just basically adding
matrix B to matrix A. OK?
297
00:12:45,110 --> 00:12:48,720
And then I have another loop
test here, adding matrix C to
298
00:12:48,720 --> 00:12:52,120
matrix A. And what am
I doing in here?
299
00:12:52,120 --> 00:12:55,740
I am basically going through
matrix A in another
300
00:12:55,740 --> 00:12:59,020
direction in here.
301
00:12:59,020 --> 00:13:00,270
AUDIENCE: [INAUDIBLE]
302
00:13:00,270 --> 00:13:02,310
303
00:13:02,310 --> 00:13:03,700
PROFESSOR: It's not really
a transpose.
304
00:13:03,700 --> 00:13:04,590
I'm not transposing.
305
00:13:04,590 --> 00:13:08,830
What I'm doing is I'm actually
doing a mirror because the C
306
00:13:08,830 --> 00:13:10,210
gets mirrored on ix.
307
00:13:10,210 --> 00:13:11,030
It's because [UNINTELLIGIBLE]
308
00:13:11,030 --> 00:13:11,590
ix [UNINTELLIGIBLE]
309
00:13:11,590 --> 00:13:12,425
the other direction.
310
00:13:12,425 --> 00:13:12,740
OK?
311
00:13:12,740 --> 00:13:14,400
So it's not really
a transpose.
312
00:13:14,400 --> 00:13:16,960
So I do a mirror addition.
313
00:13:16,960 --> 00:13:19,010
And then I'm asking
for the two outer
314
00:13:19,010 --> 00:13:22,130
most loops to be parallel.
315
00:13:22,130 --> 00:13:25,440
So if you run this
sequential--
316
00:13:25,440 --> 00:13:29,870
OK, you get about 30
milliseconds, I
317
00:13:29,870 --> 00:13:32,120
guess, to run in here.
318
00:13:32,120 --> 00:13:33,700
So that is in [UNINTELLIGIBLE].
319
00:13:33,700 --> 00:13:35,580
But if you're running parallel,
what do you get?
320
00:13:35,580 --> 00:13:36,840
Should you get faster
or slower?
321
00:13:36,840 --> 00:13:40,960
322
00:13:40,960 --> 00:13:42,310
OK.
323
00:13:42,310 --> 00:13:45,200
Anyone want to take a guess
[UNINTELLIGIBLE]
324
00:13:45,200 --> 00:13:46,995
Sometimes some of these
questions, you might not have
325
00:13:46,995 --> 00:13:48,040
enough information to answer.
326
00:13:48,040 --> 00:13:50,750
But it's still good to just take
a stand on one direction
327
00:13:50,750 --> 00:13:51,450
or another.
328
00:13:51,450 --> 00:13:54,120
How many people think
it runs faster?
329
00:13:54,120 --> 00:13:56,350
How many people think
it runs slower?
330
00:13:56,350 --> 00:13:57,750
OK.
331
00:13:57,750 --> 00:13:59,550
Some.
332
00:13:59,550 --> 00:14:01,940
Oops.
333
00:14:01,940 --> 00:14:03,190
What happened?
334
00:14:03,190 --> 00:14:07,310
335
00:14:07,310 --> 00:14:09,140
What happened in here?
336
00:14:09,140 --> 00:14:19,750
337
00:14:19,750 --> 00:14:25,150
Can anybody point out why it
might be running slower
338
00:14:25,150 --> 00:14:27,880
parallely than running
sequentially?
339
00:14:27,880 --> 00:14:31,275
340
00:14:31,275 --> 00:14:33,220
AUDIENCE: [INAUDIBLE]
341
00:14:33,220 --> 00:14:33,550
PROFESSOR: Yeah.
342
00:14:33,550 --> 00:14:34,760
There's a cache issue.
343
00:14:34,760 --> 00:14:38,611
Watch the possible cache
issue in here.
344
00:14:38,611 --> 00:14:41,600
AUDIENCE: [INAUDIBLE]
345
00:14:41,600 --> 00:14:42,850
PROFESSOR: Yeah.
346
00:14:42,850 --> 00:14:44,610
347
00:14:44,610 --> 00:14:50,610
If you think about, the first
equations of, I guess, the
348
00:14:50,610 --> 00:14:51,080
first core--
349
00:14:51,080 --> 00:14:52,230
I have some diagram.
350
00:14:52,230 --> 00:14:53,740
I'll show it to you in there.
351
00:14:53,740 --> 00:14:56,850
And here, only the last data
elements we'll get for the
352
00:14:56,850 --> 00:14:58,760
first iterations because
we are going
353
00:14:58,760 --> 00:14:59,970
in the other direction.
354
00:14:59,970 --> 00:15:03,500
So if you look at it a little
more deeply into
355
00:15:03,500 --> 00:15:04,980
what's going on.
356
00:15:04,980 --> 00:15:08,710
Number of instructions seem
to be a little higher.
357
00:15:08,710 --> 00:15:11,310
This one I couldn't actually
explain why this might be the
358
00:15:11,310 --> 00:15:13,270
case in here.
359
00:15:13,270 --> 00:15:15,530
If anybody has an idea,
you can say that.
360
00:15:15,530 --> 00:15:18,620
But this was kind of
[UNINTELLIGIBLE].
361
00:15:18,620 --> 00:15:21,500
This might be [UNINTELLIGIBLE]
the cycles.
362
00:15:21,500 --> 00:15:22,750
Huh.
363
00:15:22,750 --> 00:15:27,830
364
00:15:27,830 --> 00:15:28,600
OK.
365
00:15:28,600 --> 00:15:30,610
I can explain this.
366
00:15:30,610 --> 00:15:34,730
Because this is a sequential
run, this is a sum total of a
367
00:15:34,730 --> 00:15:36,380
parallel run.
368
00:15:36,380 --> 00:15:40,560
So because of all the overhead
that happens because this was
369
00:15:40,560 --> 00:15:43,200
running on, I think, an
eight core machine.
370
00:15:43,200 --> 00:15:45,630
So you're running eight times
of small companies.
371
00:15:45,630 --> 00:15:48,420
There's a lot of overhead that
goes around, synchronization,
372
00:15:48,420 --> 00:15:49,320
and stuff like that.
373
00:15:49,320 --> 00:15:52,050
So a number of instructions
just blows up.
374
00:15:52,050 --> 00:15:55,998
But for each core, you don't
have this blow up.
375
00:15:55,998 --> 00:15:57,248
AUDIENCE: [INAUDIBLE]
376
00:15:57,248 --> 00:15:59,484
377
00:15:59,484 --> 00:15:59,982
Cilk?
378
00:15:59,982 --> 00:16:03,468
Because does Cilk have different
processor affinity,
379
00:16:03,468 --> 00:16:06,456
things that open
[UNINTELLIGIBLE]?
380
00:16:06,456 --> 00:16:11,960
Because it seems like if the
program, the language--
381
00:16:11,960 --> 00:16:13,786
PROFESSOR: [INAUDIBLE].
382
00:16:13,786 --> 00:16:16,610
Let's see if we can process
the affinity
383
00:16:16,610 --> 00:16:18,670
information or if not.
384
00:16:18,670 --> 00:16:21,330
It's just pure [UNINTELLIGIBLE].
385
00:16:21,330 --> 00:16:23,780
AUDIENCE: [INAUDIBLE]
386
00:16:23,780 --> 00:16:24,600
PROFESSOR: Yeah.
387
00:16:24,600 --> 00:16:28,430
I mean if you like executed
locally if you have good cache
388
00:16:28,430 --> 00:16:28,900
[UNINTELLIGIBLE]
389
00:16:28,900 --> 00:16:29,210
with them.
390
00:16:29,210 --> 00:16:31,060
But if there's no cache
[UNINTELLIGIBLE]
391
00:16:31,060 --> 00:16:32,670
you might steal something
where data might
392
00:16:32,670 --> 00:16:33,887
be somewhere else.
393
00:16:33,887 --> 00:16:37,296
AUDIENCE: But you'll still
mimic the cache behavior,
394
00:16:37,296 --> 00:16:42,166
considerably, except
for when you steal.
395
00:16:42,166 --> 00:16:44,601
[INAUDIBLE]
396
00:16:44,601 --> 00:16:45,088
PROFESSOR: Yeah.
397
00:16:45,088 --> 00:16:46,070
So OK.
398
00:16:46,070 --> 00:16:49,450
We don't have a mic in here.
399
00:16:49,450 --> 00:16:49,640
OK.
400
00:16:49,640 --> 00:16:50,570
There's a mic.
401
00:16:50,570 --> 00:16:51,820
There we go.
402
00:16:51,820 --> 00:16:54,380
403
00:16:54,380 --> 00:16:57,110
But if you have two different of
these regions, the way the
404
00:16:57,110 --> 00:17:01,985
parallelization happens
can be different.
405
00:17:01,985 --> 00:17:06,440
AUDIENCE: In Cilk, what
happens is the code is
406
00:17:06,440 --> 00:17:10,569
mimicking, for the most part,
exactly what the C or C++ code
407
00:17:10,569 --> 00:17:11,930
would be doing.
408
00:17:11,930 --> 00:17:15,339
And so you get exactly the same
cache hits and misses.
409
00:17:15,339 --> 00:17:19,690
Except when you steal, it's
like starting over with an
410
00:17:19,690 --> 00:17:21,550
empty cache.
411
00:17:21,550 --> 00:17:21,770
OK?
412
00:17:21,770 --> 00:17:24,589
But as long as you have
sufficient parallelism, the
413
00:17:24,589 --> 00:17:29,080
steals don't occur very often.
414
00:17:29,080 --> 00:17:31,520
And so therefore, you end up
getting the same kind of
415
00:17:31,520 --> 00:17:33,740
behavior that you would get
out of the serial code.
416
00:17:33,740 --> 00:17:34,120
PROFESSOR: Yeah.
417
00:17:34,120 --> 00:17:37,250
But Charles, in this one,
because you had to steal
418
00:17:37,250 --> 00:17:41,240
everything here, the between
here and here, the parallelism
419
00:17:41,240 --> 00:17:41,830
would be different.
420
00:17:41,830 --> 00:17:43,220
AUDIENCE: There would be no
affinity between those two.
421
00:17:43,220 --> 00:17:44,380
PROFESSOR: No [? affinity ?]
will be there.
422
00:17:44,380 --> 00:17:45,300
Yeah, exactly.
423
00:17:45,300 --> 00:17:48,180
So in the sequential one,
everything that fits in the
424
00:17:48,180 --> 00:17:50,720
cache, so that would be affinity
because we are not
425
00:17:50,720 --> 00:17:51,790
doing parallelism.
426
00:17:51,790 --> 00:17:53,510
And that's what I think
happened here.
427
00:17:53,510 --> 00:17:54,990
Because you had no
[? affinity-- ?]
428
00:17:54,990 --> 00:17:57,960
AUDIENCE: No in a serial code,
there's no [? affinity. ?]
429
00:17:57,960 --> 00:17:58,070
PROFESSOR: No.
430
00:17:58,070 --> 00:18:00,920
Serial code-- if this fits in
the cache, it's then running
431
00:18:00,920 --> 00:18:03,090
on one core.
432
00:18:03,090 --> 00:18:07,900
So if it fits in the one core's
cache, you're happy.
433
00:18:07,900 --> 00:18:10,770
AUDIENCE: So the issue is--
right-- is if you only access
434
00:18:10,770 --> 00:18:14,860
it once, by the time you
fill up the cache--
435
00:18:14,860 --> 00:18:16,600
It takes some time to fill
up the cache to get them
436
00:18:16,600 --> 00:18:17,280
synchronized.
437
00:18:17,280 --> 00:18:19,050
PROFESSOR: So it fits in the
one core's cache, it's OK.
438
00:18:19,050 --> 00:18:21,210
Otherwise, it has no
affinity in here.
439
00:18:21,210 --> 00:18:26,290
So the key difference in here
is, of course, CPI is slow.
440
00:18:26,290 --> 00:18:27,530
We don't know exactly why.
441
00:18:27,530 --> 00:18:28,910
But in [UNINTELLIGIBLE].
442
00:18:28,910 --> 00:18:31,480
So what you find is that there's
a huge amount of cache
443
00:18:31,480 --> 00:18:32,150
in [UNINTELLIGIBLE]
444
00:18:32,150 --> 00:18:33,660
going on.
445
00:18:33,660 --> 00:18:35,660
So that should give you a
feeling of what's going on.
446
00:18:35,660 --> 00:18:37,370
So let's look at what
might happen.
447
00:18:37,370 --> 00:18:42,070
So I'm showing this matches
[UNINTELLIGIBLE]
448
00:18:42,070 --> 00:18:46,950
last year on-- what we had were
Cagnode machines that
449
00:18:46,950 --> 00:18:49,130
were basically code
to quad processor.
450
00:18:49,130 --> 00:18:50,620
So we had eight codes in here.
451
00:18:50,620 --> 00:18:52,956
And I put them-- you
don't have to now
452
00:18:52,956 --> 00:18:53,680
look at this table.
453
00:18:53,680 --> 00:18:57,210
I put them in the slides so
you can look at it later.
454
00:18:57,210 --> 00:18:58,920
And so this is the last
year's machine.
455
00:18:58,920 --> 00:19:01,360
And of course, this year's
machine is different.
456
00:19:01,360 --> 00:19:04,610
We have two six core
processors in here.
457
00:19:04,610 --> 00:19:06,330
So this is what we
[UNINTELLIGIBLE]
458
00:19:06,330 --> 00:19:07,650
this year.
459
00:19:07,650 --> 00:19:09,380
[UNINTELLIGIBLE]
460
00:19:09,380 --> 00:19:10,390
OK.
461
00:19:10,390 --> 00:19:13,670
And so right now, I'm showing
numbers for this one.
462
00:19:13,670 --> 00:19:15,830
And later, I will show
what happened in the
463
00:19:15,830 --> 00:19:16,770
[UNINTELLIGIBLE].
464
00:19:16,770 --> 00:19:19,360
So if you look at a cache--
465
00:19:19,360 --> 00:19:25,390
so what happened is each of the
data items in the cache
466
00:19:25,390 --> 00:19:27,340
can be in multiple states.
467
00:19:27,340 --> 00:19:29,600
This is called MSI
protocol here.
468
00:19:29,600 --> 00:19:33,120
What that means is the item
might be modified.
469
00:19:33,120 --> 00:19:36,490
If it is modified, it can
be only in one cache.
470
00:19:36,490 --> 00:19:40,220
If anybody else wants to touch
it, it has to get it out of
471
00:19:40,220 --> 00:19:42,340
the modified state.
472
00:19:42,340 --> 00:19:43,670
Or it can be sharing.
473
00:19:43,670 --> 00:19:46,190
Sharing means it's reading.
474
00:19:46,190 --> 00:19:49,780
So that means that item is
read by multiple people.
475
00:19:49,780 --> 00:19:51,160
And that can have
multiple covers.
476
00:19:51,160 --> 00:19:53,375
So sharing items can be
in multiple places.
477
00:19:53,375 --> 00:19:55,890
478
00:19:55,890 --> 00:19:59,220
However if you're modifying,
[UNINTELLIGIBLE]
479
00:19:59,220 --> 00:20:00,890
items in everybody else.
480
00:20:00,890 --> 00:20:03,130
So that means if I modify
something, it
481
00:20:03,130 --> 00:20:04,230
can only be in mine.
482
00:20:04,230 --> 00:20:06,540
If other people had that
data, I had to go in
483
00:20:06,540 --> 00:20:07,070
and validate this.
484
00:20:07,070 --> 00:20:09,470
So if you modify this, I had
to go in and validate it.
485
00:20:09,470 --> 00:20:10,740
So that's a sharing state.
486
00:20:10,740 --> 00:20:11,910
That means I'm [UNINTELLIGIBLE]
everybody
487
00:20:11,910 --> 00:20:13,130
[UNINTELLIGIBLE]
488
00:20:13,130 --> 00:20:13,880
read this.
489
00:20:13,880 --> 00:20:16,290
But if I ever want to change
that, I have to go in and
490
00:20:16,290 --> 00:20:17,780
validate this one.
491
00:20:17,780 --> 00:20:20,530
So what that means is when I
start writing, I am validating
492
00:20:20,530 --> 00:20:21,750
it from everybody.
493
00:20:21,750 --> 00:20:25,980
So even if everybody kept a copy
and they start modifying,
494
00:20:25,980 --> 00:20:28,540
I had to get my own copy.
495
00:20:28,540 --> 00:20:30,000
And everybody else
will invalidate.
496
00:20:30,000 --> 00:20:32,260
And then, if somebody else
wanted to read--
497
00:20:32,260 --> 00:20:34,410
for example, if this
guy wants to read--
498
00:20:34,410 --> 00:20:35,960
basically, this has
to make this a
499
00:20:35,960 --> 00:20:37,950
sharing and back to sharing.
500
00:20:37,950 --> 00:20:40,080
That means I have to get the
value 13, propogate it, and
501
00:20:40,080 --> 00:20:43,100
this becomes sharing again.
502
00:20:43,100 --> 00:20:43,560
OK.
503
00:20:43,560 --> 00:20:44,600
Did you get that?
504
00:20:44,600 --> 00:20:45,770
What's going on in here?
505
00:20:45,770 --> 00:20:47,520
In the cache?
506
00:20:47,520 --> 00:20:51,050
So reads, everybody can keep
a copy if they want.
507
00:20:51,050 --> 00:20:53,530
Write-- only one guy
can keep a copy.
508
00:20:53,530 --> 00:20:55,730
So what happens then
is true sharing.
509
00:20:55,730 --> 00:20:58,280
So you have these two
different cores.
510
00:20:58,280 --> 00:20:59,550
So I want to read something.
511
00:20:59,550 --> 00:21:02,500
So I get it from outside
probably on main memory, and I
512
00:21:02,500 --> 00:21:04,820
put it in my cache in here.
513
00:21:04,820 --> 00:21:09,790
And then, the next guy wants
to write the same thing.
514
00:21:09,790 --> 00:21:11,200
Assume I'm writing that.
515
00:21:11,200 --> 00:21:14,520
And once I want to write that,
I can keep this copy I had
516
00:21:14,520 --> 00:21:16,920
invalidated from here
and get a copy here.
517
00:21:16,920 --> 00:21:19,130
And then, if this way, I want
to write it again, I have to
518
00:21:19,130 --> 00:21:21,140
basically invalidate it from
here and get a copy.
519
00:21:21,140 --> 00:21:23,400
If I'm reading, both of us can
keep a copy and just kind of
520
00:21:23,400 --> 00:21:25,570
keep bouncing back and forth,
back and forth.
521
00:21:25,570 --> 00:21:30,350
And so if you bounce too many
times, you get all of these
522
00:21:30,350 --> 00:21:31,520
invalidations.
523
00:21:31,520 --> 00:21:34,770
So the fact I looked at that I
have invalidations basically
524
00:21:34,770 --> 00:21:37,770
tells me something like
this is going on.
525
00:21:37,770 --> 00:21:39,505
So what's happening
in this program?
526
00:21:39,505 --> 00:21:42,070
527
00:21:42,070 --> 00:21:46,420
When I parallelize this four
loop, my four cores--
528
00:21:46,420 --> 00:21:49,080
basically since I am doing
here [UNINTELLIGIBLE]--
529
00:21:49,080 --> 00:21:51,710
are going to get this
nice distribution of
530
00:21:51,710 --> 00:21:53,690
data into the caches.
531
00:21:53,690 --> 00:21:55,900
Assume it fits in cache.
532
00:21:55,900 --> 00:21:56,140
OK.
533
00:21:56,140 --> 00:21:59,050
So all this data nicely fits
into cache, and now I'm pretty
534
00:21:59,050 --> 00:22:00,760
happy when I run this one
because I got this
535
00:22:00,760 --> 00:22:01,490
data into the cache.
536
00:22:01,490 --> 00:22:02,910
And I write it.
537
00:22:02,910 --> 00:22:07,860
But the minute I go in here,
basically this data has to
538
00:22:07,860 --> 00:22:09,110
[UNINTELLIGIBLE].
539
00:22:09,110 --> 00:22:10,710
540
00:22:10,710 --> 00:22:14,690
OK, because I am going this is n
minus i in here so this data
541
00:22:14,690 --> 00:22:15,520
has a flip route.
542
00:22:15,520 --> 00:22:19,380
And by doing this, basically,
I incur this huge amount of
543
00:22:19,380 --> 00:22:22,090
[UNINTELLIGIBLE], and
it slows down.
544
00:22:22,090 --> 00:22:22,670
OK?
545
00:22:22,670 --> 00:22:25,250
So that's why it didn't
work well.
546
00:22:25,250 --> 00:22:28,990
So what can you do?
547
00:22:28,990 --> 00:22:33,580
548
00:22:33,580 --> 00:22:38,230
When you have these read,
write and write, write
549
00:22:38,230 --> 00:22:40,050
conflicts in here.
550
00:22:40,050 --> 00:22:44,420
And you have to actually move
the data across in here.
551
00:22:44,420 --> 00:22:48,450
And what you can do is look
for this true sharing.
552
00:22:48,450 --> 00:22:49,550
You can look at the
[UNINTELLIGIBLE]
553
00:22:49,550 --> 00:22:51,030
and see if we have excessive
554
00:22:51,030 --> 00:22:52,910
[UNINTELLIGIBLE], we have a problem.
555
00:22:52,910 --> 00:22:54,990
And how do we eliminate that?
556
00:22:54,990 --> 00:22:56,240
You want to make the
sharing minimal.
557
00:22:56,240 --> 00:22:59,580
558
00:22:59,580 --> 00:23:01,984
If you want to get some data
into a cache, you want to try
559
00:23:01,984 --> 00:23:04,320
to keep it there as
much as possible.
560
00:23:04,320 --> 00:23:06,050
And if you're using, you'd
want to try to align
561
00:23:06,050 --> 00:23:08,080
everything across.
562
00:23:08,080 --> 00:23:11,140
So even across different
regions, it'll use the same
563
00:23:11,140 --> 00:23:12,950
kind of things.
564
00:23:12,950 --> 00:23:15,120
And/or enforce some kind of
[UNINTELLIGIBLE] technique to
565
00:23:15,120 --> 00:23:15,900
keep the data alive.
566
00:23:15,900 --> 00:23:18,045
So there are a lot of techniques
in there, but true
567
00:23:18,045 --> 00:23:19,900
sharing can be an interesting
problem.
568
00:23:19,900 --> 00:23:23,090
So in here, simple
change, yes.
569
00:23:23,090 --> 00:23:28,380
You're, basically, instead of
changing A, you change C. So
570
00:23:28,380 --> 00:23:30,130
you write A the same way.
571
00:23:30,130 --> 00:23:33,580
But now what I have done is
I am doing the mirror by
572
00:23:33,580 --> 00:23:36,760
changing the axis to C, is to
[UNINTELLIGIBLE] is the same
573
00:23:36,760 --> 00:23:38,590
as this axis.
574
00:23:38,590 --> 00:23:40,070
So these two are
the same thing.
575
00:23:40,070 --> 00:23:42,620
And the minute I
do that, voila!
576
00:23:42,620 --> 00:23:44,210
I get good speed up.
577
00:23:44,210 --> 00:23:47,290
Because if you look at that, my
inundations has gone down.
578
00:23:47,290 --> 00:23:48,640
My L1 cache [UNINTELLIGIBLE]
has now
579
00:23:48,640 --> 00:23:49,610
really, really gone down.
580
00:23:49,610 --> 00:23:51,030
I'm not doing anything.
581
00:23:51,030 --> 00:23:54,820
And of course, I am doing more
instructions here than this
582
00:23:54,820 --> 00:23:57,220
one because--
583
00:23:57,220 --> 00:24:00,070
I think, the difference between
instruction here and
584
00:24:00,070 --> 00:24:03,220
here is because a lot of times
synchronization operations are
585
00:24:03,220 --> 00:24:05,530
dynamic because in the
[UNINTELLIGIBLE]
586
00:24:05,530 --> 00:24:07,760
miscounted as the instructions,
you are busy
587
00:24:07,760 --> 00:24:09,770
waiting in there.
588
00:24:09,770 --> 00:24:13,350
So this number is not really
a constant number.
589
00:24:13,350 --> 00:24:14,850
OK, question.
590
00:24:14,850 --> 00:24:15,440
AUDIENCE: Not a question.
591
00:24:15,440 --> 00:24:19,020
So another thing one could do
here is do loop fusion.
592
00:24:19,020 --> 00:24:19,530
PROFESSOR: Yes.
593
00:24:19,530 --> 00:24:20,410
Yes.
594
00:24:20,410 --> 00:24:24,100
Here is a nice way of putting
both of the loops into one and
595
00:24:24,100 --> 00:24:24,990
do loop fusion.
596
00:24:24,990 --> 00:24:26,300
And that works.
597
00:24:26,300 --> 00:24:28,230
In this case, you can do that.
598
00:24:28,230 --> 00:24:31,360
AUDIENCE: So loop fusion is
where you take two loops, and
599
00:24:31,360 --> 00:24:33,620
you convert it into one loop.
600
00:24:33,620 --> 00:24:37,050
So in this case, you could have
just written one nest,
601
00:24:37,050 --> 00:24:38,840
which has two things
going on inside.
602
00:24:38,840 --> 00:24:41,490
603
00:24:41,490 --> 00:24:44,580
And then you would save all
the loop overhead and the
604
00:24:44,580 --> 00:24:46,090
scheduling overhead.
605
00:24:46,090 --> 00:24:49,090
So rather than doing it twice,
you actually have reduced the
606
00:24:49,090 --> 00:24:52,890
overhead to just the parallelism
of the one loop.
607
00:24:52,890 --> 00:24:55,790
So if you look at that, you'll
realize that you could somehow
608
00:24:55,790 --> 00:24:59,670
make it just be a single nest
with two statements in there,
609
00:24:59,670 --> 00:25:01,130
rather than one.
610
00:25:01,130 --> 00:25:04,100
PROFESSOR: So basically, instead
of [UNINTELLIGIBLE]
611
00:25:04,100 --> 00:25:08,190
entire thing and move plus
C into here, basically.
612
00:25:08,190 --> 00:25:10,040
And I could have just done
it in one loop nest.
613
00:25:10,040 --> 00:25:11,950
That's what loop fusion
would do here.
614
00:25:11,950 --> 00:25:13,750
So we can actually
[UNINTELLIGIBLE]
615
00:25:13,750 --> 00:25:15,280
much nicer in here.
616
00:25:15,280 --> 00:25:19,190
But just for example purposes,
so now I really reduced this
617
00:25:19,190 --> 00:25:20,540
one and got that.
618
00:25:20,540 --> 00:25:21,740
So this is great.
619
00:25:21,740 --> 00:25:23,830
Cagnodes really showed
this classic
620
00:25:23,830 --> 00:25:25,820
problem in the computer.
621
00:25:25,820 --> 00:25:27,070
And so I'm like, OK.
622
00:25:27,070 --> 00:25:28,720
Now we have new machines.
623
00:25:28,720 --> 00:25:30,075
Let's try it and see
what happens.
624
00:25:30,075 --> 00:25:34,090
625
00:25:34,090 --> 00:25:35,670
What does this show?
626
00:25:35,670 --> 00:25:41,240
This is your nice cloud machines
we've got now.
627
00:25:41,240 --> 00:25:42,480
I have no slow down.
628
00:25:42,480 --> 00:25:47,570
I was really disappointed
because beforehand, I had this
629
00:25:47,570 --> 00:25:50,070
for sharing going on
in here and had a
630
00:25:50,070 --> 00:25:51,100
really big slow down.
631
00:25:51,100 --> 00:25:54,070
But this one, in fact, the
difference is very small.
632
00:25:54,070 --> 00:25:56,540
And when you look at any kind of
performance counters, they
633
00:25:56,540 --> 00:25:58,580
are pretty comparable.
634
00:25:58,580 --> 00:26:00,220
There's nothing much
going on here.
635
00:26:00,220 --> 00:26:03,580
So what do you think is
going on in this new
636
00:26:03,580 --> 00:26:06,130
architecture now?
637
00:26:06,130 --> 00:26:07,400
Why this might be?
638
00:26:07,400 --> 00:26:15,730
639
00:26:15,730 --> 00:26:16,980
AUDIENCE: [INAUDIBLE]
640
00:26:16,980 --> 00:26:26,030
641
00:26:26,030 --> 00:26:28,800
PROFESSOR: That's an interesting
observation, but
642
00:26:28,800 --> 00:26:32,050
also we have--
643
00:26:32,050 --> 00:26:36,250
yes, core seven basically is
two by two in the die.
644
00:26:36,250 --> 00:26:38,650
But we also have two different
processors.
645
00:26:38,650 --> 00:26:40,110
So that's there, too.
646
00:26:40,110 --> 00:26:43,640
So in some sense, when you
get our two-way process
647
00:26:43,640 --> 00:26:44,670
[UNINTELLIGIBLE]
648
00:26:44,670 --> 00:26:45,620
So that's there.
649
00:26:45,620 --> 00:26:46,370
That might help.
650
00:26:46,370 --> 00:26:48,650
That's an interesting
observation.
651
00:26:48,650 --> 00:26:50,340
What else might be
going on here?
652
00:26:50,340 --> 00:26:52,880
Why do you think they manage
to get this one?
653
00:26:52,880 --> 00:26:54,580
What might be another answer?
654
00:26:54,580 --> 00:26:58,790
655
00:26:58,790 --> 00:27:03,350
What can hide these kind of
delays that can happen?
656
00:27:03,350 --> 00:27:06,990
Load delays, and cache misses,
and stuff like that.
657
00:27:06,990 --> 00:27:09,110
What techniques and hardware
can hide those?
658
00:27:09,110 --> 00:27:15,300
659
00:27:15,300 --> 00:27:17,050
Just [UNINTELLIGIBLE]
a speculation.
660
00:27:17,050 --> 00:27:18,610
Prefetching.
661
00:27:18,610 --> 00:27:22,290
So most hardware has an
internal mechanism.
662
00:27:22,290 --> 00:27:24,890
When you start fetching
data, you say, aha!
663
00:27:24,890 --> 00:27:26,420
I see a pattern.
664
00:27:26,420 --> 00:27:27,650
I know you want to
get this thing.
665
00:27:27,650 --> 00:27:31,960
Let me go forward and bring more
data, thinking you are
666
00:27:31,960 --> 00:27:34,240
going to follow that pattern.
667
00:27:34,240 --> 00:27:34,590
OK.
668
00:27:34,590 --> 00:27:36,240
All or most of the
[UNINTELLIGIBLE]
669
00:27:36,240 --> 00:27:39,010
for, I think, have
[UNINTELLIGIBLE] even a
670
00:27:39,010 --> 00:27:42,280
Pentium had something for
prefetching going on.
671
00:27:42,280 --> 00:27:44,470
But most of the time, what
happens is the prefetching
672
00:27:44,470 --> 00:27:45,970
engine can't keep up.
673
00:27:45,970 --> 00:27:47,930
If you are getting there,
it's [UNINTELLIGIBLE]
674
00:27:47,930 --> 00:27:48,880
a little bit further.
675
00:27:48,880 --> 00:27:50,650
You are going to catch up, and
you're going to start because
676
00:27:50,650 --> 00:27:53,026
you have more [UNINTELLIGIBLE]
here.
677
00:27:53,026 --> 00:27:54,980
I think [UNINTELLIGIBLE]
678
00:27:54,980 --> 00:27:56,880
has a really, really
good prefetcher.
679
00:27:56,880 --> 00:28:02,170
And then, we saw it in our
architecture slides, too.
680
00:28:02,170 --> 00:28:04,940
That a lot of things that used
to happen before is gone.
681
00:28:04,940 --> 00:28:05,990
So this is really good.
682
00:28:05,990 --> 00:28:11,330
What that means is a lot of
weird stuff that's going on
683
00:28:11,330 --> 00:28:12,360
[UNINTELLIGIBLE]
684
00:28:12,360 --> 00:28:14,010
making them disappear.
685
00:28:14,010 --> 00:28:15,990
So these kind of problems
don't show up.
686
00:28:15,990 --> 00:28:17,260
So that's the nice story.
687
00:28:17,260 --> 00:28:18,920
The other part is, OK.
688
00:28:18,920 --> 00:28:22,630
Now if you start really tweaking
your programs to one
689
00:28:22,630 --> 00:28:25,120
architecture, you wait
a generation.
690
00:28:25,120 --> 00:28:28,300
And then now, we have done
either the tweaking--
691
00:28:28,300 --> 00:28:33,910
the best case, tweaking has
no impact, and it's
692
00:28:33,910 --> 00:28:35,170
not affecting anything.
693
00:28:35,170 --> 00:28:37,235
In most of the time, worst case,
tweaking actually slows
694
00:28:37,235 --> 00:28:39,160
down the program because you
are trying to do something
695
00:28:39,160 --> 00:28:40,030
complicated.
696
00:28:40,030 --> 00:28:42,790
That's just not needed
anymore.
697
00:28:42,790 --> 00:28:46,990
So even though these kind
of things showed up in
698
00:28:46,990 --> 00:28:47,500
[UNINTELLIGIBLE]
699
00:28:47,500 --> 00:28:48,750
architecture, it's
not an issue.
700
00:28:48,750 --> 00:28:52,250
But if you go to many of the
smaller architectures that
701
00:28:52,250 --> 00:28:55,830
have that don't have that
much of the very popular
702
00:28:55,830 --> 00:28:57,620
prefetchers, this kind of
issue you would see.
703
00:28:57,620 --> 00:29:00,610
So for example, if you go to a
cell phone [UNINTELLIGIBLE],
704
00:29:00,610 --> 00:29:03,770
you would probably see these
kind of issues happening.
705
00:29:03,770 --> 00:29:05,560
Any questions here so far?
706
00:29:05,560 --> 00:29:06,300
So that's the good news.
707
00:29:06,300 --> 00:29:08,280
You guys don't have to worry
about it too much.
708
00:29:08,280 --> 00:29:11,440
But at least it's good to know
the technique because you'll
709
00:29:11,440 --> 00:29:13,820
see it in other architectures.
710
00:29:13,820 --> 00:29:18,110
So now, I want to switch a
little bit into looking at
711
00:29:18,110 --> 00:29:21,630
programs that don't have what
we call data parallelism.
712
00:29:21,630 --> 00:29:24,320
That means you can start and
say, [UNINTELLIGIBLE]
713
00:29:24,320 --> 00:29:24,740
parallels.
714
00:29:24,740 --> 00:29:26,840
Everybody get the different
chunk and run.
715
00:29:26,840 --> 00:29:30,110
And we are going a little bit
more deeply into looking at
716
00:29:30,110 --> 00:29:32,230
programs that are a little
bit different.
717
00:29:32,230 --> 00:29:36,540
So I wanted to come up with this
little representation to
718
00:29:36,540 --> 00:29:37,580
represent the program.
719
00:29:37,580 --> 00:29:42,090
And so if you think about
iteration space--
720
00:29:42,090 --> 00:29:44,090
actually before you, I'll
go down to dependence.
721
00:29:44,090 --> 00:29:46,200
I'll also do a little
bit of load balance.
722
00:29:46,200 --> 00:29:49,890
So here's a loop that
in my iterations--
723
00:29:49,890 --> 00:29:53,690
the first one I transformed
zero to eight.
724
00:29:53,690 --> 00:29:55,670
But J only runs from
one to eight.
725
00:29:55,670 --> 00:29:58,630
So each I, I have less
and less amount of
726
00:29:58,630 --> 00:30:02,160
J iterations, basically.
727
00:30:02,160 --> 00:30:02,600
OK?
728
00:30:02,600 --> 00:30:05,100
It's a triangular loop.
729
00:30:05,100 --> 00:30:06,350
OK?
730
00:30:06,350 --> 00:30:10,050
731
00:30:10,050 --> 00:30:10,250
OK.
732
00:30:10,250 --> 00:30:12,420
So this is the way to represent
iteration space, so
733
00:30:12,420 --> 00:30:15,600
I will represent data and
get back to this again.
734
00:30:15,600 --> 00:30:18,830
So if you look at a data space,
you can assume data
735
00:30:18,830 --> 00:30:23,740
iteration space could be this
funky, triangular, hyperplane
736
00:30:23,740 --> 00:30:24,760
type of thing.
737
00:30:24,760 --> 00:30:28,990
Whereas data is mostly
[? rectangulineum ?],
738
00:30:28,990 --> 00:30:30,740
multi-dimensional rectangle.
739
00:30:30,740 --> 00:30:32,810
So for example, if I have
[UNINTELLIGIBLE]
740
00:30:32,810 --> 00:30:35,140
and it's a one-dimensional
one, this is basically a
741
00:30:35,140 --> 00:30:36,050
two-dimensional data.
742
00:30:36,050 --> 00:30:37,480
And you can have
three-dimensional cubes and
743
00:30:37,480 --> 00:30:38,020
stuff like that.
744
00:30:38,020 --> 00:30:39,500
You can represent
data like that.
745
00:30:39,500 --> 00:30:41,170
So this is a way to
nicely represent.
746
00:30:41,170 --> 00:30:45,000
And when you start thinking
about it, we can look at
747
00:30:45,000 --> 00:30:45,850
what's going on.
748
00:30:45,850 --> 00:30:46,780
OK?
749
00:30:46,780 --> 00:30:49,870
So now you have this
loop again.
750
00:30:49,870 --> 00:30:52,820
So here's the basic
[UNINTELLIGIBLE] iterations.
751
00:30:52,820 --> 00:30:54,140
And here's the data.
752
00:30:54,140 --> 00:30:56,850
Assume this is both A and B.
There will be another one for
753
00:30:56,850 --> 00:30:57,480
matrix [UNINTELLIGIBLE]
754
00:30:57,480 --> 00:30:59,870
B. One data into each iteration
is going to touch.
755
00:30:59,870 --> 00:31:01,700
So these are the data that
need to get touched, and
756
00:31:01,700 --> 00:31:04,270
here's the iterations you
are going to run.
757
00:31:04,270 --> 00:31:09,720
So we can say OpenMP
parallel four.
758
00:31:09,720 --> 00:31:12,200
So what happens when you
do parallel four?
759
00:31:12,200 --> 00:31:14,630
So I am going to get parallel.
760
00:31:14,630 --> 00:31:17,930
And so core one gets this one,
another core, another core,
761
00:31:17,930 --> 00:31:20,610
another core get these
iterations running.
762
00:31:20,610 --> 00:31:22,398
So what happens if
you do this one?
763
00:31:22,398 --> 00:31:26,520
764
00:31:26,520 --> 00:31:29,590
Do you get really good
performance?
765
00:31:29,590 --> 00:31:31,852
Why?
766
00:31:31,852 --> 00:31:34,570
AUDIENCE: [INAUDIBLE]
767
00:31:34,570 --> 00:31:35,080
PROFESSOR: It's not balanced.
768
00:31:35,080 --> 00:31:36,900
The load is not balanced
in here.
769
00:31:36,900 --> 00:31:40,150
So basically if you run
sequential and if you run
770
00:31:40,150 --> 00:31:45,990
block distribution, I get about
3x performance in here.
771
00:31:45,990 --> 00:31:49,470
So if I look at closely, here
is the number of iterations
772
00:31:49,470 --> 00:31:51,340
given to each core.
773
00:31:51,340 --> 00:31:53,790
The first core gets almost
nothing, and the last guy gets
774
00:31:53,790 --> 00:31:55,380
a lot of work.
775
00:31:55,380 --> 00:31:58,510
Here's where something like the
Cilk runtime can come into
776
00:31:58,510 --> 00:32:01,955
play because with Cilk runtime,
basically, this guy
777
00:32:01,955 --> 00:32:03,040
will finish the [UNINTELLIGIBLE]
778
00:32:03,040 --> 00:32:04,620
start stealing from
somebody else.
779
00:32:04,620 --> 00:32:07,620
And so it would be
done nicely.
780
00:32:07,620 --> 00:32:09,860
But whereas if you do a
static schedule, you
781
00:32:09,860 --> 00:32:11,190
are in this big bind.
782
00:32:11,190 --> 00:32:13,270
You don't have too many
things going on.
783
00:32:13,270 --> 00:32:17,990
784
00:32:17,990 --> 00:32:20,650
And basically, this is what
we call load imbalance.
785
00:32:20,650 --> 00:32:24,760
So what you can do is figure out
a complicated partitioning
786
00:32:24,760 --> 00:32:27,645
so you can statically
partition this out.
787
00:32:27,645 --> 00:32:31,810
Or you can do something like the
dynamic scheduler like the
788
00:32:31,810 --> 00:32:32,530
[UNINTELLIGIBLE]
789
00:32:32,530 --> 00:32:35,720
scheduler for a solution.
790
00:32:35,720 --> 00:32:36,970
So how to detect
load imbalance?
791
00:32:36,970 --> 00:32:39,290
792
00:32:39,290 --> 00:32:44,120
Basically, what you might want
to do is for each of the
793
00:32:44,120 --> 00:32:45,690
different sections you are
running, you want to look at
794
00:32:45,690 --> 00:32:47,520
the time mistakes.
795
00:32:47,520 --> 00:32:50,530
And in the [UNINTELLIGIBLE] axis
varying, huge varying,
796
00:32:50,530 --> 00:32:52,260
that means there's a load
imbalance going on.
797
00:32:52,260 --> 00:32:55,345
So you might want to check and
make sure each of the parallel
798
00:32:55,345 --> 00:32:58,330
regions time is taking.
799
00:32:58,330 --> 00:33:01,370
And that gives you this view.
800
00:33:01,370 --> 00:33:04,620
How to eliminate load imbalance
or the use of
801
00:33:04,620 --> 00:33:08,500
dynamic scheduler that
will deal with that.
802
00:33:08,500 --> 00:33:12,220
Or you can do a different
distribution statically.
803
00:33:12,220 --> 00:33:14,690
That will not partition
in this large block.
804
00:33:14,690 --> 00:33:16,900
So let me show you a static part
because we have already
805
00:33:16,900 --> 00:33:18,440
learned the dynamic
part before.
806
00:33:18,440 --> 00:33:21,165
So now instead of doing that,
we do a cyclic distribution.
807
00:33:21,165 --> 00:33:23,370
We use a static one.
808
00:33:23,370 --> 00:33:27,690
That means if you have a lot
more than and a little bit
809
00:33:27,690 --> 00:33:32,220
better distribution so what
happens to the processor?
810
00:33:32,220 --> 00:33:33,810
Zero gets this one
and this one.
811
00:33:33,810 --> 00:33:35,740
One gets this one
and this one.
812
00:33:35,740 --> 00:33:36,860
So on and so forth.
813
00:33:36,860 --> 00:33:39,380
So that would be a little bit
between balancing there.
814
00:33:39,380 --> 00:33:42,830
But if you have enough of
cyclic, the imbalance would be
815
00:33:42,830 --> 00:33:45,550
much lower.
816
00:33:45,550 --> 00:33:47,610
So should we run faster?
817
00:33:47,610 --> 00:33:51,070
818
00:33:51,070 --> 00:33:56,470
So here's the iterations
each guy gets in here.
819
00:33:56,470 --> 00:33:59,360
This looks very balanced because
I had a lot more
820
00:33:59,360 --> 00:34:01,810
iterations than this
eight one.
821
00:34:01,810 --> 00:34:03,420
This is not that balanced here
because this guy gets a lot
822
00:34:03,420 --> 00:34:04,800
more than the first one.
823
00:34:04,800 --> 00:34:06,820
The first one gets six.
824
00:34:06,820 --> 00:34:09,070
And the second and last
one gets a lot more.
825
00:34:09,070 --> 00:34:13,775
826
00:34:13,775 --> 00:34:16,560
Uh oh.
827
00:34:16,560 --> 00:34:18,060
What do you think is
happening here now?
828
00:34:18,060 --> 00:34:22,040
829
00:34:22,040 --> 00:34:23,290
I ran again slower.
830
00:34:23,290 --> 00:34:25,870
831
00:34:25,870 --> 00:34:28,429
See I guess the people in class
last year had things
832
00:34:28,429 --> 00:34:31,199
worse because they had this
old processor that did all
833
00:34:31,199 --> 00:34:33,010
these crazy things on them.
834
00:34:33,010 --> 00:34:35,830
and you guys get the fast one
that doesn't do that.
835
00:34:35,830 --> 00:34:41,310
So why do you think cyclic
distribution is
836
00:34:41,310 --> 00:34:43,639
running a lot slower?
837
00:34:43,639 --> 00:34:44,454
What might be going?
838
00:34:44,454 --> 00:34:45,704
AUDIENCE: [INAUDIBLE]
839
00:34:45,704 --> 00:34:47,420
840
00:34:47,420 --> 00:34:49,219
PROFESSOR: Spoiling
[UNINTELLIGIBLE] it's not that
841
00:34:49,219 --> 00:34:51,830
much because if you don't run
this and synchronize, what you
842
00:34:51,830 --> 00:34:54,929
do is you run the same amount of
tread and say, now, instead
843
00:34:54,929 --> 00:34:59,260
of running continuously, you
run jumping all iterations.
844
00:34:59,260 --> 00:35:01,810
You should run zero and nine
or whatever jump over
845
00:35:01,810 --> 00:35:03,060
iterations.
846
00:35:03,060 --> 00:35:12,500
847
00:35:12,500 --> 00:35:13,130
Why do you think?
848
00:35:13,130 --> 00:35:14,630
AUDIENCE: [INAUDIBLE]
849
00:35:14,630 --> 00:35:16,240
PROFESSOR: Yeah, there's
a cache issue.
850
00:35:16,240 --> 00:35:19,980
All this time and the question
is not sure, it's probably a
851
00:35:19,980 --> 00:35:20,470
cache issue.
852
00:35:20,470 --> 00:35:22,550
What type of cache issue do
you think is going on?
853
00:35:22,550 --> 00:35:23,800
AUDIENCE: [INAUDIBLE]
854
00:35:23,800 --> 00:35:28,920
855
00:35:28,920 --> 00:35:29,410
PROFESSOR: Yeah.
856
00:35:29,410 --> 00:35:31,010
[UNINTELLIGIBLE].
857
00:35:31,010 --> 00:35:34,280
But let me show you
what happens.
858
00:35:34,280 --> 00:35:35,580
So you get off then.
859
00:35:35,580 --> 00:35:37,780
OK, so if you look at--
860
00:35:37,780 --> 00:35:40,150
the data is here so let's
look at what happens.
861
00:35:40,150 --> 00:35:42,950
So this is running a
[UNINTELLIGIBLE] lower.
862
00:35:42,950 --> 00:35:45,460
It's showing a lot more
instructions, but instruction
863
00:35:45,460 --> 00:35:47,710
doesn't tell you too much
because a lot of them might be
864
00:35:47,710 --> 00:35:49,360
missing synchronization
costs in here.
865
00:35:49,360 --> 00:35:52,690
So instruction is not that
illuminating here.
866
00:35:52,690 --> 00:35:54,850
The big illumination here
is this one again.
867
00:35:54,850 --> 00:35:56,950
Invalidations.
868
00:35:56,950 --> 00:35:59,300
I have a huge amount of
invalidations going on.
869
00:35:59,300 --> 00:36:04,330
So here is a case of
false sharing.
870
00:36:04,330 --> 00:36:08,080
So what happens is now things
next to each other, you want
871
00:36:08,080 --> 00:36:09,550
to multiply different
processors.
872
00:36:09,550 --> 00:36:10,890
We're not touching
the same data.
873
00:36:10,890 --> 00:36:13,250
Everybody's looking at
somebody else's data.
874
00:36:13,250 --> 00:36:16,290
So what happens is assume I want
to write this data item.
875
00:36:16,290 --> 00:36:17,550
I like that data item.
876
00:36:17,550 --> 00:36:22,430
But I get the entire cache line
because when I ask for
877
00:36:22,430 --> 00:36:25,330
that, I get my synchronization
by the cache line.
878
00:36:25,330 --> 00:36:28,830
I get this entire cache line
coming in here into this one.
879
00:36:28,830 --> 00:36:30,550
And the next guys
[UNINTELLIGIBLE] at me.
880
00:36:30,550 --> 00:36:33,020
This core won't write this
data because instead of
881
00:36:33,020 --> 00:36:36,000
blocks, I basically
give each strips.
882
00:36:36,000 --> 00:36:37,840
There's a lot of overlap
between strips.
883
00:36:37,840 --> 00:36:41,000
So this guy says not to write
this one, I had to get the
884
00:36:41,000 --> 00:36:43,580
entire cache line
going back here.
885
00:36:43,580 --> 00:36:45,630
And so if you want to write that
again, I had to get the
886
00:36:45,630 --> 00:36:47,680
entire cache line going back
even though we are writing
887
00:36:47,680 --> 00:36:48,790
different data.
888
00:36:48,790 --> 00:36:52,010
Because we are sharing
cache lines in here.
889
00:36:52,010 --> 00:36:53,880
This thinking was in back
and forth, back and
890
00:36:53,880 --> 00:36:54,960
forth, back and forth.
891
00:36:54,960 --> 00:36:56,520
I have a lot of cache
[UNINTELLIGIBLE].
892
00:36:56,520 --> 00:36:59,200
Things are really shot.
893
00:36:59,200 --> 00:37:00,160
OK?
894
00:37:00,160 --> 00:37:03,740
And so what happens here is if
you look at the cache lines--
895
00:37:03,740 --> 00:37:05,140
there's my animation.
896
00:37:05,140 --> 00:37:07,280
So cache lines basically
mess this all up.
897
00:37:07,280 --> 00:37:08,690
You can see that really
carefully.
898
00:37:08,690 --> 00:37:12,500
What happens is between these
lines, there would be some
899
00:37:12,500 --> 00:37:13,750
overlap of cache lines.
900
00:37:13,750 --> 00:37:15,710
901
00:37:15,710 --> 00:37:17,890
And this overlap in cache lines
keeps bouncing back and
902
00:37:17,890 --> 00:37:19,990
forth, back and forth in here.
903
00:37:19,990 --> 00:37:23,630
And so what happens is basically
cache lines are
904
00:37:23,630 --> 00:37:28,850
bigger than the data size, or
there's overlap in here, and
905
00:37:28,850 --> 00:37:31,480
the cache line is shared when
the data is not shared.
906
00:37:31,480 --> 00:37:37,050
And so how to detect false
sharing in too many conflicts.
907
00:37:37,050 --> 00:37:41,340
You assume this is a nice
parallelism, but suddenly, you
908
00:37:41,340 --> 00:37:43,840
don't have a speed up, and you
have a lot of conflicts here,
909
00:37:43,840 --> 00:37:47,090
even though there isn't
something to be sharing.
910
00:37:47,090 --> 00:37:49,770
And how to eliminate
false sharing.
911
00:37:49,770 --> 00:37:54,250
Make data used by each
contiguous in memory.
912
00:37:54,250 --> 00:37:55,550
That's a good way
of doing that.
913
00:37:55,550 --> 00:37:57,360
Or pad at the end.
914
00:37:57,360 --> 00:38:00,330
So these kind of at the corners,
there's not going to
915
00:38:00,330 --> 00:38:02,830
be any overlapping.
916
00:38:02,830 --> 00:38:07,770
So in here, one thing you can
do is, you can measure each
917
00:38:07,770 --> 00:38:10,370
thing that each of
the cores get.
918
00:38:10,370 --> 00:38:11,350
We can make [UNINTELLIGIBLE].
919
00:38:11,350 --> 00:38:14,560
But before what happens was
a core used to get this
920
00:38:14,560 --> 00:38:15,970
line and this line.
921
00:38:15,970 --> 00:38:17,730
There are different
places in memory.
922
00:38:17,730 --> 00:38:20,220
But you can make these two
contiguous in memory by
923
00:38:20,220 --> 00:38:22,190
basically now, instead of
having a two-dimensional
924
00:38:22,190 --> 00:38:23,940
array, you made that a
925
00:38:23,940 --> 00:38:26,210
three-dimensional or disarrays.
926
00:38:26,210 --> 00:38:27,890
AUDIENCE: Can you
say that again?
927
00:38:27,890 --> 00:38:31,030
PROFESSOR: So before you what
just happened was each of them
928
00:38:31,030 --> 00:38:34,210
were going to get this line
and this line, each core.
929
00:38:34,210 --> 00:38:38,070
All these lines that were in
different parts of the memory.
930
00:38:38,070 --> 00:38:40,510
In here, each would get
only two lines.
931
00:38:40,510 --> 00:38:41,570
But they're in a different
place.
932
00:38:41,570 --> 00:38:43,845
So if you have more cyclic,
you'll get a lot more lines or
933
00:38:43,845 --> 00:38:44,820
lower memory.
934
00:38:44,820 --> 00:38:47,830
So what we can do is we
can arrange the cache.
935
00:38:47,830 --> 00:38:50,560
So if you think about this, you
can think the cache, now
936
00:38:50,560 --> 00:38:53,220
the data, is instead of
two-dimensions is
937
00:38:53,220 --> 00:38:55,390
three-dimensional data.
938
00:38:55,390 --> 00:38:58,360
One dimension is this
cyclic part in here.
939
00:38:58,360 --> 00:38:59,450
So we can do that.
940
00:38:59,450 --> 00:39:04,030
And then, you can change any way
that the cyclic part, the
941
00:39:04,030 --> 00:39:06,290
one that I got this line and
this line, now become
942
00:39:06,290 --> 00:39:07,660
contiguous.
943
00:39:07,660 --> 00:39:10,390
So you think about data
as a two-dimension.
944
00:39:10,390 --> 00:39:11,790
You think about it as a cube.
945
00:39:11,790 --> 00:39:14,690
And you kind of change the cube
for the inner dimension
946
00:39:14,690 --> 00:39:16,670
to be the one that's
contiguous.
947
00:39:16,670 --> 00:39:18,980
So you can do data
[UNINTELLIGIBLE]
948
00:39:18,980 --> 00:39:20,230
transformation and get there.
949
00:39:20,230 --> 00:39:23,290
950
00:39:23,290 --> 00:39:29,130
So now what happens is the role
of core zero just gets
951
00:39:29,130 --> 00:39:30,590
contiguous in memory.
952
00:39:30,590 --> 00:39:32,370
And core one gets contiguous
in memory.
953
00:39:32,370 --> 00:39:34,600
So if you're trying to make it
contiguous, that's great.
954
00:39:34,600 --> 00:39:37,930
So between padding and making
things contiguous, you can get
955
00:39:37,930 --> 00:39:38,840
good performance.
956
00:39:38,840 --> 00:39:41,520
And if you do data
transformation, voila!
957
00:39:41,520 --> 00:39:44,940
My invalidations just went
down drastically.
958
00:39:44,940 --> 00:39:47,900
I again have a nice load
balancing here.
959
00:39:47,900 --> 00:39:50,100
Invalidations went
down drastically.
960
00:39:50,100 --> 00:39:53,660
That means my [UNINTELLIGIBLE]
increased a little bit and I
961
00:39:53,660 --> 00:39:56,600
get really nice speed up.
962
00:39:56,600 --> 00:40:01,320
So here are the kind of crazy
things you are to do if you
963
00:40:01,320 --> 00:40:05,200
are doing things like
algorithms that
964
00:40:05,200 --> 00:40:07,100
are not cache obvious.
965
00:40:07,100 --> 00:40:08,820
And if you are doing directly
parallizing yourself without
966
00:40:08,820 --> 00:40:11,890
letting a nice [UNINTELLIGIBLE]
967
00:40:11,890 --> 00:40:12,660
time to help you.
968
00:40:12,660 --> 00:40:16,580
Something like a
[UNINTELLIGIBLE] assistant.
969
00:40:16,580 --> 00:40:19,710
So I'm just going to
summarize this
970
00:40:19,710 --> 00:40:20,990
because this is important.
971
00:40:20,990 --> 00:40:22,630
We looked at a bunch
of cache issues.
972
00:40:22,630 --> 00:40:25,420
We looked at cold missiles,
capacity missiles, and
973
00:40:25,420 --> 00:40:26,970
conflict missiles before.
974
00:40:26,970 --> 00:40:29,390
And today, here are
some examples of
975
00:40:29,390 --> 00:40:31,340
true sharing missiles.
976
00:40:31,340 --> 00:40:36,160
What happened was I am actually
really using data,
977
00:40:36,160 --> 00:40:41,880
but I set up my parallelism
in such a way that between
978
00:40:41,880 --> 00:40:46,000
different executions, my data
has to move across.
979
00:40:46,000 --> 00:40:47,030
[UNINTELLIGIBLE]
980
00:40:47,030 --> 00:40:51,230
So I am truly sharing data,
but the data has to go to
981
00:40:51,230 --> 00:40:52,250
somebody else's cache.
982
00:40:52,250 --> 00:40:53,110
So I've got a lot of
[UNINTELLIGIBLE]
983
00:40:53,110 --> 00:40:54,830
violations here.
984
00:40:54,830 --> 00:40:58,420
More into this one is more like
false sharing, where you
985
00:40:58,420 --> 00:41:01,240
assume there's no sharing, nice
parallelism, everything,
986
00:41:01,240 --> 00:41:03,120
except the program
runs very slow.
987
00:41:03,120 --> 00:41:05,610
And that can be because
of false sharing.
988
00:41:05,610 --> 00:41:09,550
So we just kind of touch
on these two topics.
989
00:41:09,550 --> 00:41:10,660
OK?
990
00:41:10,660 --> 00:41:14,340
So let me switch gears a little
bit about dependences.
991
00:41:14,340 --> 00:41:16,380
We touched on the dependences
a little bit.
992
00:41:16,380 --> 00:41:19,070
And these are two fine
programs that are not
993
00:41:19,070 --> 00:41:20,410
completely parallel.
994
00:41:20,410 --> 00:41:24,710
So normally, what happens is a
true dependence means that I'm
995
00:41:24,710 --> 00:41:26,570
writing and reading
[UNINTELLIGIBLE]
996
00:41:26,570 --> 00:41:27,660
other way out.
997
00:41:27,660 --> 00:41:30,400
And if two guys are both
fighting, then the order has
998
00:41:30,400 --> 00:41:32,780
to maintiain us out would
be dependence.
999
00:41:32,780 --> 00:41:38,650
And did our dependence
even loop, because
1000
00:41:38,650 --> 00:41:40,880
these are single items.
1001
00:41:40,880 --> 00:41:43,760
So if you have an error here,
this is becoming a lot more
1002
00:41:43,760 --> 00:41:44,600
complicated.
1003
00:41:44,600 --> 00:41:46,200
Because there's no simple
thing in here.
1004
00:41:46,200 --> 00:41:48,450
Because it's not just using
the same iteration.
1005
00:41:48,450 --> 00:41:51,910
You might be using data from
different iterations.
1006
00:41:51,910 --> 00:41:55,920
So what happens is there's
a dynamic instance of
1007
00:41:55,920 --> 00:41:56,840
iterations.
1008
00:41:56,840 --> 00:41:59,230
So one iteration writes the
data, and somebody else might
1009
00:41:59,230 --> 00:42:01,310
be reading the data.
1010
00:42:01,310 --> 00:42:03,490
And that is basically
the order we have to
1011
00:42:03,490 --> 00:42:03,990
[UNINTELLIGIBLE].
1012
00:42:03,990 --> 00:42:05,060
Let me give you an example.
1013
00:42:05,060 --> 00:42:07,580
This kind of demonstrates
what's going on.
1014
00:42:07,580 --> 00:42:08,170
OK?
1015
00:42:08,170 --> 00:42:10,500
And when you edit, you say
look, this is where you
1016
00:42:10,500 --> 00:42:10,910
[UNINTELLIGIBLE]
1017
00:42:10,910 --> 00:42:12,590
complicated.
1018
00:42:12,590 --> 00:42:14,380
So in order to give you
and example, let me
1019
00:42:14,380 --> 00:42:15,876
look at this program.
1020
00:42:15,876 --> 00:42:17,150
I have a simple program here.
1021
00:42:17,150 --> 00:42:19,210
Ai equals Ai plus one.
1022
00:42:19,210 --> 00:42:19,990
My iterations--
1023
00:42:19,990 --> 00:42:21,570
I'm running five iterations
in here.
1024
00:42:21,570 --> 00:42:23,420
So this is my iteration space.
1025
00:42:23,420 --> 00:42:25,590
I have a large array, so
this is my data space.
1026
00:42:25,590 --> 00:42:28,090
1027
00:42:28,090 --> 00:42:29,990
And now, I keep running
this program.
1028
00:42:29,990 --> 00:42:32,800
So what happens is this is
time going down in here.
1029
00:42:32,800 --> 00:42:35,930
So the first situation
basically, I first read and
1030
00:42:35,930 --> 00:42:36,560
then write.
1031
00:42:36,560 --> 00:42:39,350
Same in the second iteration,
I read and write.
1032
00:42:39,350 --> 00:42:40,700
Third iteration read
and write.
1033
00:42:40,700 --> 00:42:42,770
Fourth iteration,
read and write.
1034
00:42:42,770 --> 00:42:44,610
Do you see how this is going
on these four situations?
1035
00:42:44,610 --> 00:42:46,050
Second iteration, third
iteration, fourth iteration,
1036
00:42:46,050 --> 00:42:48,115
[UNINTELLIGIBLE].
1037
00:42:48,115 --> 00:42:48,580
OK.
1038
00:42:48,580 --> 00:42:50,710
So what happens is first
iteration read
1039
00:42:50,710 --> 00:42:53,810
this value is zero.
1040
00:42:53,810 --> 00:42:56,170
And write the value as zero
in the menu writing.
1041
00:42:56,170 --> 00:43:00,930
Second iteration A1,
A1, A2, A2, A3, A3.
1042
00:43:00,930 --> 00:43:07,290
So now, when this is writing,
that's a dependence
1043
00:43:07,290 --> 00:43:08,160
between these two.
1044
00:43:08,160 --> 00:43:10,150
You see the true and entire
output dependence
1045
00:43:10,150 --> 00:43:11,400
between these two.
1046
00:43:11,400 --> 00:43:15,545
1047
00:43:15,545 --> 00:43:18,270
What type of dependence
do we have?
1048
00:43:18,270 --> 00:43:19,520
[UNINTELLIGIBLE] dependence.
1049
00:43:19,520 --> 00:43:24,240
1050
00:43:24,240 --> 00:43:26,340
True dependence is what?
1051
00:43:26,340 --> 00:43:28,780
What to what?
1052
00:43:28,780 --> 00:43:29,940
What's the first thing
that occurs?
1053
00:43:29,940 --> 00:43:33,620
What's the next thing
that occurs?
1054
00:43:33,620 --> 00:43:34,875
Anybody want to answer?
1055
00:43:34,875 --> 00:43:40,820
1056
00:43:40,820 --> 00:43:43,140
AUDIENCE: [INAUDIBLE]
1057
00:43:43,140 --> 00:43:43,900
PROFESSOR: Write to read.
1058
00:43:43,900 --> 00:43:45,720
So you have the first thing
has to be write to read.
1059
00:43:45,720 --> 00:43:46,570
Watch this.
1060
00:43:46,570 --> 00:43:48,800
This is a read to write.
1061
00:43:48,800 --> 00:43:51,230
So what type of dependence
is this?
1062
00:43:51,230 --> 00:43:52,240
This is anti-dependent.
1063
00:43:52,240 --> 00:43:54,680
So here is ante-dependence
in here.
1064
00:43:54,680 --> 00:43:57,710
But the nice thing about that is
this dependent didn't cross
1065
00:43:57,710 --> 00:43:58,940
the iteration boundary.
1066
00:43:58,940 --> 00:44:01,050
So these black lines are my
iteration boundaries.
1067
00:44:01,050 --> 00:44:03,190
So these are for situations
that [UNINTELLIGIBLE].
1068
00:44:03,190 --> 00:44:06,780
So there's no iteration
crossing in here.
1069
00:44:06,780 --> 00:44:09,360
You can kind of [UNINTELLIGIBLE]
it using each
1070
00:44:09,360 --> 00:44:13,590
of these iterations and my
dependencies within the very
1071
00:44:13,590 --> 00:44:14,360
same iteration.
1072
00:44:14,360 --> 00:44:18,116
So the same iteration I have
dependency [UNINTELLIGIBLE].
1073
00:44:18,116 --> 00:44:18,590
OK?
1074
00:44:18,590 --> 00:44:19,910
This is a simpler case.
1075
00:44:19,910 --> 00:44:24,060
So let's look at something a
little bit more complicated.
1076
00:44:24,060 --> 00:44:28,560
So I have Ai plus 1
equals Ai plus 1.
1077
00:44:28,560 --> 00:44:32,430
So what happens is first
I am reading Ai.
1078
00:44:32,430 --> 00:44:39,270
And then, I am writing Ai plus
1 in the same iteration.
1079
00:44:39,270 --> 00:44:41,100
The next iteration, I
am reading now Ai
1080
00:44:41,100 --> 00:44:42,050
[UNINTELLIGIBLE]
1081
00:44:42,050 --> 00:44:44,250
this is A0 and 1.
1082
00:44:44,250 --> 00:44:45,890
This is A is 1.
1083
00:44:45,890 --> 00:44:46,410
[UNINTELLIGIBLE]
1084
00:44:46,410 --> 00:44:47,880
I am writing Ai plus 1.
1085
00:44:47,880 --> 00:44:49,130
I am writing 2.
1086
00:44:49,130 --> 00:44:51,610
1087
00:44:51,610 --> 00:44:54,070
So I have a dependence
like this now.
1088
00:44:54,070 --> 00:44:55,320
What type of dependence
is this?
1089
00:44:55,320 --> 00:44:58,190
1090
00:44:58,190 --> 00:45:00,560
This is a true dependence
because I am writing.
1091
00:45:00,560 --> 00:45:04,610
And this is actually reading
what what it is writing.
1092
00:45:04,610 --> 00:45:08,720
So does this look parallel?
1093
00:45:08,720 --> 00:45:09,590
No.
1094
00:45:09,590 --> 00:45:13,790
Because what happens is if you
look at each iteration depends
1095
00:45:13,790 --> 00:45:15,680
on the previous iteration.
1096
00:45:15,680 --> 00:45:18,040
So you have to actually have
this dependence going back and
1097
00:45:18,040 --> 00:45:20,890
forth, back and forth in here.
1098
00:45:20,890 --> 00:45:23,550
So let's look at a couple
more other things.
1099
00:45:23,550 --> 00:45:26,830
So here is Ai equals
Ai plus 2.
1100
00:45:26,830 --> 00:45:30,660
So I am basically reading
Ai plus 2.
1101
00:45:30,660 --> 00:45:31,670
So I am reading this one.
1102
00:45:31,670 --> 00:45:33,220
I am writing this one.
1103
00:45:33,220 --> 00:45:33,810
Reading this one.
1104
00:45:33,810 --> 00:45:36,060
Writing this one.
1105
00:45:36,060 --> 00:45:38,040
Here is my dependence
that's in here.
1106
00:45:38,040 --> 00:45:39,620
You see the two are
anti in here.
1107
00:45:39,620 --> 00:45:42,210
1108
00:45:42,210 --> 00:45:44,150
This is anti-dependence because
I am going from a
1109
00:45:44,150 --> 00:45:47,160
reading to a write in here.
1110
00:45:47,160 --> 00:45:48,510
Can this loop be parallel?
1111
00:45:48,510 --> 00:45:55,240
1112
00:45:55,240 --> 00:45:57,744
Can this loop run parallel?
1113
00:45:57,744 --> 00:46:00,440
AUDIENCE: [INAUDIBLE]
1114
00:46:00,440 --> 00:46:04,090
PROFESSOR: So can every
iteration run parallel?
1115
00:46:04,090 --> 00:46:05,030
There could be basically.
1116
00:46:05,030 --> 00:46:07,590
No because what happens is if
you look at that, there's a
1117
00:46:07,590 --> 00:46:09,690
dependence that goes
like this.
1118
00:46:09,690 --> 00:46:11,060
And of course, there
are two chains.
1119
00:46:11,060 --> 00:46:14,170
So if you are interested, you
can run at least two-way
1120
00:46:14,170 --> 00:46:15,210
parallelism.
1121
00:46:15,210 --> 00:46:18,200
You can run one chain parallel
to another chain when you do
1122
00:46:18,200 --> 00:46:21,130
get that much parallelism.
1123
00:46:21,130 --> 00:46:22,380
How about this one?
1124
00:46:22,380 --> 00:46:24,740
1125
00:46:24,740 --> 00:46:26,970
2i and 2i plus 1.
1126
00:46:26,970 --> 00:46:28,220
[UNINTELLIGIBLE]
1127
00:46:28,220 --> 00:46:29,890
1128
00:46:29,890 --> 00:46:31,140
Is there independence in here?
1129
00:46:31,140 --> 00:46:34,740
1130
00:46:34,740 --> 00:46:36,120
Nope because one is--
1131
00:46:36,120 --> 00:46:38,300
you are reading all the elements
and even writing
1132
00:46:38,300 --> 00:46:39,210
elements [UNINTELLIGIBLE]
1133
00:46:39,210 --> 00:46:39,730
dependence.
1134
00:46:39,730 --> 00:46:42,100
So you can have a missing
parallel.
1135
00:46:42,100 --> 00:46:43,230
OK?
1136
00:46:43,230 --> 00:46:46,210
So this is the kind of
interesting thing
1137
00:46:46,210 --> 00:46:46,890
that is going on.
1138
00:46:46,890 --> 00:46:51,150
So next, I want to look at
something a little bit more
1139
00:46:51,150 --> 00:46:52,130
complicated.
1140
00:46:52,130 --> 00:46:54,400
So let's look at this.
1141
00:46:54,400 --> 00:46:59,850
So here's a classic algorithm
called successive over
1142
00:46:59,850 --> 00:47:01,320
relaxation.
1143
00:47:01,320 --> 00:47:05,050
So it kind of simulates a lot of
times things like heat flow
1144
00:47:05,050 --> 00:47:06,460
through a plane.
1145
00:47:06,460 --> 00:47:08,830
So the idea there is--
1146
00:47:08,830 --> 00:47:10,500
let me illustrate
what he does.
1147
00:47:10,500 --> 00:47:16,750
So assume you have a
big metal sheet.
1148
00:47:16,750 --> 00:47:20,790
And you put some kind of heat
source in one place.
1149
00:47:20,790 --> 00:47:23,640
And after sometime, it all
reaches a steady state.
1150
00:47:23,640 --> 00:47:24,920
The other side might be cold.
1151
00:47:24,920 --> 00:47:28,190
And you want to know part of
the sheet's temperature.
1152
00:47:28,190 --> 00:47:31,070
1153
00:47:31,070 --> 00:47:34,700
Because temperature
can leak out.
1154
00:47:34,700 --> 00:47:40,515
And there are more things like
you have a heat source and
1155
00:47:40,515 --> 00:47:44,480
others that [UNINTELLIGIBLE]
to work a glass of water or
1156
00:47:44,480 --> 00:47:45,300
some kind of a sink.
1157
00:47:45,300 --> 00:47:47,650
So what is the heat
distribution?
1158
00:47:47,650 --> 00:47:51,120
So one way to compare that this
is basically the same
1159
00:47:51,120 --> 00:47:51,550
[UNINTELLIGIBLE].
1160
00:47:51,550 --> 00:47:55,230
The heat value here is basically
the average around
1161
00:47:55,230 --> 00:47:59,690
all these other values
right now.
1162
00:47:59,690 --> 00:48:01,700
Because if something is too
hot, the heat is going to
1163
00:48:01,700 --> 00:48:03,040
propagate something
that is too cold.
1164
00:48:03,040 --> 00:48:04,800
The heat is going to propagate
because it kind of has to be
1165
00:48:04,800 --> 00:48:06,820
average around that.
1166
00:48:06,820 --> 00:48:07,840
Then, you take the
average in here.
1167
00:48:07,840 --> 00:48:09,260
So what it's doing
is calculating
1168
00:48:09,260 --> 00:48:11,200
the average in here.
1169
00:48:11,200 --> 00:48:12,810
And then, you have to do
it many, many times.
1170
00:48:12,810 --> 00:48:14,480
So if you have a heat source,
at that point, it
1171
00:48:14,480 --> 00:48:15,300
would be very hard.
1172
00:48:15,300 --> 00:48:17,790
And then, it will start
propagating slowly and kind of
1173
00:48:17,790 --> 00:48:18,810
propagate down.
1174
00:48:18,810 --> 00:48:21,890
And the cold side in this way
or after running many times,
1175
00:48:21,890 --> 00:48:23,180
it basically stabilizes.
1176
00:48:23,180 --> 00:48:25,350
And at that point, you have the
kind of heat distribution
1177
00:48:25,350 --> 00:48:26,180
that we [UNINTELLIGIBLE] have.
1178
00:48:26,180 --> 00:48:28,220
This is the kind of calculation
you do.
1179
00:48:28,220 --> 00:48:30,070
So this is the calculation.
1180
00:48:30,070 --> 00:48:31,410
So what you're doing is
calculating this.
1181
00:48:31,410 --> 00:48:35,220
You are creating this,
this, this, and this
1182
00:48:35,220 --> 00:48:36,370
and updating that.
1183
00:48:36,370 --> 00:48:38,480
And then, you do it
for t time stamps.
1184
00:48:38,480 --> 00:48:41,380
So you just go around doing each
of these things first and
1185
00:48:41,380 --> 00:48:44,284
doing it for t time stamps.
1186
00:48:44,284 --> 00:48:44,740
OK?
1187
00:48:44,740 --> 00:48:46,608
So we would like to
run this parallel.
1188
00:48:46,608 --> 00:48:49,620
1189
00:48:49,620 --> 00:48:53,650
So here's my basically
data space.
1190
00:48:53,650 --> 00:48:55,270
There's my data items.
1191
00:48:55,270 --> 00:48:56,980
So here's my array,
two-dimensional array.
1192
00:48:56,980 --> 00:48:58,540
So this is how I'm
trying to update.
1193
00:48:58,540 --> 00:49:00,170
I'm reading all this file.
1194
00:49:00,170 --> 00:49:01,760
So here's my iteration space.
1195
00:49:01,760 --> 00:49:03,050
So what I have looked at this.
1196
00:49:03,050 --> 00:49:04,590
I don't want to--
1197
00:49:04,590 --> 00:49:06,400
it's hard to give you
a 3D diagram.
1198
00:49:06,400 --> 00:49:08,190
I don't have a 3D projector.
1199
00:49:08,190 --> 00:49:11,160
So what I'm showing here is
three-dimension here.
1200
00:49:11,160 --> 00:49:13,850
So this is the previous
iteration, first iteration.
1201
00:49:13,850 --> 00:49:15,990
So if I still go tij.
1202
00:49:15,990 --> 00:49:19,800
So you go through t, and then
you go through i in here, and
1203
00:49:19,800 --> 00:49:21,560
then, when you're done, you
go to the [UNINTELLIGIBLE]
1204
00:49:21,560 --> 00:49:23,440
iteration and you go this way.
1205
00:49:23,440 --> 00:49:24,820
So here's how you
would iterate.
1206
00:49:24,820 --> 00:49:27,300
So you run this one, this one,
this one, this one, this one,
1207
00:49:27,300 --> 00:49:28,530
this one, this one, this
one, this one.
1208
00:49:28,530 --> 00:49:31,500
And then increase t by
1, and go like this.
1209
00:49:31,500 --> 00:49:35,370
And right now, we are here.
1210
00:49:35,370 --> 00:49:37,140
We are trying to update
this one.
1211
00:49:37,140 --> 00:49:39,080
That's what we are
trying to do.
1212
00:49:39,080 --> 00:49:42,570
And that means we are already
finished up to this point.
1213
00:49:42,570 --> 00:49:45,760
All these points are
finished up.
1214
00:49:45,760 --> 00:49:50,160
Now, what we have to do is
figure out when I'm reading,
1215
00:49:50,160 --> 00:49:53,480
who actually wrote this value.
1216
00:49:53,480 --> 00:49:53,920
OK?
1217
00:49:53,920 --> 00:49:56,830
First of all, let's figure out
which iterations might be able
1218
00:49:56,830 --> 00:49:58,430
to write this value.
1219
00:49:58,430 --> 00:50:04,460
So if you look at
this value, this
1220
00:50:04,460 --> 00:50:06,110
relationship in between here.
1221
00:50:06,110 --> 00:50:09,060
This one, basically, is ij.
1222
00:50:09,060 --> 00:50:11,030
And this is ij, ij, ij.
1223
00:50:11,030 --> 00:50:13,770
These three iterations
can write this one.
1224
00:50:13,770 --> 00:50:17,650
So and these iterations
can write this one.
1225
00:50:17,650 --> 00:50:19,770
Let me go to this one.
1226
00:50:19,770 --> 00:50:22,480
This is a pretty darn
complicated [UNINTELLIGIBLE].
1227
00:50:22,480 --> 00:50:26,650
So what that means is in
this one, this one
1228
00:50:26,650 --> 00:50:28,870
already wrote something.
1229
00:50:28,870 --> 00:50:30,990
This is what I'm reading
in here.
1230
00:50:30,990 --> 00:50:32,500
This one already wrote
something.
1231
00:50:32,500 --> 00:50:33,355
This is what I'm reading here.
1232
00:50:33,355 --> 00:50:34,390
This iteration wrote
something.
1233
00:50:34,390 --> 00:50:35,930
I read it here.
1234
00:50:35,930 --> 00:50:36,200
OK.
1235
00:50:36,200 --> 00:50:38,850
Everybody following so far?
1236
00:50:38,850 --> 00:50:40,170
How about this, guys?
1237
00:50:40,170 --> 00:50:42,665
Who wrote the value I am reading
in these iterations?
1238
00:50:42,665 --> 00:50:47,530
1239
00:50:47,530 --> 00:50:50,300
In this one, I haven't
reached there yet.
1240
00:50:50,300 --> 00:50:53,130
So who has written that?
1241
00:50:53,130 --> 00:50:56,812
So I assume this
is t equals 1.
1242
00:50:56,812 --> 00:50:57,630
[UNINTELLIGIBLE]
1243
00:50:57,630 --> 00:50:59,260
somebody has to write
those things.
1244
00:50:59,260 --> 00:51:03,150
So what that means is this also
wrote all of those values
1245
00:51:03,150 --> 00:51:04,520
because I have done
those iterations.
1246
00:51:04,520 --> 00:51:07,190
But the interesting thing is
some of these values got
1247
00:51:07,190 --> 00:51:07,630
overwritten.
1248
00:51:07,630 --> 00:51:09,650
This value got overwritten ,
this value got overwritten.
1249
00:51:09,650 --> 00:51:13,310
So these two values disappear.
1250
00:51:13,310 --> 00:51:15,720
This value got overwritten
by this guy.
1251
00:51:15,720 --> 00:51:20,080
This value got overwritten
by this guy.
1252
00:51:20,080 --> 00:51:20,280
OK.
1253
00:51:20,280 --> 00:51:22,400
But we haven't overwritten this
value, this value, and
1254
00:51:22,400 --> 00:51:23,310
this value yet.
1255
00:51:23,310 --> 00:51:25,850
This one, basically,
I've just updated.
1256
00:51:25,850 --> 00:51:26,620
But I [UNINTELLIGIBLE]
1257
00:51:26,620 --> 00:51:28,894
this one.
1258
00:51:28,894 --> 00:51:30,676
Do you see this?
1259
00:51:30,676 --> 00:51:33,954
Is everybody following me?
1260
00:51:33,954 --> 00:51:34,906
AUDIENCE: Once again, sir.
1261
00:51:34,906 --> 00:51:36,810
I got lost.
1262
00:51:36,810 --> 00:51:40,630
So what are [INAUDIBLE]
1263
00:51:40,630 --> 00:51:42,640
PROFESSOR: So what happens is I
am trying to update in this
1264
00:51:42,640 --> 00:51:46,780
iteration because this array
get rid of multiple times.
1265
00:51:46,780 --> 00:51:50,200
But in each iteration, you are
only doing one update.
1266
00:51:50,200 --> 00:51:52,220
So I am trying to read
and write in here.
1267
00:51:52,220 --> 00:51:54,380
So I need to read all
of these five
1268
00:51:54,380 --> 00:51:56,010
elements in this iteration.
1269
00:51:56,010 --> 00:51:58,450
So I want to figure out
who wrote that.
1270
00:51:58,450 --> 00:51:59,550
OK?
1271
00:51:59,550 --> 00:52:03,180
This one can be written by this
guy and this iteration.
1272
00:52:03,180 --> 00:52:05,440
Could this iteration write
its value in here?
1273
00:52:05,440 --> 00:52:08,110
1274
00:52:08,110 --> 00:52:09,000
OK?
1275
00:52:09,000 --> 00:52:09,420
[UNINTELLIGIBLE]
1276
00:52:09,420 --> 00:52:11,990
This iteration write because
we see it's writing ij.
1277
00:52:11,990 --> 00:52:16,890
I mean my diagram is not that
great because I have three in
1278
00:52:16,890 --> 00:52:18,660
here and five in here.
1279
00:52:18,660 --> 00:52:19,890
So just bear with me on that.
1280
00:52:19,890 --> 00:52:21,395
So assume I am writing
ij in here.
1281
00:52:21,395 --> 00:52:24,760
1282
00:52:24,760 --> 00:52:28,150
So my iterations go from 1 to n,
but my data goes from 0 to
1283
00:52:28,150 --> 00:52:29,260
n plus 1, basically.
1284
00:52:29,260 --> 00:52:30,990
1 to n minus 1 iterations.
1285
00:52:30,990 --> 00:52:32,270
0 to n plus 1 data.
1286
00:52:32,270 --> 00:52:35,160
So data is bigger than iteration
space because of
1287
00:52:35,160 --> 00:52:36,290
[UNINTELLIGIBLE].
1288
00:52:36,290 --> 00:52:40,740
So what happens is when I'm in
this iteration, we'll say this
1289
00:52:40,740 --> 00:52:44,770
is 1 2 iteration.
1290
00:52:44,770 --> 00:52:47,590
I will write this value.
1291
00:52:47,590 --> 00:52:52,370
This iteration will also
write this value.
1292
00:52:52,370 --> 00:52:53,075
OK?
1293
00:52:53,075 --> 00:52:54,730
You see that?
1294
00:52:54,730 --> 00:52:56,400
All of these iterations
are the same.
1295
00:52:56,400 --> 00:52:58,170
This iteration we will also
write this value.
1296
00:52:58,170 --> 00:53:00,880
But right now, who is the
last guy who wrote it?
1297
00:53:00,880 --> 00:53:02,400
The last guy that wrote
it is this guy.
1298
00:53:02,400 --> 00:53:05,125
1299
00:53:05,125 --> 00:53:07,910
Because this iteration wrote
it, and after that, it got
1300
00:53:07,910 --> 00:53:09,060
ordered in this one.
1301
00:53:09,060 --> 00:53:11,330
But this one hadn't occurred
yet, so it hadn't been ordered
1302
00:53:11,330 --> 00:53:11,900
by this guy.
1303
00:53:11,900 --> 00:53:15,670
So the last guy who wrote
it was this one.
1304
00:53:15,670 --> 00:53:16,820
So that's why I had
to eliminate this.
1305
00:53:16,820 --> 00:53:19,630
But this data value--
1306
00:53:19,630 --> 00:53:21,400
I haven't executed this
iteration yet.
1307
00:53:21,400 --> 00:53:24,260
So nobody had written this
one in this time stamp.
1308
00:53:24,260 --> 00:53:26,310
So it has to be from the
previous time stamp.
1309
00:53:26,310 --> 00:53:32,300
So I read two values from the
current time stamp, three
1310
00:53:32,300 --> 00:53:33,870
values from the previous
time stamp.
1311
00:53:33,870 --> 00:53:35,120
These three values have
to come from the
1312
00:53:35,120 --> 00:53:35,980
previous time stamp.
1313
00:53:35,980 --> 00:53:38,670
These two values that come from
the current time stamp.
1314
00:53:38,670 --> 00:53:39,830
You see that?
1315
00:53:39,830 --> 00:53:40,105
OK.
1316
00:53:40,105 --> 00:53:41,350
Good.
1317
00:53:41,350 --> 00:53:45,250
So what that means is because
dependence means--
1318
00:53:45,250 --> 00:53:47,500
OK.
1319
00:53:47,500 --> 00:53:50,455
This line, this dark, red line.
1320
00:53:50,455 --> 00:53:51,160
See.
1321
00:53:51,160 --> 00:53:54,975
I am reading a value in a
current iteration that was
1322
00:53:54,975 --> 00:53:56,070
written by this iteration.
1323
00:53:56,070 --> 00:53:58,210
So that means I have no
dependence between these two
1324
00:53:58,210 --> 00:54:00,670
iterations.
1325
00:54:00,670 --> 00:54:00,960
OK.
1326
00:54:00,960 --> 00:54:03,930
This line, this dark,
red line.
1327
00:54:03,930 --> 00:54:06,020
I am reading a value written
by this iteration.
1328
00:54:06,020 --> 00:54:07,870
So I have a dependency
in here.
1329
00:54:07,870 --> 00:54:10,070
This line means I have a
dependence between this
1330
00:54:10,070 --> 00:54:11,750
iteration to the current one.
1331
00:54:11,750 --> 00:54:13,310
This line means I have
dependence between this
1332
00:54:13,310 --> 00:54:14,540
iteration and the current one.
1333
00:54:14,540 --> 00:54:15,980
This line means I have
dependence between this
1334
00:54:15,980 --> 00:54:18,980
iteration and the current one.
1335
00:54:18,980 --> 00:54:20,230
You see that?
1336
00:54:20,230 --> 00:54:22,480
1337
00:54:22,480 --> 00:54:23,320
OK.
1338
00:54:23,320 --> 00:54:28,560
So now, I want to see how we
can parallelize this group.
1339
00:54:28,560 --> 00:54:30,380
So what can I do?
1340
00:54:30,380 --> 00:54:32,200
So I look at all this
dependence.
1341
00:54:32,200 --> 00:54:34,720
At this point, I don't have to
think about all this where who
1342
00:54:34,720 --> 00:54:34,910
wrote what.
1343
00:54:34,910 --> 00:54:37,840
I can say this is dependence.
1344
00:54:37,840 --> 00:54:43,310
In order to do this equation,
all these iterations have to
1345
00:54:43,310 --> 00:54:47,440
be done because I am losing the
values produced by them.
1346
00:54:47,440 --> 00:54:49,740
So these have to be finished
before I can patch that.
1347
00:54:49,740 --> 00:54:51,760
So the parallelism
means I tried to
1348
00:54:51,760 --> 00:54:54,430
do things in parallel.
1349
00:54:54,430 --> 00:54:56,900
So can we parallelize
this loop?
1350
00:54:56,900 --> 00:55:00,120
1351
00:55:00,120 --> 00:55:01,630
Can we run each time
stamp separately?
1352
00:55:01,630 --> 00:55:05,110
1353
00:55:05,110 --> 00:55:08,940
No because I am using these
three values from the previous
1354
00:55:08,940 --> 00:55:10,100
time stamp.
1355
00:55:10,100 --> 00:55:14,030
So I can't run this time stamp,
B equals 1, until B
1356
00:55:14,030 --> 00:55:15,670
equals 0 is done.
1357
00:55:15,670 --> 00:55:16,460
Or B plus 2 [UNINTELLIGIBLE]
1358
00:55:16,460 --> 00:55:18,700
B plus 1 is done.
1359
00:55:18,700 --> 00:55:19,030
OK?
1360
00:55:19,030 --> 00:55:22,690
So I can't parallelize
this loop.
1361
00:55:22,690 --> 00:55:22,930
OK.
1362
00:55:22,930 --> 00:55:26,320
Can I parallelize this loop?
1363
00:55:26,320 --> 00:55:27,850
Why?
1364
00:55:27,850 --> 00:55:31,000
Will dependence stop me from
parallelizing this one?
1365
00:55:31,000 --> 00:55:36,070
1366
00:55:36,070 --> 00:55:37,710
So I'm looking at i.
1367
00:55:37,710 --> 00:55:40,180
This is my i dimension.
1368
00:55:40,180 --> 00:55:41,720
How many lines, at
least, tell me.
1369
00:55:41,720 --> 00:55:43,370
How many dependencies
are going to stop
1370
00:55:43,370 --> 00:55:46,850
me from doing that?
1371
00:55:46,850 --> 00:55:47,160
OK.
1372
00:55:47,160 --> 00:55:47,510
Good.
1373
00:55:47,510 --> 00:55:48,180
I have [UNINTELLIGIBLE].
1374
00:55:48,180 --> 00:55:49,290
Somebody says three.
1375
00:55:49,290 --> 00:55:50,750
Somebody says one.
1376
00:55:50,750 --> 00:55:51,030
OK.
1377
00:55:51,030 --> 00:55:51,930
Let's get a vote.
1378
00:55:51,930 --> 00:55:53,545
How many people think
it's three?
1379
00:55:53,545 --> 00:55:56,390
1380
00:55:56,390 --> 00:55:56,630
OK.
1381
00:55:56,630 --> 00:55:58,760
There's one vote for three.
1382
00:55:58,760 --> 00:56:00,060
How many people think
it's three?
1383
00:56:00,060 --> 00:56:01,310
How many people think
it's one?
1384
00:56:01,310 --> 00:56:03,770
1385
00:56:03,770 --> 00:56:04,530
Wait a minute.
1386
00:56:04,530 --> 00:56:08,100
One vote for three and
two votes for one?
1387
00:56:08,100 --> 00:56:08,350
OK.
1388
00:56:08,350 --> 00:56:10,100
Where's the rest?
1389
00:56:10,100 --> 00:56:10,930
For two?
1390
00:56:10,930 --> 00:56:12,110
For 0?
1391
00:56:12,110 --> 00:56:13,480
Can't be 0 if the
0 is parallel.
1392
00:56:13,480 --> 00:56:14,300
OK.
1393
00:56:14,300 --> 00:56:16,150
So we'll start parallelizing.
1394
00:56:16,150 --> 00:56:16,750
OK.
1395
00:56:16,750 --> 00:56:18,960
So what happens in here?
1396
00:56:18,960 --> 00:56:21,210
Right now, this is
actually one.
1397
00:56:21,210 --> 00:56:22,660
This one.
1398
00:56:22,660 --> 00:56:26,900
Because these things don't
participate because this has
1399
00:56:26,900 --> 00:56:28,350
already happened.
1400
00:56:28,350 --> 00:56:33,030
When you go to ij iterations,
these are already done.
1401
00:56:33,030 --> 00:56:34,170
So you're going from t.
1402
00:56:34,170 --> 00:56:36,160
So you're looking at the current
iterations because
1403
00:56:36,160 --> 00:56:38,680
you're ending in two loops.
1404
00:56:38,680 --> 00:56:39,740
So the t is done.
1405
00:56:39,740 --> 00:56:41,720
So these all are already
done when you go try
1406
00:56:41,720 --> 00:56:42,490
to parallelize sides.
1407
00:56:42,490 --> 00:56:44,740
So I don't have to worry
about these three.
1408
00:56:44,740 --> 00:56:48,150
In here because actually I'm
losing t of something here, I
1409
00:56:48,150 --> 00:56:51,480
am in trouble.
1410
00:56:51,480 --> 00:56:56,410
So when you go look at this
one, I have this one.
1411
00:56:56,410 --> 00:56:59,020
So every dimension has
a dependence in here.
1412
00:56:59,020 --> 00:57:00,750
So I can't run it in parallel.
1413
00:57:00,750 --> 00:57:02,740
So does this mean that there's
no parallelism?
1414
00:57:02,740 --> 00:57:08,160
1415
00:57:08,160 --> 00:57:09,410
Who think there's
no parallelism?
1416
00:57:09,410 --> 00:57:12,410
1417
00:57:12,410 --> 00:57:13,340
Who thinks there is?
1418
00:57:13,340 --> 00:57:14,370
Oh, somebody thinks there's
no parallelism.
1419
00:57:14,370 --> 00:57:16,680
Who thinks there's
parallelism?
1420
00:57:16,680 --> 00:57:17,170
OK.
1421
00:57:17,170 --> 00:57:18,080
More people think there's
parallelism.
1422
00:57:18,080 --> 00:57:20,090
Let's see what we can do.
1423
00:57:20,090 --> 00:57:21,889
Question?
1424
00:57:21,889 --> 00:57:24,354
AUDIENCE: Do you really
think [INAUDIBLE]
1425
00:57:24,354 --> 00:57:34,214
1426
00:57:34,214 --> 00:57:36,679
I'm trying to figure out
how to word this.
1427
00:57:36,679 --> 00:57:38,158
Do you really want to have
1428
00:57:38,158 --> 00:57:39,144
dependence on the same concept?
1429
00:57:39,144 --> 00:57:40,394
[INAUDIBLE]?
1430
00:57:40,394 --> 00:57:43,620
1431
00:57:43,620 --> 00:57:43,880
PROFESSOR: Yeah.
1432
00:57:43,880 --> 00:57:45,730
I mean you can do--
1433
00:57:45,730 --> 00:57:48,810
this is the way this SOR
is sitting so there's a
1434
00:57:48,810 --> 00:57:50,420
dependence between time stamp.
1435
00:57:50,420 --> 00:57:51,750
There's another SOR.
1436
00:57:51,750 --> 00:57:53,380
What they do is kind
of a red, black.
1437
00:57:53,380 --> 00:57:56,400
So when you calculate the next
time stamp, you calculate it
1438
00:57:56,400 --> 00:57:57,760
right and complete
the new array.
1439
00:57:57,760 --> 00:57:58,560
So there's no dependence.
1440
00:57:58,560 --> 00:58:00,900
So that's a different
algorithm.
1441
00:58:00,900 --> 00:58:03,800
This algorithm, basically,
uses sum value from
1442
00:58:03,800 --> 00:58:05,730
[UNINTELLIGIBLE] because the
value-- the algorithm you're
1443
00:58:05,730 --> 00:58:06,985
talking-- you already created
the other copies.
1444
00:58:06,985 --> 00:58:07,890
You had two copies.
1445
00:58:07,890 --> 00:58:09,170
You're bouncing back
and forth.
1446
00:58:09,170 --> 00:58:09,530
Nice.
1447
00:58:09,530 --> 00:58:11,100
No real problem in here.
1448
00:58:11,100 --> 00:58:12,760
But then you had to have twice
the amount of storage.
1449
00:58:12,760 --> 00:58:14,320
Here, you are updating in.
1450
00:58:14,320 --> 00:58:17,740
And since this is kind of
running enough iterations
1451
00:58:17,740 --> 00:58:23,190
until it converges, it doesn't
seem to matter that the
1452
00:58:23,190 --> 00:58:24,440
[UNINTELLIGIBLE PHRASE].
1453
00:58:24,440 --> 00:58:26,520
1454
00:58:26,520 --> 00:58:27,280
OK.
1455
00:58:27,280 --> 00:58:32,380
So we cannot find a loop,
what we call doall loop.
1456
00:58:32,380 --> 00:58:35,380
The doall loop means there's no
loop carried dependences.
1457
00:58:35,380 --> 00:58:36,930
It's fully parallel.
1458
00:58:36,930 --> 00:58:38,900
This is the best case.
1459
00:58:38,900 --> 00:58:41,470
So what happens is when
you get there,
1460
00:58:41,470 --> 00:58:42,620
everybody can run parallel.
1461
00:58:42,620 --> 00:58:44,770
And when you're done, you can
stop and then do that.
1462
00:58:44,770 --> 00:58:46,540
So this is the doall loop.
1463
00:58:46,540 --> 00:58:47,560
Of course, there's
no doall loop.
1464
00:58:47,560 --> 00:58:48,830
We can look at every
dimension.
1465
00:58:48,830 --> 00:58:50,290
We had some kind
of dependence.
1466
00:58:50,290 --> 00:58:53,430
So there's another choice, what
we call doacross loop.
1467
00:58:53,430 --> 00:58:57,200
What that means is we have some
loop carried dependence.
1468
00:58:57,200 --> 00:58:58,760
There's something I
have to use for
1469
00:58:58,760 --> 00:59:00,320
the previous iteration.
1470
00:59:00,320 --> 00:59:01,910
But it's only one thing.
1471
00:59:01,910 --> 00:59:05,150
I have a lot of other things I
can run around that only I
1472
00:59:05,150 --> 00:59:06,190
just have to wait one thing.
1473
00:59:06,190 --> 00:59:06,820
One is done.
1474
00:59:06,820 --> 00:59:08,190
I can just keep running.
1475
00:59:08,190 --> 00:59:11,020
And if I calculate and send this
one early, then I can do
1476
00:59:11,020 --> 00:59:13,250
my other calculations later.
1477
00:59:13,250 --> 00:59:14,160
This is not that great.
1478
00:59:14,160 --> 00:59:15,820
If you look at the
difference here.
1479
00:59:15,820 --> 00:59:19,350
This definitely has very little
overhead in here.
1480
00:59:19,350 --> 00:59:21,330
This can run slow.
1481
00:59:21,330 --> 00:59:23,050
And of course, this thing
gets produced very late.
1482
00:59:23,050 --> 00:59:24,790
It's [? almost ?] sequential.
1483
00:59:24,790 --> 00:59:27,890
So I hope you can just-- it the
other guy wants something,
1484
00:59:27,890 --> 00:59:29,810
I can immediately send
it very early.
1485
00:59:29,810 --> 00:59:31,560
And then I can run there.
1486
00:59:31,560 --> 00:59:35,600
So you can get some kind of
doacross patterns in here.
1487
00:59:35,600 --> 00:59:37,690
So if you want to
do this one--
1488
00:59:37,690 --> 00:59:39,710
this is a little bit
crazy in here.
1489
00:59:39,710 --> 00:59:41,580
But they'll do it in here.
1490
00:59:41,580 --> 00:59:44,260
And so what first we are to
do is you are to say, OK.
1491
00:59:44,260 --> 00:59:44,640
Look.
1492
00:59:44,640 --> 00:59:48,570
I'm running this loop, the
i loop in parallel.
1493
00:59:48,570 --> 00:59:52,410
But I have to exchange
some data.
1494
00:59:52,410 --> 00:59:55,730
Before I want to run this one,
I have to basically get the
1495
00:59:55,730 --> 00:59:58,060
previous i value produced.
1496
00:59:58,060 --> 01:00:00,620
And when it's done, I can say
the next guy can use it.
1497
01:00:00,620 --> 01:00:02,490
So this is a very
complicated one.
1498
01:00:02,490 --> 01:00:05,430
I don't want you to understand
it too well.
1499
01:00:05,430 --> 01:00:09,210
So the reason I put it is to
show that OK, if you want to
1500
01:00:09,210 --> 01:00:12,080
spend a week trying to really
call this up and understand
1501
01:00:12,080 --> 01:00:14,230
and make sure that
it works OK.
1502
01:00:14,230 --> 01:00:16,930
So you can do things
like that.
1503
01:00:16,930 --> 01:00:17,910
OK?
1504
01:00:17,910 --> 01:00:18,570
Aha.
1505
01:00:18,570 --> 01:00:21,410
So this is the true
voodooness.
1506
01:00:21,410 --> 01:00:23,170
OK.
1507
01:00:23,170 --> 01:00:28,150
AUDIENCE: So in Cilk, if you
do this with divide and
1508
01:00:28,150 --> 01:00:34,400
conquer, you can make it be what
I called in the Tableau
1509
01:00:34,400 --> 01:00:35,760
construction.
1510
01:00:35,760 --> 01:00:38,820
Each layer here is basically
constructing a Tableau.
1511
01:00:38,820 --> 01:00:41,078
And so if you do it with divide
and conquer, you can do
1512
01:00:41,078 --> 01:00:44,670
it with a very simple
recursive code.
1513
01:00:44,670 --> 01:00:49,160
But you can also do it with a
loop that goes diagonally.
1514
01:00:49,160 --> 01:00:49,260
AUDIENCE: [INTERPOSING VOICES]
1515
01:00:49,260 --> 01:00:49,390
PROFESSOR: Yes.
1516
01:00:49,390 --> 01:00:50,460
I'm going to get that.
1517
01:00:50,460 --> 01:00:52,690
That's next.
1518
01:00:52,690 --> 01:00:53,770
AUDIENCE: Sorry.
1519
01:00:53,770 --> 01:00:54,540
PROFESSOR: That's OK.
1520
01:00:54,540 --> 01:01:01,080
So the reason that I'm showing
that is because this class is
1521
01:01:01,080 --> 01:01:05,210
not just about how to make the
cores exactly run faster.
1522
01:01:05,210 --> 01:01:07,630
Think about algorithmic issues
and stuff like that.
1523
01:01:07,630 --> 01:01:10,670
So sometimes, when you look at
a problem, it looks crazy.
1524
01:01:10,670 --> 01:01:13,240
And there might be some changes
you can do that you
1525
01:01:13,240 --> 01:01:16,590
can get to run things
in parallel.
1526
01:01:16,590 --> 01:01:18,890
So I'm actually doing
not diagonal.
1527
01:01:18,890 --> 01:01:21,030
I'm actually doing something
very simple.
1528
01:01:21,030 --> 01:01:26,120
So what I have done here
is I have all these
1529
01:01:26,120 --> 01:01:27,120
dependences in here.
1530
01:01:27,120 --> 01:01:27,660
OK?
1531
01:01:27,660 --> 01:01:34,380
So the problem here is I can't
find a single [UNINTELLIGIBLE]
1532
01:01:34,380 --> 01:01:37,390
that basically has
no crossing.
1533
01:01:37,390 --> 01:01:39,500
But if you look at this
[UNINTELLIGIBLE]
1534
01:01:39,500 --> 01:01:41,140
diagonal here.
1535
01:01:41,140 --> 01:01:44,780
What you see is, in fact,
there's nothing that crosses
1536
01:01:44,780 --> 01:01:46,030
the diagonal.
1537
01:01:46,030 --> 01:01:48,370
1538
01:01:48,370 --> 01:01:48,880
OK?
1539
01:01:48,880 --> 01:01:51,600
So this one basically
doesn't depend on
1540
01:01:51,600 --> 01:01:52,790
this one or this one.
1541
01:01:52,790 --> 01:01:54,480
It only depends on
the previous one.
1542
01:01:54,480 --> 01:01:57,230
So I can run everything in
the diagonal parallel in
1543
01:01:57,230 --> 01:01:58,560
here in this one.
1544
01:01:58,560 --> 01:02:01,247
So of course, I can't write
anything [UNINTELLIGIBLE] in
1545
01:02:01,247 --> 01:02:03,260
here, but there's a cute
trick you can do.
1546
01:02:03,260 --> 01:02:06,400
What you can do is you
can take iteration
1547
01:02:06,400 --> 01:02:07,910
space and skew it.
1548
01:02:07,910 --> 01:02:10,620
1549
01:02:10,620 --> 01:02:15,300
So what I have done is now
instead off the same thing,
1550
01:02:15,300 --> 01:02:17,851
instead of this being a square,
now I skewed it a
1551
01:02:17,851 --> 01:02:20,250
little bit.
1552
01:02:20,250 --> 01:02:20,790
OK?
1553
01:02:20,790 --> 01:02:28,730
So what that means is when I'm
running first i, I basically
1554
01:02:28,730 --> 01:02:29,530
don't run any here.
1555
01:02:29,530 --> 01:02:32,380
Then, I run this one and
this iteration here.
1556
01:02:32,380 --> 01:02:34,220
So what I have done is I have
kind of moved my iteration
1557
01:02:34,220 --> 01:02:35,090
space around.
1558
01:02:35,090 --> 01:02:38,810
Do you see how this might be?
1559
01:02:38,810 --> 01:02:42,900
So now, the interesting thing
is when I skew, if I look at
1560
01:02:42,900 --> 01:02:56,635
this line, I can parallelize
in this one because all the
1561
01:02:56,635 --> 01:02:58,490
dependences come from the
previous iteration.
1562
01:02:58,490 --> 01:02:59,150
Am I right?
1563
01:02:59,150 --> 01:03:00,400
[UNINTELLIGIBLE]
1564
01:03:00,400 --> 01:03:03,532
1565
01:03:03,532 --> 01:03:04,000
Yeah.
1566
01:03:04,000 --> 01:03:05,250
I skewed it.
1567
01:03:05,250 --> 01:03:10,864
1568
01:03:10,864 --> 01:03:16,440
Yes, everything in here, these
ones are parallel.
1569
01:03:16,440 --> 01:03:16,990
OK?
1570
01:03:16,990 --> 01:03:21,070
And any dependence comes from
the previous iteration.
1571
01:03:21,070 --> 01:03:22,670
There's no current iteration
in here.
1572
01:03:22,670 --> 01:03:24,580
Everything in this
one is parallel.
1573
01:03:24,580 --> 01:03:26,960
So I can parallelize this.
1574
01:03:26,960 --> 01:03:29,020
So this one doesn't depend
on this one or this one.
1575
01:03:29,020 --> 01:03:32,000
So this is all parallel.
1576
01:03:32,000 --> 01:03:33,570
This is a little bit
more complicated.
1577
01:03:33,570 --> 01:03:36,990
So if you're interested
to go deep, just go
1578
01:03:36,990 --> 01:03:38,005
stare at the slides.
1579
01:03:38,005 --> 01:03:40,460
I have the slides out there to
understand how that happens.
1580
01:03:40,460 --> 01:03:43,170
So if you think about what I'm
running here in parallel is
1581
01:03:43,170 --> 01:03:47,660
the one basically this
diagonal in here.
1582
01:03:47,660 --> 01:03:50,870
So what happens is if you run
this, this, and this parallel,
1583
01:03:50,870 --> 01:03:51,690
there's no dependence.
1584
01:03:51,690 --> 01:03:54,600
I don't need this one or this
one to run this one.
1585
01:03:54,600 --> 01:03:56,550
So I can run this, this,
this, this, all
1586
01:03:56,550 --> 01:03:58,010
this diagonal in parallel.
1587
01:03:58,010 --> 01:03:59,680
But the trouble with just the
diagonal is I don't have a
1588
01:03:59,680 --> 01:04:01,850
place in here to say
[UNINTELLIGIBLE]
1589
01:04:01,850 --> 01:04:02,560
for a diagonal.
1590
01:04:02,560 --> 01:04:04,940
So I basically skewed
it and then made a
1591
01:04:04,940 --> 01:04:06,760
diagonal into one loop.
1592
01:04:06,760 --> 01:04:12,780
So then, now what happens
is basically j
1593
01:04:12,780 --> 01:04:15,850
loop I can run parallel.
1594
01:04:15,850 --> 01:04:17,800
This one.
1595
01:04:17,800 --> 01:04:20,450
So I can do it four
[UNINTELLIGIBLE] four.
1596
01:04:20,450 --> 01:04:21,070
OK?
1597
01:04:21,070 --> 01:04:27,910
So here's something you found
a problem that has no nice
1598
01:04:27,910 --> 01:04:28,200
parallelism.
1599
01:04:28,200 --> 01:04:31,370
But you realize there's kind of
a what you call a wavefront
1600
01:04:31,370 --> 01:04:32,550
going on here.
1601
01:04:32,550 --> 01:04:33,530
Wave going on here.
1602
01:04:33,530 --> 01:04:36,870
So not the given dimension, but
there's another dimension
1603
01:04:36,870 --> 01:04:37,690
that you can parallel.
1604
01:04:37,690 --> 01:04:41,090
So you kind of skewed your
space to get that nice
1605
01:04:41,090 --> 01:04:41,690
[UNINTELLIGIBLE] line.
1606
01:04:41,690 --> 01:04:44,210
And you run parallel.
1607
01:04:44,210 --> 01:04:47,100
So that's all I have for today.
1608
01:04:47,100 --> 01:04:48,350