Wouldn't the traffic of a lot of websites experience similar trends and therefore need more resources at roughly the same times? If that is the case then amazon now has the problem of having all of these unused resources instead of the individual websites having extra resources. Seems to me like shifting the problem to somebody else, but maybe it is better to have that problem centralized instead of everybody worrying about it.

Do ASICs also use relaxed memory consistency for use cases that don't need the strict memory requirements? This would be a good way to improve efficiency for very specific applications on the hardware level

If kernel calls are asynchronous would it be possible to schedule a large kernel call and then start work on the CPU at the same time? This would result in both the CPU and the GPU being utilized simultaneously which seems like a great option.

Remembering the freeway example, throughput is the width of the freeway and latency is the speed limit. Thus if our program only has one thing running (one car on the freeway) increasing the bandwidth (width of the freeway) won't really make sense; we would want to decrease the latency (increase the speed limit). However if we had a lot of processes running in parallel (there are a tonne of cars on the freeway), we would be better off (usually) increasing the bandwidth (making the freeway wider would reduce traffic, increasing the speed limit would help though would not be as effective if there simply isn't enough room on the freeway for the cars)

1) Remembering Amdahl's law, we know that even a small amount of sequential computation can greatly increase the runtime of a relatively parallel program run in parallel.

2) From 213 Cache Lab we know that having relevant data in the cache reduces the latency involved with loading from disk.

3) Sometimes the overhead of avoiding synchronization or implementing fine-grain locking can actually be greater than the cost of having an atomic section in code. Thus trying the simplest solutions first is a good approach to problems

In this case, once you load in one (row, column), you get a 64-bit cache line, which is the desired result in most cases. The same effect can be reached by interleaving the bytes more coarsely (up to 8 bytes on each DRAM).

Generally, you check if the loss is too high by computing the difference between the current solution and the desired solution. If this difference is above a predetermined threshold, then the loss is considered too high.

@jedi I think that busy waiting would drive the phone to electric limit since it is still taking up power. However, since busy waiting is so low-power, I doubt that it would drain the battery all that quickly.

@williamx I believe functional languages would lean more to C/C++ side. Most common functional languages (such as SML/Haskell/OCaml) are extremely powerful and you can develop nearly anything with them but the learning curves for functional languages are extremely high.

A non-CS example of the ABA problem is if you're driving and stopped at a red light. Then, you turn to talk to a friend and then later turn back to see that the light is still red. You think that the light hasn't changed but, while you were turned, the light actually turned green and back to red.

Most current systems lie in the first circled region and so generally we can optimize throughput in database systems for these systems. For the sake of high performance computing and supercomputers with many more cores, it is important to consider database performance in the middle regions of the graph.

The deadlock in this situation could have been avoided by prioritizing some requests over others. In this case, the BusRdX request would be prioritized over the BusRd request and thus the processor would either service the incoming BusRd request or override the BusRd request with its BusRdX request and then the other processor can resend the BusRed request.

For assignment 2, it would have been useful to scale down since that would have still kept the circle to grids ratio the same. Since that was where the main parallelism came from, we wouldn't have changed the dynamic of the problem.

Even though we don't need locks in the message passing model, we need to be careful we don't change the value after sending it if it's an asynchronous send, and we don't immediately try reading the value on an asynchronous receive

This is a very inefficient usage of the 64 bit bus, since at any point in time, we are only getting 8 bits of information out of the 64 possible bits. If we were trying to fill in a cache line of size 64 bytes, it would take 64 cycles instead of 8 cycles that you would get if we were using all 8 DRAM chips.

Always try the simplest approach first - using an "atomic" statement for critical sections. It may very well be the case that implementation of fine-grain and lock-free data structures will have excess overhead leading to worse performance than a simpler locking algorithm.

This type of convolution is similar to a convolution with a signal of length 9 in signal processing, however this process can be made faster by taking a FFT, multiplying the two signals (instead of convolving) and then taking the inverse Fourier transform. This tends to be a better option when the sizes of the signals are significantly larger to overcome the overhead of the FFT and IDTFT

Because the capacitors have to be constantly charged and discharged each time DRAM is read / written to, the capacitors tend to denature after repeated use, especially if the DRAM is cheaply fabricated.

Embedded DRAM can be optimized for low latency applications such as program, data, or cache memory in embedded microprocessor or DSP chips. With appropriate memory architecture and circuit design, GHz speeds are possible with on-chip DRAM.

There has historically been different mindsets between people dealing with computationally intensive programs and data intensive programs. Thus, their approaches to scaling programs to large systems as well as to parallelism and efficient computing has with trying to leverage the power of different aspects of the system, and if these types of programs could be dealt with as one, we could see more progress with each individual part of the system.

@dyzz. I want to be absolutely clear here that there is no substitute for good substance.

However, when you have done work with good substance or have a very good idea you want to see your team implement in a future job, this is the time when it's probably the most important to have the skills to communicate that substance well. The best ideas will benefit others, lead to better systems, etc. and we don't want the good ideas to lose out to other ideas that might not have as good of technical merit, but are communicated well and thus trick others into thinking they are the best ideas.

In other words, good computer architecture often involves good communication.

Test-and-test-and-set lock reduces traffic because each process will only need to check its local cache to see whether the value has changed, and only send out traffic when it observes that the value becomes valid.

Using locks: guarantee that only one thread will modify the stack at one time; Lock-free: not guarantee that only one thread will modify the stack at one time, but if not, only one thread will succeed and others will start over.

Intermediary images in deep neural networks sometimes are "recognizable" to the human eye, but more often than not it is actually very difficult to tell what the neural net is "looking" at and yet they are able to make accurate predictions based on those images. We really have a long way to go in understanding how these, and by some extension, the brain works.

Although this code produces tangible performance benefits, it is only acceptable in absolutely performance critical applications. For an application where this kind of calculation is not a bottleneck or done very often, this code becomes overkill as it is much more difficult to maintain and much more prone to error.