ComputeCube

Saturday, 17 June 2017

The below code is meant to accompany the 3D Gaussian Convolution post. Note that the code is not generic and only calculates a 5x5x5 Gaussian on a set of 256x256 planes. It is very easy to modify to support other sizes and could even be templated. The Z convolution is not included in the code, but remember in this step you would need to do the final division.

Friday, 16 June 2017

I mentioned in a previous post that I've pre-ordered a ReMarkable . Mostly for scribbling down ideas and reading technical documentation.

It occurred to me that these pads could be ideal for contract negotiation. Many years ago we looked at the possibility of using iPads for this purpose but the supporting technology / cloud infrastructure didn't really exist to the extent it does now.

ReMarkable don't currently offer handwriting recognition but they do have wifi connections and hopefully some form of SDK. In the insurance markets a lot of contracts are still negotiated on paper in a face to face manner, for example we see all sorts of documents where the risk has been proportioned by simply adding hand written percentages to a document as the parties have negotiated in person.

Picture one of these ReMarkables with the Exari technology stack: In a simple scenario custom contracts could be created on the screen in real time and signed by the parties right on the ReMarkable pad with its pen then sent off to the cloud for capture into our Universal Contract Model.

A more complex use case could be two parties each with their own ReMarkable negotiating on their own version of a contract. Exari capture, match, and analysis technologies operating in the background highlighting changes, showing areas of risk, populating new clauses until the eventual agreement, sign and capture of data into the Universal Contract Model.

I really can't wait to get one of these devices now and hopefully a SDK to go with it!

Friday, 9 June 2017

After weeks of procrastination I pre-ordered a remarkable. With any luck it will live up to the hype and preproduction reviews.

What eventually convinced me? Well, once again leaving some notes I was working on at home...

I'm a little concerned about the latency (55ms) which seems a bit high, but the convenience of not having to manage stacks of notepads whilst also incorporating a PDF reader will hopefully mitigate the potential latency issue.

Thursday, 8 June 2017

Adsensa, the company I joined when it was a startup over ten years ago, has been acquired by Exari systems. As this is a personal blog I hardly, if ever, comment on my job. I'm breaking from tradition here as it is a very exciting and complimentary deal between the two companies.

We are now part of a truly global company with a substantial increase in engineering resources and due to the geographic diversity we can offer improved support to our clients.

The mature Exari workflow should make an immediate improvement to the Adsensa products once integrated. Our industry leading capture, match and analysis tools will enable Exari clients to process legacy documents or documents they receive during the various phases of contract negotiation.

From a development perspective we are very excited about Exari's Universal Contract Model. This is something rather ground breaking and speaking from an Adsensa perspective we are looking forward to integrating our technology to populate the contract models.

It is rather pleasing to see something we have worked so hard on for so many years becoming part of something even bigger and making a real difference to the operations of our clients.

Wednesday, 7 June 2017

I'm currently working on some image manipulation that requires a Gaussian Point Spread function that isn't uniform in the x and y directions so thought its worth revisiting the derivation from an older blog post along with some thoughts on optimization:

In one dimension the Gaussian function looks like:

\$f(x) = Ae^{- \frac{(x-b)^2} {2\sigma^2} } \$

where \$\sigma\approx2.718281828 \$ which is Euler's Number, and b is the point over which the bell curve will be centred. You should recognize the \$(x-b)^2\$ as the first step in calculating the distance between two points. A is the amplitude of the function. The bigger A is, the higher the peak produced.

As can been seen from this equation \$\sigma\$ controls the spread of the bell shaped curve produced. If its not immediately obvious then keep in mind as you divide by a bigger number then the fraction gets smaller and \$anything^0 = 1 \$

Now that we understand the function in one dimension lets extend it to 2 dimensions, after all that's what I am interested in for my image manipulation.

It is very simple to extend the function to 2 dimensions as we are really looking at the distance of a point from a centre location. For now lets assume our \$\sigma\$ (curve spread) is the same in each direction and that our centre point is \$bx,by\$

\$f(x,y) = Ae^{- \frac{(x-b_x)^2+(y-b_y)^2} {2\sigma^2} } \$

Again you should see the \$(x-b_x)^2+(y-b_y)^2\$ as the distance of the two points from the centre point - we are just missing the square root.

This is the equation we used to calculate our Gaussian PSF kernel, for this example we are going to use the following parameters:

\$A=15\$ and \$\sigma=1.4\$

We then take the points from -2 to 2 = 5 in each direction and plug them into our equation as x,y values. The resultant value is rounded and stored in our matrix.

For Example: x=2, y=2

\$f(2,2) = Ae^{-(\frac{2^2+2^2}{2(1.4^2)})}\$

\$f(2,2) = Ae^{-(\frac{8}{3.92})}\$

\$f(2,2) = Ae^{-2.0408}\$

\$f(2,2) = 15*0.1299\$

\$f(2,2) = 1.9488\$ which can be rounded up to 2 and this is the value we store at 2,2 in our kernel.

Note that as we are squaring our differences from the position to the centre in each direction the values at: -2,-2 ; -2,2 ; 2,-2 are all the same as the 2,2 one calculated above. We can use this as an optimization in calculating our kernel coefficients.

As mentioned above the \$\sigma\$ value can vary for x and y directions. This causes our 2D curve to be stretched/compressed in the x or y direction.

knowing about our one dimensional Gaussian function we can clearly see how the above function works: the two component directions are calculated first, added together, then the result used as the power to which we are raising e.

Once we have calculated our kernel coefficients we can apply them to our image. Remember that this is a separable convolution so don't implement it in the trivial manner by reading in a 5x5 block around your pixel of interest, multiplying, adding and finally dividing. Rather, apply the x and y convolutions separately which will involve 4 times fewer reads and these reads will be in a more cache friendly manner.

Wednesday, 31 May 2017

NVidia announced CUDA 9 a few weeks ago. I've been using CUDA since v1.1 and compute capability 1, and things have matured significantly over the years.

The new CUDA adds support for the new Volta architecture, C++14, faster libraries and Tensor core matrix multiply, which is clearly targeting deep learning applications. But, for me, there is one stand out feature: Cooperative Groups.

The release says that it is a new programming model for managing groups of communicating threads. What does that really mean?

Previously you could synchronize threads across a thread block with the __syncthreads() function. Cooperative groups allow you to define groups of threads at the sub-block and multi block levels and synchronization across the entire grid.

The grid sync means you now longer have to have multiple kernels operating in successive launches in order to complete a complex task on a data set. A single kernel can now operate on the data and using something like:

thread_group group_grid = this_grid();

//do something here

grid.sync();

//do something else here

//etc

You also get a this_multi_grid() variant which will synchronize the kernel across all GPU's its been launched on!

You no longer have to wait to the sync at the end of a kernel launch and launch another kernel from the cpu code. Presumably you will still be restricted by the timeout on the driver for your primary device.

This coupled with the pinned memory / zero copy means you can have long running kernels running all sorts of operations on memory that can be streamed into the device.