README.md

Compiled some OpenCV OpenCL kernels onto Arria 10 FPGA:
- Added several print to console statements to the OpenCV source for Gaussian Blur, which in turn calls SepFilter2D
- From running OpenCL version of OpenCV functions and seeing console output, realized Gaussian Blur cl kernels never getting ran, and just SepFilter2DRow and SepFilter2DCol kernels getting called
- Also added a print to console statement for the build_options being passed to SepFilter2DRow and SepFilter2DCol
- Found the ProgramSource objects for SepFilter2DRow and SepFilter2DCol kernels, and put these into a SepFilter2DRow/Col.cl file
- Added the parameters from the respective build_options strings to the two .cl kernel files, and appropriately modified the code such that all macros properly defined
- Followed the Altera instructions to compile using aoc compiler; first generated a partial compiled file (.aoco; quick and good for knowing most compile-time errors) and then generated the full compiled file (.aocx; takes over an hour to complete the full compile)

Added FPGA/Altera tools to OpenCV:
- Initially tried creating separate FPGA module to OpenCV that was a duplicate of the core module
- Motive was that we had two conflicting OpenCL's to deal with - Altera for the few FPGA converted cl kernels, and normal NVidia GPU for all the remaining kernels. Was worried that there would be a linking conflict from the conflicting OpenCL lib.so files
- This was fully unnecessary actually, as OpenCL actually has two parts to it - a higher level lib.so library, that is highly similar across Altera/Intel/NVidia/etc..., and a lower level part that actually implements the higher level library specific to the manufacturer's architecture
- Thus, by simply correctly detecting and choosing the proper OpenCL device/context, the proper type of OpenCL will be used. This enables using multiple different vendor's OpenCLs in the same code.
- Learning from above experience, actually did this by adding results of aocl link-config to CMakeLists of core module (look at this CMake file)
- Studied vectoradd example from Altera
- Added vectoradd AOCLUtils include files to the core module's src directory
- Changed createProgramWithSource in original code to createProgramFromBinary from Altera, making sure to give correct parameters here
- Had to make sure to preserve the "handle" cl program handling object
- Did the above by creating a code switch in ocl.cpp
- Depending on if ProgramSource parameter to Program constructor is a simple path to Altera binary file (.aocx), or a long string of cl kernel src, constructed the Program object accordingly
- Essentially two calls to pay attention to: createProgramWithSource/createProgramFromBinary, and clBuildProgram
- NOTE: to mold the code switch into the OpenCV callstream, I had to create my own custom ProgramSource objects (containing a simple binary file path instead of large cl kernel string) and pass these in as parameters. To do this, need to first build/make OpenCV without the modifications (as the ProgramSource objects are in the builds directory), and then add the extra ProgramSource objects inside the builds directory (see imgproc/src/filter.cpp for more info), and then remake OpenCV.

Fixed context callback/segfault issue:
- Dug deep into OpenCV source; found a 'getenv(OPENCV_OPENCL_DEVICE)' call
- From this, looked up how to set the env variable properly, giving this: OPENCV_OPENCL_DEVICE=":ACCELERATOR:"

Fixed DMA data transfer misalignment issue:
- in core/src/ocl.cpp, set the following #define CV_OPENCL_DATA_PTR_ALIGNMENT 64
- OpenCV previously had set it to 16, so hard coded it like this and that fixed the issue

Fixed issue of SepFilter2DCol not getting called:
- When calling Kernel::run from ocl.cpp for SepFilter2DRow, the clEnqueueNDRangeKernel call gave a retval of -56, which is the CL_INVALID_GLOBAL_OFFSET error code
- Current version of OpenCL doesn't support an offset parameter and needs this to be NULL; future versions will support it. OpenCV was creating an array containing offsets of 0; fixed issue by replacing with a hardcoded NULL for this parameter of the function call.

Issue: FPGA keeps locking up

Issue: How to just create program from binaries once instead of each time object constructed

ISSUE: FPGA version of code (partial FPGA, partial normal OpenCL) takes VERY LONG TIME; need to get to ~3 ms to compare with normal OpenCL
- Try to take the headers from example code from convolve.cl and add those to SepFilter2DRow and SepFilter2DCol kernels.

TODO: Learn VTune/perf, use this to understand which of the kernels takes largest time to execute; these could be converted to FPGA instead of the simple SepFilter2D kernels

Create two virtual environments, one for TF-CPU, one for TF-GPU; separately compile the Udacity steering workloads in both and get some performance benchmarks