I have long been interested in use cases where a Map/Reduce implementation using a NoSQL database would be preferable to a simple SQL database. A group of students and I decided to examine the effects of converting a MySQL/PHP based version of the Apriori algorithm to a Couchbase/JS based one. The Apriori algorithm is used to determine the most common sets of items, given a large set of “baskets”. The most common example of this is finding the most common sets of products sold together in the grocery store. In this example, a basket would simply be all of the products a customer purchased at one time. We are interested in discovering which products, or sets of products (say, “beer, milk, and cheese”), appear in a high percentage of baskets.

The implementation of Apriori that we tested against was for the Brovine gene database. In this database, the customer stores various information about an animal’s genetic information, including specific genes found and transcription factors. Transcription factors are proteins that bind to a specific part of the DNA strand and copy that part of DNA, which leads to gene expression. The customers are interested in finding sets of transcription factors (products) that occur together in different genes (market baskets).

Our group decided to implement several different implementations for the Berkeley DB, Couchbase, and MongoDB separately in an attempt to rule out any differences based on the specific database or programmatic choice. We decided to use Python for the BerkeleyDB and the MongoDB implementations, and Node.js for the Couchbase implementation. BerkeleyDB is a key-value store that touts very fast retrieval times, and Couchbase and MongoDB are both document stores. An attempt was made to do as much of the work as possible on the database using MapReduce for the MongoDB and Couchbase implementations, and the BerkeleyDB implementation was used as a control: we wanted to see whether MapReduce is actually a good choice for the Apriori algorithm. The only operation BerkeleyDB is used for in that implementation is to store and retrieve the basekets.

For our experiments, we tested each implementation (the original Brovine implementation, BerkeleyDB, MongoDB, and Couchbase) on the original dataset, which consisted of about 50 baskets with 100 items each. To view the effects of a larger database on each implementation, we also duplicated the dataset 10 and 100 times, and reran the tests.

The results we received were very surprising. The BerkeleyDB implementation literally crushed all other implementations because of the database retrieval time, being more than seven times slower than the other implementations. However, every new implementation we wrote was faster than the original algorithm, which used MySQL/PHP; the BerkeleyDB implementation was almost 50 times faster.

Consistency in modern NoSQL-style databases has always been a hot topic, and it indeed is critical when determining the correct DBMS for a situation. I wrote a paper explaining some of the differences between the solutions of Dynamo, Amazon’s database solution, and Chubby’s backend fault-tolerant database. Each database was examined in context of the CAP theorem, by Eric Brewer, which states that a DBMS cannot completely guarantee consistency (C), availability (A), and partition tolerance (P) at the same time. This result was proven by a research group at MIT several years later. The two DBMSs examined, however, provide novel solutions that still do not guarantee all three properties, but rather sacrifice a little bit of each property to meet the specifications of the application which the database was built for.

Dynamo uses an unusual approach to consistency. Due to the workload of the applications that Amazon provides, such as the shopping cart, they needed a database that would provide up to 99.9% availability, while also being distributed to hundreds of thousands of servers around the world. However, the CAP theorem then states that we cannot guarantee consistency, given that we need availability and partition tolerance. Amazon found a way around the need for guaranteed consistency by allowing any node in a cluster of Dynamo servers to accept a write at any time without checking for conflicts. Then, these conflicts are resolved when the value is read by returning multiple versions, if multiple versions are found. The application is then responsible for resolving conflicts.

Chubby, on the other hand, must guarantee consistency and must be distributed, leaving availability to suffer. They required consistency because the types of data that Chubby handles is not tolerant of conflicts – for example, it is easy to merge a shopping cart, but not so easy to merge two versions of a term paper. The Paxos algorithm was implemented to guarantee consistency in a partitioned environment, even if a subset of servers go down.

Whenever I’m working on my website, I always find working with the error logs from Apache, MySQL and PHP a hassle. Splunk tries to solve this problem by keeping track of the log files you have, and letting you effectively search through each entry. It works fairly well. They also claim to use Artificial Intelligence to predict threats to your website. After some digging, I found that this functionality is only available as a plugin – a very expensive one.

I decided that this would be a perfect project to undertake: a free system like Splunk that attempts to find website users with suspicious activity. Working with three other students, we built a system in Java attempting to mimic the functionality claimed by Splunk.

The system first watched for changes to a set of files chosen by the user (essentially a simple implementation of the tail command). Each time a new log entry was recorded, the system notified each of the “researchers” that were active. A researcher is simply a class that accepts log entries as strings, parses them as it chooses, and keeps tabs on possible threats to the system. When it deems that there is a threat, it notifies a process about the nature, severity, and the IP address of the threat. The system is exemplified by this dependency graph:

The process (Commander) that receives threats from researchers aggregates all threat severities for each IP address and reports to the user once a level has been reached, as determined by the user. Over time, the threat level of each IP address decreases to signify the use of the website over time as good activity, not bad. Therefore, an IP address that accesses the website over a thousand times in a few minutes would have a much higher score than an IP address that accesses the website the same number of times in a month.

The two researchers we implemented to show the system’s functionality are the FrequentAccessResearcher and the ErrorResearcher. The FrequentAccessResearcher is responsible for determining whether a user is attempting to hit website with a large number of requests. If a user makes enough accesses within a short amount of time, this researcher reports the user’s IP address to the commander. The ErrorResearcher checks whether a user’s requests result in errors. Such a user will also be reported to the commander. With this researcher, we attempt to find users who may be attempting to exploit the website by fuzzing, which frequently results in errors due to the random nature of the requests sent.

Brovine is a website that allows users to manipulate gene data gathered through experimentation. Originally started as a project in my advanced databases class, it aims to eliminate the manipulation and haphazard storage of many Excel files that were once used by Animal Science researchers.

The website backend is a glorified LAMP stack, with the CodeIgniter framework used to speed up the development process. CodeIgniter is an MVC framework that provides many security and stability features without the complexity or massive configuration files that plague many Java web frameworks. Making SQL calls to MySQL and returning AJAX data was very simple using CodeIgniter.

On the frontend, many different tools were used. Bootstrap was used to give the website a very clean look and let us worry about the implementation features instead of design. DataTables let us easily create tables to display, sort, and filter the data we retrieved from the server on the fly. Uploadify was used to allow the customer to upload multiple files to the server at the same time. The customer had hundreds of data files that needed to be uploaded to the server, and anticipated many more in the future, so this was an essential feature of the website.

Each of the DataTables are dynamically populated using AJAX, which ensures a clean user interface. All of this requires a great deal of Javascript to achieve, so jQuery was used to simplify DOM selection and manipulation. The customer required token-based filtering, and the jQuery TokenInput plugin was the perfect solution: an AJAX-lookup suggestion box as the user typed, with multiple token selection.

During the first iteration of Brovine, the Javascript was haphazardly written due to the time constraints our team had (the project implementation was to be completed in 7 weeks). However, this has been partially changed. The CommonJS API is utilized to provide a much more encapsulated solution, while the BrowserBuild tool concatenates all of the required modules into a single Javascript file which is included in the page. UglifyJS2 is used to minify this code for the production environment. The result of these tools is a much easier to maintain codebase which virtually eliminates the code duplication and file inclusion headaches present in the previous version.

If you’d like to view the code, visit the Github repo for the project implementation.

Spline interpolation, a topic of Numerical Analysis, involves deriving the coefficients of an equation that models a curve that goes through two or more points. This technique is used heavily in data visualization and graphics to turn a discrete set of points into a continuous line. In the Advanced Computer Architectures class I took, my group created an implementation of a spline interpolation algorithm using an NVIDIA GPU and the CUDA framework, in an attempt to make an HDD data visualization tool easier to use.

The type of splines we chose to use (and that look the most natural) are called “natural cubic splines.” This means that each spline is made up of cubic functions that are added together to look continuous.

Here is a set of before and after screenshots of the tool:

As you can see, it is much easier to distinguish where a specific data set (one line) is going on the right, when using “natural cubic splines.” On the left, it simply looks as if there are sets of line segments that are somehow correlated.

Qt is an application framework for C++ that provides relatively easy cross-platform GUI creation. Unfortunately, it was not made with the CUDA application framework in mind – at all. To get Qt to work with CUDA, there are a few little tricks that you need to incorporate into your project’s .pro file. The approach I’m using has worked on Ubuntu 12 LTS, Mac OSX 10.8 (its buggy, though), and Windows.

First thing’s first: I have not gotten Qt to compile my CUDA code. If anyone can suggest a way to do that I’d like to hear it. Next, make sure you have the GPU Computing Toolkit installed in addition to the GPU SDK.

Next I just included the folders containing the cudart library:For Mac:
DEPENDPATH += /usr/local/cuda/lib .
INCLUDEPATH += /usr/local/cuda/lib .

After you’ve got your include path set up, just add -lcudart:LIBS += -lcudart

The difficult part, at least in Windows, was finding the right architecture settings. MS Visual Studio only compiled CUDA code to x32 at the time, hence the ‘Win32′ folder in the include path. Be sure to check the architecture settings of Qt as well, under the “Toolchain” settings. After you’re sure the architecture is right, add the CUDA object files to the LIBS variable as we did with -lcudart.

Now when you attempt to build the project in CUDA, it should still fail! Don’t worry. We still need to add the CUDA object files into the now-generated output directory. Drop them into the YourProjectName-build* folder, including any .h files from your CUDA project that you need to link.

Finally, attempt to build again. It should finally compile and then link with your CUDA object files.

Numerical Analysis is all about algorithms that efficiently solve problems which may contain errors or require approximation. One of the most recognizable uses of numerical analysis are the algorithms used to compute the square root, which requires finding successively better approximations to arrive at a solution for numbers that are not square. I wrote two papers on various Numerical Analysis topics, including using Newton’s method and solving linear systems of equations using a computer.

ABSTRACT: Linear systems of equations are notoriously difficult to solve. Most numerical methods for solving linear systems are plagued by inaccuracy in special cases or just being too slow. In this report, four methods will be compared in accuracy when solving a linear system with a very small number as one of the coefficients.

The four methods compared are Naïve Gaussian Elimination, the Gauss-Seidel Method, Iterative Refinement, and Scaled Partial Pivoting. Each of these is plagued by their own problems. Naïve Gaussian Elimination, the standard for hand-solving linear systems, can actually return the wrong answer in some cases. The Gauss-Seidel method only works when the matrix representing the linear system is diagonally dominant. Finally, Iterative Refinement adds extra work onto the already slow Gaussian Elimination process.

In addition, this report will attempt to discover ways to determine if a linear system is fit to be solved by one of the methods described above.

ABSTRACT: In a small, embedded system where small size and power efficiency are crucial, a programmer must limit the use of memory and processor power to absolutely essential tasks. The firmware we are developing for our new implantable pacemaker is no exception. Our developers have expressed interest in using the square root function, but the microchip we have chosen does not have built-in functionality for this calculation. This report describes our options for implementing sqrt(x), where 0 <= x <= 4, given the constraints on processing power and memory footprint.