Project iDiary

From DRLWiki

Contents for This Page

Motivation/Challenges

"What did you do today?" When we hear this question, we try to think back to our day's activities and locations. When we end up drawing a blank on the details of our day, we reply with a simple, "not much." Remembering our daily activities is a difficult task. For some, a manual diary works. For the rest of us, however, we don't have the time to (or simply don't want to) manually enter diary entries. The goal of this project is to create a system that automatically generates answers to questions about a user's history of activities and locations.

This system uses a user's GPS data to identify locations that have been visited. Activities and terms associated with these locations are found using latent semantic analysis and then presented as a searchable diary. One of the big challenges of working with GPS data is the large amount of data that comes with it, which becomes difficult to store and analyze. This project solves this challenge by using compression algorithms to first reduce the amount of data. It is important that this compression does not reduce the fidelity of the information in the data or significantly alter the results of any analyses that may be performed on this data. After this compression, the system analyzes the reduced dataset to answer queries about the user's history.

Why is it hard? This challenge is hard for two reasons. First, the system has to keep track of a user's visited locations. This is difficult because GPS data comes in very large quantities. This is evident from a few estimation calculations. One GPS packet (which includes latitude, longitude, and a timestamp) is on the order of 100 bytes. If a single phone collects one GPS packet every second, the phone would collect about 10 megabytes of data per day. In 2010, there were approximately 300 million smart phones sold [3]. If even a third of these phones were to continuously collect GPS data, 1 petabyte of data would be generated each day. That's enough data to fill one thousand external hard drives, each with a terabyte capacity, per day.

Large quantities of data are difficult to store and even harder to analyze. Generally, having more data also means having more noise in the data. This noise makes it difficult to distill the important information. We need to find ways to analyze this large amount of noisy data.

The second reason why this challenge is difficult is due to the necessity for activity recognition. That is, the system needs to convert GPS data into activities that are performed by the user. This translation of raw data into human-readable text requires the system to associate external information about locations with the locations'coordinates as well as parse this information to determine activities performed
at that location. This process is difficult, which makes the challenge difficult.

Why is it Interesting? Being able to automatically generate answers to queries about a user's history is interesting not only because it is a problem that comes up on a daily basis, but also because a solution to this problem can be used to solve many other problems. A solution to this challenge would be able to manage large quantities of GPS data. Looking beyond GPS data, the ability to manage large quantities
of data would be valuable in commercial businesses, scientific research, government analyses, and many other applications. As such, this challenge is interesting not only for forgetful users who would like to remember their previous activities and visited
locations, but is also of interest to many other fields.

Our approach

The solution is to compress this large amount of data into a smaller, less noisy sketch of the data, and then run algorithms to analyze this compressed data. Compressing the data first is the key insight, as it allows the system to manage large quantities of data. This solution uses novel coreset creation and trajectory clustering algorithms to compress the data. After compression, the solution uses latent semantic analysis with the compressed data to perform search queries.

What is a coreset?

A coreset is a small subset of data elements, possibly augmented with additional information that represents the original dataset with respect to a family of queries / cost function computations. This representation is approximate, and allows us to control the tradeoff between approximation accuracy and coreset size / complexity.

iDiary system

During the project, several generations of the system have been developed. An overall schematic of the system is given below.

The core of the system is in the coreset creation, which generates a compact representation for the data streams obtained from the user (GPS, images).