Abstract:

More and more data about users and their behavior is being collected both on the web and in mobile applications. Aggregating, processing, and making this data available in ways that help to guide decision-making is difficult, however. This thesis presents an algorithm for pre-processing user behavior data to produce a data structure that answers queries about the numbers of distinct users matching a given set of criteria. These kinds of queries can be made directly by planners and managers, or by systems such as forecasting and reporting systems that need information about counts of distinct users.

In our algorithm, users are aggregated according to their transaction profile, or the set of behaviors that each user exhibits in the input data. This aggregation results in a massive reduction in terms of the memory footprint of the information. Our data structure can only be queried for counts, however, and not the behavior of specific users. Data identifying individual users is lost upon aggregation.

We advocate pruning some transaction profiles to further reduce the size of the resulting data structures, and provide insights into how this might best be done. After pruning, our output no longer provides exact counts of distinct users, but rather approximations. Such data structures are known in the literature as synopsis data structures.

The method presented in this thesis reduced the memory footprint of our user behavioral test data to less than 5 % of its original size while providing accurate counts, and to less than 1.2 % with pruning. Pruning provides substantial benefits in terms of memory, but can introduce a significant amount of error into the approximate counts. This can be counteracted with scaling. The effectiveness of the algorithm depends on its ability to aggregate the input data enough so that the resulting data structures are small and so that pruning does not have too large of an effect on accuracy.Käyttäjistä kerätään internetissä ja mobiili-sovelluksissa yhä enemmän tietoa. Tämä tieto on erittäin hyödyllistä: sen avulla voidaan oppia tuntemaan käyttäjiä paremmin, tai siihen voidaan perustaa ennusteita tulevasta. Tieto on kuitenkin ensin saatettava käytettäväksi, joko suoraan ihmisten käsiin tai muiden järjestelmien ulottuville. Tässä diplomityössä esitetty algoritmi ryhmittää palvelun käyttäjistä kerättyä tietoa ja koostaa tietorakenteen, josta voi hakea käyttäjäryhmien kokoja eli annettujen määritelmien vastaavien uniikkien käyttäjien määrää.