Currently, I am working on the development of Network Intelligence (NI) software solutions, which involve traffic classification, analysis, and complete decoding of detected protocols and applications. These solutions are characterized by high performance for core network links with speeds up to 100 Gbit/s and faster. They use various technologies (e.g., Deep Packet Inspection, behavioral, heuristic, and statistical analysis) to reliably detect network protocols, applications, and services, and extract metadata, in real time. I am a daily user of G Suite, Atlassian Software (JIRA, Confluence) and GIT. From the time of taking this position, I am working with Agile development methodologies including SCRUM. During that time, I identified many aspects of SCRUM that are critical from the quality and development productivity points of view. I also served as a customer support developer channel, which made me able to better understand how the customers see and use our software and what are their priorities for the product development and maintenance. As sharing the technical knowledge is my passion, I organized multiple training workshops related to computer networks and network traffic analysis.

I obtained my PhD in Classification and Analysis of Computer Network Traffic from Aalborg University in Denmark on June 6, 2014. My PhD project was co-financed and co-supervised by Bredbånd Nord A/S, a regional electricity and Internet provider. Due to this industrial collaboration, I learned how to collect and understand customer requirements, present high-level concepts and results to the company management, and structure the work in order to reach both the scientific and industrial goals on time.

I was the founder and developer of nDPIng - the next generation open-source computer network traffic classification tool, which aims in consistent real-time traffic identification on multiple levels: transport layer protocol, all application level protocols, type of content, service provider, and content provider. I was also the principal investigator in the Volunteer-Based System for Research on the Internet project, which was focused on designing and developing a system, which is able to provide detail data about applications used in the Internet. This information can be used for obtaining the knowledge which applications are most frequently used in the network, providing the users some basic statistics about their Internet connection usage (for example, for which kinds of applications their connection is used the most), creating scientific profiles of traffic generated by different applications or different groups of applications.

I used to work independently and cover the entire development process, from architecture, design, implementation and customer feedback up to bug fixing. Apart from the nDPIng and Volunteer-Based System for Research on the Internet projects, I fully authored 2 industrial projects. Web-Based Client for InDesign Server uses web-based techniques and tools in collaboration with a headless version of InDesign Server, which is controlled by scripts produced by the designed web interface, to render InDesign documents in the real-time. The Efficient Invoicing Solution with Offline Synchronization Capabilities project was concentrated on creating an invoicing system for a mining company, which is characterized by a significant fraction of features differing from other systems already existing in the market. The designed and implemented system was in use in around 30 departments of OPA-LABOR during 4 years, successfully satisfying all the requirements set in this project.

I am quick in learning new technologies (e.g., programming languages, development platforms and frameworks) and using the new knowledge and skills in practice, which allows me to easily switch between different IT-related fields. I put the ability to solve problems with the help of the Internet, books, or other people above the encyclopedic knowledge (e.g., knowing by heart the syntax of a particular programming language or an already existing and documented algorithm).

During my PhD, I was a visiting researcher at Universitat Politècnica de Catalunya (UPC) in Barcelona, Spain, where I was working together with the Broadband Communications Research Group on the comparison of Deep Packet Inspection Tools for traffic classification. I was also visiting ntop in Pisa, Italy (collaboration on nDPI) and TELECOM Sudparis in Evry, France (collaboration on traffic classification in 802.11). I am an author of 4 journal articles, 8 conference papers, and 3 technical reports on the topics related to traffic monitoring and analysis. Two of my papers got awards as top 7% and top 5%, respectively. Since 2011, I gave 11 presentations in seminars and guest lectures at Aalborg University in Denmark, TELECOM Sudparis in France, University of Pisa in Italy, Polytechnic University of Turin in Italy, RWTH University in Germany, Universitat Politècnica de Catalunya in Spain, IDA House of Engineers in Denmark, and Albena Resort in Bulgaria. I am a reviewer of 15 articles submitted to different journals and conferences.

During my postdoctoral research, I investigated techniques used for tracking users' activity online. Many content providers and online retailers collect large amounts of personal information from their users when browsing the web. The large scale collection and analysis of personal information constitutes the core business of most of these companies, which use this information for lucrative purposes, such as online advertising and price discrimination. However, most mechanisms used to track users and collect personal information are not well known or intentionally obfuscated. The main objective was to uncover these mechanisms and understand how they collect, analyze, store and (possibly) sell this information.

I am also a holder of 2 language certificates: TOEFL iBT (98/120) and Prøve i Dansk 3 (9/12).

Does that sound interesting? If yes, you are welcome to contact me, as, currently, I am looking for new job opportunities worldwide! I am open to almost any form of employment - I can work as a full-time company employee as well as a contracted project-based consultant. However, I would like to be able to work at least half of the time remotely from home (in a way according to the needs of both the company and me).

R&S®PACE 2 is the next generation software library that identifies thousands of protocols, applications, and services, and provides deeper insight into application attributes (e.g., real-time performance metrics). R&S®PACE 2 combines the power of the Protocol and Application Classification Engine (PACE) and decoding engine (PADE), and is also capable of advanced metadata extraction. This solution is characterized by high performance for core network links with speeds up to 100 Gbit/s and faster. It uses various technologies (e.g., Deep Packet Inspection, behavioral, heuristic, and statistical analysis) to reliably detect network protocols, applications, and services, and extract metadata, in real time. Key performance indicators are calculated for deeper insight.

The decoding results of R&S®PACE 2 provide the deepest information about the current connection. R&S®PACE 2 extracts all important and relevant metadata from a number of network classification results with a configurable level of detail to suit different use cases. For example, it is possible to decompress HTTP payload and reconstruct all images or videos from internet sites. The depth of information required can be flexibly adjusted to provide just the actual data needed. Internal aggregators gather decoding information from certain decoders and bundle them into classes. For example, even if an email connection takes a long time, the full session decoding information still provides all of the data in one single place. The decoding feature of R&S®PACE 2 is especially useful in network security applications, e.g, the playback of VoIP calls, websites and chat sessions, or gathering upload and download statistics of various documents.

The aim of this unique project is to bring new quality to the field of traffic classification by providing the results on many levels. The clear, unambiguous identification of network flows is meant to be ensured by various classification techniques combined into a single tool. The following information is intended to be given for each flow inspected by the classifier: transport layer protocol, all the application-layer protocols, type of the content, service provider, and content provider. Look at the Projects section for a detailed description.

It is widely known that content providers and online retailers (e.g., Google, Facebook and Amazon) collect large amounts of personal information from their users when browsing the web. The large scale collection and analysis of personal information constitutes the core business of most of these companies, which use this information for lucrative purposes, such as online advertising and price discrimination. However, most mechanisms used to track users and collect personal information are still unknown. Our main objective is to uncover these mechanisms and understand how they collect, analyze, store and (possibly) sell this information.

Personal information in the web can be voluntarily given by the user (e.g., by filling web forms) or it can be collected indirectly without their explicit knowledge through the analysis of the IP headers, HTTP requests, queries in search engines, or even by using JavaScript and Flash programs embedded in web pages. Among the collected data, we can find information of technical nature (e.g., the browser in use) and also more sensible information (e.g., the geographical location or the visited web pages). The webmail services are also known for scanning and processing user's e-mails, even if they are received from a user who did not allow any kind of message inspection. In order to track their users, online services use various methods. The most popular techniques are the use of different kinds of browser cookies, fingerprinting the user in the background, or suggesting (or requiring) the user to fill in a profile, so the web identity can be further extended by associating it with the real user's identity.

We investigate whether the services are using other, unexpected mechanisms to track user activity, as if the network of contacts of a user and their interests are used to build the profile of the users, and what impact it has on their privacy. We also analyze if online services collect information when users are logged off to a service using cookies or user fingerprints and later combine this information with their online profiles when they log in. We investigate the ability of web services to follow the users' activity in the private browsing mode and analyze special privacy-focused search engines. We test their capabilities and compare them with the standard search engines. In another front, we investigate the impact of user tracking in price discrimination. Product pricing can be based on the geographical location of the user but also on the user profiles sold by online services.

Our objective: to evaluate the performance of various applications in a high-speed Internet infrastructure.

1. We performed substantial testing of widely used DPI classifiers (PACE, OpenDPI, L7-filter, NDPI, Libprotoident, and NBAR) and assessed their usefulness in generating ground-truth, which can be used as training data for Machine Learning Algorithms (MLAs).

2. Because the existing methods (DPI, port-based, statistical) were shown to not be sufficient, we built our own host-based system (VBS) for collecting and labeling of network data. The packets are grouped into flows, which are labeled by the process name obtained from the system sockets. Look at the Projects section for a detailed description.

3. We assessed the usefulness of C5.0 MLA in the classification of computer network traffic. We showed that the application-layer payload is not needed to train the C5.0 classifier, defined the sets of classification attributes and tested various classification modes.

4. We showed how to use our VBS tool to obtain per-flow, per-application, and per-content statistics of traffic in computer networks. Furthermore, we created two datasets composed of various applications, which can be used to assess the accuracy of different traffic classification tools. The datasets contain full packet payloads and they are available to the research community as a set of PCAP files and their per-flow description in the corresponding text files.

5. We designed and implemented our own system for multilevel traffic classification, which provides consistent results on all of the 6 levels: Ethernet, IP protocol, application, behavior, content, and service provider. The system is able to deal with unknown traffic, leaving it unclassified on all the levels, instead of assigning the traffic to the most fitting class. Our system was implemented in Java and released as an open-source project.

6. Finally, we created a method for assessing the Quality of Service in computer networks.

The outcomes were thoroughly described in a technical report Comparison of Deep Packet Inspection (DPI) Tools for Traffic Classification, which is shown below in the Publications section.

1. We created a dataset of 10 different applications (eDonkey, BitTorrent, FTP, DNS, NTP, RDP, NETBIOS, SSH, HTTP, RTMP), which is available to the research community. It contains 1 262 022 flows captured during 66 days. The dataset is available as a bunch of PCAP files containing full flows including the packet payload, together with corresponding text files, which describe the flows by providing all the necessary details, including the corresponding application name, start, and end timestamps based on the system sockets.

2. We tested the accuracy of several Deep Packet Inspection tools (PACE, OpenDPI, L7-filter, NDPI, Libprotoident, and NBAR) on our dataset. To test NBAR, we needed to replay the packets to the Cisco router and process the Flexible NetFlow logs. The other tools were tested directly as libraries by a special software, which was reading packets from the PCAP files and providing the packets to the classifiers.

Development of an application used for creating, managing and printing invoices. This program was in use in around 30 departments of OPA-LABOR during 4 years. Look at the Projects section for a detailed description.

The aim of this unique project is to bring new quality to the field of traffic classification by providing the results on many levels. The results obtained from nDPIng are easy to be accounted and they are given as: protocol (beginning from TCP/UDP, then going into higher levels), content type, service provider (the well-known name of the remote host , e.g., Facebook for web browser flows from Facebook), and content provider (content delivery network: cdn, e.g., Akamai or Google). Examples of the results provided in the non-verbose mode:

This project is focused on designing and developing a system, which is able to provide detail data about applications used in the Internet. This information can be used for obtaining the knowledge which applications are most frequently used in the network, providing the users some basic statistics about their Internet connection usage (for example, for which kinds of applications their connection is used the most), creating scientific profiles of traffic generated by different applications or different groups of applications, etc.

The developed Volunteer-Based system has the client-server architecture. Clients are installed among machines belonging to volunteers, while the server is installed on the computer located in the premises of the data collecting entity. Each client registers information about the data passing computer's network interfaces. Captured packets are grouped into flows. A flow is defined as a group of packets which have the same local and remote IP addresses, local and remote ports, and using the same transport layer protocol. For every flow, the client registers: anonymized identifier of the client, start timestamp of the flow, anonymized local and remote IP addresses, local and remote ports, transport protocol, anonymized global IP address of the client, and name of the application associated with that flow. The name of the application is taken from the system sockets. For every packet, the client additionally registers: direction, size, state of all TCP flags (for TCP connections only), time in microseconds elapsed from the previous packet in the flow, and type of transmitted HTTP content. We do not inspect the payload the type of the HTTP content is obtained from the HTTP header, which is present in the first packet carrying this specific content. One HTTP flow (for example a connection to a web server) can carry multiple files: HTML documents, JPEG images, CSS stylesheets, etc. Thanks to that ability implemented in our VBS, we are able to split the flow and separate particular HTTP contents. The data collected by VBS are stored in a local file and periodically sent to the server. The task of the server is to receive the data from clients and to store them into the MySQL database.

This open source tool is released under GNU General Public License v3.0 and published as a SourceForge project. Both Windows and Linux versions are available. VBS is designed to collect the traffic from numerous volunteers spread around the world and, therefore, with a sufficient number of volunteers the collected data can provide us with a good statistical base.

Compare Testlab in Karlstad, NettOp at the University of Stavanger, and CNP at Aalborg University, are three living labs for development of new ICT-services, infrastructure and media by means of involving users (i.e. end users as well as companies). The industrial partners Ipark (Stavanger Innovation Park), ICTNORCOM, and the Greater Stavanger Development will present real cases to which users will be invited to co-create and test ICT services.

The aim of this project is to build on and improve the work of existing Living Labs and generate knowledge on how to innovate new services, media and infrastructure in Living Labs in three different Nordic countries.

This project aims at defining a standard Deep Packet Inspection API that most DPI implementations will support. In order to achieve this goal, the API will be released under an open license. This will promote DPI libraries interchange, so that it will be possible to plug/unplug implementations as needed. The standardization group consists of developers of both the commercial and open-source DPI software.

The goal of this project was to design and implement a new module for Imento - a web-based system for creating fliers and advertisements, which is in use by many well-known companies in Denmark, e.g., 727, Cosmographic, Lidl, Spar, Bong, Nordal, Intersport, Bygma, and Tempur. The system consists of a media bank and a product database, which are used to store all the information about the products needed by the customers. The task of the module being the subject of this project was to allow easy production of real advertisements, in the inDesign and PDF formats, using the web-based Imento interface.

The built solution uses web-based techniques and tools (e.g., HTML, JavaScript, jQuery, and AJAX) in collaboration with a headless version of inDesign Server, controlled by scripts produced by the web interface. At first, the user is able to choose a template used for building the advertisement. Then, the website turns into an environment known from drawing and painting applications, where the user can use existing snippets (per-product graphical templates) to build multi-pages multi-layer document by dragging and dropping the selected objects. The information about the products (e.g., images, prices, and descriptions) are automatically imported from the database and rendered in the document in the real-time. The user is able to save the document and return to it later. The document can be saved in the inDesign format or exported to PDF.

The project was concentrated on creating an invoicing system for a mining company, which will be characterized by a significant fraction of features differing from other systems already existing in the market. These requirements are imposed due to a very specific way how the company works and makes its revenue. The company consists of main headquarters and more than 30 departments in different geographical locations. The tariffs used by the particular departments are different and should be able to be created and entered into the system only in the main headquarters, while both the main headquarters and the departments should be able to use the tariffs for invoicing purposes. Additionally, the departments are allowed to create custom invoices, which are not based on tariffs, but they must be properly marked to be checked into the headquarters. The departments cannot directly print any invoices; this ability is reserved for the headquarters. The departments had only dial-up Internet connection and, therefore, the tariffs and generated invoices needed to be synchronized between the headquarters and departments using small files distributed by e-mails. Additionally, the headquarters needed to have abilities to edit any invoice or to create a memo. The designed and implemented system was in use in around 30 departments of OPA-LABOR during 4 years, successfully satisfying all the requirements set in this project.

Traffic monitoring and analysis can be done for multiple different reasons: to investigate the usage of network resources, adjust Quality of Service (QoS) policies in the network, log the traffic to comply with the law, or create realistic models of traffic for academic purposes. The core activity in this area is traffic classification, which is the main topic of this thesis.

We introduced the already known methods for traffic classification (as by using transport layer port numbers, Deep Packet Inspection (DPI), statistical classification) and assessed their usefulness in particular areas. Statistical classifiers based on Machine Learning Algorithms (MLAs) were shown to be accurate and at the same time they do not consume a lot of resources and do not cause privacy concerns. However, they require good quality training data. We performed substantial testing of widely used DPI classifiers and assessed their usefulness in generating ground-truth, which can be used as training data for MLAs. Because the existing methods were shown to not be capable of generating the proper training data, we built our own host-based system for collecting and labeling of network data, which depends on volunteers. Afterwards, we designed and implemented our own system for traffic classification based on various statistical methods, which provides consistent results on all of the 6 levels: Ethernet, IP protocol, application, behavior, content, and service provider. Finally, we contributed to the open source community by improving the accuracy of nDPI traffic classifier. The thesis also evaluates the possibilities of using various traffic classifiers in order to assess the per-application QoS level.

Privacy seems to be the Achilles' heel of today's web. Most web services make continuous efforts to track their users and to obtain as much personal information as they can from the things they search, the sites they visit, the people they contact, and the products they buy. This information is mostly used for commercial purposes, which go far beyond targeted advertising. Although many users are already aware of the privacy risks involved in the use of internet services, the particular methods and technologies used for tracking them are much less known. In this survey, we review the existing literature on the methods used by web services to track the users online as well as their purposes, implications, and possible user's defenses. We present five main groups of methods used for user tracking, which are based on sessions, client storage, client cache, fingerprinting, and other approaches. A special focus is placed on mechanisms that use web caches, operational caches, and fingerprinting, as they are usually very rich in terms of using various creative methodologies. We also show how the users can be identified on the web and associated with their real names, e-mail addresses, phone numbers, or even street addresses. We show why tracking is being used and its possible implications for the users. For each of the tracking methods, we present possible defenses. Some of them are specific to a particular tracking approach, while others are more universal (block more than one threat). Finally, we present the future trends in user tracking and show that they can potentially pose significant threats to the users' privacy.

Deep Packet Inspection (DPI) is the state-of-the-art technology for traffic classification. According to the conventional wisdom, DPI is the most accurate classification technique. Consequently, most popular products, either commercial or open-source, rely on some sort of DPI for traffic classification. However, the actual performance of DPI is still unclear to the research community, since the lack of public datasets prevent the comparison and reproducibility of their results. This paper presents a comprehensive comparison of 6 well-known DPI tools, which are commonly used in the traffic classification literature. Our study includes 2 commercial products (PACE and NBAR) and 4 open-source tools (OpenDPI, L7-filter, nDPI, and Libprotoident). We studied their performance in various scenarios (including packet and flow truncation) and at different classification levels (application protocol, application and web service). We carefully built a labeled dataset with more than 750K flows, which contains traffic from popular applications. We used the Volunteer-Based System (VBS), developed at Aalborg University, to guarantee the correct labeling of the dataset. We released this dataset, including full packet payloads, to the research community. We believe this dataset could become a common benchmark for the comparison and validation of network traffic classifiers. Our results present PACE, a commercial tool, as the most accurate solution. Surprisingly, we find that some open-source tools, such as nDPI and Libprotoident, also achieve very high accuracy.

Monitoring of the Quality of Service (QoS) in high-speed Internet infrastructures is a challenging task. However, precise assessments must take into account the fact that the requirements for the given quality level are service-dependent. The backbone QoS monitoring and analysis requires processing of large amounts of data and the knowledge about the kinds of applications, which generate the traffic. To overcome the drawbacks of existing methods for traffic classification, we proposed and evaluated a centralized solution based on the C5.0 Machine Learning Algorithm (MLA) and decision rules. The first task was to collect and to provide to C5.0 high-quality training data divided into groups, which correspond to different types of applications. It was found that the currently existing means of collecting data (classification by ports, Deep Packet Inspection, statistical classification, public data sources) are not sufficient and they do not comply with the required standards. We developed a new system to collect the training data, in which the major role is performed by volunteers. Client applications installed on volunteers' computers collect the detailed data about each flow passing through the network interface, together with the application name taken from the description of system sockets. This paper proposes a new method for measuring the level of Quality of Service in broadband networks. It is based on our Volunteer-Based System to collect the training data, Machine Learning Algorithms to generate the classification rules and the application-specific rules for assessing the QoS level. We combine both passive and active monitoring technologies. The paper evaluates different possibilities of the implementation, presents the current implementation of the particular parts of the system, their initial runs and the obtained results, highlighting parts relevant from the QoS point of view.

To overcome the drawbacks of the existing methods for traffic classification (by ports, Deep Packet Inspection, statistical classification), a new system was developed, in which the data are collected and classified directly by clients installed on machines belonging to volunteers. Our approach combines the information obtained from the system sockets, the HTTP content types, and the data transmitted through network interfaces. It allows to group packets into flows and associate them with particular applications or the types of service. This paper presents the design and implementation of our system, the testing phase and the obtained results. The performed threat assessment highlights potential security issues and proposes solutions in order to mitigate the risks. Furthermore, it proves that the system is feasible in terms of uptime and resource usage, assesses its performance and proposes future enhancements. We released the system under The GNU General Public License v3.0 and published it as a SourceForge project called Volunteer-Based System for Research on the Internet.

Network traffic analysis was traditionally limited to packet header, because the transport protocol and application ports were usually sufficient to identify the application protocol. With the advent of port-independent, peer-to-peer, and encrypted protocols, the task of identifying application protocols became increasingly challenging, thus creating a motivation for creating tools and libraries for network protocol classification. This paper covers the design and implementation of nDPI, an open-source library for protocol classification using both packet header and payload. nDPI was extensively validated in various monitoring projects ranging from Linux kernel protocol classification, to analysis of 10 Gbit traffic, reporting both high protocol detection accuracy and efficiency.

The validation of the different proposals in the traffic classification literature is a controversial issue. Usually, these works base their results on a ground-truth built from private datasets and labeled by techniques of unknown reliability. This makes the validation and comparison with other solutions an extremely difficult task.

This paper aims to be a first step towards addressing the validation and trustworthiness problem of network traffic classifiers. We perform a comparison between 6 well-known DPI-based techniques, which are frequently used in the literature for ground-truth generation. In order to evaluate these tools we have carefully built a labeled dataset of more than 500 000 flows, which contains traffic from popular applications. Our results present PACE, a commercial tool, as the most reliable solution for ground-truth generation. However, among the open-source tools available, NDPI and especially Libprotoident, also achieve very high precision, while other, more frequently used tools (e.g., L7-filter) are not reliable enough and should not be used for ground-truth generation in their current form.

Understanding Internet traffic is crucial in order to facilitate the academic research and practical network engineering, e.g. when doing traffic classification, prioritization of traffic, creating realistic scenarios and models for Internet traffic development etc. In this paper, we demonstrate how the Volunteer-Based System for Research on the Internet, developed at Aalborg University, is capable of providing detailed statistics of Internet usage. Since an increasing amount of HTTP traffic has been observed during the last few years, the system also supports creating statistics of different kinds of HTTP traffic, like audio, video, file transfers, etc. All statistics can be obtained for individual users of the system, for groups of users, or for all users altogether. This paper presents results with real data collected from a limited number of real users over six months. We demonstrate that the system can be useful for studying the characteristics of computer network traffic in application-oriented or content-type-oriented way, and is now ready for a larger-scale implementation. The paper is concluded with a discussion about various applications of the system and the possibilities of further enhancements.

In this paper, we demonstrate how the Volunteer Based System for Research on the Internet, developed at Aalborg University, can be used for creating statistics of Internet usage. Since the data are collected on individual machines, the statistics can be made on the basis of both individual users and groups of users, and as such be useful also for segmentation of the users into groups. We present results with data collected from real users over several months; in particular we demonstrate how the system can be used for studying flow characteristics - the number of TCP and UDP flows, average flow lengths, and average flow durations. The paper is concluded with a discussion on what further statistics can be made, and the further development of the system.

Our previous work demonstrated the possibility of distinguishing several kinds of applications with accuracy of over 99%. Today, most of the traffic is generated by web browsers, which provide different kinds of services based on the HTTP protocol: web browsing, file downloads, audio and voice streaming through third-party plugins, etc. This paper suggests and evaluates two approaches to distinguish various types of HTTP content: distributed among volunteers' machines and centralized running in the core of the network. We also assess the accuracy of the global classifier for both HTTP and non-HTTP traffic. We achieved accuracy of 94%, which supposed to be even higher in real-life usage. Finally, we provided graphical characteristics of different kinds of HTTP traffic.

Monitoring of Quality of Service (QoS) in high-speed Internet infrastructure is a challenging task. However, precise assessments must take into account the fact that the requirements for the given quality level are service-dependent. Backbone QoS monitoring and analysis requires processing of large amount of the data and knowledge of which kind of application the traffic belongs to. To overcome the drawbacks of existing methods for traffic classification we proposed and evaluated a centralized solution based on C5.0 Machine Learning Algorithm (MLA) and decision rules. The first task was to collect and provide C5.0 high-quality training data, divided into groups corresponding to different types of applications. It was found that currently existing means of collecting data (classification by ports, Deep Packet Inspection, statistical classification, public data sources) are not sufficient and they do not comply with the required standards. To collect training data a new system was developed, in which the major role is performed by volunteers. Client applications installed on their computers collect the detailed data about each flow passing through the network interface, together with the application name taken from the description of system sockets. This paper proposes a new method for measuring the Quality of Service (QoS) level in broadband networks, based on our Volunteer-Based System for collecting the training data, Machine Learning Algorithms for generating the classification rules and application-specific rules for assessing the QoS level. We combine both passive and active monitoring technologies. The paper evaluates different implementation possibilities, presents the current implementation of particular parts of the system, their initial runs and obtained results, highlighting parts relevant from the QoS point of view.

Monitoring of the network performance in a high-speed Internet infrastructure is a challenging task, as the requirements for the given quality level are service-dependent. Therefore, the backbone QoS monitoring and analysis in Multi-hop Networks requires the knowledge about the types of applications forming the current network traffic. To overcome the drawbacks of existing methods for traffic classification, usage of C5.0 Machine Learning Algorithm (MLA) was proposed. On the basis of the statistical traffic information received from volunteers and C5.0 algorithm, we constructed a boosted classifier, which was shown to have the ability to distinguish between 7 different applications in the test set of 76,632 - 1,622,710 unknown cases with average accuracy of 99.3 - 99.9%. This high accuracy was achieved by using high quality training data collected by our system, a unique set of parameters used for both training and classification, an algorithm for recognizing flow direction and the C5.0 itself. The classified applications include Skype, FTP, torrent, web browser traffic, web radio, interactive gaming and SSH. We performed subsequent tries using different sets of parameters and both training and classification options. This paper shows how we collected accurate traffic data, presents arguments used in classification process, introduces the C5.0 classifier and its options, and finally, evaluates and compares the obtained results.

To overcome the drawbacks of existing methods for traffic classification (by ports, Deep Packet Inspection, statistical classification) a new system was developed, in which the data are collected from client machines. This paper presents design of the system, implementation, initial runs and obtained results. Furthermore, it proves that the system is feasible in terms of uptime and resource usage, assesses its performance and proposes future enhancements.

This articles surveys the existing literature on the methods currently used by web services to track the user online as well as their purposes, implications, and possible user's defenses. A significant majority of reviewed articles and web resources are from years 2012 - 2014. Privacy seems to be the Achilles' heel of today's web. Web services make continuous efforts to obtain as much information as they can about the things we search, the sites we visit, the people with who we contact, and the products we buy. Tracking is usually performed for commercial purposes. We present 5 main groups of methods used for user tracking, which are based on sessions, client storage, client cache, fingerprinting, or yet other approaches. A special focus is placed on mechanisms that use web caches, operational caches, and fingerprinting, as they are usually very rich in terms of using various creative methodologies. We also show how the users can be identified on the web and associated with their real names, e-mail addresses, phone numbers, or even street addresses. We show why tracking is being used and its possible implications for the users. For example, we describe recent cases of price discrimination, assessing financial credibility, determining insurance coverage, government surveillance, and identity theft. For each of the tracking methods, we present possible defenses. Some of them are specific to a particular tracking approach, while others are more universal (block more than one threat) and they are discussed separately. Apart from describing the methods and tools used for keeping the personal data away from being tracked, we also present several tools that were used for research purposes - their main goal is to discover how and by which entity the users are being tracked on their desktop computers or smartphones, provide this information to the users, and visualize it in an accessible and easy to follow way. Finally, we present the currently proposed future approaches to track the user and show that they can potentially pose significant threats to the users' privacy.

Existing tools for traffic classification are shown to be incapable of identifying the traffic in a consistent manner. For some flows only the application is identified, for others only the content, for yet others only the service provider. Furthermore, Deep Packet Inspection is characterized by extensive needs for resources and privacy or legal concerns. Techniques based on Machine Learning Algorithms require good quality training data, which are difficult to obtain. They usually cannot properly deal with other types of traffic, than they are trained to work with, and they are unable to detect the content carried by the flow, or the service provider. To overcome the drawbacks of already existing methods, we developed a novel hybrid method to provide accurate identification of computer network traffic on six levels: Ethernet, IP protocol, application, behavior, content, and service provider. Our system built based on the method provides also traffic accounting and it was tested on 2 datasets. We have shown that our system gives a consistent, accurate output on all the levels. We also showed that the results provided by our system on the application level outperformed the results obtained from the most commonly used DPI tools.

Network traffic classification became an essential input for many network-related tasks. However, the continuous evolution of the Internet applications and their techniques to avoid being detected (as dynamic port numbers, encryption, or protocol obfuscation) considerably complicated their classification. We start the report by introducing and shortly describing several well-known DPI tools, which later will be evaluated: PACE, OpenDPI, L7-filter, NDPI, Libprotoident, and NBAR.

This report has several major contributions. At first, by using VBS, we created 3 datasets of 17 application protocols, 19 applications (also various configurations of the same application), and 34 web services, which are available to the research community. The first dataset contains full flows with entire packets, the second dataset contains truncated packets (the Ethernet frames were overwritten by 0s after the 70th byte), and the third dataset contains truncated flows (we took only 10 first packets for each flow). The datasets contain 767 690 flows labeled on a multidimensional level. These datasets are available as a bunch of PCAP files containing full flows including the packet payload, together with corresponding text files, which describe the flows in the order as they were originally captured and stored in the PCAP files.

At second, we developed a method for labeling non-HTTP flows, which belong to web services (as YouTube). Labeling based on the corresponding domain names taken from the HTTP header could allow to identify only the HTTP flows. Other flows (as encrypted SSL / HTTPS flows, RTMP flows) are left unlabeled. Therefore, we implemented a heuristic method for detection of non-HTTP flows, which belong to the specific services. Then, we examined the ability of the DPI tools to accurately label the flows included in our datasets.

Nowadays, there are many tools, which are being able to classify the traffic in computer networks. Each of these tools claims to have certain accuracy, but it is a hard task to asses which tool is better, because they are tested on various datasets. Therefore, we made an approach to create a dataset, which can be used to test all the traffic classifiers. In order to do that, we used our system to collect the complete packets from the network interfaces. The packets are grouped into flows, and each flow is collected together with the process name taken from Windows / Linux sockets, so the researchers do not only have the full payloads, but also they are provided the information which application created the flow. Therefore, the dataset is useful for testing Deep Packet Inspection (DPI) tools, as well as statistical, and port-based classifiers. The dataset was created in a fully manual way, which ensures that all the time parameters inside the dataset are comparable with the parameters of the usual network data of the same type. The system for collecting of the data, as well as the dataset, are made available to the public. Afterwards, we compared the accuracy of classification on our dataset of PACE, OpenDPI, NDPI, Libprotoident, NBAR, four different variants of L7-filter, and a statistic-based tool developed at UPC. We performed a comprehensive evaluation of the classifiers on different levels of granularity: application level, content level, and service provider level. We found out that the best performing classifier on our dataset is PACE. From the non-commercial tools, NDPI and Libprotoident provided the most accurate results, while the worst accuracy we obtained from all 4 versions of L7-filter.

Consistency, Accuracy, and Usefulness of Techniques and Tools for Network Traffic Identification

Event

Seminar organized by the Networks, Systems, Services, and Security (R3S) research team from the Distributed Services, Architectures, Modelling, Validation, and Network Administration (SAMOVAR) research unit