To successfully prevent attacks it is vital to have a complete and accurate detection system. Signature-based intrusion detection systems (IDS) are one of the most popular approaches, but they are not adequate for detection of web-based or novel attacks. The purpose of this project is to study and design an anomaly-based intrusion detection system capable of detecting those kinds of attacks.

Anomaly-based IDS can create a model of normal behavior from a set of training data, and then use it to detect novel attacks. In most cases, this model represents more instances than those in the training data set, a characteristic that we designate as generalization and which is necessary for accurate anomaly detection. The accuracy of such systems, which determines their effectiveness, is considerably influenced by the model building phase (often called training), which depends on having data that is free from attacks resembling the normal operation of the protected application. Having good models is particularly important, or else significant amounts of false positives and false negatives will likely be generated by the IDS during the detection phase.

This dissertation details our research on the use of anomaly-based methods to detect attacks against web servers and applications. Our contributions focus on three different strands: i) advanced training procedures that enable anomaly-based learning systems to perform well even in presence of complex and dynamic web applications; ii) a system comprising several anomaly detection techniques capable of recognizing and identifying attacks against web servers and applications and iii) an evaluation of the system and of the most suitable techniques for anomaly detection of web attacks, using a large data set of real-word traffic belonging to a web application of great dimensions hosted in production servers of a Portuguese ISP.

Keywords: anomaly detection, intrusion detection, web attacks, computers, networks.


Ever since computers exist, computer networks have not ceased to grow and evolve. It is a known fact that computer networks have become essential tools in the development of most enterprises and organizations in various fields such as banking, insurance, military, etc. The increase in interconnection between several systems and networks has made them accessible by a vast and diversified user population that keeps on growing. The users, known or not, do not always carry good intentions when accessing these networks. They can try to read, modify or destroy critical information or just attack and disturb the system. Since these networks can be potential targets of attacks, making them safe is an important aspect that cannot be ignored. Over the past two decades, the growing number of computer security incidents witnessed in the Internet has reflected the very growth of the Internet itself.

As a facet of the Internet growth, the Hypertext Transfer Protocol (HTTP) has been widely used throughout the years. Applications are nowadays designed to run as a web service and are deployed across the Internet using HTTP as its standard communication protocol. The success and broad acceptance of web applications shaped a trend suggesting that people will heavily rely of web applications in the years to come. The popularity of such web applications has caught the attention of attackers which try to exploit its vulnerabilities. According to the CSI/FBI Computer Crime and Security Survey of 2005 [1], 95% of the respondent organizations reported having experienced more than 10 incidents related to their web sites.

Additionally, as shown in by the CVE vulnerability trends [2] over a period of five years, the total number of publicly reported web application vulnerabilities rose sharply, to the point where they overtook buffer overflows. Three of the most typical web related attacks which contributed to this trend are: SQL-Injection, Cross Site Scripting (XSS) and Remote File Inclusion (RFI). The main reasons for this tendency can be explained by the ease of discovery and exploitation of web vulnerabilities, combined with the proliferation of low-grade software applications written by inexperienced developers. Figure 1.1, which show the evolution of attack sophistication versus the necessary technical knowledge of the intruders throughout the years, is a solid evidence of this tendency.

Motivation and Objectives

Anomaly-based intrusion detection systems have the potential to detect novel, or previously unknown attacks (also referred to as zero-day attacks[1]). Nevertheless, these systems typically suffer from high rates of false positives and can be evaded by mimicry attacks (i.e., a variation of attacks that attempts to pass by as normal behavior), for example by using byte substitution and/or padding techniques. In order to address these problems and increase the accuracy of anomaly-based IDS, one of the main challenges lies in controlling model generalization. An anomaly-based IDS learns and builds a model of normal behavior derived from observations in a set of training data. Normal behavior assumes that a service is used according to what their administrators intend for it, in other words, it assumes no attacks are present. After the model is created, consequent observations which deviate from it are categorized as anomalies. In order to achieve more than simply memorizing the training data, an anomaly-based IDS must generalize. By generalizing, an anomaly-based IDS accepts input similar, but not necessarily identical to that of the training data set, which implies the set of instances considered normal will encompass a larger set than those in the training data. An anomaly detection system that under generalizes generates too many false positives, while one that over generalizes misses attacks. Therefore, correct generalization is a necessary condition for the accuracy of an anomaly-based IDS.

Another major challenge, which we explicitly address with our research, is the significant influence the model creation phase (often called training) imposes on the system’s overall detection accuracy. This phase depends on the quality of the traffic or data that is used to resemble normal activity of the protected system or application, which should be free from attacks. In this dissertation, we propose an algorithm to sanitize (i.e., remove attacks) a dataset of web requests collected from captures of network traffic, in order to make it suitable for training the anomaly detection system. The main goal of this process is to enable anomaly-based systems to build their statistical models using as input data without attacks.

Another objective of our research was to devise an anomaly-based intrusion detection system comprising multiple anomaly detection techniques apt to recognize and indentify attacks against web servers and applications, based on models that allow the distinction between normal and anomalous requests.

As a final and main goal of our research, we evaluate the usage of the anomaly-based IDS with a large data set of real-world traffic from a production environment hosting a web application of great dimensions, as well as the accuracy for each of the anomaly detection techniques it includes.

A zero-day attack exploits a software vulnerability that has not been made public yet.