Anomaly Detection through Behavioral Data

Application domain Computer and Network Security
Source Dip. di Informatica, Univ. di Torino
Dataset size 343 examples of 41 attributes each
Data format FOIL-Iike/plain
Systems used RELIC
References [1, 2]
Pointers {bergadan,gunetti,ruffo}

Data Acquisition

The data used in the experiments were collected over a period of three months. Ten volunteers accepted to be monitored as described below. The volunteers were asked to behave as usual, with the only constraint of not allowing other people to sit at the keyboard and use the workstation with the volunteer's account. The monit~red people include two system administrators, one PhD student, one professor and six researchers. Each user is monitored with two programs that are launched when the user logs on and runs the X server. After ten minutes, the programs save the relevant data and stop.

The first program, time, connects to the X server and gets the elapsed time between every two keystrokes on the keyboard. To guarantee privacy, typed characters are blurred and are not recorded. Since we want the average elapsed time in continuous typing, time does not take into consideration times larger than 600 milliseconds. For such cases we assume the user stopped typing. Elapsed times are summed up together, and after ten minutes this sum is divided .by the number of typed keys. A verage elapsed time between two strokes and the number of strokes are recorded and then the time process dies.

The second program, command, records the commands executed by the user in the first ten minutes of his working session, together with the number of times each command was run. This is done through the lastcomm command provided by the Unix System.

Together with the above information, the login time is also recorded. Hence, after the first ten minutes of a session, each user produces a set of parameters such as the following:
user: User-l; login time: 09:15; number of keystrokes: 157; average elapsed time between keystrokes: 243 (milliseconds) command: cat, how-many: 3; command: elm, how-many: I; command: more, how-many: 3; command: .rm, how-many: 2; ...

Every such set is a positive example of User-1, and a negative example for every other user.

Every example must be turned into something that the learning system can handle. The first three parameters can immediately be used as continuous attributes by the learning system (the login time is first turned into the number of minutes from midnight). To handle the commands they are split into a set of classes containing 'homogeneous' commands. For example, a class contains command used to see the content of files. Commands used to modify files as a whole ( such as cp and mu) may form another class. We initially identified 24 classes of Unix commands. Later, they were increased to 37.


Given a set of positive and negative examples of a user, the goal is to synthesize a decision tree representing a model of that user. When given in input a new example, the decision tree must be able to correctly classify it as a positive or negative example of the user. Note that we totally ignore some of the available users during the training phase -these users will only be used for testing the system performance, and are equivalent to external, previously unknown intruders. More precisely, six users are selected to learn a model, let us call these the known users. The remaining four are left out to be used only in the testing phase, as explained below. We will call them the unknown users. The set of positive examples of each known user is randomly split into a training set containing 213 of the examples, and a test set containing the remaining examples. A decision tree for a known user is learned from a set of positive and negative examples of that user. The training set of this user's examples are used as positive examples, and the training sets of the other 5 known users are used as negative examples. The learned decision tree is then tested on the set of the testing examples, in order to compute the percentage of positive examples of the user classified as positive, and the percentage of negative examples classified as negative.

The testing set for each known user is made by putting together: a) the examples of that user not used in the learning phase ( these examples are marked as 'positive'); b) the examples of the other five known users not used in the learning phase (marked as 'negative'); c) all the examples of the four unknown users, marked as 'negative'. The presence, in the testing sets, of negative examples of unknown users is important, because it simulates the real situation when an intruder is coming from the outside, and hence his behavior is completely unknown to the 'guards'. For this reason the four unknown users are selected to be as heterogeneous as possible. The four unknown are the professor, the PhD student, one of the researchers and one of the system administrators. The learning/testing process just described is repeated 6 times for each user, each time with a different random split of the set of his examples into a training and a testing set. The mean error rates out of the six runs available for every known user is then computed. The whole procedure is repeated in six experiments where different attributes are used. For the best choice of attributes a 90% accuracy for the test set as described above is obtained. More details are found in [1, 2].


  1. D. Gunetti and G. Ruffo, 'Intrusion Detection through Behavioral Data in Third Symposium on Intelligent Data Analysis (IDA499), LNCS, Springer Verlag, 1999.

  2. F. Bergadano, D. Gunetti and G. Ruffo, 'Online Anomaly Detection from Sample User Sessions' in Deliverable TO1e.

back to index