Adaptive System Management (KUL)

Application domain:	Adaptive System Management
Further specification:	data + script + background knowledge
Pointers:	Contact Luc Dehaspe ldh@cs.kuleuven.ac.be
Data complexity:	1000 facts
Data format:	Prolog

With the ever increasing complexity of Client/Server environments, the problem of managing such environments becomes more important and more difficult. Automatic techniques and in particular learning techniques could help a lot in solving these problems. As most of the system management is still done by hand, applying ILP-techniques will be a big step forward.

ASM has as two of its main goals keeping the performance and availability of the system as high as possible. By systems we usually mean a network of UNIX-machines, with databases and applications running on them. It is of course very valuable to discover what determines the performance and availability of such systems. System management can be roughly divided into four areas:

Network management.
UNIX-system management.
Database management.
Application management.

It has been decided that the focus for now will be on problems involving management of single UNIX-systems with possibly databases and applications running on them. In the future the focus will shift towards management of more complex networks involving several machines.

Problem definitions

Four problem areas have been defined. For these problems the focus has been on simplicity and on the availability of datasets, and should thus be considered as preliminary. No real attempt has been made to come up with problems that are specifically suited for ILP, so the comparison with other ML-techniques may not always be in favor of ILP. The definitions as they are now, are meant as a first step and after some results we should look at more complex problems.

The following four problem areas have been defined:

Time series analysis on performance data. Given the value of several system-parameters over the last few timesteps, predict the performance and availability of the system in the next timestep. The aim is to be able to predict fatal events some time before they happen. Think of this as a classification task.
Performing KDD on system-performance data in order to find (non-time) dependencies between performance-parameters. The aim is to build up knowledge about how complex systems (in this case UNIX-systems) work. This kind of knowledge will be valuable to consultants in order to for example diagnose poor performance, etc.
Performing KDD on system-performance data in order to do what-if analysis on proposed configurations of new systems. Given the experience from different others systems, it should be possible to predict whether the new system will meet the requirements. This problem may have some overlap with the previous problem.
Predicting the usefulness of alerts to particular user, by analyzing the response to previous alerts (such as 'useful'/'useless'). This will help filtering the potential large amount of alerts a user has to work through.

Data sets

The availability of datasets will mostly be the responsibility of Syllogic.

For the first three problems datasets can easily be generated. Marc Gathier has written a script that will produce a list of system-parameters periodically. The output can be in the format of Prolog clauses or insert-statements into a database, or any other ASCII format you like. The scripts will enable everybody to create their own datasets, on various types of machines. It has been agreed that each party will generate a dataset in this way and distribute this so performance on different machines may be compared. A standard format for this still needs to be agreed upon. For those who have difficulties using the script, Syllogic will give support or send a dataset that has already been generated.

Availability of datasets for the last problem still needs to be discussed.

Jeroen Speekenbrink will look into the availability of larger and more complex datasets at customers-sites. Probably new problem-definitions will result from this.

ILP Considerations

The datasets for the first three problems as it stands now correspond to a single table, with a column for each system parameter and a row for each timestep. Attribute-value learners have been applied to these datasets to induce dependencies holding at a single timestep between system performance/availability and other parameters. The challenge for ILP is to find more useful rules that take into account:

previous timesteps (time series analysis), and
background knowledge about the system and its users.

This problem might seem less suitable for ILP as the system parameters are all purely numerical values. However, either the list of values is limited, or fairly straightforward discretization is possible.

The attractive aspects of this application for ILP include:

it concerns an instantiation of the generic time series analysis task, and
it enables testing and comparing the KDD potential of ILP algorithms on deliberately large databases machine-built on-site.

Experimental setup

We further refined the learning task to finding sufficient conditions for a high system load at timestep Future. These conditions are to be selected from the information available at an earlier timestep Now which is equal to Future-PredictionGap. To make the rules more flexible we introduced a time interval Recently which starts at Now-PredictionMemory and stops Now. We then want to induce rules of the type

IF Recently system parameters have been as such
THEN system load will be high at timestep Future

Raw data To generate the data for the experiment we used a slightly modified version of the two scripts provided by Syllogic.

We executed the scripts on a SPARCstation 2 running SunOS 4.1.3. In about three days, with a parameter dump every 20 seconds, we gathered about 13000 facts of arity 18. From these we selected an ``interesting'' sequence of 1000 facts. Interesting here meaning that the system load showed some variance in this 5.5 hour period (e.g. during the night it was rather uninterestingly stable).

In 100 of the 1000 selected timesteps the systemload is high.

Background knowledge We mainly used two types of background knowledge, one related to discretization, one to model the history of previous timesteps.

Discretization For all system parameters we generated a list of values that occur in our sample of 1000 facts. Where appropriate we introduced an extremely rough discretization into three intervals (low, average, high).

Model the history of timesteps To model the history of previous timesteps we implemented two concepts:

prediction-memory: in the last X seconds something of this type has happened
prediction-gap: in Y seconds something of this type will happen

The prediction-memory is then used in the condition part of induced rules, the prediction-gap in the conclusion part.

The learning engine: CLAUDIEN

We used the clausal discovery engine CLAUDIEN. Here are some of the CLAUDIEN specific inputs.

The language The declarative language bias formalism DLAB was used for defining the search space.

dlab_template('
sysload(F, hi) 
<--
len-len:[t(N,Prediction_Gap,F),
         0-len:[toggle(h(tax, N, users,fam)),
                toggle(h(tax, N, non_root,fam)),
                toggle(h(tax, N, processes,fam)),
                toggle(h(tax, N, non_root_processes,fam)),
                toggle(h(tax, N, defunct_processes,fam)),
                toggle(h(tax, N, free_tmp,fam)),
                toggle(h(tax, N, paging_space_used,fam)),
                toggle(h(tax, N, sysload,lah)),
                toggle(h(tax, N, cpu_us,lah)),
                toggle(h(tax, N, cpu_sy,lah)),
                toggle(h(tax, N, cpu_id,lah))
               ]
        ]
').

dlab_variable(toggle,1-1, [y,n]).
dlab_variable(fam, 1-1, [few, avg, lot]).
dlab_variable(lah, 1-1, [lo, avg, hi]).
dlab_taxonomy(tax, [60,40,20,0]).
dlab_variable(term, 1-1, [20,60,300]).

The total number of clauses in this hypothesis space amounts to

Settings The following non-default settings were used:

talking(2).
scope(local).
local_transform(1).
min_lower_accuracy(0.7).
min_accuracy(1).
non_redundancy(off).
min_refine_coverage(20).
min_coverage(20).
max_complexity(1000).

Notice :

the setting which allows rules to be only accurate
the setting which rejects rules that cover less than 20 examples

Results

We did three experiments with prediction gaps of respectively 20, 60 and 300 seconds.

Prediction-gap = 20 Of the 8387 solutions found in less than 3 days of cpu time, 137 have an accuracy of at least . There are 3 rules with the maximum accuracy of , e.g.:

rule(7230,
     [accuracy(0.95),total(20),pos(19),neg(1),
      comp(37),cpu(221908),realtime(223763)],
     (sysload(F,hi) :- 
          t(N,20,F) ,
          y(h(60,N,users,avg)) , 
          n(h(60,N,free_tmp,avg)) ,
          n(h(60,N,sysload,avg)) , 
          n(h(60,N,cpu_us,lo)) ,
          n(h(60,N,cpu_sy,lo)))).

This rule number 7230 has been discovered after 61.2 hours of cpu time (62.2 hours real time). It says that sysload will be high 20 seconds from now if in the last 60 seconds users has been average, the sysload has not been average, cpu_us has not been low, and cpu_sy has not been low. This rule holds for 19 of the 20 observed timepoints that meet the condition part, hence

accuracy. With in total 100 timepoints in which the systemload is high, this rule has a coverage of about

Prediction-gap = 60

Of the 3164 solutions found in less than 3 days of cpu time, 65 have an accuracy of at least . There are 40 rules with the maximum accuracy of , e.g.:

rule(208,
     [accuracy(0.857143),total(21),pos(18),neg(3),
      comp(31),cpu(16707),realtime(16965)],
     (sysload(F,hi) :- 
          t(N,60,F) , 
          y(h(60,N,users,avg)) , 
          n(h(60,N,cpu_us,lo)) , 
          n(h(60,N,cpu_sy,lo)) , 
          n(h(40,N,cpu_sy,hi)))).

This rule number 208 has been discovered after 4.6 hours of cpu time (4.7 hours real time). It says that sysload will be high 60 seconds from now if in the last 60 seconds users has been average, cpu_us has not been low, and cpu_sy has not been low, and in the last 40 seconds cpu_sy has not been high. This rule holds for 18 of the 21 observed timepoints that meet the condition part, hence

accuracy. With in total 100 timepoints in which the systemload is high, this rule has a coverage of about

Prediction-gap = 300

Of the 44 solutions found in less than 3 days of cpu time, none has accuracy above . There are 8 rules with the maximum accuracy of , e.g.:

rule(5,
     [accuracy(0.75),total(28),pos(21),neg(7),
      comp(31),cpu(53235.4),realtime(54062)],
     (sysload(F,hi) :- 
          t(N,300,F) , 
          y(h(60,N,processes,avg)) , 
          y(h(40,N,processes,lot)) , 
          y(h(60,N,sysload,hi)) , 
          y(h(60,N,cpu_sy,avg)))).

This rule number 5 has been discovered after 14.8 hours of cpu time (15 hours real time). It says that sysload will be high 5 minutes from now if in the last 60 seconds processes has been average, sysload has been high, and cpu_sy has been average, and in the last 40 seconds processes has been lot. This rule holds for 21 of the 28 observed timepoints that meet the condition part, hence

accuracy. With in total 100 timepoints in which the systemload is high, this rule has a coverage of about

Discussion and conclusion

The experiments and results here described merely give a flavor of the kind of rules a relational learner like CLAUDIEN can discover in ASM data. We briefly comment on some current shortcomings, possible next steps etc..

About the results. Probably the most interesting (though rather unsurprising) piece of ASM-information obtained from these experiments is that the number of accurate rules predicting high system load drops as the prediction gap widens from 20 seconds to 5 minutes. Besides that, it seems the cpu parameters in general play an important part.

About the learning task. We only looked at one of the many possible learning tasks: finding sufficient conditions in the recent history of system parameters for future high systemload. The notions prediction-gap and prediction-memory should be further discussed and possible supplemented/replaced.

About the data. We believe the data collected at our site are not very interesting. The systemload varies between 0 and 4, and never reaches a critical level. Some of the other parameters remain constant or varied little. More extreme environments should probably be monitored if we want to discover ``exciting'' rules.

About Background knowledge. Probably the hardest next step will be the definition of additional background knowledge. More sophisticated discretization strategies should be applied and both the application domain and the history of timesteps should be further modeled.

back to index