Adaptive System Management (KUL)

Application domain: Adaptive System Management
Further specification: data + script + background knowledge
Pointers: Contact Luc Dehaspe ldh@cs.kuleuven.ac.be
Data complexity: 1000 facts
Data format: Prolog

With the ever increasing complexity of Client/Server environments, the problem of managing such environments becomes more important and more difficult. Automatic techniques and in particular learning techniques could help a lot in solving these problems. As most of the system management is still done by hand, applying ILP-techniques will be a big step forward.

ASM has as two of its main goals keeping the performance and availability of the system as high as possible. By systems we usually mean a network of UNIX-machines, with databases and applications running on them. It is of course very valuable to discover what determines the performance and availability of such systems. System management can be roughly divided into four areas:

It has been decided that the focus for now will be on problems involving management of single UNIX-systems with possibly databases and applications running on them. In the future the focus will shift towards management of more complex networks involving several machines.

Problem definitions

Four problem areas have been defined. For these problems the focus has been on simplicity and on the availability of datasets, and should thus be considered as preliminary. No real attempt has been made to come up with problems that are specifically suited for ILP, so the comparison with other ML-techniques may not always be in favor of ILP. The definitions as they are now, are meant as a first step and after some results we should look at more complex problems.

The following four problem areas have been defined:

Data sets

The availability of datasets will mostly be the responsibility of Syllogic.

For the first three problems datasets can easily be generated. Marc Gathier has written a script that will produce a list of system-parameters periodically. The output can be in the format of Prolog clauses or insert-statements into a database, or any other ASCII format you like. The scripts will enable everybody to create their own datasets, on various types of machines. It has been agreed that each party will generate a dataset in this way and distribute this so performance on different machines may be compared. A standard format for this still needs to be agreed upon. For those who have difficulties using the script, Syllogic will give support or send a dataset that has already been generated.

Availability of datasets for the last problem still needs to be discussed.

Jeroen Speekenbrink will look into the availability of larger and more complex datasets at customers-sites. Probably new problem-definitions will result from this.

ILP Considerations

The datasets for the first three problems as it stands now correspond to a single table, with a column for each system parameter and a row for each timestep. Attribute-value learners have been applied to these datasets to induce dependencies holding at a single timestep between system performance/availability and other parameters. The challenge for ILP is to find more useful rules that take into account:

This problem might seem less suitable for ILP as the system parameters are all purely numerical values. However, either the list of values is limited, or fairly straightforward discretization is possible.

The attractive aspects of this application for ILP include:

Experimental setup

We further refined the learning task to finding sufficient conditions for a high system load at timestep Future. These conditions are to be selected from the information available at an earlier timestep Now which is equal to Future-PredictionGap. To make the rules more flexible we introduced a time interval Recently which starts at Now-PredictionMemory and stops Now. We then want to induce rules of the type
IF Recently system parameters have been as such
THEN system load will be high at timestep Future
Raw data To generate the data for the experiment we used a slightly modified version of the two scripts provided by Syllogic.

We executed the scripts on a SPARCstation 2 running SunOS 4.1.3. In about three days, with a parameter dump every 20 seconds, we gathered about 13000 facts of arity 18. From these we selected an ``interesting'' sequence of 1000 facts. Interesting here meaning that the system load showed some variance in this 5.5 hour period (e.g. during the night it was rather uninterestingly stable).

In 100 of the 1000 selected timesteps the systemload is high.

Background knowledge We mainly used two types of background knowledge, one related to discretization, one to model the history of previous timesteps.

Discretization For all system parameters we generated a list of values that occur in our sample of 1000 facts. Where appropriate we introduced an extremely rough discretization into three intervals (low, average, high).

Model the history of timesteps To model the history of previous timesteps we implemented two concepts:

The prediction-memory is then used in the condition part of induced rules, the prediction-gap in the conclusion part.

The learning engine: CLAUDIEN

We used the clausal discovery engine CLAUDIEN. Here are some of the CLAUDIEN specific inputs.

The language The declarative language bias formalism DLAB was used for defining the search space.

dlab_template('
sysload(F, hi) 
<--
len-len:[t(N,Prediction_Gap,F),
         0-len:[toggle(h(tax, N, users,fam)),
                toggle(h(tax, N, non_root,fam)),
                toggle(h(tax, N, processes,fam)),
                toggle(h(tax, N, non_root_processes,fam)),
                toggle(h(tax, N, defunct_processes,fam)),
                toggle(h(tax, N, free_tmp,fam)),
                toggle(h(tax, N, paging_space_used,fam)),
                toggle(h(tax, N, sysload,lah)),
                toggle(h(tax, N, cpu_us,lah)),
                toggle(h(tax, N, cpu_sy,lah)),
                toggle(h(tax, N, cpu_id,lah))
               ]
        ]
').

dlab_variable(toggle,1-1, [y,n]).
dlab_variable(fam, 1-1, [few, avg, lot]).
dlab_variable(lah, 1-1, [lo, avg, hi]).
dlab_taxonomy(tax, [60,40,20,0]).
dlab_variable(term, 1-1, [20,60,300]).
The total number of clauses in this hypothesis space amounts to tex2html_wrap_inline1635.

Settings The following non-default settings were used:

talking(2).
scope(local).
local_transform(1).
min_lower_accuracy(0.7).
min_accuracy(1).
non_redundancy(off).
min_refine_coverage(20).
min_coverage(20).
max_complexity(1000).
Notice :

Results

We did three experiments with prediction gaps of respectively 20, 60 and 300 seconds.

Prediction-gap = 20 Of the 8387 solutions found in less than 3 days of cpu time, 137 have an accuracy of at least tex2html_wrap_inline1643. There are 3 rules with the maximum accuracy of tex2html_wrap_inline1645, e.g.:

rule(7230,
     [accuracy(0.95),total(20),pos(19),neg(1),
      comp(37),cpu(221908),realtime(223763)],
     (sysload(F,hi) :- 
          t(N,20,F) ,
          y(h(60,N,users,avg)) , 
          n(h(60,N,free_tmp,avg)) ,
          n(h(60,N,sysload,avg)) , 
          n(h(60,N,cpu_us,lo)) ,
          n(h(60,N,cpu_sy,lo)))).
This rule number 7230 has been discovered after 61.2 hours of cpu time (62.2 hours real time). It says that sysload will be high 20 seconds from now if in the last 60 seconds users has been average, the sysload has not been average, cpu_us has not been low, and cpu_sy has not been low. This rule holds for 19 of the 20 observed timepoints that meet the condition part, hence tex2html_wrap_inline1645 accuracy. With in total 100 timepoints in which the systemload is high, this rule has a coverage of about tex2html_wrap_inline1649.

Prediction-gap = 60

Of the 3164 solutions found in less than 3 days of cpu time, 65 have an accuracy of at least tex2html_wrap_inline1651. There are 40 rules with the maximum accuracy of tex2html_wrap_inline1653, e.g.:

rule(208,
     [accuracy(0.857143),total(21),pos(18),neg(3),
      comp(31),cpu(16707),realtime(16965)],
     (sysload(F,hi) :- 
          t(N,60,F) , 
          y(h(60,N,users,avg)) , 
          n(h(60,N,cpu_us,lo)) , 
          n(h(60,N,cpu_sy,lo)) , 
          n(h(40,N,cpu_sy,hi)))).
This rule number 208 has been discovered after 4.6 hours of cpu time (4.7 hours real time). It says that sysload will be high 60 seconds from now if in the last 60 seconds users has been average, cpu_us has not been low, and cpu_sy has not been low, and in the last 40 seconds cpu_sy has not been high. This rule holds for 18 of the 21 observed timepoints that meet the condition part, hence tex2html_wrap_inline1653 accuracy. With in total 100 timepoints in which the systemload is high, this rule has a coverage of about tex2html_wrap_inline1649.

Prediction-gap = 300

Of the 44 solutions found in less than 3 days of cpu time, none has accuracy above tex2html_wrap_inline1659. There are 8 rules with the maximum accuracy of tex2html_wrap_inline1659, e.g.:

rule(5,
     [accuracy(0.75),total(28),pos(21),neg(7),
      comp(31),cpu(53235.4),realtime(54062)],
     (sysload(F,hi) :- 
          t(N,300,F) , 
          y(h(60,N,processes,avg)) , 
          y(h(40,N,processes,lot)) , 
          y(h(60,N,sysload,hi)) , 
          y(h(60,N,cpu_sy,avg)))).
This rule number 5 has been discovered after 14.8 hours of cpu time (15 hours real time). It says that sysload will be high 5 minutes from now if in the last 60 seconds processes has been average, sysload has been high, and cpu_sy has been average, and in the last 40 seconds processes has been lot. This rule holds for 21 of the 28 observed timepoints that meet the condition part, hence tex2html_wrap_inline1659 accuracy. With in total 100 timepoints in which the systemload is high, this rule has a coverage of about tex2html_wrap_inline1649.

Discussion and conclusion

The experiments and results here described merely give a flavor of the kind of rules a relational learner like CLAUDIEN can discover in ASM data. We briefly comment on some current shortcomings, possible next steps etc..

About the results. Probably the most interesting (though rather unsurprising) piece of ASM-information obtained from these experiments is that the number of accurate rules predicting high system load drops as the prediction gap widens from 20 seconds to 5 minutes. Besides that, it seems the cpu parameters in general play an important part.

About the learning task. We only looked at one of the many possible learning tasks: finding sufficient conditions in the recent history of system parameters for future high systemload. The notions prediction-gap and prediction-memory should be further discussed and possible supplemented/replaced.

About the data. We believe the data collected at our site are not very interesting. The systemload varies between 0 and 4, and never reaches a critical level. Some of the other parameters remain constant or varied little. More extreme environments should probably be monitored if we want to discover ``exciting'' rules.

About Background knowledge. Probably the hardest next step will be the definition of additional background knowledge. More sophisticated discretization strategies should be applied and both the application domain and the history of timesteps should be further modeled.


back to index