Application domain: | Adaptive System Management |
Further specification: | data + script + background knowledge |
Pointers: | Contact Luc Dehaspe ldh@cs.kuleuven.ac.be |
Data complexity: | 1000 facts |
Data format: | Prolog |
With the ever increasing complexity of Client/Server environments, the problem of managing such environments becomes more important and more difficult. Automatic techniques and in particular learning techniques could help a lot in solving these problems. As most of the system management is still done by hand, applying ILP-techniques will be a big step forward.
ASM has as two of its main goals keeping the performance and availability of the system as high as possible. By systems we usually mean a network of UNIX-machines, with databases and applications running on them. It is of course very valuable to discover what determines the performance and availability of such systems. System management can be roughly divided into four areas:
Four problem areas have been defined. For these problems the focus has been on simplicity and on the availability of datasets, and should thus be considered as preliminary. No real attempt has been made to come up with problems that are specifically suited for ILP, so the comparison with other ML-techniques may not always be in favor of ILP. The definitions as they are now, are meant as a first step and after some results we should look at more complex problems.
The following four problem areas have been defined:
The availability of datasets will mostly be the responsibility of Syllogic.
For the first three problems datasets can easily be generated. Marc Gathier has written a script that will produce a list of system-parameters periodically. The output can be in the format of Prolog clauses or insert-statements into a database, or any other ASCII format you like. The scripts will enable everybody to create their own datasets, on various types of machines. It has been agreed that each party will generate a dataset in this way and distribute this so performance on different machines may be compared. A standard format for this still needs to be agreed upon. For those who have difficulties using the script, Syllogic will give support or send a dataset that has already been generated.
Availability of datasets for the last problem still needs to be discussed.
Jeroen Speekenbrink will look into the availability of larger and more complex datasets at customers-sites. Probably new problem-definitions will result from this.
The datasets for the first three problems as it stands now correspond to a single table, with a column for each system parameter and a row for each timestep. Attribute-value learners have been applied to these datasets to induce dependencies holding at a single timestep between system performance/availability and other parameters. The challenge for ILP is to find more useful rules that take into account:
The attractive aspects of this application for ILP include:
IF Recently system parameters have been as suchRaw data To generate the data for the experiment we used a slightly modified version of the two scripts provided by Syllogic.
THEN system load will be high at timestep Future
We executed the scripts on a SPARCstation 2 running SunOS 4.1.3. In about three days, with a parameter dump every 20 seconds, we gathered about 13000 facts of arity 18. From these we selected an ``interesting'' sequence of 1000 facts. Interesting here meaning that the system load showed some variance in this 5.5 hour period (e.g. during the night it was rather uninterestingly stable).
In 100 of the 1000 selected timesteps the systemload is high.
Background knowledge We mainly used two types of background knowledge, one related to discretization, one to model the history of previous timesteps.
Discretization For all system parameters we generated a list of values that occur in our sample of 1000 facts. Where appropriate we introduced an extremely rough discretization into three intervals (low, average, high).
Model the history of timesteps To model the history of previous timesteps we implemented two concepts:
The language The declarative language bias formalism DLAB was used for defining the search space.
dlab_template(' sysload(F, hi) <-- len-len:[t(N,Prediction_Gap,F), 0-len:[toggle(h(tax, N, users,fam)), toggle(h(tax, N, non_root,fam)), toggle(h(tax, N, processes,fam)), toggle(h(tax, N, non_root_processes,fam)), toggle(h(tax, N, defunct_processes,fam)), toggle(h(tax, N, free_tmp,fam)), toggle(h(tax, N, paging_space_used,fam)), toggle(h(tax, N, sysload,lah)), toggle(h(tax, N, cpu_us,lah)), toggle(h(tax, N, cpu_sy,lah)), toggle(h(tax, N, cpu_id,lah)) ] ] '). dlab_variable(toggle,1-1, [y,n]). dlab_variable(fam, 1-1, [few, avg, lot]). dlab_variable(lah, 1-1, [lo, avg, hi]). dlab_taxonomy(tax, [60,40,20,0]). dlab_variable(term, 1-1, [20,60,300]).The total number of clauses in this hypothesis space amounts to .
Settings The following non-default settings were used:
talking(2). scope(local). local_transform(1). min_lower_accuracy(0.7). min_accuracy(1). non_redundancy(off). min_refine_coverage(20). min_coverage(20). max_complexity(1000).Notice :
Prediction-gap = 20 Of the 8387 solutions found in less than 3 days of cpu time, 137 have an accuracy of at least . There are 3 rules with the maximum accuracy of , e.g.:
rule(7230, [accuracy(0.95),total(20),pos(19),neg(1), comp(37),cpu(221908),realtime(223763)], (sysload(F,hi) :- t(N,20,F) , y(h(60,N,users,avg)) , n(h(60,N,free_tmp,avg)) , n(h(60,N,sysload,avg)) , n(h(60,N,cpu_us,lo)) , n(h(60,N,cpu_sy,lo)))).This rule number 7230 has been discovered after 61.2 hours of cpu time (62.2 hours real time). It says that sysload will be high 20 seconds from now if in the last 60 seconds users has been average, the sysload has not been average, cpu_us has not been low, and cpu_sy has not been low. This rule holds for 19 of the 20 observed timepoints that meet the condition part, hence accuracy. With in total 100 timepoints in which the systemload is high, this rule has a coverage of about .
Prediction-gap = 60
Of the 3164 solutions found in less than 3 days of cpu time, 65 have an accuracy of at least . There are 40 rules with the maximum accuracy of , e.g.:
rule(208, [accuracy(0.857143),total(21),pos(18),neg(3), comp(31),cpu(16707),realtime(16965)], (sysload(F,hi) :- t(N,60,F) , y(h(60,N,users,avg)) , n(h(60,N,cpu_us,lo)) , n(h(60,N,cpu_sy,lo)) , n(h(40,N,cpu_sy,hi)))).This rule number 208 has been discovered after 4.6 hours of cpu time (4.7 hours real time). It says that sysload will be high 60 seconds from now if in the last 60 seconds users has been average, cpu_us has not been low, and cpu_sy has not been low, and in the last 40 seconds cpu_sy has not been high. This rule holds for 18 of the 21 observed timepoints that meet the condition part, hence accuracy. With in total 100 timepoints in which the systemload is high, this rule has a coverage of about .
Prediction-gap = 300
Of the 44 solutions found in less than 3 days of cpu time, none has accuracy above . There are 8 rules with the maximum accuracy of , e.g.:
rule(5, [accuracy(0.75),total(28),pos(21),neg(7), comp(31),cpu(53235.4),realtime(54062)], (sysload(F,hi) :- t(N,300,F) , y(h(60,N,processes,avg)) , y(h(40,N,processes,lot)) , y(h(60,N,sysload,hi)) , y(h(60,N,cpu_sy,avg)))).This rule number 5 has been discovered after 14.8 hours of cpu time (15 hours real time). It says that sysload will be high 5 minutes from now if in the last 60 seconds processes has been average, sysload has been high, and cpu_sy has been average, and in the last 40 seconds processes has been lot. This rule holds for 21 of the 28 observed timepoints that meet the condition part, hence accuracy. With in total 100 timepoints in which the systemload is high, this rule has a coverage of about .
About the results. Probably the most interesting (though rather unsurprising) piece of ASM-information obtained from these experiments is that the number of accurate rules predicting high system load drops as the prediction gap widens from 20 seconds to 5 minutes. Besides that, it seems the cpu parameters in general play an important part.
About the learning task. We only looked at one of the many possible learning tasks: finding sufficient conditions in the recent history of system parameters for future high systemload. The notions prediction-gap and prediction-memory should be further discussed and possible supplemented/replaced.
About the data. We believe the data collected at our site are not very interesting. The systemload varies between 0 and 4, and never reaches a critical level. Some of the other parameters remain constant or varied little. More extreme environments should probably be monitored if we want to discover ``exciting'' rules.
About Background knowledge. Probably the hardest next step will be the definition of additional background knowledge. More sophisticated discretization strategies should be applied and both the application domain and the history of timesteps should be further modeled.