- C.H.Bryant: Data Mining for Chemistry: the Application of Three Machine Induction Tools to a Database of Enantioseparations
- Gunther Sablon: Iterative versionspaces with an application in ILP
- Patrick van der Laag: An analysis of refinement operators in ILP
- Aram Karalic: First order regression
- Saso Dzeroski: Numerical constraints and learnability in ILP

A validated first step is taken towards meeting the need for a
symbolic expert system for enantioseparations by High-Performance
Liquid Chromatography on chiral stationary phases (CSPs). Three data
mining tools are applied to data from a database of published
enantioseparations performed on commercially available Pirkle CSPs
with the aim of inducing generalisations that recommend particular CSP
chiral selectors based on the structural features of an enantiomer
pair. Two of the tools, Golem and Progol, are from the field of
inductive logic programming (ILP) and are available in the public
domain. The other, DataMariner, is a commercially
available tool which generates essentially propositional rules.

The ability of each of these tools to induce useful generalisations
from the database is revealed by investigating whether each one can
make generalisations about the values of attributes, attributes
themselves, relationships between attributes and negations of
literals. The usefulness of generalisations is measured by their
accuracy, size, ease of interpretation and chemical justification.

The ability of DataMariner to make generalisations about the values of
attributes and negations of literals is shown to be sufficient for the
induction of accurate rules from the database: the accuracy of the
most useful rule-set induced by DataMariner is 63% +/-3% which is more
than ten times greater than the accuracy that would have resulted from
a random choice. However it is difficult to provide a chemical
justification for the rules. Golem can induce clauses that represent
knowledge about relationships between attributes in the database which
can not be represented by the essentially propositional rules
generated by DataMariner. A scheme devised for Progol, which allows
generalisations to be made about both the attributes in the database
and their values, enables Progol to induce clauses which are more
concise and easier to justify chemically than those which can be
induced by DataMariner. Progol uses a substantial amount of general
chemical knowledge when making these generalisations. This supports
the claim made by the proponents of ILP that one of the advantages of
ILP over classical machine induction is that ILP allows substantial
background knowledge to be used during induction.

This thesis consists of two parts: in the first part we develop a
language-independent framework for efficiently solving concept
learning problems; in the second part we apply this framework to
Inductive Logic Programming.

We view concept learning as a search problem. In the well-known
framework of Versionspaces a bi-directional depth-first search
algorithm that learns a maximally general and maximally specific
concept representation is presented, and contrasted to the
breadth-first approach of Mellish's Description Identification
algorithm. In this context, we identify redundant information elements,
in order to reduce the memory needed for storing information elements.
We describe how automatically generated information elements can
replace less informative ones.

Next, we extend this framework to describe the more complex
versionspaces that originate from introducing disjunctions. To be
practically useful, we gradually restrict these disjunctive versionspaces
by imposing preference criteria, based on notions of minimality. This
leads to extensions of the non-disjunctive algorithms to the disjunctive
case.

In the second part of the thesis we show in detail how this general
framework can be instantiated to Inductive Logic Programming. In this
respect we also discuss the integration of machine learning in a
planning system based on Horn clause logic. This illustrates that the use
of a logical representation allows a smooth integration of Machine
Learning and Problem Solving.

In summary, the thesis contributes to the understanding and the
development of search algorithms for concept learning in general, by
developing a language independent framework, and by introducing
several novel and generally applicable concept learning techniques.
The application in the second part of the thesis shows that the
framework is practically useful, and that it contributes to the field of
Inductive Logic Programming.

Our goals were do develop a system which can:

- induce first-order logic concepts which incorporate continuous variables;
- make use of the background knowledge in intensional form;
- model dynamic systems (learn from time series);
- partition attribute space to subspaces and find a submodel for each subspace;
- handle noisy data.

FORS constructs a model in the form of a Prolog program. Covering approach, similar to the one of FOIL is used. The clause building part of the algorithm uses a top-down approach. The algorithm starts with the most general candidate clause, covering the entire example set and then specializes the clause by adding literals. Clause construction uses beam search to guide the algorithm through the space of possible clauses.

As a part of the system, the pruning based on the Minimum description length principle was developed that can handle also continuous variables. It turned out that MDL pruning helps to build more comprehensible models, while at the same time preserves model's performance in terms of its prediction power.

In experiments with physical domains FORS rediscovered Kepler's third law of planetary motion and ideal gas law (including rediscovery of the gas constant and the absolute temperature scale).

In all the real-world domains, models using linear regression appealed most to the experts, since linear regression largely increased the expressive power of the models in comparison with pure first order logic. The experts were able to thoroughly analyse the induced models and carefully exploit the selected regions of attribute space and more reliably evaluate model's quality inside the regions.

Modelling of water behavior in surge tank proved that FORS can successfully handle times series data. Additionally, FORS' ability to partition the attribute space coupled with its linear regression capabilities proved to be crucial in inducing a useful model without prior knowledge of ``absolute value'' function.

Models induced from the Lake of Bled data adequately well describe the growth of algae in the lake. Newly induced background literals defining seasons help in better comprehensibility of the induced models. For the expert this was not the first approach to modelling the Lake of Bled behavior and experiments with FORS ~have confirmed his opinion that not much more can be done with current data without additional more precise and frequent measurements.

A domain expert in the domain of steel grinding claims to be satisfied with the results of the machine learning, since our models enabled him to grasp some additional process properties, which he wouldn't be able to discover only with classical statistical tools. However, he suggests that the machine learning approach should not be used alone, but should be considered as a powerful supplement to already existent instruments. In this domain it also turned out that simple background knowledge can sometimes significantly improve the usefulness of induced models.

FORS was also used in the domains of finite element mesh design and for construction of rules for the prediction of the mutagenic activity of nitroaromatic compounds, where its performance was at the level of other machine learning algorithms, used in those domains.

During the work in the electrical discharge machining process, a data acquisition environment was established which enabled us to monitor and record crucial process parameters as well as the operator's control actions. Models were induced which, according to the expert, capture the main behaviour patterns of the operator. During the knowledge acquisition process several important guidelines for knowledge acquisition, concerning mainly the process of interaction with the domain experts, emerged, confirming that comprehensibility of the induced model plays an important role in the process of behaviour cloning, and again confirming, that for successful modelling of the expert behavior one can not rely only on interviews with an expert.

We first present a new formalization of the nonmonotonic ILP setting, where examples are interpretations and concepts are clausal theories. Examples may be also given in the form of definite logic programs, from which interpretations are obtained as minimal Herbrand models. A background theory may be taken into account. This formalization makes nonmonotonic ILP a natural extension of propositional learning where a first-order representation and background knowledge are used.

We then present a study of the computational complexity of inductive logic programming. We introduce PAC-learnability models for both semantics and prove positive learnability results in both. Results concerning learning in the presence of noise are also presented. The results for normal ILP are based on the transformation of ILP problems to propositional form performed by DINUS. The results for nonmonotonic ILP are based on our new formalization. The concept classes considered in our results can be induced by existing ILP systems, such as GOLEM and CLAUDIEN.

Finally, we develop methods for handling real numbers and numerical constraints in both semantics. The DINUS system can be used for learning relational rules with real numbers and relational regression in normal ILP. We extend number handling in CLAUDIEN, adding the capability of using linear equations, thus combining quantitative and qualitative discovery. We finally present LAGRANGE, a machine discovery system tightly related to the two ILP semantics. From example behaviors, LAGRANGE discovers differential equations that describe the dynamics of a given system.