A Report on the Summer School on
Relational Data Mining
17-18 August 2002, Helsinki, Finland

Saso Dzeroski and Bernard Zenko

Relational Data Mining (RDM) is the multi-disciplinary field dealing with knowledge discovery from relational databases consisting of multiple tables. To emphasize the contrast to typical data mining approaches that look for patterns in a single relation of a database, the name Multi-Relational Data Mining (MRDM) is often used as well. Mining data which consists of complex/structured objects also falls within the scope of this field: the normalized representation of such objects in a relational database requires multiple tables. The field aims at integrating results from existing fields such as inductive logic programming (ILP), KDD, data mining, machine learning and relational databases; producing new techniques for mining multi- relational data; and practical applications of such techniques. Present RDM approaches consider all of the main data mining tasks, including association analysis, classification, clustering, learning probabilistic models and regression. The pattern languages used by single-table data mining approaches for these data mining tasks have been extended to the multiple-table case. Relational pattern languages now include relational association rules, relational classification rules, relational decision trees, and probabilistic relational models, among others. RDM algorithms have been developed to mine for patterns expressed in relational pattern languages. Typically, data mining algorithms have been upgraded from the single-table case: for example, distance-based algorithms for prediction and clustering have been upgraded by defining distance measures between examples/instances represented in relational logic. RDM methods have been successfully applied across many application areas, ranging from the analysis of business data, through bioinformatics (including the analysis of complete genomes) and pharmacology (drug design) to Web mining (e.g., information extraction from Web sources). The Summer School on Relational Data Mining provided a comprehensive introduction to the techniques and applications of relational data mining by leading experts in the field. The lectures given at the school are summarized below. The slides of the lectures were published as handouts and are also available for download at the Web page of the school http://www-ai.ijs.si/SasoDzeroski/RDMSchool .

The introductory lecture, "An introduction to relational data mining" by Saso Dzeroski, first summarized the standard data mining tasks (classification, regression, clustering, association discovery) and approaches (trees, rules, nearest-neighbor) and illustrated them for the propositional case. After demonstrating the need to mine (multi)-relational data, the topic of RDM was introduced and a brief overview of RDM approaches was given.
The second lecture, "An introduction to inductive logic programming", given by Peter Flach on behalf of Nada Lavrac, introduced the field of ILP, which is concerned with learning logic programs from examples and background knowledge. Fundamental topics such as searching the space of program clauses and the generality lattice induced by theta-subsumption were covered, as well as generic ILP approaches based on the search of refinement graphs and least general generalization.
The lecture "Propositionalization as a way of understanding RDM and ILP", also presented by Peter Flach, covered another generic approach to ILP, namely the approach of transforming a relational learning problem to a propositional one. The talk clarified the relationship between relational and propositional learning and the role of representation in RDM. It also presented an approach to propositionalization for individual-centered strongly typed representations, where feature construction is guided by types.
The three main messages of the lecture "A methodology of ILP" by Luc De Raedt were that: (1) ILP applies essentially to any machine learning / data mining task, not just concept learning, (2) There is a recipe for deriving new ILP algorithms from propositional ones, and (3) ILP as an expressive framework has many special cases, which are often studied separately.
The lecture "Logical trees for classification, regression and clustering" by Hendrik Blockeel presented an approach to using decision trees for relational problems. The approach extends well known techniques for induction of decision trees in two ways: (1) Predictive clustering generalizes over several induction tasks (classification regression and clustering), while (2) First order logical decision trees make trees usable in the context of ILP. The resulting technique therefore combines the high expressiveness of first order logical decision trees with the efficiency and accuracy of decision trees.
The lecture "Relational subgroup discovery", given by Stefan Wrobel, dealt with the problem of finding interesting patterns which are only valid in selected regions of the instance space. These local patterns are often impossible to find if we try to model the entire instance space. The talk first introduced several medical and business problems which can be successfully solved by relational subgroup discovery, and then presented the algorithm MIDOS which can be used for finding relational descriptions of subgroups.
The next lecture by Stefan Wrobel, titled "Relational distance-based methods", presented several relational instance based learning systems and explained the distance functions used in these systems. The talk also presented several successful applications of these techniques in the fields of chemistry and biology.
The lecture "Kernel-based learning from structured data", given by Thomas Gaertner, first gave a clear explanation of the basics of support vector machines. The problem of selecting an appropriate kernel function for a given domain was pointed out, emphasizing that the kernel is a part of the background knowledge about a domain. The talk then discussed how to construct kernel functions for structured data and concluded with some example applications of this technique.
The lecture "Learning statistical models from relational data" given by Lise Getoor presented an extension of statistical models for relational problems. It first covered the topic of Bayesian networks, which are the foundation of the Probabilistic Relational Models (PRMs). PRMs, ways to learn their structure and parameters from data, and their applications to practical problems were then discussed.
"Bayesian logic programs" (BLPs), discussed in the lecture by Luc De Raedt (presenting joint work with Kristian Kersting), are the first order equivalent of Bayesian networks. In addition, they generalize pure Prolog, dynamic Bayesian netorks, dynamic Bayesian multinets, hidden Markov models etc. The talk first focussed on the language of BLPs, then on leaning the parameters and structure of BLPs.
In the lecture "Applications of ILP/RDM to bioinformatics", Ross King presented a number of successful applications of ILP/RDM to real problems in the field of bioinformatics. The applications presented included structure activity relationship modeling, functional genomics and "The Robot Scientist Project".
The lecture "RDM Applications; An overview", given by Saso Dzeroski, presented a variety of ILP/RDM applications with a more detailed description of some selected applications. Besides the applications in the area of bioinformatics, which were the topic of the previous talk, the lecture presented applications from areas of medicine, environmental science, traffic engineering, mechanical engineering, text/web mining, natural language processing, business data analysis, music, software engineering and adaptive systems management.
The last lecture on "Inductive databases" was given by Luc De Raedt. Inductive databases store both data and patterns valid in the data. In inductive databases, data mining is viewed as a querying process. The first part of the talk presented several examples of (preliminary) inductive databases while the second part described a logic oriented view on the principles of inductive databases.

The Summer School concluded with a panel discussion, where the lecturers and participants raised a number of interesting issues concerning the future of relational data mining. Interesting statements from the panel include: "The future of RDM is in upgrading (probabilistic models, SVMs, neural networks) and downgrading (to make it more efficient for specific applications, e.g., on sequences) (Luc De Raedt)"; "The different approaches to RDM should be integrated" (Hendrik Blockeel); "RDM should be more of an engineering discipline than art" (Thomas Gaertner); "The European and US scientific communities working on RDM topics should communicate and not develop in isolation" (Lise Getoor). The Summer School was organized by the Jozef Stefan Institute,
Ljubljana, with the help and support of the University of Helsinki. It was financially supported by ILPnet2 (The Network of Excellence in Inductive Logic Programming). It was attended by 34 participants from 13 countries (including the USA and New Zealand).

Created by LaTeX2html on 2002-11-04.

A Report on the Summer School on Relational Data Mining 17-18 August 2002, Helsinki, Finland

A Report on the Summer School on
Relational Data Mining
17-18 August 2002, Helsinki, Finland