The International Summer School on Knowledge Discovery in Databases (KDD) and Data Mining took place in the picturesque town Caminha in the north of Portugal, between 28 September and 3 October 1998. The school had about 80 participants, which came both from academia (55) and industry (25). The industrial participants came from 6 Portuguese companies (16 participants), others came from about 8 European companies (9 participants).
The first three days covered various issues of KDD including machine learning, clustering, data reduction, among others. A large proportion of the lectures presented during the last three days of the summer school was dedicated to Inductive Logic programming (ILP) and KDD under the title of ILPNet 2 Seminar on ILP and KDD. The organization of this seminar was supported by the ILPnet2 EUC.
Ashwin Srinivasan from the University of Oxford in one of his lectures gave insights into what ILP is and described Progol as an example system. He then discussed some medium/large scale applications of ILP, particularly in biology and biochemistry. It was stressed that ILP overcomes some limitations of propositional approaches, such as the problem of automatic introduction of new features or usage of background knowledge, both of which are of relevance to KDD. The talk concluded with some of the current limitations of ILP methods.
Stefan Wrobel from GMD, Germany stressed that ILP is better adapted for databases with several tables (relations) and also to problems with complex relational descriptions (as in chemistry, for instance). Another advantage is the ability to use background knowledge. S. Wrobel also described some issues related to scalability of ILP systems. These included the techniques of global sampling and local sampling methods (i.e. selection of a different sample for each hypothesis to be tested). S. Wrobel then contrasted propositional with multi-relational instance-based learning, such as the RIBL system. In this part he described approaches to calculating first similarity, and how these measures can be applied, for instance, to protein data. One of typical KDD tasks is involved with deviation detection within subgroups. A particularly interesting part of S. Wrobel's presentation concentrated on how this can be done within ILP setting. In this context he described the methods used in MIDOS system which include the desription of the refinement operators, systematic search and sampling for estimation subgroup frequencies, among others.
On the issue of scalability of ILP systems Alipio Jorge explained in his lecture how the problem can be alleviated with the help of stronger language bias and how the language bias can be set in a declarative way. The technical part then concentrated on various mechanisms useful for this aim, such as rule models, grammars and restricted refinement rules. A. Jorge then described various meta-languages useful in data mining. In this context he presented various extensions to SQL, such as DMQL which permits to find characteristic rules and MSQL which provides facilities for generation/selection of association rules.
Arno Knobbe from Syllogic (NL) discussed the issues of embedding ILP within KDD systems. In this context the advantages of visual presentation were stressed. On scalability of systems, some important principles were pointed out: One should compute as much as possible during pre-processing, exploit the column oriented nature of many systems and design algorithms that permit to distribute data among different processors or processes. He then concluded on an optimistic note that scalable ILP is possible, although it implies lots of engineering.
Concerning applications, Peter van der Putten from Sentient Machine Research, Amsterdam (EUC member) stressed that it is important to build vertical applications with niche applications with a focus on a specific industrial sector or purpose. The applications that were mentioned included profiling and matching cases to profiles and discovery of segments. One interesting application involved coding a large image in terms of pre-stored (but extensible) set of smaller images (part of a sky or tree). Indeed, according to the author more attention should be paid to image processing and information retrieval.
Stefan Wrobel from GMD whose lectures were already refererred to earlier presented applications of MIDOS to a survey of German hospitals, purchase data and production process quality data. The applications of RIBL were oriented towards chemistry.
Ashwin Srinivasan described a joint work in the area of pharmacophore discovery which is important for lowering blood pressure. The discoveries obtained with the help of Progol were well received by specialists in the area.
Arno Knobbe from Syllogic (EUC member) described applications in the area of aircraft maintenance and adaptive systems management. The latter aims at finding causes of poor performance and finding optimal configurations of system parameters. Both Progol and Tilde were used. Both were found to be a bit slow, although Tilde seemed to scale up better.
Olga Stepankova from Czech Technical University, who represented Atlantis Telekom (EUC member), plans to use ILP as well as CBR and dedicated special purpose methods for automated intelligent handling of phone calls. Petr Miksovsky used FOIL to compress component sheets in electronic industry, obtaining results which are now being analyzed by experts.
Saso Dzeroski presented several interesting environmental applications of machine learning, including: modelling dynamics of red deer populations (mainly with RETIS system); assessment and forecasting of changes in agricultural ecosystems and in particular modelling of Collembola insects in soil (mainly with M5, NN and linear regression); modelling and prediction of phytoplankton growth with equation discovery (Lagramge) and environmental epidemology studies oriented towards influenza, acute bronchitis etc. (mainly with C4.5 and CN2). All these applications support the view that ML techniques can help understand and solve environmental problems.
Kathy Astrahantseff from Dialogis Software & Services (Germany) - EUC member - presented a problem from the area of protein interaction. The problem is how to separate false positives from true positives in experimental outcomes.
Gabor Proszeky from Morphologic (Hungary) - EUC member - presented several applications in the area of natural language processing within his company.
Nada Lavrac and Tanja Urbancic from the Jozef Stefan Institute and Milan Stepanovic from Studio Phi (EUC member) presented the idea of forming a business oriented network consisting of companies and academic institutions.
Panel "On Collaboration between Universities and Industry: What should be the Role of End-User-Club and Similar Organizations". The participants of this panel were Reza Nakhaeizadeh, Saso Dzeroski, Arno Knobbe, Filipe Silvrio, Stefan Wrobel and Pavel Brazdil. A part of the discussion concentrated on which organizations should the EUC be targeted to. The widely accepted conclusion was that these must include tool vendors and solution providers. Academic partners are not in a position to provide complete turnkey solutions to end-users that are not willing to deal with technology directly. Here an EUC involving both final end users and solution providers can help overcome the communication gap. According to Filipe Silvrio, this represents the most important barrier to collaboration between Universities and Industry. He stated that for most real end-users, i.e., organizations which have the data to be mined, techniques must have been proven successful to be accepted in decision making. A. Knobbe expects collaboration between the academic partners and EUC members to be tighter and more frequent, minimizing thus this gap.
The organization of this event, carried out mainly by Dr. Fernando Moura-Pires and Dr. Pavel Brazdil, was highly acclaimed. The support of Carlos Soares is also acknowledged, mainly in what concerns contacts with several Portuguese companies that participated in this event.