Discovery Science 2006

Home
Call for Papers
Committees and Chairs
Awards
Venue and Travel
Paper Submission
Registration
Program
Links
Sponsors

Conference Program

The Joint (DS and ALT) Conference Program is now available in PDF format. You can also take a look at the DS-06 program schedule or the ALT-06 program schedule in HTML format.

DS Schedule
ALT Schedule
Invited Speakers
Tutorials
List of Accepted Papers

Joint Invited Speaker (DS and ALT)

Andrew Ng, Stanford University, USA

Reinforcement Learning and Apprenticeship Learning for Robotic Control

Many control problems, such as autonomous helicopter flight, legged robot locomotion, and autonomous driving are difficult because (i) It is hard to write down, in closed form, a formal specification of the control task (for example, what is the cost function for "driving well"?), (ii) It is difficult to learn good models of the robot's dynamics, and (iii) It is expensive to find closed-loop controllers for high dimensional, highly stochastic domains. Using apprenticeship learning—in which we learn from a human demonstration of a task—as a unifying theme, I will present formal results showing how many control problems can be efficiently addressed given access to a demonstration. In presenting these ideas, I will also draw from a number of case studies, including applications in autonomous helicopter flight, quadruped obstacle negotiation, snake robot locomotion, and high-speed off-road navigation.

Finally, I will also describe the application of these ideas to the STAIR (STanford AI Robot) project, which has the long term goal of integrating methods from all major areas of AI—including spoken dialog/NLP, manipulation, vision, navigation, and planning—to build a general-purpose, "intelligent" home/office robotic assistant.

Joint work with Pieter Abbeel, Adam Coates, Ashutosh Saxena, Jeremy Kolter, Honglak Lee, Yirong Shen, Justin Driemeyer, Justin Kearns, and Chioma Osondu.

Invited Speakers (DS)

Carole Goble, University of Manchester, UK

Putting Semantics into e-Science and the Grid

e-Science is scientific investigation performed through distributed global collaborations between scientists and their resources, and the computing infrastructure that enables this. Scientific progress increasingly depends on pooling know-how and results; making connections between ideas, people, and data; and finding and interpreting knowledge generated by strangers in new ways other than that intended at its time of collection. It is about harvesting and harnessing the /collective intelligence/ of the scientific community. It has as much to do with intelligent information management as with sharing scarce resources like large scale compute power or expensive instrumentation.

The Semantic Web is an initiative to enable and operate a semantic infrastructure for gathering and exploiting the Web's collective intelligence, exploiting technologies primarily from artificial intelligence and data management computing. Applying the Semantic Web paradigm to e-Science seems like it might be a winner. Moreover, e-Science looks promising as the nursery that the fledgling Semantic Web needs in order to mature.

This talk makes a case for why e-Science needs the Semantic Web and the Semantic Web needs e-Science, using my experiences from working in the Life Sciences.

Padhraic Smyth, University of California, Irvine, USA

Data-Driven Discovery using Probabilistic Hidden Variable Models

Generative probabilistic models have proven to be a very useful framework for machine learning from scientific data. Key ideas that underlie the generative approach include (a) representing complex stochastic phenomena using the structured language of graphical models, (b) using latent (hidden) variables to make inferences about unobserved phenomena, and (c) leveraging Bayesian ideas for learning and prediction. This talk will begin with a brief review of learning from data with hidden variables and then discuss some exciting recent work in this area that has direct application to a broad range of scientific problems. A number of different scientific data sets will be used as examples to illustrate the application of these ideas in probabilistic learning, such as time-course microarray expression data, functional magnetic resonance imaging (fMRI) data of the human brain, text documents from the biomedical literature, and sets of cyclone trajectories.

Invited Speakers (ALT)

Gunnar Raetsch, Max Planck Society, Germany

The Solution of Semi-Infinite Linear Programs using Boosting-like Methods

We consider methods for the solution of large linear optimization problems, in particular so-called Semi-Infinite Linear Programs (SILPs) that have a finite number of variables but infinitely many linear constraints. We illustrate that such optimization problems frequently appear in machine learning and discuss several examples including maximum margin boosting, multiple kernel learning and structure learning.

In the second part we review methods for solving SILPs. Here, we are particularly interested in methods related to boosting. We review recent theoretical results concerning the convergence of these algorithms and conclude this work with a discussion of empirical results comparing these algorithms.

Hans Ulrich Simon, Ruhr-University Bochum, Germany

The Usage of the Spectral Norm in Learning Theory: Some Selected Topics

In the talk, we review some known results about the statistical query complexity of a concept class and the spectral norm of its correlation matrix. Since spectral norms are widely used in various other areas, we are then able to put statistical query complexity in a broader context. We briefly describe some surprising connections to (seemingly) different topics in learning theory, complexity theory, and cryptography. A connection to the so-called Hidden Number Problem, which plays an important role for proving bit-security of cryptographic functions, will be discussed in somewhat more detail.

Slides of the talk available here.

Tutorials

Michael May, Fraunhofer Institute for Autonomous Intelligent Systems, Germany

Geographic and Spatial Data Mining

The widespread use of ubiquitous and mobile technologies such as sensor networks, GPS, mobile phones and RFID, as well as the recent success of Google Earth lead to a situation where more and more data mining applications will have to deal with non-trivial problems of spatio-temporal data analysis. Applications range from telecommunication, retail and market research to scientific applications from ecology or epidemiology.

Despite the importance, standard data mining tools and methods cannot not adequately deal with spatial information. Consequently, important information is thrown away, leading to non-optimal results. The last years have seen several lines of research that try to change this situation. Various classes of data mining algorithms - e.g. clustering, association rules, decision trees, subgroup discovery - have been upgraded to handle geographic objects such as lines, points and polygons and their spatial relationships. Nicely complementing classical approaches that have been pioneered in geostatistics (e.g. Kriging, Point Pattern Analysis), those approaches are often rooted in some form of Multi-Relational Data Mining.

In this tutorial, we will first clarify the various data types relevant for geographic data mining and work out the specific characteristics and challenges of geographic data. Next, we discuss several examples of algorithms that take advantage of these data types. Finally, we present a wide range of applications to illustrate the potential, successes and shortcomings of current Spatial Data Mining approaches. We conclude by pointing out some future challenges and directions.

Luis Torgo, University of Porto, Portugal

Using R for Data Mining and Scientific Discovery

R is a freely downloadable language and environment for data analysis. The R community has been growing at a very fast rate, the same happening to the list of available add-on packages addressing a very large set of domains of application. The main purpose of this tutorial is to illustrate R capabilities on typical data mining and scientific discovery tasks. We aim to convince you that R is an excellent tool to implement ideas to solve specific tasks within these areas. We will pursuit our goal by means of presenting a set of concrete case studies. These case studies will be described and all necessary steps to reach the results using R will be provided as a means of both introducing you to R, but also for allowing you to continue, adapt, and change these "solutions" after attending the tutorial. An associated web site will be made available containing all code and data necessary for you to replicate what will be shown in the tutorial, following the open source spirit of the R project.

Our presentation of R will be illustrated by three different case studies. The first is an ecological modelling task, where the main objective is to obtain models that are able to early forecast harmful algae blooms in a river dam used to collect potable water. The second case study is related to stock market trading. We will show how to obtain models for these complex dynamic systems, and also how to use these models for decision making. Finally, the third case study addresses the exploratory analysis of micro-array genomic data so common in bioinformatics applications.

List of Accepted Papers

Long Papers

Jae-Seong Ahn, Yang-Won Lee and Key-Ho Park. Classification of Changing Regions Based on Temporal Context in Local Spatial Association
Albert Bifet and Ricard Gavalda. Kalman Filters and Adaptive Windows for Learning in Data Streams
Catherine Blake and Meredith Rendall. Scientific Discovery: A View from the Trenches
Marc Boulle and Carine Hue. Optimal Bayesian 2D-discretization for variable ranking in regression
Krzysztof Ciesielski and Mieczyslaw Klopotek. Text Data Clustering by Contextual Graphs
Armando Fernandes and Susana Nascimento. Automatic Water Eddy Detection in SST Maps Using Random Ellipse Fitting and Vectorial Fields for Image Segmentation
Pedro Ferreira, Paulo J. Azevedo, Rui Brito and Candida G. Silva. Mining Approximate Motifs in Time Series
Yaakov HaCohen-Kerner, Hananya Beck, Elchai Yehudai and Dror Mughaz. Identifying Historical Period and Ethnic Origin of Documents Using Stylistic Feature Sets
Yasuto Higa, Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda. A new family of string classifiers based on local relatedness
Ata Kaban, Jianyong Sun, Somak Raychaudhury and Louisa Nolan. On Class Visualisation for High Dimensional Data: Exploring Scientific Data Sets
Takashi Katoh, Kouichi Hirata and Masateru Harao. Mining Sectorial Episodes from Event Sequences
Heidi Koivistoinen, Minna Ruuska and Tapio Elomaa. A Voronoi Diagram Approach to Autonomous Clustering
Taneli Mielikainen, Pance Panov and Saso Dzeroski. Itemset Support Queries using Frequent Itemsets and Their Condensed Representations
Yukihiro Nakamura, Shin Ando, Kenji Aoki, Hiroyuki Mano and Einoshin Suzuki. Strategy Diagram for Identifying Play Strategies in Multi-view Soccer Video Data
ThanhPhuong Nguyen and Tu Bao Ho. Prediction of domain-domain interactions using inductive logic programming from multiple genome databases
Jan Poland and Thomas Zeugmann. Clustering Pairwise Distances with Missing Data: Maximum Cuts versus Normalized Cuts
Antti Rasinen, Jaakko Hollmen and Heikki Mannila. Analysis of Linux evolution using aligned source code segments
Rita Ribeiro and Luis Torgo. Rule-based Prediction of Rare Extreme Values
Jean Sallantin and Christopher Dartnell. A Pragmatic Logic of Scientific Discovery
Milton Severo and Joao Gama. Change detection with Kalman Filter and CUSUM
Tomasz Stepinski, Soumya Ghosh and Ricardo Vilalta. Automatic Recognition of Landforms on Mars Using Terrain Segmentation and Classification
Gyorgy Szarvas, Richard Farkas and Andras Kocsor. A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms
Xin Wang and Ata Kaban. Model-based estimation of word saliency in text

Regular (Short) Papers

Hanene Borchani, Nahla Ben Amor and Khaled Mellouli. Learning Bayesian Network Equivalence Classes from Incomplete Data
Laurent Brisson. Interesting patterns extraction using prior knowledge
Isabel Nepomuceno Chamorro, Juan Antonio Nepomuceno Chamorro, Roberto Ruiz and Jesus S. Aguilar-Ruiz. Databases Reduction simultaneously by Ordered Projection
Alipio M. Jorge, Fernando Pereira and Paulo J. Azevedo. Visual interactive subgroup discovery with numerical properties of interest
lobna Karoui, Nacera Bennacer and Marie-Aude Aufaure. Contextual Ontological Concepts Extraction
Fabio B. Losa, Pau Origoni and Gilbert Ritschard. Experiences from a Socio-Economic Application of Induction Trees
Ricardo Martinez, Nicolas Pasquier, Claude Pasquier and Lucero Lopez-Perez. Interpreting Microarray Experiments Via Co-expressed Gene Groups Analysis (CGGA)
Shin-ichi Minato. Symmetric Item Set Mining Based on Zero-suppressed BDDs
Mizuho Mishima and Makoto Kikuchi. Mathematical Models of Category-Based Induction
Makoto Miwa, Daisaku Yokoyama and Takashi Chikayama. Automatic Construction of Static Evaluation Functions for Computer Game Players
Mihaela Oprea. Mapping Ontologies in an Air Pollution Monitoring and Control Agent-based System
Aritz Perez, Pedro Larranaga and Inaki Inza. Information Theory and Classification Error in Probabilistic Classifiers
Joe Phillips, Ronald Edwards and Raghuveer Kumarakrishnan. Checking Scientific Assumptions by Modeling
Gonzalo Ramos-Jimenez, Jose del Campo-Avila and Rafael Morales-Bueno. Incremental Algorithm Driven by Error Margins
Nazha Selmaoui, Claire Leschi, Dominique Gay and Jean-Francois Boulicaut. Feature Construction and delta-Free Sets in 0/1 Samples
Roberto Theron. Visual Knowledge Discovery in Paleoclimatology with Parallel Coordinates
Hye Sung Yoon, Sang Ho Lee, Sung Bum Cho and Ju Han Kim. A Novel Framework for Discovering Robust Cluster Results
Li Juan Zhang and Zhou Jun Li. Gene Selection for Classifying Microarray Data Using Grey Relation Analysis