Data Mining Institute

Data Mining Institute*

Computer Sciences Department

University of Wisconsin -- Madison

Annual Review -- June 1, 2001

Abstracts of Talks

Reduced Data Classifiers

Olvi Mangasarian with Yuh-Jye Lee

An algorithm is proposed which generates a nonlinear

kernel-based separating surface that requires as little

as 1% of a large dataset for its explicit evaluation.

Although the entire dataset is used

to generate the nonlinear surface, the final kernel makes explicit use

of a very small portion of the data, the remainder of which

can be thrown away. This is achieved

by making use of a rectangular by kernel

that greatly reduces the size of the quadratic

program to be solved and simplifies the characterization of

the nonlinear separating surface.

Here, the rows of represent the original

data points while the rows of represent

a greatly reduced subset of data points. Computational results

indicate that test set correctness for the reduced support

vector machine (RSVM), with a nonlinear separating surface

that depends on a small randomly selected portion

of the dataset, is better or much better than that of a conventional support

vector machine with a nonlinear surface that explicitly depends on

either the entire dataset or the smaller randomly selected one.

ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-07.ps

Proximal Plane Classification

Glenn Fung with Olvi Mangasarian

Instead of a standard support vector machine (SVM)

that classifies points by assigning them to one

of two disjoint halfspaces,

points are classified

by assigning them to the closest of two parallel planes (in input

or feature space) that

are pushed apart as far as possible. This formulation, which

can also be interpreted as regularized least squares and

considered in the much more general context of regularized

networks of Evgeniou-Pontil-Poggio, leads to an extremely

fast and simple algorithm for generating a linear or nonlinear classifier

that merely requires

the solution of a single system of linear equations. In contrast, standard

SVMs solve a quadratic or a linear

program that require considerably longer computational time.

Computational results on publicly available datasets

indicate that the proposed proximal SVM classifier has comparable test

set correctness to that of standard SVM classifiers,

but with considerably faster computational time that can

be an order of magnitude faster. The linear proximal SVM can

easily handle large datasets as indicated by the classification

of a 2 million point 10-attribute set in 20.8 seconds.

All computational results are based on 6 lines

of MATLAB code.

ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-02.ps

Survival-Time Classification of Breast Cancer Patients

Yuh-Jye Lee with Olvi Mangsarian & William H. Wolberg

The principal objective of this work

is to classify 253 breast cancer patients into three survival

groups each of which having a Kaplan-Meier survival curve distinct

from the other two groups. This is achieved by a two stage process.

Stage I consists of actually generating the three groups:

good, intermediate and poor survival groups, using as group

criteria: lymph node status, tumor size and (adjunctive) chemotherapy.

Stage II consists of using six features (not including lymph node status)

in a support vector machine (SVM) classifier to classify

a given patient into one of these three groups which

we are able to achieve with 82.7% tenfold cross validation correctness.

Important findings include the following:

1. The good group consists of 69 patients all without chemotherapy.

2. The poor group consists of 73 patients all with chemotherapy

3. The intermediate group consists of 44 patients without chemotherapy

and 67 with chemotherapy.

4. Pairwise p-values, based

on the logrank statistic, for the distinct survival curves for the

three groups above is no greater than 0.0076.

5. The intermediate group's 67 patients with chemotherapy

have a better survival curve than the group's 44 patients

without chemotherapy. The p-value for this pair of distinct

survival curves is 0.0306. Furthermore, the survival curve of

the 67 patients with chemotherapy in this group is not significantly

different (p-value 0.0817) from the good group survival curve.

Of particular significance is the last item above, because we have

identified a classifiable intermediate group, for which patients

with chemotherapy do better than those without chemotherapy,

which is the reverse of that for the overall population of 253 patients.

ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-03.ps

Mass Collaboration and Data Mining

Raghu Ramakrishnan

Mass Collaboration is a new "P2P"-style approach to large-scale knowledge sharing, with applications in customer support, focused community development, and capturing knowledge distributed within large organizations. Effectively supporting this paradigm raises many technical challenges, and offers intriguing opportunities for mining massive amounts of data captured continually from user interactions. Data mining offers the promise of increased business intelligence, and also improved user experiences, leading to increased participation and greater quality in the knowledge that is captured, both of which are central objectives in Mass Collaboration. In this talk, I will introduce Mass Collaboration and discuss some important data mining tasks motivated by this paradigm.

On Evaluating Joins with Different Classes of Predicates

Jeff Naughton with Jin-Yi Cai, Venkatesan Chakaravarthy & Raghav Kaushik

Joins are a fundamental operation in database systems, forming a basic

building block of all but the most trivial of queries. Historically,

with relational systems and their limited type systems, join algorithm

research has focussed on "equijoins," in which the join predicate is

equality. More recently, with the advent of object-relational systems

with their richer type systems, research has turned to developing

algorithms for spatial joins and set-containment joins. In this talk,

we consider the complexity of joins with different join predicates.

We use a graph pebbling model to characterize joins by the length of

their optimal pebbling strategies and the complexity of discovering

these strategies. Our results show that equijoins are the easiest of

all joins, with pebbling strategies that meet the lower bound over all

join problems and that can be found in polynomial time. By contrast,

spatial-overlap and set-containment joins are the hardest joins, with

optimal pebbling strategies that reach the upper bound over all join

problems, and discovering optimal pebbling strategies NP-complete.

For set-containment joins, discovering the optimal pebbling is also

MAX-SNP-Complete. Our results shed some light on the difficulty the

applied community has had in finding "good" algorithms for

spatial-overlap and set-containment joins.

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Ashraf Aboulnaga with Alaa Alameldeen and Jeff Naughton

Data on the Internet is increasingly presented in XML format. This

allows for novel applications that query all this data using some XML

query language. All XML query languages use path expressions to

navigate through the tree structure of the data. Estimating the

selectivity of these path expressions is therefore essential for

optimizing queries in these languages. In this work, we propose two

techniques for capturing the structure of complex large-scale XML data

as would be handled by Internet-scale applications in a small amount

of memory for estimating the selectivity of XML path expressions:

summarized path trees and summarized Markov tables. We

experimentally demonstrate the accuracy of our proposed techniques,

and explore the different situations that would favor one technique

over the other. We also demonstrate that our proposed techniques are

more accurate than the best previously known technique.

Slice Modeling in Classification

Meta Voelker with Michael Ferris

A common method for testing the effectiveness of a classifier or a
predictive model is to use cross-validation. This procedure generates
a collection of similar models having the same structure, but
different data instantiations. Furthermore, in many cases, large
portions of the data remain constant. We term such collections of
problems slice models. Slice models tend to be data-intensive and
time consuming to solve, due to the fact that every slice is generated
and solved independently. By incorporating additional information in
the solution process, such as the common structure and shared data, we
are able to solve these models much more efficiently and are able to
process much larger real-world problems. To demonstrate the slice
modeling approach, we apply it to cross-validation models concerned
with feature selection. In addition, we apply the same approach to
parameter estimation in classification models using likelihood basis
pursuit.

Optimization Issues for Huge Datasets and Long Computation

Michael C. Ferris with Todd S. Munson, Qun Chen, Jeffrey Linderoth & Meta Voelker

Many problems arising in data mining give rise to optimization models

that are currently intractable due to memory or time restrictions.

Examples from classification, feature selection, imaging, and treatment

planning immediately spring to mind. In this talk, we will review

several approaches that are being tested for such problems and outline

the design of a simple toolkit that provides data miners with relevant

optimization technology.

Three key issues will be discussed in some detail:

- Out of core data handling

- Grid computation, including fault tolerant computation and data sharing

- Structure exploitation

We will describe some semismooth optimization approaches and the

FATCOP optimization algorithm in the context of these three issues, illuminating

how approaches for these problems can significantly improve computational

time, accuracy and increase the size of datasets treated.

* Funded by a gift from the Microsoft Corporation.