Data Mining Institute*
Computer Sciences Department
University of Wisconsin -- Madison
Annual Review -- June 1, 2001
Abstracts of Talks
Reduced Data
Classifiers
Olvi Mangasarian with Yuh-Jye Lee
An
algorithm is proposed which generates a nonlinear
kernel-based
separating surface that requires as little
as 1%
of a large dataset for its explicit evaluation.
Although
the entire dataset is used
to
generate the nonlinear surface, the final kernel makes explicit use
of a
very small portion of the data, the remainder of which
can be
thrown away. This is achieved
by
making use of a rectangular
by
kernel
that
greatly reduces the size of the quadratic
program
to be solved and simplifies the characterization of
the
nonlinear separating surface.
Here,
the
rows of
represent the
original
data points while the
rows of
represent
a
greatly reduced subset of
data points.
Computational results
indicate
that test set correctness for the reduced support
vector
machine (RSVM), with a nonlinear separating surface
that
depends on a small randomly selected portion
of the
dataset, is better or much better than that of a conventional support
vector
machine with a nonlinear surface that explicitly depends on
either
the entire dataset or the smaller randomly selected one.
ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/00-07.ps
Proximal Plane Classification
Glenn Fung with Olvi Mangasarian
Instead
of a standard support vector machine (SVM)
that
classifies points by assigning them to one
of two
disjoint halfspaces,
points
are classified
by
assigning them to the closest of two parallel planes (in input
or
feature space) that
are
pushed apart as far as possible. This formulation, which
can
also be interpreted as regularized least squares and
considered
in the much more general context of regularized
networks
of Evgeniou-Pontil-Poggio, leads to an extremely
fast
and simple algorithm for generating a linear or nonlinear classifier
that
merely requires
the
solution of a single system of linear equations. In contrast, standard
SVMs
solve a quadratic or a linear
program
that require considerably longer computational time.
Computational
results on publicly available datasets
indicate
that the proposed proximal SVM classifier has comparable test
set
correctness to that of standard SVM classifiers,
but
with considerably faster computational time that can
be an
order of magnitude faster. The linear proximal SVM can
easily
handle large datasets as indicated by the classification
of a 2
million point 10-attribute set in 20.8 seconds.
All
computational results are based on 6 lines
of
MATLAB code.
ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-02.ps
Survival-Time Classification of Breast Cancer Patients
Yuh-Jye Lee with Olvi Mangsarian & William H. Wolberg
The principal
objective of this work
is to
classify 253 breast cancer patients into three survival
groups
each of which having a Kaplan-Meier survival curve distinct
from
the other two groups. This is achieved by a two stage process.
Stage I
consists of actually generating the three groups:
good,
intermediate and poor survival groups, using as group
criteria:
lymph node status, tumor size and (adjunctive) chemotherapy.
Stage
II consists of using six features (not including lymph node status)
in a
support vector machine (SVM) classifier to classify
a given
patient into one of these three groups which
we are
able to achieve with 82.7% tenfold cross validation correctness.
Important
findings include the following:
1. The
good group consists of 69 patients all without chemotherapy.
2. The
poor group consists of 73 patients all with chemotherapy
3. The
intermediate group consists of 44 patients without chemotherapy
and 67
with chemotherapy.
4.
Pairwise p-values, based
on the
logrank statistic, for the distinct survival curves for the
three
groups above is no greater than 0.0076.
5. The
intermediate group's 67 patients with chemotherapy
have a better
survival curve than the group's 44 patients
without chemotherapy. The p-value for this
pair of distinct
survival
curves is 0.0306. Furthermore, the survival curve of
the 67
patients with chemotherapy in this group is not significantly
different
(p-value 0.0817) from the good group survival curve.
Of particular
significance is the last item above, because we have
identified
a classifiable intermediate group, for which patients
with
chemotherapy do better than those without chemotherapy,
which
is the reverse of that for the overall population of 253 patients.
ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/01-03.ps
Mass
Collaboration and Data Mining
Raghu
Ramakrishnan
Mass
Collaboration is a new "P2P"-style approach to large-scale knowledge
sharing, with applications in customer support, focused community development,
and capturing knowledge distributed within large organizations. Effectively supporting this paradigm raises
many technical challenges, and offers intriguing opportunities for mining
massive amounts of data captured continually from user interactions. Data
mining offers the promise of increased business intelligence, and also improved
user experiences, leading to increased participation and greater quality in the
knowledge that is captured, both of which are central objectives in Mass
Collaboration. In this talk, I will introduce Mass Collaboration and discuss
some important data mining tasks motivated by this paradigm.
On
Evaluating Joins with Different Classes of Predicates
Jeff
Naughton with Jin-Yi Cai, Venkatesan Chakaravarthy & Raghav Kaushik
Joins
are a fundamental operation in database systems, forming a basic
building
block of all but the most trivial of queries.
Historically,
with
relational systems and their limited type systems, join algorithm
research
has focussed on "equijoins," in which the join predicate is
equality. More recently, with the advent of
object-relational systems
with
their richer type systems, research has turned to developing
algorithms
for spatial joins and set-containment joins.
In this talk,
we
consider the complexity of joins with different join predicates.
We use
a graph pebbling model to characterize joins by the length of
their
optimal pebbling strategies and the complexity of discovering
these
strategies. Our results show that
equijoins are the easiest of
all
joins, with pebbling strategies that meet the lower bound over all
join
problems and that can be found in polynomial time. By contrast,
spatial-overlap
and set-containment joins are the hardest joins, with
optimal
pebbling strategies that reach the upper bound over all join
problems,
and discovering optimal pebbling strategies NP-complete.
For
set-containment joins, discovering the optimal pebbling is also
MAX-SNP-Complete. Our results shed some light on the
difficulty the
applied
community has had in finding "good" algorithms for
spatial-overlap
and set-containment joins.
Estimating
the Selectivity of XML Path Expressions for Internet Scale Applications
Ashraf
Aboulnaga with Alaa Alameldeen and Jeff Naughton
Data on
the Internet is increasingly presented in XML format. This
allows
for novel applications that query all this data using some XML
query
language. All XML query languages use
path expressions to
navigate
through the tree structure of the data.
Estimating the
selectivity
of these path expressions is therefore essential for
optimizing
queries in these languages. In this
work, we propose two
techniques
for capturing the structure of complex large-scale XML data
as
would be handled by Internet-scale applications in a small amount
of
memory for estimating the selectivity of XML path expressions:
summarized
path trees and summarized Markov tables.
We
experimentally
demonstrate the accuracy of our proposed techniques,
and
explore the different situations that would favor one technique
over
the other. We also demonstrate that our
proposed techniques are
more
accurate than the best previously known technique.
Slice
Modeling in Classification
Meta Voelker
with Michael Ferris
A
common method for testing the effectiveness of a classifier or a
predictive model is to use cross-validation.
This procedure generates
a collection of similar models having the same structure, but
different data instantiations.
Furthermore, in many cases, large
portions of the data remain constant.
We term such collections of
problems slice models. Slice models
tend to be data-intensive and
time consuming to solve, due to the fact that every slice is generated
and solved independently. By incorporating
additional information in
the solution process, such as the common structure and shared data, we
are able to solve these models much more efficiently and are able to
process much larger real-world problems.
To demonstrate the slice
modeling approach, we apply it to cross-validation models concerned
with feature selection. In addition, we
apply the same approach to
parameter estimation in classification models using likelihood basis
pursuit.
Optimization
Issues for Huge Datasets and Long Computation
Michael C.
Ferris with Todd S. Munson, Qun Chen, Jeffrey Linderoth & Meta Voelker
Many
problems arising in data mining give rise to optimization models
that
are currently intractable due to memory or time restrictions.
Examples
from classification, feature selection, imaging, and treatment
planning
immediately spring to mind. In this
talk, we will review
several
approaches that are being tested for such problems and outline
the
design of a simple toolkit that provides data miners with relevant
optimization
technology.
Three
key issues will be discussed in some detail:
- Out of core data handling
- Grid computation, including fault
tolerant computation and data sharing
- Structure exploitation
We will
describe some semismooth optimization approaches and the
FATCOP
optimization algorithm in the context of these three issues, illuminating
how
approaches for these problems can significantly improve computational
time,
accuracy and increase the size of datasets treated.