Speaker: | Inderjit S. Dhillon |
Department of Computer Sciences | |
University of Texas at Austin |
Title: Co-clustering, Matrix Approximations and Bregman Divergences
Abstract:
Many applications in data mining and machine learning require the analysis of large two-dimensional matrices. Depending on the application, these matrices can have very different characteristics: in text mining, co-occurrence matrices are large, sparse and non-negative while DNA microarray analysis yields smaller, dense matrices with positive as well as negative entries. In data analysis, it is often desirable (a) to find "low-parameter" matrix approximations, and (b) to "co-cluster" such matrices, i.e., simultaneously cluster rows as well as columns.
In this talk, I will discuss a framework that inextricably links a certain class of matrix approximations with co-clustering. The approximation error can be measured using a non-trivial class of distortion measures, called Bregman divergences, that have connections to the exponential family of probability distributions. Our algorithms allow us to handle a wide variety of matrices and distortion measures within a unifying framework, and are able to efficiently construct good co-clusterings and the corresponding low-parameter matrix approximations. I will conclude by presenting experimental results on text and microarray data.