UCF Data Systems Group
    |     Search
Projects / Multidimensional Data
  Multidimensional Data Minimize  

Similarity search is to retrieve objects considered ‘similar’ to the object of interest within some user-specified threshold. This problem is of fundamental importance in a large variety of application domains such as image, Web, time series, and so on. Mainly, our research focus on three main aspects of the similarity search, namely data representations, distance metrics, and query processing.

A significant challenge in managing data in the conventional high dimensional vector space model is the curse of dimensionality, which causes most multidimensional index structures to perform poorly. To address this problem, we have proposed to reduce dimensionalities of the original data and execute quick-and-dirty search over the approximate data with no false dismissal guarantee. In our method, the dimensions are partitioned into disjoint subsets a given data vector is approximated by the aggregate values extracted from each dimension subset. The algorithm to generate the optimal dimension partition is also designed. Consequently, our reduction technique has robust performances and can well adapt to different characteristics of data in diverse applications.

Besides to speed up the expensive similarity search, we also investigate to embed the multi-dimensional data onto the intrinsic low-dimensional manifold structure in order to improve the retrieval accuracy. The conventional technique, Principal Component Analysis seeks to derive a set of axes along which the data exhibit greater variances than others and it mainly preserves global structures of the data. Locality Preserving Projection encodes the neighborhood information into a similarity matrix and derives a linear manifold embedding as the optimal approximation to this local structure. However, neither of them handles both local and global structures in a systematic way. We propose the Local and Global Structures Preserving Projection to address this problem, which minimizes the distances of the points in each local neighborhood while dispersing them far away from their corresponding remote points.

We also conduct research study on constrained clustering. The pairwise instance-level constraints indicate whether the two data points belong to the same cluster. This information is utilized to boost the clustering accuracy. We derive the error probability of the assignment for a group of points. Based on the results, we design an algorithm to place data points into their most suitable feasible clusters to minimize the chance of the mis-assignments. This paper is created by Hao Cheng (haocheng AT eecs.ucf.edu).

    
   
© 2008 UCF Data Systems Group All Rights Reserved
Department of Electrical Engineering & Computer Science | University of Central Florida | Orlando, FL 32816
Phone: 407.823.2341   Comments or questions: