日本データベース学会

A Study on Distance-based Outlier Detection on Uncertain Data

概要PDF

番号 8
氏名 Salman Ahmed SHAIKH
フリガナ サルマンアーマットシェイク
学位名 博士(工学)
取得大学 筑波大学
学位授与日 2014年3月25日
指導教員名 北川 博之
論文題目(主) A Study on Distance-based Outlier Detection on Uncertain Data
論文題目(副) 不確実データに対する距離に基づく外れ値検出に関する研究
論文概要等

■論文概要
Uncertain data management, querying and mining have become important because the majority of real world data is accompanied with uncertainty these days. Uncertainty in data is often caused by the deficiency in underlying data collecting equipments or sometimes manually introduced to preserve data privacy. The uncertainty information in the data is useful and can be used to improve the quality of the underlying results. Therefore in this dissertation, we study the problem of outlier detection on uncertain data.

To address this problem, we focus on distance-based approach because the distance-based approach is the simplest and the most commonly used and it can be used as preprocessing before applying more sophisticated application dependent outlier detection techniques. In this dissertation, the uncertainty of data is modelled by the Gaussian probability density function, because in statistics it is the most important and the most commonly used. Hence in this dissertation, three problems are being solved related to outlier detection on uncertain data. 1) Distance-based outlier detection on uncertain data: In this research, we give a novel definition of distance-based outliers on uncertain data. Since the distance probability computation is expensive, a cell-based approach is proposed to index the dataset objects and to speed up the outlier detection process. The cell-based approach identifies and prunes the cells containing only inliers based on its bounds on outlier score (#D-neighbors). Similarly it can also detect the cells containing only outliers. Finally, exact #D-neighbors are computed for the un-pruned objects using the Naïve nested loop approach. 2) Top-k outlier detection on uncertain data: In this work, a top-k distance-based outlier detection approach is presented. In order to detect top-k outliers from uncertain data efficiently, we propose a data structure, populated-cells list (PC-list). Using the PC-list, the top-k outlier detection algorithm needs to consider only a fraction of the dataset objects and hence quickly identifies candidate objects for the top-k outliers. 3) Continuous outlier detection on uncertain data streams: In this part of the dissertation, a distance-based approach is proposed to detect outliers continuously from a set of uncertain objects’ states that are originated synchronously from a group of data sources (e.g., sensors in WSN). A set of objects’ states at a timestamp is called a state set. Usually, the duration between two consecutive timestamps is very short and the state of all the objects may not change much in this duration. Therefore, to eliminate the unnecessary computation at every timestamp, an incremental approach of outlier detection is proposed which makes use of outlier detection results obtained from the previous timestamp to detect outliers in the current timestamp. Finally, extensive experimental evaluations on real and synthetic datasets are presented for each of the proposed outlier detection approaches, to prove their accuracy, efficiency and scalability.

アーカイブ