A Study on Distance-based Outlier Detection on Uncertain Data
番号 | 8 |
---|---|
氏名 | Salman Ahmed SHAIKH |
フリガナ | サルマンアーマットシェイク |
学位名 | 博士(工学) |
取得大学 | 筑波大学 |
学位授与日 | 2014年3月25日 |
指導教員名 | 北川 博之 |
論文題目(主) | A Study on Distance-based Outlier Detection on Uncertain Data |
---|---|
論文題目(副) | 不確実データに対する距離に基づく外れ値検出に関する研究 |
論文概要等 | ■論文概要 To address this problem, we focus on distance-based approach because the distance-based approach is the simplest and the most commonly used and it can be used as preprocessing before applying more sophisticated application dependent outlier detection techniques. In this dissertation, the uncertainty of data is modelled by the Gaussian probability density function, because in statistics it is the most important and the most commonly used. Hence in this dissertation, three problems are being solved related to outlier detection on uncertain data. 1) Distance-based outlier detection on uncertain data: In this research, we give a novel definition of distance-based outliers on uncertain data. Since the distance probability computation is expensive, a cell-based approach is proposed to index the dataset objects and to speed up the outlier detection process. The cell-based approach identifies and prunes the cells containing only inliers based on its bounds on outlier score (#D-neighbors). Similarly it can also detect the cells containing only outliers. Finally, exact #D-neighbors are computed for the un-pruned objects using the Naïve nested loop approach. 2) Top-k outlier detection on uncertain data: In this work, a top-k distance-based outlier detection approach is presented. In order to detect top-k outliers from uncertain data efficiently, we propose a data structure, populated-cells list (PC-list). Using the PC-list, the top-k outlier detection algorithm needs to consider only a fraction of the dataset objects and hence quickly identifies candidate objects for the top-k outliers. 3) Continuous outlier detection on uncertain data streams: In this part of the dissertation, a distance-based approach is proposed to detect outliers continuously from a set of uncertain objects’ states that are originated synchronously from a group of data sources (e.g., sensors in WSN). A set of objects’ states at a timestamp is called a state set. Usually, the duration between two consecutive timestamps is very short and the state of all the objects may not change much in this duration. Therefore, to eliminate the unnecessary computation at every timestamp, an incremental approach of outlier detection is proposed which makes use of outlier detection results obtained from the previous timestamp to detect outliers in the current timestamp. Finally, extensive experimental evaluations on real and synthetic datasets are presented for each of the proposed outlier detection approaches, to prove their accuracy, efficiency and scalability. |