Unsupervised Anomaly Detection for High Dimensional Data with Embedded Manifold
Problem Background
Real world data is often high dimensional. However, we may only need a few of the attributes or a specific combination of them to carry out the subsequent learning process, be it clustering, classification or identifying the rare events. A comparatively reduced feature space also saves us from the noise aggregation effects in higher dimensional space. The idea of representing data with fewer but more vital, information rich features is known as the dimensionality reduction. This is closely tied with the manifold approximation or intrinsic structure recovery problem, credit goes to the manifold hypothesis (Chapelle:2006, fefferman:2016). According to this hypothesis, high dimensional data can be efficiently summarized in a much lower dimensional feature space, typically a non-linear manifold. There are two associated problems to this approach. First, learning a manifold structure is a difficult task as we neither know the structure beforehand nor can we visualize it. Second, whatever dimensionality reduction method we design, the underlying manifold structure embedded in the original data should be preserved during the transformation to the reduced space.
Let us start with explaining the first problem a bit more in detail. The traditional measure of distance such as Euclidean distance can not accurately measure the relative closeness of data points sampled from the non linear manifold. Consider the data points E, D, F in Fig. 1(a). With the structure, E-to-D are closer than E-to-F, meaning that E and D are more similar than E and F. But if using the Euclidean distance, represented by the dotted lines, EF is shorter than ED, making the learning algorithm to believe E and F more similar than E and D. Furthermore, these examples happen in a 3 dimensional space. It means that when the space structure exists, the Euclidean distance can break down even in a relatively low-dimensional space. A similar real life example could be the air travel route selection between two places on a earth surface. It is not the direct Euclidean distance that governs the route selection but the distance constrained by earth shape and structure which is known as geodesic distance. In Fig. 1(a), the solid green line highlights the geodesic path between E and F.
To solve this problem, we devise a new similarity measure based on the concept of minimum spanning tree (MST). MST has the capability of approximating the geodesic distance in presence of non linear manifold. To achieve that it only uses the knowledge of neighboring points, nothing more than that. Please refer to Fig. 1(b), where MST track the geodesic path to provide an approximation of geodesic distance between E an F.
The second problem that concerns us is how can we preserve the MST approximated structure in the low dimensional space. Whatever dimensionality reduction method we choose to follow be it PCA, NMF or even Autoencoder, we need to find a way to incorporate the MST guided distance there. Here, we see a chance to provide an integrated solution to the both intrinsic structure recovery and preservation problem and that's what our research is about. We also provide online version of each of our algorithms for streaming data.
Though the algorithms developed by us is specific to anomaly detection, however, they can be easily altered to be applied in clustering and classification problem.
Work 1: LoMST: A MST based anomaly detection method
(*This work Won the best poster award, at TAMU Conference on Energy, 2019*)
In this work, we propose a new MST based anomaly detection scheme for structured data space. Rather than applying MST to the entire data set, we choose to follow a two-stage procedure. At first, a global MST is used to separate the distant anomalous clusters from the rest of the data set. Then, we build local MSTs for the remaining instances to detect point wise anomalies. The proposed MST-based method is effective and registers the best performance when it is compared with a wide array of methods on 20 benchmark data sets. The superiority of the proposed method inspires us to apply it to a real life hydropower data set. The MST-based method is successful in detecting different families of anomalies, achieving the merit of subspace-based methods without losing the benefits of local neighborhood based methods. The validity of the anomalies detected is cross validated by two other anomaly detection methods and confirmed by the domain experts and maintenance operators who provided us the hydropower data in the first place. Root causes and threshold values for key attributes that contribute to the anomalies are determined in the form of a decision tree. The knowledge generated from the anomaly detection analyses helps service engineers continuously monitor the turbine operation and potentially diagnose and predict the malfunctions of turbines in time.
Figure 2: Global and Local Anomaly Detection
Figure 3: Decision Tree Based on Anomalies Identified by LoMST method
Work 2: Neighborhood Structure Assisted Non-negative Matrix Factorization and its Application in Unsupervised Point Anomaly Detection
(*This work Won the first place in IISE QCRE best student paper competition, 2019*)
Non-negative matrix factorization (NMF) is a popular and widely used method to accomplish the goal of clustering and anomaly detection from the resulting clusters. But NMF, together with its recent, enhanced version, like graph regularized NMF or symmetric NMF, do not have the provision to include the neighborhood structure information and, as a result, may fail to provide satisfactory performance in presence of nonlinear manifold structure. To address that shortcoming, we propose to consider and incorporate the neighborhood structural similarity information within the NMF framework by modeling the data through a minimum spanning tree. What motivates our choice is the understanding that in the presence of complicated data structure, a minimum spanning tree can approximate the intrinsic distance between two data points better than a simple Euclidean distance does, and consequently, it constitutes a more reasonable basis for differentiating anomalies from the normal class data. We label the resulting method as the neighborhood structure assisted NMF. By comparing the formulation and properties of the neighborhood structure assisted NMF with other versions of NMF including graph regularized NMF and symmetric NMF, it is apparent that the inclusion of the neighborhood structure information using minimum spanning tree makes a key difference. We further devise both offline and online algorithmic versions of the proposed method.
Figure 4: Proposed NS-NMF Formulation
Figure 5: Conceptual Idea of NS-NMF
Work 3: Graph Regularized Autoencoder and its Application in Anomaly Detection
(*This work Won the first place in INFORMS best student poster competition, 2019*)
Autoencoders are considered an unsupervised learning technique as they do not need labels for training, which makes them an ideal candidate for anomaly detection. If we use Autoencoder to reconstruct the data from their low dimensional representation, we can safely assume that anomalies will be subjected to higher reconstruction error compared to the regular observations as they do not belong to the regular data pattern or distribution. Autoencoder uses loss functions and optimize it to minimize the distance between the original data and the reconstruction. The choice of loss function often depends on the ultimate learning tasks. However, these loss functions do not have any provision to preserve the neighborhood structure in the reduced representation, which is the key piece in non-linear embedding based approaches.
So, Autoencoders need an additional embedding loss component to guarantee that data points maintain the structural similarity in low dimensional space. On the other hand, nonlinear embedding based approaches have their own limitations as they do not perform well in presence of high dimensional data and also not scalable to big data scenario. So, plain Autoencoder and non-linear embedding approaches can complement each other and they should be combined to overcome their drawbacks and achieve both the benefit of multi-layer deep learning framework and the neighborhood embedding. We devise a new graph regularizer based on MST and plant it inside the Autoencoder framework as an extra loss component in addition to the original reconstruction loss. We provide two alternative formulations for the graph regularizer. It will essentially guide the Autoencoder to preserve the original data structure when generating the latent features. These latent features in turn will help to detect the anomalies or rare events from the data.
We compare our graph regularized Autoencoder with the no regularizer scenario. To specially highlight the efficacy of MST, we also compare with the scenario when the regularizer is built on Euclidean distance. In both cases, our approach shows evident advantage. Also, to further demonstrate the superiority of our graph regularizer we incorporate it in two GAN based anomaly detection methods (Schlegl et al., 2017; Zenati et al., 2018). We find that adding the MST based graph regularizer significantly improves the detection capability.