Favorite quote:

"If we knew what it was we were doing, it would not be called research, would it?" –Albert Einstein

Saturday, 17 October 2009

Finding Structural Similarity in Time Series Data Using Bag-of-Patterns Representation.

Time series data can be represented as a bag of segments like  bag of words representation for text data.

· What is bag of words approach?

Wikipedia

The BoW in NLP is a popular method for representing documents, which ignores the word orders. For example, "a good book" and "book good a" are the same under this model. The BoW representation serves as the basic element for further processing, such as object categorization.

· There are two kinds of similarities:

o Shape-based similarity:

§ Determines the similarity of two datasets by comparing their local patterns.

§ Work well with short time series.

o Structure-based similarity: determines the similarity by comparing their global structures.

· Why we need Structure-based similarity approach?

Because most existing approaches focus on finding shape-based similarity that work well with short time series and fail to produce satisfactory results with long sequences.

o Example for textual data: To compare two strings, we can use the string edit distance to compute their similarity. However, if we want to compare two documents, we use a higher-level representation that can capture the structure or semantic of the document

· Time Series: A time series T = t1,…,tp is an ordered set of p real-valued variables.

· Challenges of algorithm:

o The definition and construction of the patterns “vocabulary.”

o Time series data are composed of consecutive data points

· Algorithm:

o Step1: Find Pattern for each time series using PAA.

§ Piecewise Aggregate Approximation (PAA) transform a time series representation into a user defined number of equal segments which is the mean of the data points in that segment. The PAA values are then transformed into symbols by using a breakpoint table.

o Step2: Construct the Bag Of Patterns (BOP) matrix given 2 parameters (α:Size of alphabet ;ω: Size of patterns produced).

BOP Matrix (Time Series, Pattern) and Mij is the frequency of pattern i in sequence j.

 

1

2

3

.

.

aaa

10

2

3

.

.

aab

0

0

8

.

.

aac

 

.

.

.

.

.

.

.

.

.

.

..

.

.

.

.

.

· Empirical Evaluation:

1. Clustering

o Hierarchical Clustering: Compare BOP with CDM, Euclidean distance on raw time series, Dynamic Time Warping on raw time series, and Euclidean distance on DFT coefficients.

§ Result of experiment on a small dataset: BOP can find clusters while shape-based approaches cannot, with only six datasets.

§ Result of experiment on a large dataset: On 13 pairs of Dataset

· DFT and Euclidean distance: only 3 pairs are successfully clustered

· DTW: successfully clusters 10 pairs of dataset. However, its prohibitive time complexity makes it an unrealistic choice of distance measure for large datasets.

· CDM and BOP: All 13 pairs are successfully clustered.

o Advantages of BOP over CDM

1. BOP shows distribution of patterns from histograms, to help understanding the underlying structures of the data.

2. BOP build the final representation from subsequences, so it is suitable for streaming data

o Partitional Clustering: Performing k-means using the Euclidean distance on the raw data, and on our bag-of-patterns representation. Using an evaluation method compares the similarity between two sets of cluster labels, and returns a number between 0 and 1 denoting how similar they are. Their approach achieves the best clustering quality (0.7133 vs. 0.4644).

2. Classification: Using nearest neighbor classification algorithm. Through studying the accuracy as the ratio of count the number of correctly classified objects and the total number of objects and result was 0.996 means that there is only 1 misclassified object, out of 250 objects.

Source:Finding Structural Similarity in Time Series Data Using Bag-of-Patterns Representation
Lin, J. and Li, Y. 2009. Finding Structural Similarity in Time Series Data Using Bag-of-Patterns Representation. In Proceedings of the 21st international Conference on Scientific and Statistical Database Management (New Orleans, LA, USA, June 02 - 04, 2009). M. Winslett, Ed. Lecture Notes In Computer Science, vol. 5566. Springer-Verlag, Berlin, Heidelberg, 461-477. DOI= http://dx.doi.org/10.1007/978-3-642-02279-1_33.
 

Thursday, 8 October 2009

My First Blog Post.

Welcome to my first blog post. I have been planning to start blogging for a long time, but I postponed it because I'm not really very good about this whole blogging things. Nevertheless, today I decided to start my own blog to share knowledge with you, specially research papers that I read. I will always try to write up on things that are of some interest to you. Meanwhile, congratulate me for making my first blog post. Finally I want to say “THANKS” to GOD then to all people around me who always support and encourage me. In addition to special “THANKS” to my friend and teammate “Ebeid”, he has helped me to choose my blog name. If you can’t choose your blog name ask him :)