The sunshine of a young mind.: 2009

Time series data can be represented as a bag of segments like bag of words representation for text data.

· What is bag of words approach?

The BoW in NLP is a popular method for representing documents, which ignores the word orders. For example, "a good book" and "book good a" are the same under this model. The BoW representation serves as the basic element for further processing, such as object categorization.

· There are two kinds of similarities:

o Shape-based similarity:

§ Determines the similarity of two datasets by comparing their local patterns.

§ Work well with short time series.

o Structure-based similarity: determines the similarity by comparing their global structures.

· Why we need Structure-based similarity approach?

Because most existing approaches focus on finding shape-based similarity that work well with short time series and fail to produce satisfactory results with long sequences.

o Example for textual data: To compare two strings, we can use the string edit distance to compute their similarity. However, if we want to compare two documents, we use a higher-level representation that can capture the structure or semantic of the document

· Time Series: A time series T = t1,…,tp is an ordered set of p real-valued variables.

· Challenges of algorithm:

o The definition and construction of the patterns “vocabulary.”

o Time series data are composed of consecutive data points

· Algorithm:

o Step1: Find Pattern for each time series using PAA.

§ Piecewise Aggregate Approximation (PAA) transform a time series representation into a user defined number of equal segments which is the mean of the data points in that segment. The PAA values are then transformed into symbols by using a breakpoint table.

o Step2: Construct the Bag Of Patterns (BOP) matrix given 2 parameters (α:Size of alphabet ;ω: Size of patterns produced).

BOP Matrix (Time Series, Pattern) and Mij is the frequency of pattern i in sequence j.

	1	2	3	.	.
aaa	10	2	3	.	.
aab	0	0	8	.	.
aac		.	.	.	.
.	.	.	.	.	.
..	.	.	.	.	.

· Empirical Evaluation:

1. Clustering

o Hierarchical Clustering: Compare BOP with CDM, Euclidean distance on raw time series, Dynamic Time Warping on raw time series, and Euclidean distance on DFT coefficients.

§ Result of experiment on a small dataset: BOP can find clusters while shape-based approaches cannot, with only six datasets.

§ Result of experiment on a large dataset: On 13 pairs of Dataset

· DFT and Euclidean distance: only 3 pairs are successfully clustered

· DTW: successfully clusters 10 pairs of dataset. However, its prohibitive time complexity makes it an unrealistic choice of distance measure for large datasets.

· CDM and BOP: All 13 pairs are successfully clustered.

o Advantages of BOP over CDM

1. BOP shows distribution of patterns from histograms, to help understanding the underlying structures of the data.

2. BOP build the final representation from subsequences, so it is suitable for streaming data

o Partitional Clustering: Performing k-means using the Euclidean distance on the raw data, and on our bag-of-patterns representation. Using an evaluation method compares the similarity between two sets of cluster labels, and returns a number between 0 and 1 denoting how similar they are. Their approach achieves the best clustering quality (0.7133 vs. 0.4644).

2. Classification: Using nearest neighbor classification algorithm. Through studying the accuracy as the ratio of count the number of correctly classified objects and the total number of objects and result was 0.996 means that there is only 1 misclassified object, out of 250 objects.

Source:Finding Structural Similarity in Time Series Data Using Bag-of-Patterns Representation

Lin, J. and Li, Y. 2009. Finding Structural Similarity in Time Series Data Using Bag-of-Patterns Representation. In Proceedings of the 21st international Conference on Scientific and Statistical Database Management (New Orleans, LA, USA, June 02 - 04, 2009). M. Winslett, Ed. Lecture Notes In Computer Science, vol. 5566. Springer-Verlag, Berlin, Heidelberg, 461-477. DOI= http://dx.doi.org/10.1007/978-3-642-02279-1_33.

The sunshine of a young mind.

Favorite quote:

Saturday 17 October 2009

Finding Structural Similarity in Time Series Data Using Bag-of-Patterns Representation.

Thursday 8 October 2009

My First Blog Post.

About Me

Blog Archive

Followers