Problem Definition: Given a set of sequences, where each sequence consists of a list elements and eaxh element consists a set of items, and given a user-specified min_support threshold, sequential pattern mining is to find all of the frequent subsequences, i.e., the subsequences whose frequency in the set of the sequences is no less than min_support.
Observation: Apriori property--any super-pattern of a non-frequent pattern cannot be frequent.
Solutions: Apriori-All, GSP, FreeSpqn and PrefixSpan
GSP: use Apriori property, consider contiguous subsequence (must not drop the whole element in the sequence except the the first and last element) Step 1: find frequent items Step 2: considering forming length-(k+1) candidate, 2 length-(k) frequent patterns can join to form a candidate if their subsequences are the same after dropping the first item in one sequence and the last item in the other sequence. Only candidates having infrequent contiguous length-(k) subsequence(s) are pruned if time constraints exist. Otherwise we can igore the term contiguous. Step 3: candidate check and repeat the procedure
FreeSpan: use database projection to narrow the search of relevant sequential patterns Step 1: find frequent items Step 2: let frequent item list be a,b,c,d,e in descending support order, partition the database such that the projected database for an item containing no items following it in the list. Step 3: form and count candidates in the projected database STep 4: repeat the procedure Optimation: after obtaining frequent items, form a triangular matrix and count for all combinations in the matrix to obtain length-2 pattern and check which items can occur in the projected databases. Then, project databases for length-2 patterns.
PrefixSpan: use database projection to narrow the search of relevant sequential patterns and prefix grow Step 1: find frequent items Step 2: use frequent items as prefixes and project databases Step 3: find items in projected database to extend the prefix and repeat Optimization: 1. use sequence-match matrix to find length-2 patterns to reduce no. of projected datbases. 2. 3-way Apriori checking: while projecting database, some items can be removed by checking if the item has any conflict in order with the prefix (use the S-matrix)
Comparisons GSP is Apriori-based and its fatal disadvantage is checking the whole database for every candidate which is unnessary. Both FreeSpan and PrefixSpan improve Apriori-based methods by only checking the relevant candidate in the projected databases. One disadvantage of FreeSpan is that it may not reduce the length of the data sequence while projecting. PrefixSpan improves FreeSpan by removing prefix which must reduce the data sequence length. Also, PrefixSpan enhances performance by reducing the combinations of items (i.e. no. of candidate in a projected database). One problem of PrefixSpan is that it cannot remove any frequent item in the postfix of a data sequence while projecting. Both FreeSpan and PrefixSpan can use bi-level projection to reduce the number of projected databases. Furthermore, PrefixSpan can use pointer and offset to reduce the number of projected databases if the database can reside in main memory.
Reference
1. R. Srikant and R. Agrawal, "Mining Sequential Patterns: Generalizations and Performance Improvements", Proc. of the Fifth Int'l Conf. on Extending Database Technology (EDBT), Avignon, France, March 1996. PDF format. Expanded version available as IBM Research Report RJ 9994, December 1995.
2. J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu, `` FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining'', Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining (KDD'00), Boston, MA, August 2000.
3. J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, `` PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth '', Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001.