Papers I have read   My ideas   Survey1

Current Objective

I am looking for a research topic. The topic is probably related to data warehousing and data mining.


Progress

Nov, 2000
My research began.
I started reading papers about data mining and data warehousing. I read Knowledge Discovery in Databases: An Attribute-Oriented Approach(Han92), Efficient and Effective Clustering Method for Spatial Data Mining(Ng94), Discovery of Multiple-Level Association Rules from Large Databases(Han95), Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique(Cheung96), A Fast Distributed Algorithm for Mining Association Rules(Cheung96), Exploratory Mining and Pruning Optimizations of Constrained Associations Rules(Ng98), Mining Frequent Patterns without Candidate Generation(Han00), An Effective Hash-Based Algorithm for Mining Association Rules(Park95), CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets(Pei00), Mining Generalized Association Rules(Srikant95) and Selective Materialization: An Efficient Method for Spatial Data Cube Construction(Han98). I also read the book "Data mining: concepts and techniques".

Dec, 2000
I read Efficiently Mining Long Patterns from Databases(Bayardo98), Discovering Frequent Closed Itermsets for association Rules(Pasquier98), Implementing Data Cubes Efficiently(Hairnarayan96), Efficient Mining of Partial Periodic Patterns in Time Series Database(Han99), Constraint-Based Clustering in Large Databases(Tung01), PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Pattern Growth(Pei01), Generalization and Decision Tree Induction: Efficient Classification in Data Mining(Kamber97), Scalable Parallel Data Mining for Association Rules(EHHan97) and On the Computation of Multidimensional Aggregates(Agarwal96).

Jan, 2001
I read Constraint-Based Clustering in Large Databases(Tung01), Efficient Algorithms for Mining Outliers from Large Data Sets(Ramaswamy99), FreeSpan: Frequent Pattern-projected Sequential Pattern Mining(Han00), Mining Sequential patterns: Generalizations and Performance Improvements(Srikant96), Depth First Generation of Long Patterns(Agarwal00), An Efficient Algorithm for Mining Association Rules in Large Databases(Savasere95), Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set(Lin98), CURE: An efficient algorithm for clustering large databases(Guha98), BIRCH: An Efficient Data Clustering Method for Very Large Databases(Zhang96), ROCK: a robust clustering algorithm for categorical attributes(Guha99), A Database Interface for Clustering in Large Spatial Databases(Ester95), A Density-Based Algorithm for Discovering Clusters(Ester96), A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases(Xu98), Mining Access patterns Efficiently from Web Logs(Pei00), Mining Association Rules with Item Constraints(Srikant97), Efficient Mining of Constrained Correlated Sets(Grahne00), Can We Push More Constraints into Frequent Pattern Mining?(Pei00), Mining Frequent Itemsets with Convertible Constraints(Pei01).
12/1 (1st meeting with Dr. Cheung): I talked about PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Pattern Growth(Pei01). Dr. Cheung told me to read papers about mining with constraints and see if the techniques can be applied to long pattern mining.
19/1 (2nd meeting with Dr. Cheung): I talked about Mining Frequent Itemsets with Convertible Constraints(Pei01) and Exploratory Mining and Pruning Optimizations of Constrained Association Rules(Ng98). I introduced the input of the algorithm, anti-monotone constraint and monotone constraints. I also gave the definition of succinct constraint. I tried to explain the usage of succinct constraint but was not successful. Dr. Cheung told me that I need to convey a global picture of the algorithm to him instead of several key points. I need to define the input clearly and prepare some examples to illustrate the algorithm.

Feb, 2001
Courses begin.
I read Constraint-Based Rule Mining in Large, Dense Databases(Bayardo99).
2/2 (3rd meeting with Dr. Cheung): I talked about Exploratory Mining and Pruning Optimizations of Constrained Association Rules(Ng98). How to prove a succinct constraint must has a member generating functions and how to compute the member generating function for a succinct constraint were the problems we could not solve. Dr. Cheung told me to talk about Mining frequent Itemsets with Convertible Constraints(Pei01) in next meeting.
10/2 (4th meeting with Dr. Cheung): I talked about Mining frequent Itemsets with Convertible Constraints(Pei01) but did not mentioned the algorithm. 3 points were highlighted in the meeting. One is why not use the member generating function(MGF) as the definition of succinct constraint. Another point is why it is possible to convert succinct constraint into convertible constraint. The 3rd point is how the performance of this algorithm is compared with that of the previous algorithm. Dr. Cheung asked me to do a survey on frequent-pattern growth approach and constrained mining.
16/2 (5th meeting with Dr. Cheung): We recapped the anti-monotone and succinct constraint properties. Dr. Cheung asked what we could do if there were changes to the iteminfo table and/or the transaction table while the constraints are fixed. Dr. Cheung also mentioned that the predicates used in member generating function of succinct constraints should be worked out by the human. I pointed out the problems of Ng98. The problems are how to obtain the predicates, how to induce weak constraints and how to classify the constraints into four classes(combinations of anti-monotone and succinct). Next week, continue the survey on the papers I talked in the previous neetings and think about how to deal with the incremental updating problem.
23/2 (6th meeting with Dr. Cheung): Dr. Cheung asked me how I felt about the courses I were taking. I talked about how to mine frequent itemsets with constraints if the transaction table and the iteminfo table can be modified after doing the mining process. Dr. Cheung asked me to think of situations my approach did not work. Also, Dr. Cheung asked me to talk about the frequent pattern growth approach next time.

March, 2001
I read Rapid Association Rule Mining(Ng), Monitoring Change in Mining Results(Baron) and LGen -- A Lattice-Based Candidate Set Generation Algorithm for I/O Efficient Association Rule Mining(Yip99).
2/3 (7th meeting with Dr. Cheung): Dr. Cheung thought that the partition of items is a non-trivial observation. Dr. Cheung wants me to analyze the number of candidate itemsets considered in our approach and that in re-run of the approach in Ng98. Dr. Cheung questioned if changes in "iteminfo" table would affect the properties of the constraints. Dr. Cheung asked what sorts of constraints were dealt with in Ng98. Dr. Cheung said that we may also use top-down approach to generate new frequent itemsets and told me to read the paper (L-Gen). I suggested we may use database projection technique in this case. Our objective is to minimize the I/O scans and the number of candidates considered. Dr. Cheung asked me to think about 2 other questions. One is that if frequent-pattern growth is used when we mine valid and frequent itemsets at the first time, how can we do the update? Another is that can we do something at the first time to further facilitate our subsequent updates? In this meeting, we also discussed the number f candidates considered by Apriori and that by frequent-pattern growth. Moreover, we talked about the convertible constraints briefly.
9/3 (8th meeting with Dr. Cheung): I talked about the reduction of candidates considered if our approach is used. Also, I talked about what happens if pattern-growth method is used in the first mining. Moreover, I talked about the situation if all frequent itemsets are saved in the first mining. We found that our approach is good if the transaction table is changed greatly and the iteminfo table is changed relatively mildly. Dr. Cheung told me to start to write a paper for this problem and our approach.
24/3 (9th meeting with Dr. Cheung): I was frustrated by my performance in the meeting. We continued to discuss the update problem with constraints and we concentrated on anti-monotone constraints. Dr. Cheung suggested 2 way in order to make our original approach work. One is to do the mining in both bottom-up and top-down directions. First, count in the delta db to see if the old valid & frequent itemsets that are not affected by the updates in the iteminfo table. Then use the result and all modified items to form the upper bound of the valid & frequent after updates. My response was that we could do the same for other approaches. The other way suggested by Dr. Cheung was as follows. The first step is the same as the preceding way. The second step is to find all valid & frequent itemsets consisting of modified items only. The last step is to do self-join of the result of step 1 and a join between the results of step 1 and step 2. Dr. Cheung pointed out that if a candidate consists of unchanged items and modified items and the unchanged items are infrequent previously, the candidate must be frequent in delta db in order to be frequent in the new whole db. My response was that it worked.
30/3 (10th meeting with Dr. Cheung): I told Dr. Cheung that there was a serious problem in the motivation of the update with constraints problem. Then, Dr. Cheung suggested changing the parameters in the constraints and keeping iteminfo table unchanged. I found that one easy way is to find all frequent itemsets first and test those frequent itemsets against the constraints later. In this way, we only need to scan the database in the first time. Dr. Cheung told me to keep on think about the motivations of update with constraints problem.

April, 2001
I read Efficient Computation of Iceberg Cubes with Complex Measures(Han01), An Array-Based Algorithm for simultaneous Multidimensional Aggregate(Zhao97) and Bottom-Up Computation of Sparse and Iceberg CUBEs(Beyer99).
6/4 (11th meeting with Dr. Cheung): We talked about the update with constraints problem and the paper Efficient Computation of Iceberg Cubes with Complex Measures(Han01). Dr. Cheung told me to talk about the paper Bottom-Up computation of Sparse and Iceberg CUBEs next time.
20/4 (12th meeting with Dr. Cheung): I told Dr. Cheung my feeling towards PAKDD01. We talked about data cube as well as iceberg data cube. Dr. Cheung mentioned one of his paper on data cube. Dr. Cheung told me to continue to study the cube problems.
27/4 (13th meeting with Dr. Cheung): We talked about how to compute the iceberg data cube in the presence of attribute hierarchy so that the iceberg cube can support roll up and drill down operations.

May, 2001
I read Optimizing Selections over Datacubes(Ross00) and Fault-Tolerant Frequent Pattern Mining: Problems and Challenges(Pei01).
3/5 (14th meeting with Dr. Cheung): We talked about how to compute multi-level iceberg cubes. Dr. Cheung said that the cost of doing roll-up operation based on the base cube was reasonable. Hence, we shifted our target to optimize the storage space and the construction cost. We thought of a way to compress the multi-level iceberg cubes. We would incorporate attribute hierarchies into the lattice for different group-bys and use the bottom-up approach to traverse the lattice.
11/5 (15th meeting with Dr. Cheung): We continued to talk about how to make iceberg cube support hierarchy roll-up operations. Dr. Cheung suggested to expand the bottom-level iceberg cube. The expanded botton-level iceberg cube keeps some un-interesting cells such that it have enough information to support hierarchocal roll-up. Dr. Cheung also suggested to use the average(or min/max) of all the un-interesting cells to represent all theose cells. I suggested that we may distribute the values of un-interesting cells to neighbouring interesting cells. The problem is how we can approximate the results of roll-up more accurately.
18/5 (16th meeting with Dr. Cheung): I talked about selective materialization of cuboids in data cube and approximate query processing. We discussed the merits of full approximation and iceberg plus approximation(e.g. wavelet). Dr. Cheung argued that the later approach would probably produce more accurate answers. He suggested me to study various approximation methods and solve the problem before think of anything else. Dr. Cheung also mentioned k-d tree and dense region OLAP.
25/5 (17th meeting with Dr. Cheung): We talked about how to do roll-up given the base data cube. Dr. Cheung said that people would select some cuboids to materialize. The goal of Ullman's papers is to optimize the time needed to compute the full cube in real time.

June, 2001
1/6 (18th meeting with Dr. Cheung): We talked about the weakness of the method I proposed in the previous meeting. Dr. Cheund asked me to think about how to approximate the small cells.
8/6 (19th meeting with Dr. Cheung): Dr. Cheung said that my arguments are quite confusing sometimes. We discussed about what we should approximate (the base cuboid, the un-interesting cells in the base cuboids or the upper cuboids ). I mentioned my observations (differences between the 3 approaches and the relationship between the constraints and approximation). Finally, I asked Dr. Cheung what data cube appproach he wants to adopt. He told me that we should only focus on how to support roll-up operations and leave the selection of cuboids to materialize to the users.
15/6 (20th meeting with Dr. Cheung): I talked about lots about histogram. At the end, we changed our problem definition as "How to contruct a complete base cube with approximate cuboids?".
22/6 (21st meeting with Dr. Cheung): I mentioned the simple method to solve the problem brought out in the previous meeting. The method is to first compute all the cells in the cuboid lattice and then compress the light cells. I pointed out that we may not need to compute the actual values in the light cells. We may increase the efficiency at the cost of accuracy. I said that we may assume that only the actual values of the light cells in the base cuboid are needed to compute. Then we can use approximated light cells in the base cuboid to compute the light cells in other cuboids. (actually, this cannot increase efficiency.) Another approach I mentioned is to use bottom-up approach but use sampling to approximate the light along the way. I also metioned using average. Dr. Cheung pointed out that we may use wavelet to compute all the approximate all the light cells together (use compressed light cells to compute other compressed light cells).
29/6 (22nd meeting with Dr. Cheung) We went through our ideas once. Especially, we discussed the wavelet-based compression technique. Dr. Cheung suggested thinking about how to use histogram effectively.