I am looking for a research topic. The topic is probably related to data warehousing and data mining.
Nov, 2000
My research began.
I started reading papers about data mining and data warehousing. I read Knowledge Discovery in Databases: An Attribute-Oriented Approach(Han92), Efficient and Effective Clustering Method for Spatial Data Mining(Ng94), Discovery of Multiple-Level Association Rules from Large Databases(Han95), Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique(Cheung96), A Fast Distributed Algorithm for Mining Association Rules(Cheung96), Exploratory Mining and Pruning Optimizations of Constrained Associations Rules(Ng98), Mining Frequent Patterns without Candidate Generation(Han00), An Effective Hash-Based Algorithm for Mining Association Rules(Park95), CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets(Pei00), Mining Generalized Association Rules(Srikant95) and Selective Materialization: An Efficient Method for Spatial Data Cube Construction(Han98). I also read the book "Data mining: concepts and techniques".
Dec, 2000
I read Efficiently Mining Long Patterns from Databases(Bayardo98), Discovering Frequent Closed Itermsets for association Rules(Pasquier98), Implementing Data Cubes Efficiently(Hairnarayan96), Efficient Mining of Partial Periodic Patterns in Time Series Database(Han99), Constraint-Based Clustering in Large Databases(Tung01), PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Pattern Growth(Pei01), Generalization and Decision Tree Induction: Efficient Classification in Data Mining(Kamber97), Scalable Parallel Data Mining for Association Rules(EHHan97) and On the Computation of Multidimensional Aggregates(Agarwal96).
Jan, 2001
I read Constraint-Based Clustering in Large Databases(Tung01), Efficient
Algorithms for Mining Outliers from Large Data Sets(Ramaswamy99),
FreeSpan: Frequent Pattern-projected Sequential Pattern Mining(Han00),
Mining Sequential patterns: Generalizations and Performance
Improvements(Srikant96), Depth First Generation of Long
Patterns(Agarwal00), An Efficient Algorithm for Mining Association Rules
in Large Databases(Savasere95), Pincer-Search: A New Algorithm for
Discovering the Maximum Frequent Set(Lin98), CURE: An efficient algorithm
for clustering large databases(Guha98), BIRCH: An Efficient Data
Clustering Method for Very Large Databases(Zhang96), ROCK: a robust
clustering algorithm for categorical attributes(Guha99), A Database
Interface for Clustering in Large Spatial Databases(Ester95), A
Density-Based Algorithm for Discovering Clusters(Ester96), A
Distribution-Based Clustering Algorithm for Mining in Large Spatial
Databases(Xu98), Mining Access patterns Efficiently from Web Logs(Pei00),
Mining Association Rules with Item Constraints(Srikant97), Efficient
Mining of Constrained Correlated Sets(Grahne00), Can We Push More
Constraints into Frequent Pattern Mining?(Pei00), Mining Frequent Itemsets
with Convertible Constraints(Pei01).
12/1 (1st meeting with Dr. Cheung): I talked about PrefixSpan: Mining
Sequential Patterns Efficiently by Prefix-Projected Pattern Pattern
Growth(Pei01). Dr. Cheung told me to read papers about mining with
constraints and see if the techniques can be applied to long pattern
mining.
19/1 (2nd meeting with Dr. Cheung): I talked about Mining Frequent
Itemsets with Convertible Constraints(Pei01) and Exploratory Mining and
Pruning Optimizations of Constrained Association Rules(Ng98). I introduced
the input of the algorithm, anti-monotone constraint and monotone
constraints. I also gave the definition of succinct constraint. I
tried to explain the usage of succinct constraint but was not
successful. Dr. Cheung told me that I need to convey a global picture of
the algorithm to him instead of several key points. I need to
define the input clearly and prepare some examples to illustrate the
algorithm.
Feb, 2001
Courses begin.
I read Constraint-Based Rule Mining in Large, Dense
Databases(Bayardo99).
2/2 (3rd meeting with Dr. Cheung): I talked about Exploratory
Mining and Pruning Optimizations of Constrained Association Rules(Ng98).
How to prove a succinct constraint must has a member generating
functions and how to compute the member generating function for a succinct
constraint were the problems we could not solve. Dr. Cheung told me to
talk about Mining frequent Itemsets with Convertible Constraints(Pei01) in
next meeting.
10/2 (4th meeting with Dr. Cheung): I talked about Mining frequent
Itemsets with Convertible Constraints(Pei01) but did not mentioned the
algorithm. 3 points were highlighted in the meeting. One is why not
use the member generating function(MGF) as the definition of succinct
constraint. Another point is why it is possible to convert succinct
constraint into convertible constraint. The 3rd point is how the
performance of this algorithm is compared with that of the previous
algorithm. Dr. Cheung asked me to do a survey on frequent-pattern growth
approach and constrained mining.
16/2 (5th meeting with Dr. Cheung): We recapped the anti-monotone
and succinct constraint properties. Dr. Cheung asked what we could do if
there were changes to the iteminfo table and/or the transaction table
while the constraints are fixed. Dr. Cheung also mentioned that the
predicates used in member generating function of succinct constraints
should be worked out by the human. I pointed out the problems of Ng98. The
problems are how to obtain the predicates, how to induce weak constraints
and how to classify the constraints into four classes(combinations of
anti-monotone and succinct). Next week, continue the survey on the papers
I talked in the previous neetings and think about how to deal with the
incremental updating problem.
23/2 (6th meeting with Dr. Cheung): Dr. Cheung asked me how I felt
about the courses I were taking. I talked about how to mine frequent
itemsets with constraints if the transaction table and the iteminfo table
can be modified after doing the mining process. Dr. Cheung asked me to
think of situations my approach did not work. Also, Dr. Cheung asked me to
talk about the frequent pattern growth approach next time.
March, 2001
I read Rapid Association Rule Mining(Ng), Monitoring Change in Mining
Results(Baron) and LGen -- A Lattice-Based Candidate Set Generation
Algorithm for I/O Efficient Association Rule Mining(Yip99).
2/3 (7th meeting with Dr. Cheung): Dr. Cheung thought that the
partition of items is a non-trivial observation. Dr. Cheung wants me to
analyze the number of candidate itemsets considered in our approach and
that in re-run of the approach in Ng98. Dr. Cheung questioned if changes
in "iteminfo" table would affect the properties of the constraints. Dr.
Cheung asked what sorts of constraints were dealt with in Ng98. Dr. Cheung
said that we may also use top-down approach to generate new frequent
itemsets and told me to read the paper (L-Gen). I suggested we may use
database projection technique in this case. Our objective is to
minimize the I/O scans and the number of candidates considered. Dr. Cheung
asked me to think about 2 other questions. One is that if frequent-pattern
growth is used when we mine valid and frequent itemsets at the first time,
how can we do the update? Another is that can we do something at the first
time to further facilitate our subsequent updates? In this meeting, we
also discussed the number f candidates considered by Apriori and that by
frequent-pattern growth. Moreover, we talked about the convertible
constraints briefly.
9/3 (8th meeting with Dr. Cheung): I talked about the reduction of
candidates considered if our approach is used. Also, I talked about what
happens if pattern-growth method is used in the first mining. Moreover, I
talked about the situation if all frequent itemsets are saved in the first
mining. We found that our approach is good if the transaction table is
changed greatly and the iteminfo table is changed relatively mildly. Dr.
Cheung told me to start to write a paper for this problem and our
approach.
24/3 (9th meeting with Dr. Cheung): I was frustrated by my
performance in the meeting. We continued to discuss the update problem
with constraints and we concentrated on anti-monotone constraints. Dr.
Cheung suggested 2 way in order to make our original approach work. One is
to do the mining in both bottom-up and top-down directions. First, count
in the delta db to see if the old valid & frequent itemsets that are not
affected by the updates in the iteminfo table. Then use the result and all
modified items to form the upper bound of the valid & frequent after
updates. My response was that we could do the same for other approaches.
The other way suggested by Dr. Cheung was as follows. The first step is
the same as the preceding way. The second step is to find all valid &
frequent itemsets consisting of modified items only. The last step is to
do self-join of the result of step 1 and a join between the results of
step 1 and step 2. Dr. Cheung pointed out that if a candidate consists of
unchanged items and modified items and the unchanged items are infrequent
previously, the candidate must be frequent in delta db in order to be
frequent in the new whole db. My response was that it worked.
30/3 (10th meeting with Dr. Cheung): I told Dr. Cheung that there
was a serious problem in the motivation of the update with constraints
problem. Then, Dr. Cheung suggested changing the parameters in the
constraints and keeping iteminfo table unchanged. I found that one easy
way is to find all frequent itemsets first and test those frequent
itemsets against the constraints later. In this way, we only need to scan
the database in the first time. Dr. Cheung told me to keep on think about
the motivations of update with constraints problem.
April, 2001
I read Efficient Computation of Iceberg Cubes with Complex
Measures(Han01), An Array-Based Algorithm for simultaneous
Multidimensional Aggregate(Zhao97) and Bottom-Up Computation of Sparse and
Iceberg CUBEs(Beyer99).
6/4 (11th meeting with Dr. Cheung): We talked about the update with
constraints problem and the paper Efficient Computation of Iceberg Cubes
with Complex Measures(Han01). Dr. Cheung told me to talk about the paper
Bottom-Up computation of Sparse and Iceberg CUBEs next time.
20/4 (12th meeting with Dr. Cheung): I told Dr. Cheung my feeling
towards PAKDD01. We talked about data cube as well as iceberg data
cube. Dr. Cheung mentioned one of his paper on data cube. Dr. Cheung told
me to continue to study the cube problems.
27/4 (13th meeting with Dr. Cheung): We talked about how to
compute the iceberg data cube in the presence of attribute hierarchy so
that the iceberg cube can support roll up and drill down operations.
May, 2001
I read Optimizing Selections over Datacubes(Ross00) and Fault-Tolerant
Frequent Pattern Mining: Problems and Challenges(Pei01).
3/5 (14th meeting with Dr. Cheung): We talked about how to
compute multi-level iceberg cubes. Dr. Cheung said that the cost of doing
roll-up operation based on the base cube was reasonable. Hence, we
shifted our target to optimize the storage space and the construction
cost. We thought of a way to compress the multi-level iceberg cubes.
We would incorporate attribute hierarchies into the lattice for different
group-bys and use the bottom-up approach to traverse the lattice.
11/5 (15th meeting with Dr. Cheung): We continued to talk about
how to make iceberg cube support hierarchy roll-up operations. Dr. Cheung
suggested to expand the bottom-level iceberg cube. The expanded
botton-level iceberg cube keeps some un-interesting cells such that it
have enough information to support hierarchocal roll-up. Dr. Cheung also
suggested to use the average(or min/max) of all the un-interesting cells
to represent all theose cells. I suggested that we may distribute the
values of un-interesting cells to neighbouring interesting cells. The
problem is how we can approximate the results of roll-up more
accurately.
18/5 (16th meeting with Dr. Cheung): I talked about selective
materialization of cuboids in data cube and approximate query processing.
We discussed the merits of full approximation and iceberg plus
approximation(e.g. wavelet). Dr. Cheung argued that the later approach
would probably produce more accurate answers. He suggested me to study
various approximation methods and solve the problem before think of
anything else. Dr. Cheung also mentioned k-d tree and dense region
OLAP.
25/5 (17th meeting with Dr. Cheung): We talked about how to do
roll-up given the base data cube. Dr. Cheung said that people would
select some cuboids to materialize. The goal of Ullman's papers is to
optimize the time needed to compute the full cube in real time.
June, 2001
1/6 (18th meeting with Dr. Cheung): We talked about the
weakness of the method I proposed in the previous meeting. Dr.
Cheund asked me to think about how to approximate the small cells.
8/6 (19th meeting with Dr. Cheung): Dr. Cheung said that my
arguments are quite confusing sometimes. We discussed about what we should
approximate (the base cuboid, the un-interesting cells in the base
cuboids or the upper cuboids ). I mentioned my observations (differences
between the 3 approaches and the relationship between the constraints
and approximation). Finally, I asked Dr. Cheung what data cube appproach
he wants to adopt. He told me that we should only focus on how to support
roll-up operations and leave the selection of cuboids to materialize to
the users.
15/6 (20th meeting with Dr. Cheung): I talked about lots about
histogram. At the end, we changed our problem definition as "How to
contruct a complete base cube with approximate cuboids?".
22/6 (21st meeting with Dr. Cheung): I mentioned the simple method
to solve the problem brought out in the previous meeting. The method is to
first compute all the cells in the cuboid lattice and then compress the
light cells. I pointed out that we may not need to compute the actual
values in the light cells. We may increase the efficiency at the cost of
accuracy. I said that we may assume that only the actual values of the
light cells in the base cuboid are needed to compute. Then we can use
approximated light cells in the base cuboid to compute the light cells in
other cuboids. (actually, this cannot increase efficiency.) Another
approach I mentioned is to use bottom-up approach but use sampling to
approximate the light along the way. I also metioned using average. Dr.
Cheung pointed out that we may use wavelet to compute all the
approximate all the light cells together (use compressed light cells to
compute other compressed light cells).
29/6 (22nd meeting with Dr. Cheung) We went through our ideas
once. Especially, we discussed the wavelet-based compression technique.
Dr. Cheung suggested thinking about how to use histogram effectively.