
CarenDF - Depth_First expansion Caren version  

		and

CarenclASS - carendf for classification




README2 - January 28, 2005


This is a novel implementation for the Caren package.
The main novelty lies on the new frequent itemsets calculation algorithm.

The main idea is to drop the breadth first (bottom-up) approach of the Apriori algorithm
and move to a depth first construction of the itemsets trie.
Counting is now performed by representing the cover list of each frequent item through bitmaps.
Itemset extraction and counting is obtained through itemset expansion (adding a new item to the explored itemset).
Counting is performed through the use of bitwise operation on the bitmaps.
The algorithm makes two scan in the dataset. The first is used to find frequent items and to count the number of transactions
in the dataset. In the second scan, 2-itemsets are counted and the bitmaps representing the frequent items cover is mounted.
The 2-itemsets counting is useful to restrain itemset expansion (since now one does not have subsets of counting itemsets).
Counts for 2-itemsets are represented by a flatten matrix (stored in an array of integers).
In the sequel steps the depth first itemsets expansion is performed.
Itemset counting is performed by bitcounting in the cover bitmap. The bitcounting algorithm is a standard Precompute 16-bits.
The order of the frequent items is still crucial for the performance of the algorithm. 
The algorithm preserves the support ascendant order as in the earlier caren versions.
This feature drives the expansion process.

The new implementation drastically reduces the memory consumption in comparison with the former Caren system.
We estimate an 80% performance improvement in relation to the former implementation.


The algorithm has some resembles with the ECLAT algorithm [Zaki 2000 in IEEE Transactions on Knowledge and Data Engineering],
since it is also a depth-first algorithm and uses a vertical-representation of the dataset.
However, only the frequent items have a bitmap representation of the cover.
During the execution, apart the list of bitmaps for the frequent items, the algorithm holds at most a number of bitmaps 
which is equal to the size of the largest itemset. 


To check for the new features try the command line

> java carendf -help


There are two new programs: 

	carendf,  
and 
	carenclass.




Please note:

* Due to the depth-first expansion and the order of the frequent items 
(and the fact that chi^2 metric is not downward closure)
the -X2 switch in this caren version does not give the same results (itemsets or rules) as in the 1.7 version.
A breath-first approach filters more itemsets since in this approach for each N-itemset one has access to its N-1-subsets.


* Theta-improvement. A new algorithm for applying the improvement filter on derived rules is used.
Improvement now is applied when rules are derived (and not as a post-processing as implemented in caren1.7).
This approach gives different results from caren1.7. The differences can be illustrated with the following example:

a <-         conf = 0.75
a <- b       conf = 0.8
a <- b & c   conf = 0.85

Assumed that rules are derived by this order. If minimp = 0.1 caren1.7 would give the following result:

a <-         conf = 0.75

However carendf gives the result:

a <-         conf = 0.75
a <- b & c   conf = 0.85

The reason for this output is that since carendf applies improvement along derivation of rules.
The third rule is eliminated by the presence of the second rule. However the second rule is derived first and is
eliminated by the first rule. When the third rule is derived there is no rule which ensures a no improvement in the
interest metric (confidence).
In light of this characteristics we rename the filter to theta-improvement.
Consequently the new theta-improvement implementation is faster than the original improvement.
Note that the procedure for this filter is sound and complete. 
However the result is uniquely determined by the ordering of the frequent sets. 
That is different orders give different results.


* We now have constraints on the minimal and maximal number of items (including the head of the rule) contained in a rule.
  (switches -rs and -RS)

* A new switch (*A) exists to filter rules which antecedent is not covered by the user specified list of attributes.
The filtering occurs during itemset expansion.
NOTE: switches of type 'a' and 'A' cannot be combined!

* A module $convert$ for pre-processing numeric attributes is included with caren. 
It contains several different discretization algorithms and features. Try   

> java convert -help.




There are two new programs: 

* Carendf, which includes the new features previously described.

* Carenclass, a caren module for classification and design to optimally interact with the $predict$ module.
It is a more efficient version of $canredf$ but always requires, from the user, a specification of the consequent.
Here, unless required, the trie of itemsets is not materialized. Moreover, rules are derived on-the-fly during the
itemset expansion process, which results in a much faster association rules generator.
This is achieved by reordering frequent items in such a way that the set of consequent items are allocated 
in the last positions.
Again, the switch -X2 gives different results from $carendf$ (due to the new reordering).
Due to the new reordering, for this module, theta-improvement and standard improvement give the same results.
Note that if more than a consequente is defined then this module is potentially incomplete.
That is, a rule with an item/attribute in the antecedent that also occurs in the user defined consequente list,
is potentially not generated.






README2 - February, 2005

Carenclass now generates subgroups rules. This is useful to analyse subgroup behaviour in relation to a predefined
attribute. These rules help characterize attribute values distribution in relation to a specific population (subgroup).
The rules are of the form:

	{ value_1/#, value_2/#, ..., value_n/# } <-- items defining subgroup.

Heres an example for dataset 'test' using the attribute CC:

	{ 2/1,34/2,43/1,8/1,9/1 }    <--    M3 in [-2.0 : 3.0[  &  ORIGEM=olga

If a numeric attribute is predefined then one can apply caren discretization methods $cin$ and $Srik$ 
but not $bin$ on the this attribute.
Heres another example for the same dataset but applying $cin$ in attribute CC:

	{ [0.0 : 5.0[/1,[9.0 : 43.0[/3 }    <--    M3 in [-2.0 : 3.0[  &  ORIGEM=olga

A rule expressing the apriori distribution is always generated. This rule has an empty antecedent.
The defined minsup is now meant to be a filter for antecedent suport. Interest metrics for these rules
should be ignored. Notice that attribute values in the consequente are exposed in lexicographic order.


Use switch -G to defined the attribute to generate subgroup rules for.
This switch only applies in -Att mode. 







If you find memory problems try increasing the heap size allocated by the java interpreter
(check java -X, and include it when invoking the java emulator with carendf and carenclass)


Notice: the implementation is for the 1.4 (or higher) Java package. This version was compiled using JSDK1.4.2


For questions, comments, send email to Paulo Azevedo (pja@di.uminho.pt)


