

CAREN - Class Association Rules ENgine

README - June 18, 2001

Welcome to the distribution of CAREN, a JAVA based  application
for finding association rules in datasets.

This is a JAVA implementation of the classic Apriori algorithm. 
The original Apriori algorithm is beefed-up with several optimization.
The most important is the Trie-based structure to represent the itemsets.
There is a fast function to check itemsets occurrence into transactions i.e. transactions
are projected into the trie structure. Candidate generation through the $join$ function
is obtained by a very fast mechanism of inspecting the leaf nodes of the trie structure.
The implementation is disk-based. That is, there is no internal representation of the 
dataset and in all passes the data is read from disk.

There are two programs: $aprioriatt$ and $aprioribas$. The first tackles datasets in the 
attribute/value format.
The other is target to treat basket-data like datasets 
i.e. each line is in the form: TRANS-ID    ITEM.
Note that CAREN is case sensitive on the TRANS-ID identifiers. Make sure that your dataset 
has this data normalized.
The format attribute/value requires that the first line in the dataset should be a description 
with the names of the Attributes.

To check the different options try the command line 

> java aprioriatt -help

CAREN output execution time. Please bear in mind that this value refers to the 
overall execution time (system + user time)!


README - June 14, 2002

This new version (1.6.3) has two forms of attribute discretization.
One is the Binary Discretization. Here it is implemented following the C4.5 style using
a measure of entropy to select cut points. A class attribute is assumed and the default
is the last attribute described in the dataset.

The second form of discretization is Srikant & Agrawal 96 in SIGMOD'96.
The number of intervals for a discretized attribute is calculated using the following 
formula:

	num_int = (2 * N)/ (minsup * (K - 1))

where K is the partial completeness level and N is the number of attributes to be discretized. 
A Maximal support is used to control the process of joining the adjacent equi-depth intervals.


Also, check the new options for text output of the derived rules. 
For the moment we about pure tex, csv format, Prolog format and eventually PMML format in the future.



README - September 30, 2002

There is a batch file for accessing the java programs called CAREN. Use

> ./caren -help

to obtain usage format. Do not forget to change the paths in the caren file according to 
your java installation.



README - January 13, 2003


The PMML format is now implemented. Check the output options.
PMML is a XML format to describe data mining models.

A few improvements had been made on the candidates counting procedure. Thus, one should
feel slight improvements on the overall execution time.



README - April 22, 2003

Bug on attributes discretization with negative values solved. Intervals have now format [a,b]
instead of [a - b].



README - May 22, 2003

New version (1.6.4).

Caren now deals with null values. The switch "-null" defines the character to be interpreted
as the null symbol. Null value replacement is done as follows:
In discrete attributes null values are replaced by the most frequent value in the attribute.
In continuous attributes nulls are replaced by the attribute value which minimizes the standard deviation.
That is, the "closest" (in absolute value) to the attribute average value.
The new switch "-ignore" is used to ignore null values. In this mode, the item obtained from an attribute/value
pair where the null symbol occurs is not considered and is removed from the transaction (tuple) to be counted.


This version also includes a general performance optimization for $aprioriatt$.

The basket version is now replaced by a much faster version called $aprioriGRAPH$.
This new version can be up to 60% faster than the older one. It is also more efficient in terms of memory 
consumption.


README - September 25, 2003

New version (1.7).

AprioriGraph 1.3.

Includes selection of rules by defining the items that should occur in the 
antecedent of the rule. Three options: At least one item of the list provided by
the user, or all the items provided and all rules which antecedent is covered by
the items provided by the user (suitable for classification purposes).


Implements the metrics conviction and lift. Rules can be filtered by these new metrics.
Lift is calculated as:

		Lift(A -> C) = conf(A -> C) / sup(C)

Conviction is:

		Conv(A -> C) = (1 - sup(C)) / (1 - conf(A -> C))

For conviction, infinity is dealt and is represented by the symbol "+oo".
Only one specified metric can be used to filter rules.


In .csv format all metrics are printed along the rules. This is used for convenience of post-processing.


Improvement was updated to cope with the implemented metrics.
Improvement is calculated according to the user specified metric (which will play the strength measure role).
Rule's improvement is evaluated against its generalization:

	        imp(A->C) = min(strength(A->C) - strength(As->C): As in A),

where strength() can be confidence, lift or conviction.



Intervals are now represented in the [a : b] format. The requirement is due to the use of
comma to separate items in lists.  The use of this symbol as separator could puzzle the parsing of lists 
where intervals occur.

When referring to discretized items in lists, filtering rules by consequent or antecedent occurrence, write
item name in between "". Ex: -a"CC in [5.0 : 5.0]".



The former java programs $apriorigraph$ and $aprioriatt$ are now combined into a single command.
The caren file includes both versions for dealing with attribute/value and basket data.
Two switches for declaring the dataset format (-Bas, -Att). Default is -Bas.
Use:

> java caren -help

to obtain list of options.



Simple update in null values: Display of null values replacement occurs if the replacement is a frequent item.

Definition of class attribute is by name (formally was by number).


Includes a switch (-rs) to define maximal size for rules in number of items (including consequent).
It basically defines the number of database scanning allowed.

Now option (-H) to filter rules according to consequent attribute. This option can be combined with (-h).
The several options for selection of rules by consequent and antecedent occurrence are dealt as a conjunction of constraints.

Also included a switch (+A) to specify which attributes should occur in antecedent (along the same line 
of the switch +a)


The confidence and other metrics rule filtering are applied like suport filtering.
That is the constraint is evaluated as >= rather than >.








README - November, 2003

A novel prediction model is generated for building a classifier out of the generated rules.
In the near future a classifier will be added to the CAREN package.


Small optimization on itemset counting when the switch *a is used. Since only candidates and transactions with subsets of the items
described in the *a switch need to be counted, a considerable performance improvement is obtained.





README - February, 2004

Solved a bug in the improvement calculation. A rule satisfies a minimal improvement if its metric is superior than the maximal value
of the metric among the rules that are more general. If these two values coincide then the specific rule is preserved. 
Example:   

     c = 1.0  s = 0.55    a <- b & c
     c = 1.0  s = 0.56    a <- b

In this case the first rule (more specific) is preserved, considering a value of minimp = 0.0



README - March, 2004			ver 1.7.2.

An important algorithm reformulation was added to improve the implementation resulting in a considerable speedup  
(more than an order of magnitude).


The predict module now includes new classification methods (voting and class distribution).
Do not forget to specify the class attribute when generating the model at caren.


A Chi-squared test for independence was added. A new switch (-chi) is used to filter rules that fail to pass
test for dependence. Test use standard confidence value (95%) and 1 degree of freedom which is equivalent
to prune rules that chi^2 test value is less than 3.84.


A bug was detected and solved on the Srikant discretization method. Base intervals generated by the equidepth intervals procedure
are not discarded. Thus, now truly overlapping intervals are generated.
The general outcome is that in the new version the frequent itemsets counting of discretized attributes is slower and more itemsets
are generated. Typically, for $n$ values of an attribute, the number of generated intervals is O(n^2)!





README - October, 2004			ver 1.7.3.

A new discretization method was implemented. The option (-cin) switch for contiguous class intervals discretization.
It uses the "cut-points" of binary discretization for deriving intervals. Intervals are contiguous and open on the right-hand side. 
Tie cases are solved by voting. That is, cases where the same attribute value associates with different classes are solved
by calculating the most voted class. In example:

		1 2 3 3 3 3 4 5
                y y y n y y n n

The algorithm derives the intervals  [1-4[, [4-+oo[.
When voting yields a draw between classes the algorithm picks the class that leads to a larger left-hand side interval.
Example:
		1 2 5 5 6 7
                y y n y n n

The algorithm derives the intervals  [1-6[, [6-+oo[.

Notice that infinity is represented by the value +1.79E+308 (the largest double value!).




A novel switch for itemset filtering is introduced. The switch -X2 applies a chi^2 test to itemsets that contain any item
(or attribute) specified by the user in options -h or -H.
The test is only applied in frequent itemsets (that pass the support constraint).
With this options the system is potentially incomplete since chi^2 does not preserve the downward closure property (like support).
That is, if itemset ABC is considered independent then itemset ABCD is never derived, although the latter could pass
the chi^2 test.
Option -X2 can give different results than option -chi, in terms of derived association rules.
The latter is a post-processing filter whereas the -X2 is an "on the fly" applied constraint.





README - December, 2004			ver 1.7.4.

Novel reformulation in the caren itemsets counting engine. A slight speed-up was obtained.
This new implementation includes:

* Includes candidate counting inference as in 
  [Pasquier et al.] "Discovering frequent closed itemsets for association rules" in ICDT'99
  Basically, the rules says that:  If sup(X) == sup(XY) then sup(XYZ) = sup(XZ).
  We apply this rule along candidate generation. 
  If the antecedent of the rule holds then a candidate for that itemset is generated  but 
  not considered on the counting process. The counting is already known, and at the end of the process
  the candidate is treated as the others i.e. moved to the itemsets trie. 

* Reformulation of 2-candidate counting. Does not generates candidates for these itemsets. Instead it  makes use
of a matrice (represented through a flatten array). Counting reduces to a two-level for loop over each transaction.

* Registers transaction that do not contribute to k-cands counting. These transaction are not considered
when counting (k+1)-cands.








If you find memory problems try increasing the heap size allocated by the java interpreter
(check java -X, and include in the -X caren option)


Notice: the implementation is for the 1.2 (or higher) Java package. This version was compiled using JSDK1.4.2


For questions, comments, send email to Paulo Azevedo (pja@di.uminho.pt)



