Workshop on Frequent Itemset Mining Implementations (FIMI'04)
1 November 2004, Brighton, UK in conjunction with ICDM'04
Submission Rules
All submissions to the workshop must adhere to the following
guidelines.
- All submitted code should use the ANSI C/C++ programming language.
The target compiler and platform will be g++ and Linux respectively (g++ (GCC) 3.3.3 20040412 (Red Hat Linux 3.3.3-7)).
The memory usage of each implementation will be tested using the 'memusage' command
(memusage (GNU libc) 2.3.3), and the time usage will be measured using the 'time'
command (GNU time 1.7).
-
Source code that is accepted will be made publicly available
(via a web link or source code) on the FIM algorithm repository.
Note that we are flexible about the licensing issues.
The base line is that all code must be made freely available for research purposes.
The authors are free to add restrictions for commercial uses.
Thus the code may use GPL/Lesser GPL licensing or an NDA
(non disclosure agreement) for research purposes.
- The code must be accompanied by a paper describing the implementation,
with a detailed performance study on public datasets,
and a detailed explanation on the efficacy/effectiveness/efficiency of the implementation.
- Any existing (including variants) or new algorithm can be
implemented for the FIM tasks (all, closed and maximal set
mining). As mentioned above, the authors must describe why the
techniques used are likely to be efficient/effective.
- The code must follow minimal standards of documentation and must be understandable.
-
All submissions should contain a single Makefile
which, after issuing the command 'make' (GNU Make version 3.79.1),
creates an executable 'fim_all' (for all frequent itemsets),
'fim_closed' (for closed frequent itemsets) and 'fim_maximal' (for maximal
frequent itemsets)
-
The program should be executable with 3 parameters:
- The path of the dataset-file
- The minimal support threshold, which should be a non-negative integer.
An itemset is frequent if its absolute support is larger or equal to this threshold.
- The output file. (should be able to be /dev/null!)
-
The dataset input must use the following ascii format only!
Each transaction is stored on a separate line as a list of items separated by
white space and ending with a newline. Each item is a non-negative integer.
The items in the test datasets will be consecutively numbered starting from 0, and each transaction
will be sorted ascending. (Note that this is not yet the case for the provided test datasets.)
For fairness of comparison, the use of multi-threading, advanced pipelining, low level memory optimizations,
or direct expolitation of the hardware, etc. are disallowed!
-
After execution of the program, the output file should contain
all frequent itemsets together with their absolute support.
Each line contains a single frequent itemset as a list of items separated by whitespace.
At the end of the line, its absolute support is printed between parantheses.
For example:
1 2 3 (100)
represents the itemset containing items 1,2 and 3 and it has an absolute support of 100.
Note that also the empty set is considered as an itemset and its support should also be printed
on a separate line in the output file.
-
During or after execution, the program must print
the total number of frequent itemsets and this number for each length between
0 and k separately to the standard output stream,
with k the size of the largest (closed/maximal) frequent set.
If there is no maximal/closed frequent itemset for a given length, print 0!
For example:
32
1
5
10
10
5
1
means that there are 32 frequent itemsets, of which 1 frequent itemset is of length 0,
5 of length 1,
10 of length 2,
10 of length 3,
5 of length 4 and
1 of length 5.
No other output is printed to the standard output stream.
-
The exit status of the algorithm is 0 unless incorrect termination of
the algorithm.
-
No output is sent to the error output stream unless incorrect termination of the algorithm.
Frequently Asked Qestions
-
If the minimal support threshold is set too low,
the number of frequent sets can become too large to
be send to output. Shouldn't output of the frequent sets be optional?
Since
FIM algorithms are meaningless if they do not write the
wanted sets to output, and
different optimizations can be used when itemsets
or not printed to the output, the output file will always
be used as a parameter. This can, however, be /dev/null in order
to prevent a disk overflow.
- Can we use temporary files?
Yes, but they should be created in the tmp subdirectory of
the current working directory.
- Can we assume that items are consecutively numbered?
Yes, the items in the test datasets will be consecutively numbered starting from 0.
(Note that this is not yet the case for the provided test datasets.)
- Can we assume that each transaction is sorted?
Yes, each transaction will be sorted ascending.
(Note that this is not yet the case for some of the provided test datasets.)
- Can we assume that an item is contained at most once in each transaction?
Yes.
(Note that this might not be the case for some of the provided test datasets.)
- According to the submission rules we should perform
"a detailed performance study on public datasets".
Do we have to present experiments on all datasets that are
and will be provided?
No, the experiments should be sufficiently illustrative.
You do not have to use all provided datasets and you can also use your own datasets.
In the latter case, we request that you send us all used datasets to be published
on the FIMI dataset repository.
- Do our implementations have to be object oriented?
No, they should only be compilable with an ANSI C/C++ compiler, such as g++.
- We have our own implementation of some of the well known FIM algorithms,
but we don't have any specific 'new' contributions, can we submit them?
We welcome such contributions. We will select the best implementations
out of all submissions to be posted on the archive.
We expect that these implementations will get a lot of exposure over time.
-
Do we have to submit an implementation for all three FIM tasks, i.e. all, closed and maximal sets?
No, you can choose which category or categories you want us to consider your code for.
|