In this quick post we will prepare a few CSV files containing data for Training and Test sets in order load them in Weka and train a classifier.

Prerequisites

To install Weka, use this link

Weka

Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code.

Weka accepts the ARFF files which have two distinct sections. The first section is the Header information containing the name of the relation, a list of the attributes (the columns in the data), and their types. The second one is the Data.
Please note that the @RELATION, @ATTRIBUTE and @DATA declarations are case insensitive.

Header
The ARFF Header section of the file contains the relation and attribute declarations.
The instance data
Each instance is represented on a single line, with carriage returns denoting the end of the instance. Attribute values for each instance can be delimited by commas or tabs. Attribute values must appear in the order in which they were declared in the header section.

The Weka ARFF file we need

Let’s consider the following ARFF file in which we have 2 string attributes, 8 integer attributes and a binary class {yes,no} that reflects the interest of the user.

The CSV file we have

Let’s suppose that we have the required data stored in MySQL. We can create a csv file with the relevant attributes for each event. In order to train the classifier we need a Training set, a Validation set and a Test Set. Fortunately Weka will create the Training and the Validation set from a list of instances. Hence, our goal is to Generate only Two ARFF files.

Training Set
Let’s suppose we have two files. Two lists of comma separated positive and negative samples. Unfortunately we need to append ‘,yes’ to the positives and ‘,no’ to the others.

Positive instances: path /tmp/TEST_POS_userk.csv in which userk is the userID and changes for each user.

Negative instances: path /tmp/TEST_NEG_userk.csv

The Bash Script

Actions required:

  • Append labels to pos/neg files
  • Control lenght of files
  • Optional: sort lines in file
  • Merge files
  • Create Folder if it does not exist
  • Add first line for attribute’s details
  • Call Weka CSVLoader to create Arff file

Append labels to files

First, we need to create a few variables containing the path of the file.
Then we can call ls in the path of the file /tmp with the -t option to sort by modification time, newest first. We use send the output of the ls command to grep using pipe. We use the -v option to invert the sense of matching, to select non-matching lines and then to show only the first result using sed to append a string at the end of a line. For more details see this previous post: Linux utility commands or google sed.

Control the length of files

We can use wc -l which print the newline counts for a specified file, in this case, the output of cat. And save the result in a variable:

Sort lines and Merge files

Let’s use cat to merge and sort to sort

Add first line for attribute’s details

The CSVloader class needs to know the name of the attribute it will deal with. We need to specify these details in the first line of the csv file. So execute the sed command to create one.

Create new folder and ARFF file

In order to conver the CSV file to an ARFF file we use the CSVloader class which Reads a source that is in comma separated format and outputs an ARFF file.
From the documentation we notice that we can specify which attributes are to be forced to be of a certain type. Nominal, string or numeric.

In this case we need to specify the options:

  • -N “last” to indicate that the last value in the file refers to the nominal class
  • -S “2,3” to specify STRING types
  • -R “1,4-10” for numeric values

Here are the resulting commands:

Final Script and Results

Here is the final script:

To test the script run the following commands


userk@dopamine:~$ sudo echo "1,"Fantasy","final_cinema_event",612578,784652,412593,1,1,0,0
0,"Animazione","final_cinema_event",665355,765651,412593,1,1,0,1
1,"Commedia","final_cinema_event",326814,448159,456148,1,1,0,1" > /tmp/TEST_POS_userk.csv
userk@dopamine:~$ sudo echo "0,"Drammatico","final_cinema_event",748987,475122,412593,0,1,1,0
0,"Commedia","final_cinema_event",778845,784652,787974,0,0,1,0
1,"Biografia","final_cinema_event",998511,475122,143153,0,0,1,0" > /tmp/TEST_NEG_userk.csv
userk@dopamine:~$ git clone https://gist.github.com/8ddbb6418c0e707729d08e9152ba49af.git
userk@dopamine:~$ cd 8ddbb6418c0e707729d08e9152ba49af
userk@dopamine:~/8ddbb6418c0e707729d08e9152ba49af$ sudo chmod 770 mergeAndConvert2Arff.sh 
userk@dopamine:~/8ddbb6418c0e707729d08e9152ba49af$ ./mergeAndConvert2Arff.sh
Please, enter the initial portion of the filename you want to modify
userk
The following file will be modified:
Labelled file: TEST_POS_userk.csv
Unlabelled file: TEST_NEG_userk.csv
Controlling lenght of files:
Labelled:  3
UnLabelled:  3
Total Test Set instances:  6
Sorting pseudo-randomly lines

Hope it helps!

References