User Guide¶

mltool is a command line tool that can be used to build ranking model.

Getting Started¶

Installation¶

mltool can be easily installed from PyPI using pip:

$ pip install mltool

If you want to install it from the source, you need to install numpy first. You can install numpy with easy_install or pip. This is an example using pip:

$ pip install numpy

Then clone the repository and run setup.py:

$ git clone git@bitbucket.org:duilio/mltool.git && cd mltool
$ python setup.py install

Now you should be able to run mltool:

$ mltool -h

Building the first model¶

Let’s see how to build our first ranking model. In this example we will use an example dataset provided with svmrank. You can download the dataset here: http://download.joachims.org/svm_light/examples/example3.tar.gz

A dataset contains tuples of query-samples. A sample is a vector with feature values. A dataset for training and testing must provide a label for each sample. The models built with mltool predict labels for unlabelled samples.

Download and unpack the example3.tar.gz file:

$ wget http://download.joachims.org/svm_light/examples/example3.tar.gz
$ tar zxvf example3.tar.gz

The archive contains two files:

train.dat: This file consists of 3 queries with some ranked results.
test.dat: This file consists of a ranking that we will use to evaluate the model.

mltool uses a different format than svmrank, so we cannot use any of these file. Fortunately the mltool’s conv command can convert svmrank file format to mltool’s one. Here is how to use it:

$ mltool conv example3/train.dat train.tsv
$ mltool conv example3/test.dat test.tsv

mltool command line requires the first argument to be the command we want to execute. mltool supports several commands, you can see a list of them running mltool -h. Each command has it’s own list of arguments. We’ve run the conv command that just need two parameters: the input and the output files. You can have a look at the help for the conv command with mltool conv -h.

Now we have both a train set to build a model and a test set to evaluate it. mltool currently supports only two algorithms to build a model: CART and Random Forest. Let’s see how to build a model:

$ mltool rf-train -t 2 -s 1 train.tsv test.tsv -o ex3.pkl
2012-04-09 23:05:26,762 Reading training set...
2012-04-09 23:05:26,765 Reading validation set test.tsv...
2012-04-09 23:05:26,765 Training forest...
1   0.288675        1.000000        0.500000        0.861688
2012-04-09 23:05:26,805 Trees 1/2 generated.
2   0.322749        1.000000        0.250000        1.000000
2012-04-09 23:05:26,816 Trees 2/2 generated.
Feature gains:
  f0        91.694444444444457
  f2        41.333333333333336
  f1        30.666666666666668
  f4        18.722222222222221

rf-train builds a model using Random Forest. -t 2 sets the number of trees of the forest to 2, normally forest should be bigger but for this example that’s fine. -s 1 sets the seed for the random number generator. This will let us to rebuild the same model in the future. The rest of the arguments passed by the command line are the datasets and the output file name which will contain the model.

rf-train shows some information about while building the model. The lines starting with a date are logs displayed for your convenience, these messages keep us updated on the building process. We can safely ignore those lines for now.

What’s more interesting for our purpose are the other lines. First mltool prints some statistics for each tree added to the model. Random Forest models are built with several trees, each tree add some logic to the model and this generally improve the model. We can measure the improvements looking at some metrics. mltool statistics show the number of trees of the model evaluated then RMSE and NDCG scores for each dataset passed to the command line. The first pair of RMSE/NDCG scores is related to the train data, the second pair is about the test data.

Observing the output we can see that the final model gets an NDCG score of 1.0 for the test data. This means the model provided a perfect rank for each ranking in the test set (just one). RMSE score is about 0.35, this tell us something more about the distance of the predicted label and the effective labels in the dataset. The lower the score for RMSE the better the predicted labels fit the expected ones.

Finally mltool shows a feature gain table. This table give us some hints about how much useful the features used in the train set are. This can be helpful as a quick hint for the usefulness of the selected features.

We want to know now something more about the provided ranking, we want to know the predicted labels. mltool provides an eval command that lets us get what we are looking for:

$ mltool eval -o preds.txt ex3.pkl test.tsv
2012-04-09 23:06:53,577 Reading dataset...
RMSE: 0.25
NDCG: 1.0
2012-04-09 23:06:53,578 Writing predicted labels...
$ cat preds.txt
3.5
3.0
2.0
1.0
$ paste <(echo; cat preds.txt) test.tsv
    qid_    label_  f0      f1      f2      f3      f4
3.5 4       4.0     1.0     0.0     0.0     0.20000000000000001     1.0
3.0 4       3.0     1.0     1.0     0.0     0.29999999999999999     0.0
2.0 4       2.0     0.0     0.0     0.0     0.20000000000000001     1.0
1.0 4       1.0     0.0     0.0     1.0     0.20000000000000001     0.0

eval just need the model and the file we want to evaluate. -o filename must be used to output the predicted labels, otherwise mltool just outputs the RMSE and NDCG scores.

The preds.txt file contains one number per line, the predicted labels. The labels are ordered following the same order of the evaluated file. The last command we’ve used show both preds.txt and test.tsv files aligned so we can see the differences of the scores.

We have correctly guess the value of the label in two samples over four. The other two samples were predicted with a slighly different score but this doesn’t affect the final ranking, that’s why our NDCG score is 1.0. We can compute manually the RMSE score:

\[\mathrm{RMSE} = \sqrt{{(4-3.5)^2 + (3.0 - 3.0)^2 + (2.0 - 2.0)^2 + (1.0-1.0)^2} \over 4}\]\[\mathrm{RMSE} = \sqrt{{0.5}^2 \over 4} = 0.25\]

That explains the RMSE score seen in the statistics of the eval command.

Now you know everything to build a ranker using mltool. You can start to experiment with your custom rankers. When you want to embed the model built into your app, please have a look to Embed the model in the API Documentation section.