User Guide¶
mltool is a command line tool that can be used to build ranking model.
Getting Started¶
Installation¶
mltool can be easily installed from PyPI using pip
:
$ pip install mltool
If you want to install it from the source, you need to install numpy
first.
You can install numpy with easy_install
or pip
. This is an example
using pip
:
$ pip install numpy
Then clone the repository and run setup.py
:
$ git clone git@bitbucket.org:duilio/mltool.git && cd mltool
$ python setup.py install
Now you should be able to run mltool:
$ mltool -h
Building the first model¶
Let’s see how to build our first ranking model. In this example we will use an example dataset provided with svmrank. You can download the dataset here: http://download.joachims.org/svm_light/examples/example3.tar.gz
A dataset contains tuples of query-samples. A sample is a vector with feature values. A dataset for training and testing must provide a label for each sample. The models built with mltool predict labels for unlabelled samples.
Download and unpack the example3.tar.gz
file:
$ wget http://download.joachims.org/svm_light/examples/example3.tar.gz
$ tar zxvf example3.tar.gz
The archive contains two files:
train.dat
- This file consists of 3 queries with some ranked results.
test.dat
- This file consists of a ranking that we will use to evaluate the model.
mltool uses a different format than svmrank, so we cannot use any of these
file. Fortunately the mltool’s conv
command can convert svmrank file
format to mltool’s one. Here is how to use it:
$ mltool conv example3/train.dat train.tsv
$ mltool conv example3/test.dat test.tsv
mltool command line requires the first argument to be the command we want to
execute. mltool supports several commands, you can see a list of them running
mltool -h
. Each command has it’s own list of arguments. We’ve run the
conv
command that just need two parameters: the input and the output
files. You can have a look at the help for the conv
command with
mltool conv -h
.
Now we have both a train set to build a model and a test set to evaluate it. mltool currently supports only two algorithms to build a model: CART and Random Forest. Let’s see how to build a model:
$ mltool rf-train -t 2 -s 1 train.tsv test.tsv -o ex3.pkl
2012-04-09 23:05:26,762 Reading training set...
2012-04-09 23:05:26,765 Reading validation set test.tsv...
2012-04-09 23:05:26,765 Training forest...
1 0.288675 1.000000 0.500000 0.861688
2012-04-09 23:05:26,805 Trees 1/2 generated.
2 0.322749 1.000000 0.250000 1.000000
2012-04-09 23:05:26,816 Trees 2/2 generated.
Feature gains:
f0 91.694444444444457
f2 41.333333333333336
f1 30.666666666666668
f4 18.722222222222221
rf-train
builds a model using Random Forest. -t 2
sets the number of
trees of the forest to 2, normally forest should be bigger but for this
example that’s fine. -s 1
sets the seed for the random number generator.
This will let us to rebuild the same model in the future. The rest of the
arguments passed by the command line are the datasets and the output file
name which will contain the model.
rf-train
shows some information about while building the model. The lines
starting with a date are logs displayed for your convenience, these messages
keep us updated on the building process. We can safely ignore those lines
for now.
What’s more interesting for our purpose are the other lines. First mltool prints some statistics for each tree added to the model. Random Forest models are built with several trees, each tree add some logic to the model and this generally improve the model. We can measure the improvements looking at some metrics. mltool statistics show the number of trees of the model evaluated then RMSE and NDCG scores for each dataset passed to the command line. The first pair of RMSE/NDCG scores is related to the train data, the second pair is about the test data.
Observing the output we can see that the final model gets an NDCG score of 1.0 for the test data. This means the model provided a perfect rank for each ranking in the test set (just one). RMSE score is about 0.35, this tell us something more about the distance of the predicted label and the effective labels in the dataset. The lower the score for RMSE the better the predicted labels fit the expected ones.
Finally mltool shows a feature gain table. This table give us some hints about how much useful the features used in the train set are. This can be helpful as a quick hint for the usefulness of the selected features.
We want to know now something more about the provided ranking, we want to know
the predicted labels. mltool provides an eval
command that lets us get
what we are looking for:
$ mltool eval -o preds.txt ex3.pkl test.tsv
2012-04-09 23:06:53,577 Reading dataset...
RMSE: 0.25
NDCG: 1.0
2012-04-09 23:06:53,578 Writing predicted labels...
$ cat preds.txt
3.5
3.0
2.0
1.0
$ paste <(echo; cat preds.txt) test.tsv
qid_ label_ f0 f1 f2 f3 f4
3.5 4 4.0 1.0 0.0 0.0 0.20000000000000001 1.0
3.0 4 3.0 1.0 1.0 0.0 0.29999999999999999 0.0
2.0 4 2.0 0.0 0.0 0.0 0.20000000000000001 1.0
1.0 4 1.0 0.0 0.0 1.0 0.20000000000000001 0.0
eval
just need the model and the file we want to evaluate. -o filename
must be used to output the predicted labels, otherwise mltool just outputs
the RMSE and NDCG scores.
The preds.txt
file contains one number per line, the predicted labels.
The labels are ordered following the same order of the evaluated file. The last
command we’ve used show both preds.txt
and test.tsv
files aligned
so we can see the differences of the scores.
We have correctly guess the value of the label in two samples over four. The other two samples were predicted with a slighly different score but this doesn’t affect the final ranking, that’s why our NDCG score is 1.0. We can compute manually the RMSE score:
That explains the RMSE score seen in the statistics of the eval
command.
Now you know everything to build a ranker using mltool. You can start to experiment with your custom rankers. When you want to embed the model built into your app, please have a look to Embed the model in the API Documentation section.