API Documentation

Embed the model

One of the most common use of the mltool API is to embed the ranker model built with the command line tool into your own application.

The models are pickable file and you can load them using the pickle standard module:

import pickle
with open('model.pkl', 'rb') as fmodel:
    model = pickle.load(fmodel)

Then you can use mltool API to predict the score of a sample:

from mltool.predict import predict
pred = predict(model, {'f0': 1.0,
                       'f1': 0.0,
                       'f2': 0.0,
                       'f3': 0.2,
                       'f4': 1.0})

Check predict and predict_all functions for more details.

Prediction

mltool.predict

predict contains functions to predict the target variable score given a model.


mltool.predict.predict(model, sample)

Predict the score of a sample

Parameters:
  • model – the prediction model
  • sample – a dictionary/namedtuple with all the features required by the model
Returns:

return the predicted score


mltool.predict.predict_all(model, dataset)

Predict targets for each sample in the dataset

Parameters:
  • model – the prediction model
  • dataset – a dataset
Returns:

return the predicted scores

Model train

mltool implements some algorithms to train regression models. Currently it mainly supports Random Forest and regression trees.

mltool.forest

mltool.forest.train_random_forest(dataset, num_trees, max_depth, ff, seed=1, processors=None, callback=None)

Train a random forest model.

Parameters:
  • dataset – the labelled dataset used for training
  • num_trees – number of trees of the forest
  • max_depth – maximum depth of the trees
  • ff – feature fraction to use for the split (1.0 means all)
  • seed – seed for the random number generator
  • processors – number of processors to use (all if None)
  • callback (None or callable) – function to call for each tree trained, it takes the new tree as a parameter
Returns:

An mltool’s model with with a forest of trees.

mltool.decisiontree

mltool.decisiontree.train_decision_tree(dataset, max_depth, split_func=None, seed=None)

Train a decision tree for regression.

It is possible to customize the function used to find the split for each node in the tree. The split_func is a callable that accepts two parameters, an array of labels and an 2d-array of samples. It returns None if no split is found, otherwise a tuple with the index of the feature, the value for the split and a gain score.

The 2d-array of samples has one column for each sample and one row per feature.

Parameters:
  • dataset – the labelled dataset used for training
  • max_depth – maximum depth of the trees
  • split_func (None or callable) – function to use to find the split. If None then a default one is used.
  • seed – seed for the random number generator
Returns:

An mltool’s model with a single decision tree.

Model Evaluation

mltool.evaluate

The metrics considered for the evaluation are two:


mltool.evaluate.evaluate_preds(preds, dataset, ndcg_at=10)

Evaluate predicted value against a labelled dataset.

Parameters:
  • preds (list-like) – predicted values, in the same order as the samples in the dataset
  • dataset – a Dataset object with all labels set
  • ndcg_at – position at which evaluate NDCG
Returns:

Return the pair RMSE and NDCG scores.

mltool.evaluate.evaluate_model(model, dataset, ndcg_at=10, return_preds=False)

Evaluate a model against a labelled dataset.

Parameters:
  • model – the model to evaluate
  • dataset – a Dataset object with all labels set
  • ndcg_at – position at which evaluate NDCG
Returns:

Return the pair RMSE and NDCG scores.

Utilities

Handling Dataset

class mltool.utils.Dataset

The Dataset class is a namedtuple which represents a set of samples with their labels and query ids.

labels

An array of labels. Each label is a float.

queries

A list of query ids.

samples

A 2d-array <numpy.array of samples. It consists of one sample per column, and one row for each feature.

feature_names

A sequence of feature names. Features are in the same order as they appear in the samples rows.


mltool.utils.read_input_file(fin)

Read a dataset from a file

Parameters:fin – a file-like object to read the dataset from
Returns:a Dataset object.