API Documentation¶

Embed the model¶

One of the most common use of the mltool API is to embed the ranker model built with the command line tool into your own application.

The models are pickable file and you can load them using the pickle standard module:

import pickle
with open('model.pkl', 'rb') as fmodel:
    model = pickle.load(fmodel)

Then you can use mltool API to predict the score of a sample:

from mltool.predict import predict
pred = predict(model, {'f0': 1.0,
                       'f1': 0.0,
                       'f2': 0.0,
                       'f3': 0.2,
                       'f4': 1.0})

Check predict and predict_all functions for more details.

Prediction¶

mltool.predict¶

predict contains functions to predict the target variable score given a model.

mltool.predict.predict(model, sample)¶

Predict the score of a sample

Parameters:	model – the prediction model sample – a dictionary/namedtuple with all the features required by the model
Returns:	return the predicted score

mltool.predict.predict_all(model, dataset)¶

Predict targets for each sample in the dataset

Parameters:	model – the prediction model dataset – a dataset
Returns:	return the predicted scores

Model train¶

mltool implements some algorithms to train regression models. Currently it mainly supports Random Forest and regression trees.

mltool.forest¶

mltool.forest.train_random_forest(dataset, num_trees, max_depth, ff, seed=1, processors=None, callback=None)¶

Train a random forest model.

Parameters:

dataset – the labelled dataset used for training
num_trees – number of trees of the forest
max_depth – maximum depth of the trees
ff – feature fraction to use for the split (1.0 means all)
seed – seed for the random number generator
processors – number of processors to use (all if None)
callback (None or callable) – function to call for each tree trained, it takes the new tree as a parameter

Returns:

An mltool’s model with with a forest of trees.

mltool.decisiontree¶

mltool.decisiontree.train_decision_tree(dataset, max_depth, split_func=None, seed=None)¶

Train a decision tree for regression.

It is possible to customize the function used to find the split for each node in the tree. The split_func is a callable that accepts two parameters, an array of labels and an 2d-array of samples. It returns None if no split is found, otherwise a tuple with the index of the feature, the value for the split and a gain score.

The 2d-array of samples has one column for each sample and one row per feature.

Parameters:	dataset – the labelled dataset used for training max_depth – maximum depth of the trees split_func (None or callable) – function to use to find the split. If None then a default one is used. seed – seed for the random number generator
Returns:	An mltool’s model with a single decision tree.

Model Evaluation¶

mltool.evaluate¶

The metrics considered for the evaluation are two:

NDCG
RMSE

mltool.evaluate.evaluate_preds(preds, dataset, ndcg_at=10)¶

Evaluate predicted value against a labelled dataset.

Parameters:	preds (list-like) – predicted values, in the same order as the samples in the dataset dataset – a Dataset object with all labels set ndcg_at – position at which evaluate NDCG
Returns:	Return the pair RMSE and NDCG scores.

mltool.evaluate.evaluate_model(model, dataset, ndcg_at=10, return_preds=False)¶

Evaluate a model against a labelled dataset.

Parameters:	model – the model to evaluate dataset – a Dataset object with all labels set ndcg_at – position at which evaluate NDCG
Returns:	Return the pair RMSE and NDCG scores.

Utilities¶

Handling Dataset¶

class mltool.utils.Dataset¶

The Dataset class is a namedtuple which represents a set of samples with their labels and query ids.

labels¶: An array of labels. Each label is a float.

queries¶: A list of query ids.

samples¶: A 2d-array <numpy.array of samples. It consists of one sample per column, and one row for each feature.

feature_names¶: A sequence of feature names. Features are in the same order as they appear in the samples rows.

mltool.utils.read_input_file(fin)¶

Read a dataset from a file

Parameters:	fin – a file-like object to read the dataset from
Returns:	a `Dataset` object.