API Documentation¶
Embed the model¶
One of the most common use of the mltool API is to embed the ranker model built with the command line tool into your own application.
The models are pickable file and you can load them using the pickle
standard
module:
import pickle
with open('model.pkl', 'rb') as fmodel:
model = pickle.load(fmodel)
Then you can use mltool API to predict the score of a sample:
from mltool.predict import predict
pred = predict(model, {'f0': 1.0,
'f1': 0.0,
'f2': 0.0,
'f3': 0.2,
'f4': 1.0})
Check predict
and
predict_all
functions for more details.
Prediction¶
-
mltool.predict.
predict
(model, sample)¶ Predict the score of a sample
Parameters: - model – the prediction model
- sample – a dictionary/namedtuple with all the features required by the model
Returns: return the predicted score
-
mltool.predict.
predict_all
(model, dataset)¶ Predict targets for each sample in the dataset
Parameters: - model – the prediction model
- dataset – a dataset
Returns: return the predicted scores
Model train¶
mltool implements some algorithms to train regression models. Currently it mainly supports Random Forest and regression trees.
mltool.forest¶
-
mltool.forest.
train_random_forest
(dataset, num_trees, max_depth, ff, seed=1, processors=None, callback=None)¶ Train a random forest model.
Parameters: - dataset – the labelled dataset used for training
- num_trees – number of trees of the forest
- max_depth – maximum depth of the trees
- ff – feature fraction to use for the split (1.0 means all)
- seed – seed for the random number generator
- processors – number of processors to use (all if None)
- callback (None or callable) – function to call for each tree trained, it takes the new tree as a parameter
Returns: An mltool’s model with with a forest of trees.
mltool.decisiontree¶
-
mltool.decisiontree.
train_decision_tree
(dataset, max_depth, split_func=None, seed=None)¶ Train a decision tree for regression.
It is possible to customize the function used to find the split for each node in the tree. The
split_func
is a callable that accepts two parameters, an array of labels and an 2d-array of samples. It returns None if no split is found, otherwise a tuple with the index of the feature, the value for the split and a gain score.The 2d-array of samples has one column for each sample and one row per feature.
Parameters: - dataset – the labelled dataset used for training
- max_depth – maximum depth of the trees
- split_func (None or callable) – function to use to find the split. If None then a default one is used.
- seed – seed for the random number generator
Returns: An mltool’s model with a single decision tree.
Model Evaluation¶
-
mltool.evaluate.
evaluate_preds
(preds, dataset, ndcg_at=10)¶ Evaluate predicted value against a labelled dataset.
Parameters: - preds (list-like) – predicted values, in the same order as the samples in the dataset
- dataset – a Dataset object with all labels set
- ndcg_at – position at which evaluate NDCG
Returns: Return the pair RMSE and NDCG scores.
-
mltool.evaluate.
evaluate_model
(model, dataset, ndcg_at=10, return_preds=False)¶ Evaluate a model against a labelled dataset.
Parameters: - model – the model to evaluate
- dataset – a Dataset object with all labels set
- ndcg_at – position at which evaluate NDCG
Returns: Return the pair RMSE and NDCG scores.
Utilities¶
Handling Dataset¶
-
class
mltool.utils.
Dataset
¶ The Dataset class is a
namedtuple
which represents a set of samples with their labels and query ids.-
labels
¶ An
array
of labels. Each label is a float.
-
queries
¶ A list of query ids.
-
samples
¶ A 2d-
array <numpy.array
of samples. It consists of one sample per column, and one row for each feature.
-