Evaluation

The evaluation module is based on the k-fold cross-validation method. A stratified random selection procedure is applied when dividing the rated items of each user into \(k\) folds such that each user's ratings are uniformly distributed in each fold, i.e., the number of ratings of each user in any fold differs at most by one. For k-fold cross validation each of the \(k\) disjoint fractions of the ratings are used \(k-1\) times for training (i.e., \(R_{train}\)) and once for testing (i.e. \(R_{test}\)). Practically, ratings in \(R_{test}\) are set as missing in the original dataset and predictions/recommendations are compared to \(R_{train}\) to compute the performance measures.

We included many popular performance metrics. These are mean absolute error (MAE), root mean square error(RMSE), Precision, Recall, F1, True and False Positives, True and False Negatives, normalized discounted cumulative gain (NDCG), rank score, area under the ROC curve (AUC) and catalog coverage. RMSE and MAE metrics are computed according to their two variants, user-based vs. global. The user-based variant weights each user uniformly by computing the metric for each user separately and averaging over all users while in the global variant users with larger test sets have more weight.

To evaluate via rrecsys we must start my creating the evaluation model:

e <- evalModel(smallML, folds = 5)

rrecsys addresses the two most common scenarios in Recommender Systems:

Rating Prediction (e.g. on a scale of 1 to 5 stars), and
Item Recommendations (e.g. a list of top-N recommended items).

To evaluate the task of Rating Prediction, the command is the following:

evalPred(e, "ibknn", neigh = 5)

To evaluate the task of Item Recommendation, the command is the following:

evalRec(e, "funk", k = 5, positiveThreshold = 3, topN = 3)

The evalPred and evalRec arguments correspond to the arguments of the algorithm that we want to evaluate. evalRec, obviously, has the topN argument and two additional arguments. The positiveThreshold that defines a threshold to distinguish between a negative rating and a positive rating. For the rank score metric we need to set the alpha argument, which is the rankings half life.