The Second Week

The second week of the summer internship at HPCC systems went by without any issues. The goal for the week was to create a summary of the existing evaluation metrics that are available in the HPCC systems Machine Learning bundles, and to then come up with a list of evaluation metrics that could potentially be implemented.

The summary was compiled using the documentation as the source. The documentation was clear and complete, and I had no problem understanding the metrics available, their capabilities, and their requirements.

The list of metrics to implement includes many important tests such as the chi-squared test for feature selection, and the Area under the ROC curve for logarithmic regression. It also focuses on the evaluation of clustering algorithms.

Clustering algorithms, (such as k-means which has been implemented as part of the ML bundle) are unsupervised learning algorithms, that group a given set of unlabeled data-points into clusters, and calculate from these clusters their centroids. It then determines the cluster that any new point must belong to. The inherent challenge in the evaluation of such models lies in the fact that there is often no absolute ‘ground truth’ to compare their results against. Most known evaluation metrics, such as the Adjusted Rand Index however, still make use of these ground truth values, as they provide the most reliable way of evaluating the model. Some methods, like Silhouette Analysis, do not rely on ground truth labels. Silhouette Analysis measures the amount of separation between the clusters produced and the closeness of the points within a cluster, to determine their quality.

The goal for the next week is to create proposals detailing the design of the various metrics to be implemented, keeping in mind the structure of the ML Library, while adding more metrics to the list to be implemented, making it richer.

The second week was interesting, and it enabled me to develop a much better understanding of the framework, and of the various ways in which machine learning models can be tweaked and modified, using the right evaluation metrics as a guide.

Leave a comment