Support vector ranking with Data Partition in PA 3.2

We are trying to build a "support vector ranking" model for a ranking case in PA 3.2 connected to HANA DB.

Before trying the same, we have a general question around the Hana Partition function available in PA 3.2, so that we can also get model statistics and model comparison function.

Currently we see that this function has only random or sequential option, however in case of "support vector ranking" random split should be on the basis of query id column and not just random splits so that grouping ID column is taken into account. However during the configuration of the HANA partition step this information is not asked from the user in case of SVR.

So we wanted to ask does the HANA partition step currently handles the SVR algorithm's specific requirement of data split as per groups for training / validation etc ?

Related questions

2 Answers

Just so everyone else watching this thread are updated, the problem you described is a valid issue and since the existing partition node in Expert Analytics will slice the data either randomly or stratified based on a feature, won't work in your scenario. Unfortunately, at this time, I am unable to suggest a workaround but I'll add this to the team's backlog so we support this kind of partitioning in future.

Add comment

Stratified sampling won't fit in this case. SVR algorithm takes as input a set of records identified by a column called as group ID. So there are multiple groups, inside which records are ranked per group.

So here there should probably be a strategy where out of all data 70% groups( using group id ) are taken as train and 10% validation, 20% test. That ways group's internal ranking label will not be lost for validation metrics calculation.

Stratified sampling on other hand, helps in giving a balanced data set in case of distributed multi class population.

Edit: Just to add further, even the validation metrics fit for ranking algorithms are not MAPE, MSE, RMSE etc. Ranking algorithms usually use search engine ranking metrics like: NDCG( https://en.wikipedia.org/wiki/Discounted_cumulative_gain ). Please see if this is covered in the HANA model statistics and model compare functions ?