A new approach for rating prediction system using collaborative filtering

Abstract

Recommendation systems are most commonly used to recommend items for web users. It assists users in the selection of product from millions of product. E-Commerce websites such as AMAZON recommend items to its customers. The recommendation system mainly depends upon the previous history of its users. In this paper, a new User Rating Prediction (URP) algorithm is proposed to predict ratings for items. The proposed URP algorithm mainly depends upon similarity of users and assumes that users with similar taste may be interested in similar items. The proposed system first makes a list of related users for every user and then uses this information to predict ratings for different items. The result of the proposed algorithm was compared with the previous existing methods. The proposed algorithm gives small value of Mean Absolute Error (MAE) and root-mean-square error (RMSE) as compared to other methods.

Keywords

1 Introduction

Nowadays, the E-commerce and social networking websites are playing an important role in our daily life [10, 11, 18]. Most of the people are very much dependent on these websites for daily activities [13]. Because of the fast growth in WorldWideWeb, finding appropriate content from the web has become a difficult problem in our daily life. Recommender System (RS) is a new approach to assist the user in the handling of the vast quality of information. The RS is a tool [9, 15] that provides greater impact and offers useful products among many possible options to the target user. The goal of the RS is to supply an accurate recommendation of products that a user prefers among the given list of items [15]. E-commerce websites can use this system to provide the recommendation to their users based on their choice and personalize the website. E-commerce websites like Netfix.com and Amazon.com personalize their sites for each customer by showing them products related to their interests and tests, so the websites related to books, music, movies, news, and restaurants can recommend items to its users based on their past history [7]. It saves the users’ valuable time by recommending items that are related to their preferences and choices. It also increases profitability of the business by increasing sales at online stores. There are mainly three categories of recommendation system techniques, including Collaborative filtering (CF), Content-based filtering, and Hybrid filtering [7]. Taxonomy of Recommender System [8] is shown in Fig. 1. In personalized recommendation systems, one of the most commonly used methods is CF. CF uses the opinion of other similar users in the organization to recommend items or products to the active user. Content-based filtering uses user’s past history and recommends the similar items as they have been used in the past. Content-based filtering (CBF) uses users’ past information and recommends the similar items as they have been used in the past. In CBF, system learns the importance of item characteristics and builds a model of what the user likes. The hybrid recommendation system combines the approach of collaborative filtering and content-based method in different ways. In this paper, we proposed an algorithm that predicts the rating of an item. The proposed method first finds the similarity between users using the previous history of choices then performs prediction for items using similar user list. The basic idea to create similar user list is that if two users rate similar types of items and give similar ratings to these items then these users are similar to each other. To calculate the predicted rating for given user, the rating given by similar users and similarity values are taken into consideration. The proposed algorithm does not perform any hybridization of collaborative filtering and content-based filtering. Therefore, it is easy to implement. This algorithm makes user experience great and also increases the business sales.

The rest of the paper is organized as follows. In Sect. 2, related work is discussed. In Sect. 3, the proposed work is discussed. In Sect. 4, result and analysis are discussed. In Sect. 5, work is concluded with Conclusion and future scope.

2 Related work

Related work examines the recent work in the field of the recommendation system, including content-based filtering technique, collaborative filtering technique, and hybrid recommendation techniques with advantages and disadvantages. C. M. Rodrigues et al. [16] introduced cluster-based hybrid collaborative approaches which combine the item-based CF algorithm with user demographic-based algorithm in cluster-weighted mechanism to calculate the result and make the recommendation to the user. This approach performs saving of resources, time, and performs dynamic predictions. A weighted system is implemented to make consolidated decisions. This approach is also useful in solving user cold start problem, item cold start problem, and sparsity problem. This approach cannot generate clusters based on cross-domain data, such as if users like towards the music, the system should suggest movies based on the kinds of music which he likes. In addition, this system has not an ability to filter questionable user profiles and rating. Li and Murata [12] proposed a hybrid recommendation approach that provides a flexible solution by combining multidimensional clustering into a collaborative filtering recommendation system to produce a better quality of the recommendation. This helps to get user cluster, which has a different preference for multi-view for increasing effectiveness and diversity of recommendation. This proposed algorithm has been divided into three phases. In Phase I, training data are collected in the form of user and item profiles, and perform clustering using the proposed algorithm. In Phase II, obtained clusters having similar characteristics are discarded. In Phase III, prediction of an item is made by performing a weighted average of deviations from the neighbor mean. The proposed method requires background data in the form of user and item profiles for clustering. P. Devika et al. [3] introduced a new pattern mining algorithm for recommendation system that overwhelms the shortcoming of Apriori algorithm, called as Frequent Pattern Intersect algorithm (FPIntersect algorithm). Nowadays, e-commerce and social networking websites play an important role in our everyday life; it is very difficult to survive without it. These websites produce a huge amount of data; the traditional data mining approaches such as Apriori suffer from high latency in scanning the large database for generating association rules. FPIntersect algorithm overwhelms the shortcoming of Apriori by decreasing the number of scans and produces association rules, but the proposed system extracts the user ratings and their comments from the user reviews to obtain the information. Finding information from the user comments is a costly process. Wang and Han [19] proposed a collaborative filtering algorithm that can calculate the rating of an item that has not rated by the user based on analysis of the item characteristic. This algorithm improves the accuracy of recommendation system and prediction under the situation of the sparsity of user rating data. This system is based on content-based filtering, so if items do not contain much information to differentiate each other, the system will not perform accurate prediction. Gupta and Gadge [6] proposed a framework that combined an item-based collaborative filtering with demographic-based user clusters in an adaptive-weighted scheme and performs prediction. Cold start, scalability, and sparsity problems are faced by the conventional collaborative filtering algorithms which are solved by the proposed algorithm. When a number of ratings available for a user are less, the quality of prediction will improve by giving more weight to demographic-based user cluster. The addition of new users into an appropriate cluster is a challenging task. Moghaddam and Ali Selamat [14] proposed a novel hybrid recommender system that achieves the advantage of both model-based and memory-based collaborative filtering methods by combining user-based collaborative filtering method with DBSCAN clustering method based on users’ demographic information. This proposed method improves accuracy along with scalability. There is an examination required to recognize the result of intra-cluster rates smoothing to overcome the sparsity problem in collaborative filtering algorithms. The proposed system uses density-based user clustering. The time complexity of DBSCAN is O(n2) which is greater than K-means O(n). Deshpande and Karypis [2] proposed the computational complexity of the user-based collaborative filtering methods to grow linearly with the number of customers increases. To resolve these scalability problems, a model-based recommendation method is introduced. A model-based recommendation algorithm analyzes the user–item matrix to find relations between the different items and performs recommendations based on the relation. The proposed algorithm is divided into two phases. In phase, I similarity between items is computed. Phase II uses similarities computed in phase I, to compute the similarity between a basket of items and a candidate recommender item. The proposed algorithms are faster, independent of the size of the user–item matrix, and allow real-time recommendations. The proposed system gives the good result for smaller values of \(k (10 \le k \le 30)\), and for higher values of k, it gives very small or no improvement.

3 Proposed work

In this work, a new approach for rating prediction system for rating prediction of the items is proposed. The proposed work predicts the ratings for a given item, for a given user. The work takes the ratings table of Movielens data set [5]. The data set contains m users (U) and n items (I). This system reads the first three attributes of the rating table, i.e., userid, movieid, and rating. The rating is given at a scale of 1–5. The flowchart for the calculation of predicted rating is shown in Fig. 2. The proposed system works in the following ways.

3.1 Finding similarity between users

It will be calculated on the basis of the ratings of the items given by users. If two users give almost the same ratings to an item, then these two users are related to each other. A formula is derived for calculation of how much two users are related to each other. The formula is as follows:

where a, b : users; Sim(a, b) : similarity between user a to user b; \(r_{h}:\) highest rating; \(r_{ai}:\) rating of user a for item i; \(r_{bi}:\) rating of user b for item i; i : set of items, rated by both user a and b.

3.2 Find list of related users for every user

A list of related users is created for every user after calculating the similarity between users. This list is created on the basis of the value of similarity matrix. All those users are added to the list of the related user for a whose similarity value is greater than a threshold value \(\theta \) in the similarity matrix corresponding to the user a. A formula is derived for calculation of threshold \(\theta \) as follows:

$$\begin{aligned} \theta =n\times r_{h}\times p, \end{aligned}$$

(2)

where n : maximum number of item; \(p: {0.2} (20\%\) of the highest similarity).

3.3 Predicting ratings for users

In this step, the system predict rating for a given user for a given item. To predict rating for a user for an item, the related list of all the users which are found in step 3.2 is used. A formula is derived for the prediction of rating is as follows:

4 Results and analysis

The proposed recommendation system will suggest the most relevant items to its users from a large list list of items. As the relevancy between user expectations and items recommended by proposed system increases, the system will recommend the best item to its users. The experimental result shows that the value of MAE and RMSE decreases. In this section, we are going to describe about the data set 4.1, performance evaluation metrics 4.2, and results 4.3 of our implemented approach with the existing once.

4.1 Data set

We use MovieLeans data set [5] to assess the performance of our proposed algorithm. This data set is collected by GroupLeans Research at the University of Minnesota. The data set contains 100k records, 1628 movies rated by 943 users. Every user has at least 20 movies rated. The ratings are on the scale of 1(poor)–5(awesome) stars. This rating shows the user interests about the item.

4.2 Performance measurement

The accuracy of the proposed algorithm can be measured by the statistical accuracy metrics or decision-support metrics. In this paper, we use statistical accuracy metrics to evaluate the accuracy of rating prediction algorithm. The frequently used statistical metrics are Mean Absolute Error (MAE) and Root-Mean-Square Error (RMSE) [4]. Let \(r_{1}, r_{2}, r_{3},......, r_{n}\) are the actual ratings, and the corresponding \(p_{1}, p_{2}, p_{3},....., p_{n}\) are the predicted ratings.

Table 1

Similarity among different users

User

\(u_{1}\)

\(u_{2}\)

\(u_{3}\)

\(u_{4}\)

\(u_{5}\)

u6

\(u_{7}\)

\(u_{8}\)

u9

\(u_{10}\)

\(u_{11}\)

\(u_{12}\)

\(u_{13}\)

\(u_{14}\)

\(u_{1}\)

0

0

0

17.5

20.5

20.5

37.5

37.5

42.5

42.5

42.5

42.5

42.5

42.5

\(u_{2}\)

0

0

34

93

130

130

204

241

266

273

281

281

319

319

\(u_{3}\)

0

34

0

61

110

120.5

163

238

262

284

302

302

353.5

362

\(u_{4}\)

17.5

76.5

103.5

0

180.5

208.5

369.5

477.5

487.5

532.5

541

541.5

603.5

617.5

\(u_{5}\)

3

40

89

166

0

184.5

225.5

299.5

323.5

332.5

341.5

341.5

404.5

419.5

\(u_{6}\)

0

0

10.5

38.5

57

0

57

91.5

95.5

102.5

102.5

102.5

130.5

138

\(u_{7}\)

17

91

133.5

294.5

335.5

335.5

0

388

400

447

448

448

485.5

491.5

\(u_{8}\)

0

37

122

220

294

328.5

381

0

423

471

488.5

488.5

583.5

592.5

\(u_{9}\)

5

30

54

64

88

92

104

146

0

168

172

172

198

208

\(u_{10}\)

0

7

29

74

83

90

137

185

207

0

212

212

224.5

229.5

\(u_{11}\)

0

8

26

35

44

44

45

62.5

66.5

71.5

0

0

88.5

88.5

\(u_{12}\)

0

4

17.5

34.5

47.5

54

68.5

80.5

82.5

87.5

87.5

87.5

90.5

95.5

\(u_{13}\)

0

34

72

117

167

188

211.5

294.5

318.5

326

343

343

0

353

\(u_{14}\)

0

0

8.5

22.5

37.5

45

51

60

70

75

75

75

87

0

4.2.1 Mean absolute error (MAE)

MAE is defined as the mean of absolute differences between actual rating and predicted ratings [4]. The lower MAE value shows the better prediction [1]. It is defined as follows:

4.2.2 Root-mean-square error (RMSE)

The root-mean-square error (RMSE) is another way of model evaluations. RMSE is the square root of the mean square error [1]. The MAE values are always less than or equal to RMSE. It is defined as follows:

4.3 Discussion of results

In our proposed work, a new approach for rating prediction system for ratings’ prediction of the items is proposed. The proposed algorithm predicts the ratings for a given user for a given item. This system is easy to implement, because it does not analyze the item characteristics and avoid clustering which reduces the complexity of the algorithm. The proposed algorithm is implemented in JAVA. This algorithm is implemented on 962 records of 14 users first. Table 1 shows the calculated similarity between users. Here, first row and first column show the userid, and \(M_{a,b}\) shows the similarity value between usera and userb.

On the basis of the similarity between the users, Table 2 illustrates the list of related users for every user. Predicted rating for random 20 user–item sets is shown in Table 3. Table 3 has five attributes. The predicted rating is calculated based on the similarity values between the users and based on the list of similar users. Error in Table 3 shows that the differences between actual rating and the predicted rating and based on this error value of MAE and RMSE are calculated.

Table 4 shows the comparison of results using the proposed algorithm and existing algorithms. It compares the value of MAE and RMSE with the existing algorithm.

Figure 3 demonstrates a graphical representation of the measured Mean Absolute Error (MAE) value of the proposed method URP and three other existing methods—Rodrigues [16], Sarwar [17], and Gong [4]. The Mean Absolute Error for the URP method is the least among all other three existing methods. The method, which has the least value of MAE, provides the better predictions. As the MAE value for a particular method increases, the accuracy of the prediction decreases. The MAE value of URP method is 0.315 and the MAE value of Gong [4] is 0.80 which is greatest among all these four methods. This method will produce least accuracy in prediction. Theoretically and experimentally, it is already proven that the root-mean-square error is always greater than the Mean Absolute Error. Fig.4 shows the comparison of results of RMSE between the existing algorithm and proposed algorithm URP. URP again shows the smaller RMSE than other. This indicates that the URP is more accurate. With the above result, result analysis, and discussion, it is concluded that the proposed method URP performs better than the existing algorithm.

5 Conclusion and future scope

In this work, a new approach for predicting the ratings of items is proposed. The proposed system first finds the similarity among user using the previous data available in data set. Using that similarity, the system predicts the ratings for different items for different users. The results of the implementation suggest that the proposed algorithm predicts better ratings to items as the value of Mean Absolute Error is less as compared to the existing algorithm. However, this work has some limitations and, in future work, can be done on these limitations, which are as follows: (1). working on a larger data set or testing the algorithm on other data sets; (2). working on some other problems of recommendation system such as scalability and sparsity.

Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.