mapReduce编程之Recommender System

1 协同过滤算法

　　协同过滤算法是现在推荐系统的一种常用算法。分为user-CF和item-CF。

　　本文的电影推荐系统使用的是item-CF，主要是由于用户数远远大于电影数，构建矩阵的代价更小；另外，电影推荐系统中使用基于物品的推荐对用户来说更有说服力。因此本文对user-CF只做简单介绍，主要介绍item-CF。

　　1.1 基于用户的协同过滤算法

　　　　 a 计算出用户两两之间的相似度，得到用户相似度矩阵；

　　　　 b 预测用户的喜好，使用公式：

　　　　　其中，p(u,i)表示用户u对物品i的感兴趣程度，S(u,k)表示和用户u兴趣最接近的K个用户，N(i)表示对物品i有过行为的用户集合，Wuv表示用户u和用户v的兴趣相似度，Rvi表示用户v对物品i的兴趣。

　　　　 c 根据预测出来的喜好度来做推荐。

　　 1.2 基于物品的协同过滤算法

　　　　　1.2.1 物品相似度计算

　　　　　物品相似度的计算有多种。在这里使用同现矩阵。其中第m行第n列的元素表示物品m和物品n的相似度，具体是：如果一个用户同时看过电影m和n，则m和n的相似度就加1。还要对如下所示：

　　　　　之后还要对同现矩阵做归一化，注意归一化之后矩阵不是对称的：

　　　　　　1.2.2 预测用户对未看电影的打分

　　　　　　用户打分的预测值由下式计算：

　　　　　　因此，最后得到的预测矩阵可由同现矩阵与评分矩阵直接相乘得到：

　　　　　　1.2.3 推荐

　　　　　　根据预测的打分，选出未看电影中的topk即生成推荐列表。

2 mapReduce工作流程

2.1 输入数据形式

表示userID, movieID，评分

2.2 总体流程

2.3 MR1

　　MR1负责数据预处理，将同一个user的数据merge到一起。

　　mapper负责拆分数据：

　　reducer负责合并：

2.4 MR2

　　MR2负责构建同现矩阵。

　　mapper将一个用户看过的每部电影进行两两组合发送：

　　reducer负责merge这些值，就得到同现矩阵的每个单元（行号：列号）：

2.5 MR3

　　　MR3负责将同现矩阵归一化。

　　　mapper 负责读取上一个MR产生的同现矩阵cells，然后按行号发送到reducer(由于归一化是按行的，所以这里要以行号为Key)。

　　 reducer将得到的一行sum之后，用原来的值除以sum得到归一化的值，然后将每个单元按照列号写入HDFS（按列号写是为之后的矩阵相乘做准备）。

　　　因此，MR3的输入输出如下：

2.6 MR4

　　MR4将完成矩阵小单元相乘的工作。

　　mapper1负责读入归一化的同现矩阵的小单元，然后按列号发送（之前已经按列号存储了，这里直接读取并发送就行）

　　mapper2负责读取输入的rowdata文件，即评分矩阵的每个小单元，然后按行号（movie id）发送：

　　　在reducer中，接收到的值分别来自同现矩阵的第x列和评分矩阵的第x行。我们知道，最终生成的预测矩阵i行j列的小单元（i，j）是等于对应的同现矩阵的（i, x）乘以评分矩阵的（x, j），再对所有x求和。而这里的reducer中聚集了所有x值相同的来自两个矩阵的小单元，因此它们两两之间是可以互乘的。这里我们用=和：来区分两个矩阵的小单元。下图中橘黄色是处于同一个reducer里面的小单元，将来自同现矩阵和评分矩阵的小单元区分开后，将它们两两相乘，得到预测矩阵的行号与列号的不同组合，以它为key写入hdfs。

2.7 MR5

　　MR5负责将乘积的结果相加。

3 主要代码

DataDividerByUser.java

 import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import java.io.IOException; public class DataDividerByUser {     public static class DataDividerMapper extends Mapper<LongWritable, Text, IntWritable, Text> {         // map method         @Override         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {             //input user,movie,rating             String[] user_movie_rating = value.toString().split(",");             int userId = Integer.parseInt(user_movie_rating[0]);             String outPutKey = user_movie_rating[1] + ":" + user_movie_rating[2];             //divide data by user             context.write(new IntWritable(userId), new Text(outPutKey));         }     }     public static class DataDividerReducer extends Reducer<IntWritable, Text, IntWritable, Text> {         // reduce method         @Override         public void reduce(IntWritable key, Iterable<Text> values, Context context)                 throws IOException, InterruptedException {             StringBuilder sb = new StringBuilder();             //merge data for one user             for (Text value : values) {                 sb.append(value.toString());                 sb.append(",");             }             sb.deleteCharAt(sb.length() - 1);             context.write(key, new Text(sb.toString()));         }     }     public static void main(String[] args) throws Exception {         Configuration conf = new Configuration();         Job job = Job.getInstance(conf);         job.setMapperClass(DataDividerMapper.class);         job.setReducerClass(DataDividerReducer.class);         job.setJarByClass(DataDividerByUser.class);         job.setInputFormatClass(TextInputFormat.class);         job.setOutputFormatClass(TextOutputFormat.class);         job.setOutputKeyClass(IntWritable.class);         job.setOutputValueClass(Text.class);         TextInputFormat.setInputPaths(job, new Path(args[0]));         TextOutputFormat.setOutputPath(job, new Path(args[1]));         job.waitForCompletion(true);     } }

CoOccurrenceMatrixGenerator.java

 import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import java.io.IOException; import java.util.ArrayList; import java.util.List; public class CoOccurrenceMatrixGenerator {     public static class MatrixGeneratorMapper extends Mapper<LongWritable, Text, Text, IntWritable> {         // map method         @Override         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {             //value = userid \t movie1: rating, movie2: rating...             String[] movie_rating = value.toString().split("\t")[1].split(",");             //key = movie1: movie2 value = 1             //calculate each user rating list: <movieA, movieB>             for (int i = 0; i < movie_rating.length; i++) {                 for (int j = 0; j < movie_rating.length; j++) {                     String outPutKey = movie_rating[i].split(":")[0] + ":" + movie_rating[j].split(":")[0];                     context.write(new Text(outPutKey), new IntWritable(1));                 }             }         }     }     public static class MatrixGeneratorReducer extends Reducer<Text, IntWritable, Text, IntWritable> {         // reduce method         @Override         public void reduce(Text key, Iterable<IntWritable> values, Context context)                 throws IOException, InterruptedException {             //key movie1:movie2 value = iterable<1, 1, 1>             //calculate each two movies have been watched by how many people             int sum = 0;             for (IntWritable value : values) {                 sum += value.get();             }             context.write(key, new IntWritable(sum));         }     }     public static void main(String[] args) throws Exception{         Configuration conf = new Configuration();         Job job = Job.getInstance(conf);         job.setMapperClass(MatrixGeneratorMapper.class);         job.setReducerClass(MatrixGeneratorReducer.class);         job.setJarByClass(CoOccurrenceMatrixGenerator.class);         job.setInputFormatClass(TextInputFormat.class);         job.setOutputFormatClass(TextOutputFormat.class);         job.setOutputKeyClass(Text.class);         job.setOutputValueClass(IntWritable.class);         TextInputFormat.setInputPaths(job, new Path(args[0]));         TextOutputFormat.setOutputPath(job, new Path(args[1]));         job.waitForCompletion(true);     } }

Normalize.java

 import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import java.io.IOException; import java.util.HashMap; import java.util.Map; public class Normalize {     public static class NormalizeMapper extends Mapper<LongWritable, Text, Text, Text> {         // map method         @Override         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {             //movieA:movieB \t relation             String movieA = value.toString().split("\t")[0].split(":")[0];             String movieB = value.toString().split("\t")[0].split(":")[1];             String relation = value.toString().split("\t")[1];             //collect the relationship list for movieA             context.write(new Text(movieA), new Text(movieB + ":" + relation));         }     }     public static class NormalizeReducer extends Reducer<Text, Text, Text, Text> {         // reduce method         @Override         public void reduce(Text key, Iterable<Text> values, Context context)                 throws IOException, InterruptedException {             //key = movieA, value=<movieB:relation, movieC:relation...>             //normalize each unit of co-occurrence matrix             Map<String, Double> map = new HashMap<String, Double>();             double sum = 0;             for (Text value : values) {                 String[] movie_relation = value.toString().split(":");                 map.put(movie_relation[0], Double.parseDouble(movie_relation[1]));                 sum += Double.parseDouble(movie_relation[1]);             }             for (Map.Entry<String, Double> entry : map.entrySet()) {                 String outputKey = entry.getKey();                 String outputValue = key.toString() + "=" + String.valueOf(entry.getValue() / sum);                 context.write(new Text(outputKey), new Text(outputValue));             }         }     }     public static void main(String[] args) throws Exception {         Configuration conf = new Configuration();         Job job = Job.getInstance(conf);         job.setMapperClass(NormalizeMapper.class);         job.setReducerClass(NormalizeReducer.class);         job.setJarByClass(Normalize.class);         job.setInputFormatClass(TextInputFormat.class);         job.setOutputFormatClass(TextOutputFormat.class);         job.setOutputKeyClass(Text.class);         job.setOutputValueClass(Text.class);         TextInputFormat.setInputPaths(job, new Path(args[0]));         TextOutputFormat.setOutputPath(job, new Path(args[1]));         job.waitForCompletion(true);     } }

Multiplication.java

 import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.chain.ChainMapper; import org.apache.hadoop.mapreduce.lib.input.MultipleInputs; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import java.io.IOException; import java.util.HashMap; import java.util.List; import java.util.Map; public class Multiplication {     public static class CooccurrenceMapper extends Mapper<LongWritable, Text, Text, Text> {         // map method         @Override         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {             //input: movieB \t movieA=relation             //pass data to reducer             String[] movieB_movieARelation = value.toString().split("\t");             context.write(new Text(movieB_movieARelation[0]), new Text(movieB_movieARelation[1]));         }     }     public static class RatingMapper extends Mapper<LongWritable, Text, Text, Text> {         // map method         @Override         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {             //input: user,movie,rating             //pass data to reducer             String[] user_movie_rating = value.toString().split(",");             String outputKey = user_movie_rating[0] + ":" + user_movie_rating[2];             context.write(new Text(user_movie_rating[1]), new Text(outputKey));         }     }     public static class MultiplicationReducer extends Reducer<Text, Text, Text, DoubleWritable> {         // reduce method         @Override         public void reduce(Text key, Iterable<Text> values, Context context)                 throws IOException, InterruptedException {             //key = movieB value = <movieA=relation, movieC=relation... userA:rating, userB:rating...>             //collect the data for each movie, then do the multiplication             Map<String, Double> coMap = new HashMap<String, Double>();             Map<String, Double> ratingMap = new HashMap<String, Double>();             for (Text value : values) {                 String s = value.toString();                 if (s.contains("=")) {                     coMap.put(s.split("=")[0], Double.parseDouble(s.split("=")[1]));                 } else {                     ratingMap.put(s.split(":")[0], Double.parseDouble(s.split(":")[1]));                 }             }             for (Map.Entry<String, Double> entry1 : coMap.entrySet()) {                 for (Map.Entry<String, Double> entry2 : ratingMap.entrySet()) {                     double mult = entry1.getValue() * entry2.getValue();                     String outputKey = entry2.getKey() + ":" + entry1.getKey();                     context.write(new Text(outputKey), new DoubleWritable(mult));                 }             }          }     }     public static void main(String[] args) throws Exception {         Configuration conf = new Configuration();         Job job = Job.getInstance(conf);         job.setJarByClass(Multiplication.class);         ChainMapper.addMapper(job, CooccurrenceMapper.class, LongWritable.class, Text.class, Text.class, Text.class, conf);         ChainMapper.addMapper(job, RatingMapper.class, Text.class, Text.class, Text.class, Text.class, conf);         job.setMapperClass(CooccurrenceMapper.class);         job.setMapperClass(RatingMapper.class);         job.setReducerClass(MultiplicationReducer.class);         job.setMapOutputKeyClass(Text.class);         job.setMapOutputValueClass(Text.class);         job.setOutputKeyClass(Text.class);         job.setOutputValueClass(DoubleWritable.class);         MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, CooccurrenceMapper.class);         MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, RatingMapper.class);         TextOutputFormat.setOutputPath(job, new Path(args[2]));         job.waitForCompletion(true);     } }

Sum.java

 import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import java.io.IOException; /**  * Created by Michelle on 11/12/16.  */ public class Sum {     public static class SumMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {         // map method         @Override         public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {             //pass data to reducer             String[] key_value = value.toString().split("\t");             context.write(new Text(key_value[0]), new DoubleWritable(Double.parseDouble(key_value[1])));         }     }     public static class SumReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {         // reduce method         @Override         public void reduce(Text key, Iterable<DoubleWritable> values, Context context)                 throws IOException, InterruptedException {             //user:movie relation            //calculate the sum             double sum = 0;             for (DoubleWritable value : values) {                 sum += value.get();             }             context.write(key, new DoubleWritable(sum));         }     }     public static void main(String[] args) throws Exception {         Configuration conf = new Configuration();         Job job = Job.getInstance(conf);         job.setMapperClass(SumMapper.class);         job.setReducerClass(SumReducer.class);         job.setJarByClass(Sum.class);         job.setInputFormatClass(TextInputFormat.class);         job.setOutputFormatClass(TextOutputFormat.class);         job.setOutputKeyClass(Text.class);         job.setOutputValueClass(DoubleWritable.class);         TextInputFormat.setInputPaths(job, new Path(args[0]));         TextOutputFormat.setOutputPath(job, new Path(args[1]));         job.waitForCompletion(true);     } }

Driver.java

 public class Driver {     public static void main(String[] args) throws Exception {         DataDividerByUser dataDividerByUser = new DataDividerByUser();         CoOccurrenceMatrixGenerator coOccurrenceMatrixGenerator = new CoOccurrenceMatrixGenerator();         Normalize normalize = new Normalize();         Multiplication multiplication = new Multiplication();         Sum sum = new Sum();         String rawInput = args[0];         String userMovieListOutputDir = args[1];         String coOccurrenceMatrixDir = args[2];         String normalizeDir = args[3];         String multiplicationDir = args[4];         String sumDir = args[5];         String[] path1 = {rawInput, userMovieListOutputDir};         String[] path2 = {userMovieListOutputDir, coOccurrenceMatrixDir};         String[] path3 = {coOccurrenceMatrixDir, normalizeDir};         String[] path4 = {normalizeDir, rawInput, multiplicationDir};         String[] path5 = {multiplicationDir, sumDir};         dataDividerByUser.main(path1);         coOccurrenceMatrixGenerator.main(path2);         normalize.main(path3);         multiplication.main(path4);         sum.main(path5);     } }

1 协同过滤算法

1.1 基于用户的协同过滤算法

1.2 基于物品的协同过滤算法

1.2.1 物品相似度计算

1.2.2 预测用户对未看电影的打分

1.2.3 推荐

2 mapReduce工作流程

2.1 输入数据形式

2.2 总体流程

2.3 MR1

2.4 MR2

2.5 MR3

2.6 MR4

2.7 MR5

3 主要代码

个人收藏笔记记录

开通VIP

　　1.1 基于用户的协同过滤算法

　　 1.2 基于物品的协同过滤算法

　　　　　1.2.1 物品相似度计算

　　　　　　1.2.2 预测用户对未看电影的打分

　　　　　　1.2.3 推荐