Frank Kane's Taming Big Data with Apache Spark and Python

上QQ阅读APP看书，第一时间看更新

Counting up the sum of friends and number of entries per age

Alright, now I'm going to throw you into the deep end of the pool here, look at this big scary line:

totalsByAge = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

However, if we break it down into its components, what's going on here is pretty straightforward. What we need to do next is to aggregate our RDD information somehow. So let's just break this totalsByAge line down, one component at a time. You can see, we have sort of a compound operation going on here; we're taking our RDD of age and number of friend key/value pairs and we're calling mapValues on it, and then we're taking the resulting RDD and calling reduceByKey on it. Let's take this one step at a time, we'll start with the mapValues piece of it.

This piece transforms every value in my key/value pair, because remember we're calling mapValues, so the x that's getting passed is only going to be the value piece of the original RDD:

rdd.mapValues(lambda x: (x, 1))

Let's take the first entry in our rdd RDD for example. This entry has a 33-year-old who had 385 friends:

So the value 385 gets passed in through mapValues for every line. This value is our x in the lambda function. As you can see in the following line, our output is going to be a new value that is actually a pair, a list if you will, of 385 and the number 1:

rdd.mapValues(lambda x: (x, 1))

The method behind our madness here is that in order to get an average, we need to count up all the total number of friends seen for a given age and the number of times that age occurred. Later on, if we sum up all of these pairs of information, we will get the total number of friends for that age and the total number of times that age occurred when we look at the totals for a given age. So that's kind of our strategy here-to build up a running total of how many times 33-year-olds were seen and the total number of friends that they had.

Let's get back to the syntax of this lambda function:

rdd.mapValues(lambda x: (x, 1))

Just to review again, mapValues will receive each value, which in our case is the number of friends, and output a new value, which is actually a tuple-(x,1). This tuple contains the original number of friends and the value 1. This is an example of ending up with a key/value pair, where the value is not just a single value or a single number, it's actually a collection of numbers, a list, and that's perfectly okay:

Our output is a new RDD that is still a key/value pair, but the keys are untouched because we called mapValues and the values are now transformed from just a number of friends to a pair value of 385 and the number 1.

Now we need to add everything up together and that's where the reduceByKey part comes in. Let's have a look at the second part of our big scary totalsByAge line. reduceByKey just tells us how we combine things together for the same key:

reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

Again, going back to our example here, let's say that we are looking at every 33-year-old, so we're looking at keys of 33:

Our lambda function takes in two values, shown as x and y, and says how do we add them up? So for example, in our data shown here, we have x coming in as (385, 1), and y might be (2, 1):

Then this part of the function shown here just says add up each component. We take the first element of each value and add them together, take the second element of each value and add them together:

(x[0] + y[0], x[1] + y[1]))

The output in this case would be 385 plus 2, that is 387, and 1 plus 1, that is 2, and we'll keep doing that repeatedly for every time we encounter values for the key 33 and add them all up together:

So you see what we have at this stage is the grand total of number of friends, and the number of times we saw that key for the given key, which in this case is the age of 33 years old. It will do that for every single key. That's what reduceByKey does, and with that we have the information we need to actually compute the averages we want. Before we do that, it might be a good idea to go back over what we just covered and make sure that you understand how we count up the sum of friends and number of entries per age; even I had a hard time wrapping my head around this at first. Once you're sure you've understood, let's move on.