If one stops and ponders the amount of data and content users add to Facebook on a daily basis, it’s truly staggering. I’ve often wondered what the Facebook data team does with this data and content. Recently, I stumbled across two insightful articles and a video series that sheds some light on this.
The first article discusses how the Facebook data team uses statistical analysis to make informed product development decisions (the article also touches on Google’s use of data modeling and statistics).
Facebook’s Data Team used R in 2007 to answer two questions about new users: (i) which data points predict whether a user will stay? and (ii) if they stay, which data points predict how active they’ll be after three months?
For the first question, Itamar’s team used recursive partitioning (via the rpart package) to infer that just two data points are significantly predictive of whether a user remains on Facebook: (i) having more than one session as a new user, and (ii) entering basic profile information.
For the second question, they fit the data to a logistic model using a least angle regression approach (via the lars package), and found that activity at three months was predicted by variables related to three classes of behavior: (i) how often a user was reached out to by others, (ii) frequency of third party application use, and (iii) what Itamar termed “receptiveness” — related to how forthcoming a user was on the site.
The second article, posted by the Facebook data team in response to this Economist article, gives a very insightful description as to how the Facebook data team uses statistical analysis to answer an important question:
We were asked a simple question: is Facebook increasing the size of people’s personal networks? This is a particularly difficult question to answer, so as a first attempt we looked into the types of relationships people do maintain, and the relative size of these groups.
What the Facebook data team found was that a user’s passive network is 2 to 2.5 times larger than their active network (i.e., a reciprocal network where there is an active two-way communication happening), and that a passive network is just as important as a reciprocal network in building buzz.
The stark contrast between reciprocal and passive networks shows the effect of technologies such as News Feed. If these people were required to talk on the phone to each other, we might see something like the reciprocal network, where everyone is connected to a small number of individuals. Moving to an environment where everyone is passively engaged with each other, some event, such as a new baby or engagement can propagate very quickly through this highly connected network.
I’ll take a leap and say that these findings helped drive some of the reasoning behind the updated profile home page and business page “lifestreaming” functionality. Facebook’s focus on having people set up a profile–and updating this profile–and immediately engage with other people, coupled with an emphasis on increasing a user’s penetration within their passive network, is critical to Facebook’s continued growth. [Update: for an excellent three series analysis of the new Facebook pages go here, here, and here]. We can see an example of this passive network effect below where a Facebook user posted a short note that his twins are soon to be featured on CSI, the news spread quickly and opened up several channels of commentary:
Here’s an additional link to some interesting insights by Facebook’s former head of data and analytics, Jeff Hammerbacher, into Facebook’s approach to data analytics and lessons learned (these are fairly long videos, but really really fun to watch). Hammerbacher discusses how they analyze terabytes of data in near-real time to allow their various business units to make more informed decisions. My key take-away from the videos is that a graphical display of data that allows users to also “hack” the data to gain deeper insights yields great product development and customer relationship management gains.