Andreas Weigend
Social Data Revolution
MS&E 237, Stanford University, Spring 2010

Class 5: Content Discovery

Class Date: April 13, 2010
Audio | Transcript
Powerpoint: NA
Paper: NA

You will find all updates on @socialdata
Guest speakers will be recorded and videos can be accessed here:
Lars Backstrom from Facebook will be here on Apri 15; read his bio and leave a question for him at
Extra Credit Project: RTn -- where 'n' is an integer that indicates importance to retweeter

Review of Last Class: Product Discovery

eBusiness statistics

  • Level of analysis -- focusing on 'clicks' or 'sessions' depends on the questions we're trying to answer and the type of business.
  • Becoming familiar with data -- as we saw with histogram of data, very important to become familiar with data and understand what is causing the anomalies (e.g., bots, time zone differences, etc.)

Product discovery

  • Context goes further then just actions in the past or demographic facts such as age or gender. Encouraging users to indicate their intent provides the most robust method for discovering products.
  • The big improvements in discovery have come not from better algorithms (these have stayed largely the same), but from higher quality and volume of data.

This Class: Content Discovery

Remember the PHAME framework: (Problem, Hypothesis, Action, Metrics, Experiments)

Problem: help people discover relevant content
  • Content is becoming more and more abundant, but attention remains as scarce as ever.
  • For example, all the status info that is now broadcast is not necessarily relevant to the hearers' situations.

Hypothesis: the best channel for finding data could be:

  • Specific sources (i.e., subscription to New York Times or Wall Street Journal)
  • Specific content / topics (i.e., YourVersion)
  • Metadata (i.e., web 2.0)
  • Social graph (i.e., sharing through Facebook, Twitter, or StumbleUpon)
  • Others?

Action: just do it! (quote form Dan Olsen)

Metrics: how to evaluate whether a discovery system really works?

  • Short Click vs Long Click -- a short click (i.e., the user doesn't spend any time on the link) is a bad thing, indicating lack of interest in an unnecessary distraction, while a long click (i.e., the user spends a lot of time on link) indicates more interesting. However, how do you decide what is long versus what is short?
  • Short-term vs Long-term -- what is the short-term versus long-term impact on behavior?
  • Trade-offs -- any choice of metrics will involve some tradeoff. For example, YourVersion's focus on text-matching and topics perhaps trades off between relevance and uniqueness.

Experiments: test the hypotheses
  • Don't just assume one solution is better than the other; e.g., run two algorithms against each other and see which is better.

Guest Speaker: Dan Olsen (
  • Pandora for your real-time web content.
  • Content on the web is growing too fast for anyone to keep track of. According to Google (2008), there are a trillion URLs and that number is growing by several billion each day.
  • The social graph is not the way to go -- only way to discover users' intent or interests is to have them tell you and then search for those interests.
  • Smartphones are so popular because they allow people to get the last few minutes of online time (on the couch, etc.).

Next Class - Topic: People Discovery

Student 1: Thomas Haymore
2: Felix Huber