The Social Data Revolution: HW3_Bitly MS&E237 Spring 2010 / Weigend

The goal of this homework is to understand the dynamics of how links propagates across the web, and to apply these insights to marketing campaigns.

Note: Please post any questions you might have to the HW3 Bitly thread . Also feel free to edit this wiki entry, or answer any questions in the discussions thread.

There are 3 sets of data in bitly_data_sample.tar.gz (129mb):
stanford_decodes_100urls.csv (2.9mb)
stanford_encodes_100urls.csv (74kb)
sm_nytimes_decodes_1kurls.csv (241mb)
sm_nytimes_encodes_1kurls.csv (2.7mb)
nytimes_decodes_10kurls.csv (9.8mb)
nytimes_encodes_10kurls.csv (340mb)

You can download datasets individually here . Please email if you have any trouble getting the data.

Background example: An Interview on “Marketplace”

Broadcast date: November 18, 2009
URL: reports some numbers:

1. Total number of clicks on the shortened URL

2. Clicks as a function of time

The response to the initial posting (“impulse response”) might be described as exponential decay over time. Such a process is characterized by a decay constant, the time it takes for the number of clicks having dropped to 1/e of the initial value. created for us a total of 6 files.

There are three sets:

Set_A: Small. Understand concepts without computational burden.

Set_B: Medium. Begin testing at scale. Set_C: Large. For extra credit.

Each set consists of two files:

ENCODES: Information about the creation of the shortened url


DECODES: Information about each click on the shortened url

Mozilla/5.0 (Windows; U; Windows NT 5.1; ja; rv:1.9) Gecko/2008051206 Firefox/3.0

Examples of questions you might ask

· Does day of week and/or the time of day influence the number of clicks?
· Does day of week and/or the time of day influence the rate of decay?

· What is the effect of the initial channel (Twitter, Facebook, …)?
· Compared to one another, does traditional media content (e.g.,, blog posts (e.g.., Techcrunch), or other content differ in their propagation characteristics?
Does the popularity of the content site (e.g., via google’s pagerank) predict its propagation?
· Are properties of the initial poster important?
· How does the topic influence the link characteristics (characterized by For the ambitious, you can implement alchemy to tag the links: categ/urls.html
Just to add another quick thought -- the New York Times actually
includes the topic categorization of each article in the page
For example, from this article: 04/17/science/17plume.html? ref=science
We see:
<meta name="des" content="Volcanoes;Science and Technology" />
<meta name="geo" content="Iceland" />
<meta name="dat" content="April 16, 2010" />
<meta name="tom" content="News" />
<meta name="dsk" content="Science" />
This is a human-curated label, so it should be even better than
Alchemy for topic identification on this set.


Due at noon on Thursday 22Apr, via email to

A) Insights on stanford_decodes_100urls and stanford_encodes_100urls

This is an individual assignment. You are welcome to discuss with your classmates, but each student needs to submit their own assignment and insights.

B) Recommendations for the Chief Marketing Officer of a firm of your choice

Apply what you found in part A to a real world situation. Ideally, please discuss these with a CMO prior to submission.

C) Extra credit: Insights on Set_B and Set_C

The larger sample allows you to come up with some statistically significant insights.
You might also reserve some data of SAMPLE_B and see whether you can make predictions on the remaining as a test set.
This can be submitted in groups, all members will get the same number of points for extra credit.

Helpful Tips (added by Nick Hwang):

  • You can view lots of useful information beyond what's in the encodes/decodes files by looking at the following URLs in your browser: and
  • Note that user_hash is the unique identifier for a link that someone created, while global_hash is a higher level identifier that summarizes all links users have created for a particular URL.
  • I suggest looking at the Python API and using the stats function...note that the stats function is technically deprecated in the API as of 4/9/10, but is still active for now. You'll have to hack the Python module a bit to get referrer information.