The Social Data Revolution: HW3_Bitly MS&E237 Spring 2010 / Weigend

The goal of this homework is to understand the dynamics of how links propagates across the web, and to apply these insights to marketing campaigns.

Note: Please post any questions you might have to the HW3 Bitly thread . Also feel free to edit this wiki entry, or answer any questions in the discussions thread.

There are 3 sets of data in bitly_data_sample.tar.gz (129mb):
stanford_decodes_100urls.csv (2.9mb)
stanford_encodes_100urls.csv (74kb)
sm_nytimes_decodes_1kurls.csv (241mb)
sm_nytimes_encodes_1kurls.csv (2.7mb)
nytimes_decodes_10kurls.csv (9.8mb)
nytimes_encodes_10kurls.csv (340mb)

You can download datasets individually here . Please email mse237@gmail.com if you have any trouble getting the data.

Background example: An Interview on “Marketplace”

Broadcast date: November 18, 2009
URL:
http://bit.ly/dataNPR

Bit.ly reports some numbers:

1. Total number of clicks on the shortened URL

2. Clicks as a function of time

The response to the initial posting (“impulse response”) might be described as exponential decay over time. Such a process is characterized by a decay constant, the time it takes for the number of clicks having dropped to 1/e of the initial value.

Bit.ly created for us a total of 6 files.

There are three sets:

Set_A: Small. Understand concepts without computational burden.

Set_B: Medium. Begin testing at scale. Set_C: Large. For extra credit.

Each set consists of two files:

ENCODES: Information about the creation of the shortened url

user_hash
brPoFW
global_hash
dBs8r6
user_id
f4ef95bfae0dc7aeab267eea2791c11b837c8932327557637af624e721cbc112
timestamp
1270687304
long_url
http://www.stanford.edu/class/cs193p/cgi-bin/drupal/downloads-2010-winter

DECODES: Information about each click on the shortened url

user_hash
airDgA
global_hash
dBs8r6
user_id
2b966c88fa4e02dcdb46f9634271b46be5c330dcfe941166cf26067439c43697
iso_country
JP
ip_hash
4bc505e7-00346-041e8-b2a08fa8
cookie_id
a640ab950d538470449ad2ca35dbb9cda18b0babd793c08f793fd5f410a72d4a
timestamp
1271203303
system_info
Mozilla/5.0 (Windows; U; Windows NT 5.1; ja; rv:1.9) Gecko/2008051206 Firefox/3.0
long_url
http://www.gsb.stanford.edu/news/headlines/Desaideath.html


Examples of questions you might ask

· Does day of week and/or the time of day influence the number of clicks?
· Does day of week and/or the time of day influence the rate of decay?

· What is the effect of the initial channel (Twitter, Facebook, …)?
· Compared to one another, does traditional media content (e.g., NYT.com), blog posts (e.g.., Techcrunch), or other content differ in their propagation characteristics?
·
Does the popularity of the content site (e.g., via google’s pagerank) predict its propagation?
· Are properties of the initial poster important?
· How does the topic influence the link characteristics (characterized by bit.ly)? For the ambitious, you can implement alchemy to tag the links:

http://www.alchemyapi.com/api/ categ/urls.html
---
Just to add another quick thought -- the New York Times actually
includes the topic categorization of each article in the page
metadata.
For example, from this article:
http://www.nytimes.com/2010/ 04/17/science/17plume.html? ref=science
We see:
<meta name="des" content="Volcanoes;Science and Technology" />
<meta name="geo" content="Iceland" />
<meta name="dat" content="April 16, 2010" />
<meta name="tom" content="News" />
<meta name="dsk" content="Science" />
This is a human-curated label, so it should be even better than
Alchemy for topic identification on this set.

---

Due at noon on Thursday 22Apr, via email to mse237@gmail.com:

A) Insights on stanford_decodes_100urls and stanford_encodes_100urls

This is an individual assignment. You are welcome to discuss with your classmates, but each student needs to submit their own assignment and insights.

B) Recommendations for the Chief Marketing Officer of a firm of your choice

Apply what you found in part A to a real world situation. Ideally, please discuss these with a CMO prior to submission.

C) Extra credit: Insights on Set_B and Set_C

The larger sample allows you to come up with some statistically significant insights.
You might also reserve some data of SAMPLE_B and see whether you can make predictions on the remaining as a test set.
This can be submitted in groups, all members will get the same number of points for extra credit.


Helpful Tips (added by Nick Hwang):

  • You can view lots of useful information beyond what's in the encodes/decodes files by looking at the following URLs in your browser: http://bit.ly/global_hash+ and http://bit.ly/user_hash+.
  • Note that user_hash is the unique identifier for a bit.ly link that someone created, while global_hash is a higher level identifier that summarizes all bit.ly links users have created for a particular URL.
  • I suggest looking at the Python bit.ly API and using the stats function...note that the stats function is technically deprecated in the bit.ly API as of 4/9/10, but is still active for now. You'll have to hack the Python module a bit to get referrer information.