HW3_Bitly

= The Social Data Revolution: HW3_Bitly MS&E237 Spring 2010 / Weigend =

**The goal of this homework is to understand the dynamics of how links propagates across the web, and to apply these insights to marketing campaigns.**

 * Note: Please post any questions you might have to the [|HW3 Bitly thread] . Also feel free to edit this wiki entry, or answer any questions in the discussions thread.**

There are 3 sets of data in [|bitly_data_sample.tar.gz] (129mb): stanford_decodes_100urls.csv (2.9mb) stanford_encodes_100urls.csv (74kb) sm_nytimes_decodes_1kurls.csv (241mb) sm_nytimes_encodes_1kurls.csv (2.7mb) nytimes_decodes_10kurls.csv (9.8mb) nytimes_encodes_10kurls.csv (340mb)

You can download datasets individually [|here]. Please email mse237@gmail.com if you have any trouble getting the data. Broadcast date: November 18, 2009 URL: []
 * Background example: An Interview on “Marketplace”**

Bit.ly reports some numbers:
1. [|Total number of clicks on the shortened URL]

2. [|Clicks as a function of time]

The response to the initial posting (“impulse response”) might be described as exponential decay over time. Such a process is characterized by a decay constant, the time it takes for the number of clicks having dropped to 1/e of the initial value.

**// ENCODES: Information about the creation of the shortened url //**

 * user_hash || brPoFW ||
 * global_hash || dBs8r6 ||
 * user_id || f4ef95bfae0dc7aeab267eea2791c11b837c8932327557637af624e721cbc112 ||
 * timestamp || 1270687304 ||
 * long_url || http://www.stanford.edu/class/cs193p/cgi-bin/drupal/downloads-2010-winter ||

**// DECODES: Information about each click on the shortened url //**
= Examples of questions you might ask = · Does day of week and/or the time of day influence the number of clicks? · Does day of week and/or the time of day influence the rate of decay? · What is the effect of the initial channel (Twitter, Facebook, …)? · Compared to one another, does traditional media content (e.g., NYT.com), blog posts (e.g.., Techcrunch), or other content differ in their propagation characteristics? · Does the popularity of the content site (e.g., via google’s pagerank) predict its propagation? · Are properties of the initial poster important?  · How does the topic influence the link characteristics (characterized by bit.ly)? For the ambitious, you can implement alchemy to tag the links: http://www.alchemyapi.com/api/ categ/urls.html --- Just to add another quick thought -- the New York Times actually includes the topic categorization of each article in the page metadata. For example, from this article: http://www.nytimes.com/2010/ 04/17/science/17plume.html? ref=science We see:      This is a human-curated label, so it should be even better than Alchemy for topic identification on this set. --- = Due at noon on Thursday 22Apr, via email to mse237@gmail.com: =
 * user_hash || airDgA ||
 * global_hash || dBs8r6 ||
 * user_id || 2b966c88fa4e02dcdb46f9634271b46be5c330dcfe941166cf26067439c43697 ||
 * iso_country || JP ||
 * ip_hash || 4bc505e7-00346-041e8-b2a08fa8 ||
 * cookie_id || a640ab950d538470449ad2ca35dbb9cda18b0babd793c08f793fd5f410a72d4a ||
 * timestamp || 1271203303 ||
 * system_info || Mozilla/5.0 (Windows; U; Windows NT 5.1; ja; rv:1.9) Gecko/2008051206 Firefox/3.0 ||
 * long_url || http://www.gsb.stanford.edu/news/headlines/Desaideath.html ||

A) Insights on stanford_decodes_100urls and stanford_encodes_100urls
This is an individual assignment. You are welcome to discuss with your classmates, but each student needs to submit their own assignment and insights.

B) Recommendations for the Chief Marketing Officer of a firm of your choice
Apply what you found in part A to a real world situation. Ideally, please discuss these with a CMO prior to submission.

C) Extra credit: Insights on Set_B and Set_C
The larger sample allows you to come up with some statistically significant insights. You might also reserve some data of SAMPLE_B and see whether you can make predictions on the remaining as a test set. This can be submitted in groups, all members will get the same number of points for extra credit.

=** Helpful Tips (added by Nick Hwang): **=
 * You can view lots of useful information beyond what's in the encodes/decodes files by looking at the following URLs in your browser: http://bit.ly/global_hash+ and http://bit.ly/user_hash+.
 * Note that user_hash is the unique identifier for a bit.ly link that someone created, while global_hash is a higher level identifier that summarizes all bit.ly links users have created for a particular URL.
 * I suggest looking at the Python bit.ly API and using the stats function...note that the stats function is technically deprecated in the bit.ly API as of 4/9/10, but is still active for now. You'll have to hack the Python module a bit to get referrer information.