HW4_Twitter

[|Andreas Weigend] Spring 2010 MS&E 237 Stanford University
 * Social Data Revolution**

toc Homework 4: Twitter = = Assigned: Thu Apr 22, 2010 Due: Thu Apr 29, 2010, noon Submit to: mse237@gmail.com Note: You are welcome to work in a group for this assignment, but please submit your own code. Feel free to discuss questions about Python programming broadly.

There are two main goals of this assignment. The first is to introduce you to another API, where you find a wealth of information in Tweets and social graphs. The second goal is to get you to discover interesting people in the hopes of finding long term engagement.

**Whitelisting Information**
To get a reasonable amount of data from Twitter (20k requests instead of 150 requests), we have contacted Twitter to get everyone in the class whitelisted. To do this, please do the following: 1) Log in to Twitter with the account you will be using for this homework. 2) Go [|here] 3) In the reason, make sure to include "ATTN: Brian Sutorius. FOR: STANFORD MS&E 237" 4) Enter your IP address (see below) It will probably take a day or two, so do this early.

How to get your ip address: Go to the network (physical location) where you plan on completing most of the assignment. Mac: [] PC: [][]

=The Assignment= For this assignment please create 10 relevant recommendations of friends for each of 10 friends (100 total recommendations). Before you send the recommendations to each friend, rank them on your own (1 worst - 10 best) based on how you think your friend will rank them (ie based on relevance). Keep this ranking to yourself. Ask each friend to rank them (1 worst - 10 best) and give a short response about why they are good and/or bad. Please submit the recommendations, ratings, and comments using this [|Google Form] Deliverables to mse237@gmail.com: - Your script/program - A short write-up that describes clearly your conceptual approach, and briefly discusses the constraints you face and the rationale for taking the approach.
 * - Please note, you are now allowed to work and submit your solution in teams of 2! Please do still submit 10 recommendations each. -**

What do users really want... To discover information for certain domains? Business networking? Perhaps even dating?
 * PROBLEM**

Data sources and recommendation approaches: These are just some hypotheses, would love to see you use your creativity/think outside the box.
 * HYPOTHESES**
 * Analyzing a user's social graph (who follows who, who replies to who, who retweet's who, etc)
 * Collaborative filtering (You follow A, B and C. Other people who follow A, B and C also follow D, thus we recommend D to you)
 * Similarity (You are followed by A, B and C, all of whom also follow D)
 * Transitive attention (You follow A, B, and C, all of whom follow D)
 * Semantics (tweets)
 * Hashtags (who else uses similar set of hashtags as I do? are all hashtags created equal?)
 * Keywords (identify interesting keywords? bag of words?)
 * Location (how can geolocation be used, when is it relevant, e.g., for finding people with similar interest near me?)

Pick 10 of your friends who have been using Twitter. Randomly assign them to two groups, each of which will see one version of your algorithm.
 * ACTION**

Ask each friend to rank your recommendations from 1-10 (1 lowest, 10 highest) and give qualitative feedback in free text about why each recommendation is good as well and/or bad.
 * METRICS**

Once you have collected this information fill out the [|Google Form] once for each of the friends for whom you make recommendations.


 * Output for each friend:
 * username of friend
 * Output for each recommendation:
 * Please write all of this on one line, separated by **commas**. The username you're recommending, the rank you think your friend will give (not shared with friend), the actual rank given by your friend, free text about what made the recommendation good, free text about how the recommendation could have been better.
 * It should look like this: suggested_twitter_username, your_rank, friends_rank, "good free text", "bad free text"

Create a short write-up comparing your two hypotheses/algorithms, the constraints you faced, and what you learned. **
 * EVALUATION [quality is more important than quantity]

=Getting Started=
 * You are welcome to use the programming language of your choice, however you may find the [|files] we've put together helpful
 * Twitter API: []
 * Language specific libraries: []
 * How do you manage the necessary data (in-memory data structures? text files? database tables?)
 * Note that your system does not have to be interactive. Offline computation is good enough.

=Working with Python= We have put together a few functions in Python that should make getting data a little easier. The fastest way to get started is to download the TwitterSearch.py file and the simple-json*.tar. The two sample code files should help you get started. //Note that there are two ways to access the Twitter API detailed here.//

Download at least these **4** files from [] ReadMe.txt **- instructions for downloading and installing** python-twitter **package as well as** simplejson TwitterSearch-sample-code.py TwitterSearch.py - functions for downloading following/followers and a function to search updates
 * simplejson-2.0.9.tar ** - needed for working with json objects returned by the Twitter API

You may also want to use the python-twitter library which has more features and objects than TwitterSearch.py python-twitter-sample-code.py **-** calls to the functions in python-twitter package for examples python-twitter-0.5.tar **- functions and objects for working with the API, the file is cached here as a convenience for you

I**f you download the .tar files from the link above, you won't need to use curl. If you have any trouble installing the packages. Try moving the **python-twitter.py** file and **simple-json** directory to your working directory. The library can be accessed without installing (this should work on shared machines where you don't have write permissions on public directories).

=If you have your own RPM/Ubuntu machine= If you have your own Linux RPM desktop/server, the process is easier. The following installs the latest python, setuptools, and the twitter & simplejson APIs (feel free to remove as necessary):

code sudo apt-get install python2.6 python-setuptools python-twitter python-simplejson code

That's it. Now you can create your script.py in your favorite editor, here's some filler:

code import twitter def myFunc: print "Hello" code
 * 1) script.py

And run the script with: code > python script.py > "Hello" (output) code

- Ashwin (purohit@stanford.edu)

From here you're on you own, good luck and be creative. If you find any helpful links, feel free to post them here. (There are some good hints below.)
Here's some code to help you cache requests like the last assignment: code from datetime import datetime, timedelta import time import sqlite3 as sqlite import pickle from functools import wraps

class CachedAPIWrapper(object): """ delegate, but DB cache, everything to the api Usage: def get_api: tapi = twitter.Api(username=USER, password=PASSWORD) cache = DBCache('tweeting.db', False) return CachedAPIWrapper(tapi, cache) """ def __init__(self, api, cache): self._api = api self._cache = cache

def __getattr__(self, name): fn = getattr(self._api, name) if not callable(fn): return fn def wrapper(*args, **kwargs): try: return self._cache.load(name, args) except CacheMissError: value = fn(*args) self._cache.save(name, args, value) return value except TypeError: print 'Warning: Uncachable!' return self.func(*args) return wrapper
 * 1) uncachable -- for instance, passing a list as an argument.
 * 2) Better to not cache than to blow up entirely.

class CacheMissError(Exception): pass

class DBCache(object): def __init__(self, dbfile, debug = False): self.__state = {} self.__connection = sqlite.connect(dbfile) self.__connection.text_factory = str self.__cursor = self.__connection.cursor self.__debug = debug try: self.__cursor.execute('CREATE TABLE `cache` (id INTEGER PRIMARY KEY, `func` varchar(255), `args` blob, `result` blob);'); except: pass

def save(self, func, args, result): self.__cursor.execute('INSERT INTO `cache` (`func`, `args`, `result`) VALUES (?, ?, ?)', (func, pickle.dumps(args), pickle.dumps(result))) self.__connection.commit if self.__debug: print 'Saving %s(%s) -> %s!' % (func, args, result)

def load(self, func, args): self.__cursor.execute('SELECT `result` FROM `cache` WHERE `func` = ? AND `args` = ?;', (func, pickle.dumps(args))) result = self.__cursor.fetchone if result: value = pickle.loads(result[0]) if self.__debug: print 'Loaded %s(%s) -> %s!' % (func, args, value) return value else: raise CacheMissError

def clear(self): self.__cursor.execute('DELETE FROM `cache`;') self.__connection.commit

code ---polcari

download this [|patch] if you are using simplejson 2.0.x, and you get this error message when running:

code format="bash" $ cd python-twitter-0.5 $ python setup.py test

=
========================================================= FAIL: Test the twitter.Status AsJsonString method -- Traceback (most recent call last): File "twitter_test.py", line 121, in testAsJsonString self._GetSampleStatus.AsJsonString) AssertionError: '{"created_at": "Fri Jan 26 23:17:14 +0000 2007", "id": 4391023, "text": "A l\\u00e9gp\\u00e1rn \\u00e1s haj\\u00f3m tele van angoln\\u00e1kkal.", "user": {"description": "Canvas. JC Penny. Three ninety- eight.", "id": 718443, "location": "Okinawa, Japan", "name": "Kesuke Miyagi", "profile_image_url": "http:\\/ \\/twitter.com\\/system\\/user\\/profile_image\\/718443\\/normal\\/kesuke.png", "screen_name": "kesuke", "url": "http:\\/\\/twitter.com\\/kesuke"}}' != '{"created_at": "Fri Jan 26 23:17:14 +0000 2007", "id": 4391023, "text": "A l\\u00e9gp\\u00e1rn\\u00e1s haj\\u00f3m tele van angoln\\u00e1kkal.", "user": {"description": "Canvas. JC Penny. Three ninety-eight.", "id": 718443, "location": "Okinawa, Japan", "name": "Kesuke Miyagi", "profile_image_url": "http://twitter.com/system/user/profile_image/718443/normal/kesuke.png", "screen_name": "kesuke", "url": "http://twitter.com/kesuke"}}'

=
========================================================= FAIL: Test the twitter.User AsJsonString method -- Traceback (most recent call last): File "twitter_test.py", line 224, in testAsJsonString self._GetSampleUser.AsJsonString) AssertionError: '{"description": "Indeterminate things", "id": 673483, "location": "San Francisco, CA", "name": "DeWitt", "profile_image_url": "http:\\/\\/twitter.com\\/system\\/user\\/profile_image\\/673483 \\/normal\\/me.jpg", "screen_name": "dewitt", "status": {"created_at": "Fri Jan 26 17:28:19 +0000 2007", "id": 4212713, "text": "\\"Select all\\" and archive your Gmail inbox. The page loads so much faster!"}, "url": "http:\\/\\/unto.net\\/"}' != '{"description": "Indeterminate things", "id": 673483, "location": "San Francisco, CA", "name": "DeWitt", "profile_image_url": "http://twitter.com/system/user/profile_image/673483 /normal/me.jpg", "screen_name": "dewitt", "status": {"created_at": "Fri Jan 26 17:28:19 +0000 2007", "id": 4212713, "text": "\\"Select all\\" and archive your Gmail inbox.  The page loads so much faster!"}, "url": "http://unto.net/"}'

-- Ran 36 tests in 0.178s

FAILED (failures=2)


 * 1) apply the patch

$ patch < python-twitter-0.5-fixjsontests.patch Hmm... Looks like a unified diff to me... The text leading up to this was: -- -- Patching file twitter_test.py using Plan A... Hunk #1 succeeded at 17. Hunk #2 succeeded at 146. done
 * diff -up python-twitter-0.5/twitter_test.py.BAD python-twitter-0.5/twitter_test.py
 * --- python-twitter-0.5/twitter_test.py.BAD    2008-10-20 15:02:40.000000000 -0400
 * +++ python-twitter-0.5/twitter_test.py 2008-10-20 15:04:53.000000000 -0400

$ python setup.py test running test running egg_info writing requirements to python_twitter.egg-info/requires.txt writing python_twitter.egg-info/PKG-INFO writing top-level names to python_twitter.egg-info/top_level.txt writing dependency_links to python_twitter.egg-info/dependency_links.txt reading manifest file 'python_twitter.egg-info/SOURCES.txt' writing manifest file 'python_twitter.egg-info/SOURCES.txt' running build_ext /.amd_mnt/hut/vol/vol0/home/tirto/stanford/hw6/python-twitter-0.5/twitter.py:12: DeprecationWarning: the md5 module is deprecated; use hashlib instead import md5 Test the twitter._FileCache.Get method ... ok Test the twitter._FileCache.GetCachedTime method ... ok Test the twitter._FileCache constructor ... ok Test the twitter._FileCache.Remove method ... ok Test the twitter._FileCache.Set method ... ok Test the twitter.Status AsDict method ... ok Test the twitter.Status AsJsonString method ... ok Test the twitter.Status __eq__ method ... ok Test all of the twitter.Status getters and setters ... ok Test the twitter.Status constructor ... ok Test the twitter.Status NewFromJsonDict method ... ok Test all of the twitter.Status properties ... ok Test various permutations of Status relative_created_at ... ok Test the twitter.User AsDict method ... ok Test the twitter.User AsJsonString method ... ok Test the twitter.User __eq__ method ... ok Test all of the twitter.User getters and setters ... ok Test the twitter.User constructor ... ok Test the twitter.User NewFromJsonDict method ... ok Test all of the twitter.User properties ... ok Test the twitter.Api CreateFriendship method ... ok Test the twitter.Api DestroyDirectMessage method ... ok Test the twitter.Api DestroyFriendship method ... ok Test the twitter.Api DestroyStatus method ... ok Test the twitter.Api GetDirectMessages method ... ok Test the twitter.Api GetFeatured method ... ok Test the twitter.Api GetFollowers method ... ok Test the twitter.Api GetFriends method ... ok Test the twitter.Api GetFriendsTimeline method ... ok Test the twitter.Api GetPublicTimeline method ... ok Test the twitter.Api GetReplies method ... ok Test the twitter.Api GetStatus method ... ok Test the twitter.Api GetUser method ... ok Test the twitter.Api GetUserTimeline method ... ok Test the twitter.Api PostDirectMessage method ... ok Test the twitter.Api PostUpdate method ... ok
 * 1) run the test again

-- Ran 36 tests in 0.176s

OK

code

hth, tirto

[Update 5/26] You may find this information on the built in set operations in Python helpful as you traverse the social graph. code format="python" >>> ryans_friends = ['mike','jeff'] >>> mikes_friend = ['jeff','kelly'] >>> >>> ryan_set = set(ryans_friends) >>> mike_set = set(mikes_friends) >>> mike_set - ryan_set set(['kelly']) >>> redundant_set = set(['mike','mike','mike']) >>> redundant_set set(['mike']) >>> ryan_set & mike_set set(['jeff']) >>> ryan_set | mike_set #union set(['kelly', 'mike', 'jeff']) >>> ryan_set & mike_set #intersection set(['jeff'])

fol = set( a.get_followers('aweigend') ) fri = set( a.get_friends('aweigend') ) fol & fri
 * 1) Also Try
 * 1) this gets the intersection of the two sets. In this case it tells you who follows both directions.
 * 2) look at the online documentation for other set operations

code