Andreas Weigend
Social Data Revolution
Spring 2010
MS&E 237
Stanford University


Homework 4: Twitter

Assigned: Thu Apr 22, 2010 Due: Thu Apr 29, 2010, noon
Submit to: mse237@gmail.com
Note: You are welcome to work in a group for this assignment, but please submit your own code. Feel free to discuss questions about Python programming broadly.

There are two main goals of this assignment. The first is to introduce you to another API, where you find a wealth of information in Tweets and social graphs. The second goal is to get you to discover interesting people in the hopes of finding long term engagement.

Whitelisting Information

To get a reasonable amount of data from Twitter (20k requests instead of 150 requests), we have contacted Twitter to get everyone in the class whitelisted. To do this, please do the following:
1) Log in to Twitter with the account you will be using for this homework.
2) Go here
3) In the reason, make sure to include "ATTN: Brian Sutorius. FOR: STANFORD MS&E 237"
4) Enter your IP address (see below)
It will probably take a day or two, so do this early.

How to get your ip address:
Go to the network (physical location) where you plan on completing most of the assignment.
Mac:
http://www.wikihow.com/Find-Your-IP-Address-on-a-Mac
PC:
http://www.wikihow.com/Find-the-IP-Address-of-Your-PChttp://www.dowling.edu/mydowling/tech/ipaddress.html

The Assignment

For this assignment please create 10 relevant recommendations of friends for each of 10 friends (100 total recommendations). Before you send the recommendations to each friend, rank them on your own (1 worst - 10 best) based on how you think your friend will rank them (ie based on relevance). Keep this ranking to yourself. Ask each friend to rank them (1 worst - 10 best) and give a short response about why they are good and/or bad. Please submit the recommendations, ratings, and comments using this Google Form Deliverables to mse237@gmail.com:
- Your script/program
- A short write-up that describes clearly your conceptual approach, and briefly discusses the constraints you face and the rationale for taking the approach.
----- Please note, you are now allowed to work and submit your solution in teams of 2! Please do still submit 10 recommendations each. -----

PROBLEM
What do users really want... To discover information for certain domains? Business networking? Perhaps even dating?

HYPOTHESES
Data sources and recommendation approaches:
  • Analyzing a user's social graph (who follows who, who replies to who, who retweet's who, etc)
    • Collaborative filtering (You follow A, B and C. Other people who follow A, B and C also follow D, thus we recommend D to you)
    • Similarity (You are followed by A, B and C, all of whom also follow D)
    • Transitive attention (You follow A, B, and C, all of whom follow D)
  • Semantics (tweets)
    • Hashtags (who else uses similar set of hashtags as I do? are all hashtags created equal?)
    • Keywords (identify interesting keywords? bag of words?)
    • Location (how can geolocation be used, when is it relevant, e.g., for finding people with similar interest near me?)
These are just some hypotheses, would love to see you use your creativity/think outside the box.

ACTION
Pick 10 of your friends who have been using Twitter. Randomly assign them to two groups, each of which will see one version of your algorithm.

METRICS
Ask each friend to rank your recommendations from 1-10 (1 lowest, 10 highest) and give qualitative feedback in free text about why each recommendation is good as well and/or bad.

Once you have collected this information fill out the Google Form once for each of the friends for whom you make recommendations.

  • Output for each friend:
    • username of friend
  • Output for each recommendation:
    • Please write all of this on one line, separated by commas. The username you're recommending, the rank you think your friend will give (not shared with friend), the actual rank given by your friend, free text about what made the recommendation good, free text about how the recommendation could have been better.
    • It should look like this: suggested_twitter_username, your_rank, friends_rank, "good free text", "bad free text"

EVALUATION [quality is more important than quantity]
Create a short write-up comparing your two hypotheses/algorithms, the constraints you faced, and what you learned.





Getting Started

  • You are welcome to use the programming language of your choice, however you may find the files we've put together helpful
  • Twitter API: http://apiwiki.twitter.com/
  • Language specific libraries: http://apiwiki.twitter.com/Libraries
  • How do you manage the necessary data (in-memory data structures? text files? database tables?)
  • Note that your system does not have to be interactive. Offline computation is good enough.


Working with Python

We have put together a few functions in Python that should make getting data a little easier. The fastest way to get started is to download the TwitterSearch.py file and the simple-json*.tar. The two sample code files should help you get started.
Note that there are two ways to access the Twitter API detailed here.

Download at least these 4 files from http://tr.im/pytwit
ReadMe.txt - instructions for downloading and installing python-twitter package as well as simplejson
TwitterSearch-sample-code.py
TwitterSearch.py
- functions for downloading following/followers and a function to search updates
simplejson-2.0.9.tar - needed for working with json objects returned by the Twitter API

You may also want to use the python-twitter library which has more features and objects than TwitterSearch.py
python-twitter-sample-code.py - calls to the functions in python-twitter package for examples
python-twitter-0.5.tar - functions and objects for working with the API, the file is cached here as a convenience for you

I
f you download the .tar files from the link above, you won't need to use curl. If you have any trouble installing the packages. Try moving the python-twitter.py file and simple-json
directory to your working directory. The library can be accessed without installing (this should work on shared machines where you don't have write permissions on public directories).

If you have your own RPM/Ubuntu machine

If you have your own Linux RPM desktop/server, the process is easier. The following installs the latest python, setuptools, and the twitter & simplejson APIs (feel free to remove as necessary):

sudo apt-get install python2.6 python-setuptools python-twitter python-simplejson

That's it. Now you can create your script.py in your favorite editor, here's some filler:

#script.py
import twitter
def myFunc():
    print "Hello"

And run the script with:
> python script.py
> "Hello" (output)

- Ashwin (purohit@stanford.edu)

From here you're on you own, good luck and be creative. If you find any helpful links, feel free to post them here. (There are some good hints below.)




Here's some code to help you cache requests like the last assignment:
from datetime import datetime, timedelta
import time
import sqlite3 as sqlite
import pickle
from functools import wraps
 
class CachedAPIWrapper(object):
"""
delegate, but DB cache, everything to the api
Usage:
def get_api():
tapi = twitter.Api(username=USER, password=PASSWORD)
cache = DBCache('tweeting.db', False)
return CachedAPIWrapper(tapi, cache)
"""
def __init__(self, api, cache):
self._api = api
self._cache = cache
 
def __getattr__(self, name):
fn = getattr(self._api, name)
if not callable(fn):
return fn
def wrapper(*args, **kwargs):
try:
return self._cache.load(name, args)
except CacheMissError:
value = fn(*args)
self._cache.save(name, args, value)
return value
except TypeError:
print 'Warning: Uncachable!'
# uncachable -- for instance, passing a list as an argument.
# Better to not cache than to blow up entirely.
return self.func(*args)
return wrapper
 
 
class CacheMissError(Exception):
pass
 
class DBCache(object):
def __init__(self, dbfile, debug = False):
self.__state = {}
self.__connection = sqlite.connect(dbfile)
self.__connection.text_factory = str
self.__cursor = self.__connection.cursor()
self.__debug = debug
try:
self.__cursor.execute('CREATE TABLE `cache` (id INTEGER PRIMARY KEY, `func` varchar(255), `args` blob, `result` blob);');
except:
pass
 
def save(self, func, args, result):
self.__cursor.execute('INSERT INTO `cache` (`func`, `args`, `result`) VALUES (?, ?, ?)', (func, pickle.dumps(args), pickle.dumps(result)))
self.__connection.commit()
if self.__debug:
print 'Saving %s(%s) -> %s!' % (func, args, result)
 
def load(self, func, args):
self.__cursor.execute('SELECT `result` FROM `cache` WHERE `func` = ? AND `args` = ?;', (func, pickle.dumps(args)))
result = self.__cursor.fetchone()
if result:
value = pickle.loads(result[0])
if self.__debug:
print 'Loaded %s(%s) -> %s!' % (func, args, value)
return value
else:
raise CacheMissError
 
def clear(self):
self.__cursor.execute('DELETE FROM `cache`;')
self.__connection.commit()
 
---polcari




download this patch if you are using simplejson 2.0.x, and you get this error message when running:

$ cd python-twitter-0.5
$ python setup.py test
======================================================================
FAIL: Test the twitter.Status AsJsonString method
----------------------------------------------------------------------
Traceback (most recent call last):
File "twitter_test.py", line 121, in testAsJsonString
self._GetSampleStatus().AsJsonString())
AssertionError: '{"created_at": "Fri Jan 26 23:17:14 +0000 2007", "id": 4391023, "text": "A l\\u00e9gp\\u00e1rn
\\u00e1s haj\\u00f3m tele van angoln\\u00e1kkal.", "user": {"description": "Canvas. JC Penny. Three ninety-
eight.", "id": 718443, "location": "Okinawa, Japan", "name": "Kesuke Miyagi", "profile_image_url": "http:\\/
\\/twitter.com\\/system\\/user\\/profile_image\\/718443\\/normal\\/kesuke.png", "screen_name": "kesuke",
"url": "http:\\/\\/twitter.com\\/kesuke"}}' != '{"created_at": "Fri Jan 26 23:17:14 +0000 2007", "id":
4391023, "text": "A l\\u00e9gp\\u00e1rn\\u00e1s haj\\u00f3m tele van angoln\\u00e1kkal.", "user":
{"description": "Canvas. JC Penny. Three ninety-eight.", "id": 718443, "location": "Okinawa, Japan", "name":
"Kesuke Miyagi", "profile_image_url": "http://twitter.com/system/user/profile_image/718443/normal/kesuke.png",
"screen_name": "kesuke", "url": "http://twitter.com/kesuke"}}'
 
======================================================================
FAIL: Test the twitter.User AsJsonString method
----------------------------------------------------------------------
Traceback (most recent call last):
File "twitter_test.py", line 224, in testAsJsonString
self._GetSampleUser().AsJsonString())
AssertionError: '{"description": "Indeterminate things", "id": 673483, "location": "San Francisco, CA",
"name": "DeWitt", "profile_image_url": "http:\\/\\/twitter.com\\/system\\/user\\/profile_image\\/673483
\\/normal\\/me.jpg", "screen_name": "dewitt", "status": {"created_at": "Fri Jan 26 17:28:19 +0000 2007", "id":
4212713, "text": "\\"Select all\\" and archive your Gmail inbox.  The page loads so much faster!"}, "url":
"http:\\/\\/unto.net\\/"}' != '{"description": "Indeterminate things", "id": 673483, "location": "San
Francisco, CA", "name": "DeWitt", "profile_image_url": "http://twitter.com/system/user/profile_image/673483
/normal/me.jpg", "screen_name": "dewitt", "status": {"created_at": "Fri Jan 26 17:28:19 +0000 2007", "id":
4212713, "text": "\\"Select all\\" and archive your Gmail inbox.  The page loads so much faster!"}, "url":
"http://unto.net/"}'
 
----------------------------------------------------------------------
Ran 36 tests in 0.178s
 
FAILED (failures=2)
 
## apply the patch
 
$ patch < python-twitter-0.5-fixjsontests.patch
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|diff -up python-twitter-0.5/twitter_test.py.BAD python-twitter-0.5/twitter_test.py
|--- python-twitter-0.5/twitter_test.py.BAD     2008-10-20 15:02:40.000000000 -0400
|+++ python-twitter-0.5/twitter_test.py 2008-10-20 15:04:53.000000000 -0400
--------------------------
Patching file twitter_test.py using Plan A...
Hunk #1 succeeded at 17.
Hunk #2 succeeded at 146.
done
 
## run the test again
$ python setup.py test
running test
running egg_info
writing requirements to python_twitter.egg-info/requires.txt
writing python_twitter.egg-info/PKG-INFO
writing top-level names to python_twitter.egg-info/top_level.txt
writing dependency_links to python_twitter.egg-info/dependency_links.txt
reading manifest file 'python_twitter.egg-info/SOURCES.txt'
writing manifest file 'python_twitter.egg-info/SOURCES.txt'
running build_ext
/.amd_mnt/hut/vol/vol0/home/tirto/stanford/hw6/python-twitter-0.5/twitter.py:12: DeprecationWarning: the md5
module is deprecated; use hashlib instead
import md5
Test the twitter._FileCache.Get method ... ok
Test the twitter._FileCache.GetCachedTime method ... ok
Test the twitter._FileCache constructor ... ok
Test the twitter._FileCache.Remove method ... ok
Test the twitter._FileCache.Set method ... ok
Test the twitter.Status AsDict method ... ok
Test the twitter.Status AsJsonString method ... ok
Test the twitter.Status __eq__ method ... ok
Test all of the twitter.Status getters and setters ... ok
Test the twitter.Status constructor ... ok
Test the twitter.Status NewFromJsonDict method ... ok
Test all of the twitter.Status properties ... ok
Test various permutations of Status relative_created_at ... ok
Test the twitter.User AsDict method ... ok
Test the twitter.User AsJsonString method ... ok
Test the twitter.User __eq__ method ... ok
Test all of the twitter.User getters and setters ... ok
Test the twitter.User constructor ... ok
Test the twitter.User NewFromJsonDict method ... ok
Test all of the twitter.User properties ... ok
Test the twitter.Api CreateFriendship method ... ok
Test the twitter.Api DestroyDirectMessage method ... ok
Test the twitter.Api DestroyFriendship method ... ok
Test the twitter.Api DestroyStatus method ... ok
Test the twitter.Api GetDirectMessages method ... ok
Test the twitter.Api GetFeatured method ... ok
Test the twitter.Api GetFollowers method ... ok
Test the twitter.Api GetFriends method ... ok
Test the twitter.Api GetFriendsTimeline method ... ok
Test the twitter.Api GetPublicTimeline method ... ok
Test the twitter.Api GetReplies method ... ok
Test the twitter.Api GetStatus method ... ok
Test the twitter.Api GetUser method ... ok
Test the twitter.Api GetUserTimeline method ... ok
Test the twitter.Api PostDirectMessage method ... ok
Test the twitter.Api PostUpdate method ... ok
 
----------------------------------------------------------------------
Ran 36 tests in 0.176s
 
OK
 


hth,
tirto


[Update 5/26] You may find this information on the built in set operations in Python helpful as you traverse the social graph.
>>> ryans_friends = ['mike','jeff']
>>> mikes_friend = ['jeff','kelly']
>>>
>>> ryan_set = set(ryans_friends)
>>> mike_set = set(mikes_friends)
>>> mike_set - ryan_set
set(['kelly'])
>>> redundant_set = set(['mike','mike','mike'])
>>> redundant_set
set(['mike'])
>>> ryan_set & mike_set
set(['jeff'])
>>> ryan_set | mike_set #union
set(['kelly', 'mike', 'jeff'])
>>> ryan_set & mike_set #intersection
set(['jeff'])
 
#Also Try
<span style="border-collapse: collapse; font-family: arial; font-size: 13px; line-height: normal; white-space: normal;">fol = set( a.get_followers('aweigend') )
fri  = set( a.get_friends('aweigend') )
# this gets the intersection of the two sets. In this case it tells you who follows both directions.
#look at the online documentation for other set operations
fol & fri</span>