08_Search.topsy

toc [|Andreas Weigend] MS&E 237, Stanford University, Spring 2010
 * Social Data Revolution**

=Class 8: Search= Class Date: Apr 22, 2010 Audio | Transcript Company: Topsy

=Highlights= =Follow-ups= = = = History of Search = Prof. Andreas Weigend
 * How we define relevance has evolved from information production to information finding to information filtering/ranking
 * There is a significant information signal which is a part of how people interact on the Web, but is ignored by Google and other traditional search engines
 * We are moving from a system where people pay to send information to a system where people pay to have other people read their information
 * Some criticisms of a search engine that sorts based on popularity and influence on Twitter**
 * Twitter has too much of an influence on the popularity of tweets and tweeters given its "Suggested Users List"
 * [|Topsy is a good idea in theory only]
 * Some competition for Topsy**
 * TipTop launched in November of 2009 and was featured at the Web 2.0 Expo in NYC (also in November of 2009)
 * [|TipTop]
 * [|Launch review of TipTop]
 * Twitter search, though criticized for returning poor results, is still the primary way users search for information on Twitter
 * [|Twitter Search]
 * Google's page rank vs. real-time search
 * media type="youtube" key="1f-FSWgH07s" height="385" width="640"
 * An overview of real-time search engines**
 * [|Real-time search in its infancy]
 * [|Alternative Search Engines]

-Technology

 * index
 * crawl (or spider)

-Speed

 * Store everything -> need search -> to be fast, need to build index
 * trade-off, results are very fast, but pre-computing and storage needed

-Relevance (Algorithmic)

 * Evolution of how relevance is defined:
 * Information production -> Information finding -> Information filtering/ranking
 * How do you pay? Money or attention?
 * you pay with your money, buy software

Find without search-before search engine

 * Guess
 * common gateway to internet
 * Browse
 * Yahoo! manually generated search results
 * Not scalable
 * Difficult to maintain
 * dmoz : open directory project

Early search engines

 * infoseek
 * Lycos
 * AltaVista

New Problem: relevance

 * How to rank pages? what to show on top?
 * What information can be used to help with decision?
 * Within page
 * Location of search term on page
 * Number of occurrences of search term on page
 * Static: link structure: unless this is a link-farm
 * Number of links going into a page
 * Leverages other websites
 * Dynamic: click behavior: choice within set of links
 * Short vs. Long clicks
 * What information does the user use? This leverages users

Vertical search

 * Music: information sources
 * payment: from buy to rent

Personalized search
media type="youtube" key="EKuG2M6R4VM" height="305" width="511"
 * Explicit: custiomization
 * Implicit: based on user's past behavior; needs persistent history
 * Problem: multiple personalities
 * A9, google...

Relevant is everything

 * The search paradigm: 2.4 words, a few clicks, and done
 * Relevance is 'speed'
 * Relevance is relative

Relevance is hard to measure

 * poorly defined, subjective notion
 * analysts have focused on surrogates that are easier to measure
 * methodology important
 * developement cycle

TopsyThe past and future of search on the internet
Vipul Prakash and Rishab Ghosh



The AltaVista model of the web

 * a collection of documents
 * key innovations
 * full text search comprehensive crawling multi-word queries

The AltaVista Conundrum

 * the expanding web
 * web goes from large to very large
 * to say nothing of spam
 * content is no longer a sufficient indicator of relevance

The Google model of the web
PageRank: The Anatomy of a Large-Scale Hypertextual Web Search Engine Map/Reduce: MapReduce: Simplified Data Processing on Large Clusters Anchor Text
 * websites as journals and authors, pages as articles, and links as citations
 * key innovation
 * PageRank + Map/Reduce + Anchor Text

The Google conundrum

 * but the web became social
 * many unrelated authors per site, and in many cases per page
 * breaks the central assumption of pagerank: a website was no longer equivalent to an author
 * comment spam, "GoogleBombing" allowed bloggers to define google results
 * in response google introduced "nofollow" 2005, ignoring the enw source of signal
 * revolutionary changes on the web, but no innovation in search

The Topsy model of the web

 * web as a network of authors, links as social citations
 * track each author, independently of sites and pages
 * citation quality derived from author reputation
 * author reputation derived from social graph
 * social citatin are a gold mine, not a garbage heap
 * social citations rank search results
 * key innovations: RPF, cargo, autoauto

How Topsy interprets a twitter page

 * author
 * social graph
 * social citation

Features on topsy.com

 * web
 * photos
 * tweets
 * trending

Q&A
It's a good starting point; a lot of citation, dense network of people, all giving good rankings Topsy: new document biasGoogle: old document biasThere will be search in a long time, so long as there is information overload. Our new notion of identity: different from google model of the web.Where does real-time really matters?
 * Why you only crawling twitter?**
 * What's the difference between Topsy and Google?**
 * Things to Consider after Class**

Edited by: Neha Kothari nkot@stanford.edu Yinfeng yinfeng@stanford.edu