Andreas Weigend
Social Data Revolution
MS&E 237, Stanford University, Spring 2010

Class 8: Search

Class Date: Apr 22, 2010
Audio | Transcript
Company: Topsy

Highlights

  • How we define relevance has evolved from information production to information finding to information filtering/ranking
  • There is a significant information signal which is a part of how people interact on the Web, but is ignored by Google and other traditional search engines
  • We are moving from a system where people pay to send information to a system where people pay to have other people read their information

Follow-ups

Some criticisms of a search engine that sorts based on popularity and influence on Twitter
Some competition for Topsy
  • TipTop launched in November of 2009 and was featured at the Web 2.0 Expo in NYC (also in November of 2009)
  • Twitter search, though criticized for returning poor results, is still the primary way users search for information on Twitter
  • Google's page rank vs. real-time search
An overview of real-time search engines

History of Search

Prof. Andreas Weigend

-Technology

  • index
  • crawl (or spider)
external image google-search-technology.png

-Speed

  • Store everything -> need search -> to be fast, need to build index
  • trade-off, results are very fast, but pre-computing and storage needed

-Relevance (Algorithmic)

  • Evolution of how relevance is defined:
    • Information production -> Information finding -> Information filtering/ranking
  • How do you pay? Money or attention?
    • you pay with your money, buy software

Find without search-before search engine

  • Guess
  • common gateway to internet
  • Browse
  • Yahoo! manually generated search results
    • Not scalable
    • Difficult to maintain
external image odphead.gif
  • dmoz : open directory project


Early search engines

  • infoseek
  • Lycos
  • AltaVista
external image Infoseeklogo.pngexternal image logo.gifexternal image logo-lycos-260x85.png

New Problem: relevance

  • How to rank pages? what to show on top?
  • What information can be used to help with decision?
    • Within page
      • Location of search term on page
      • Number of occurrences of search term on page
  • Static: link structure: unless this is a link-farm
    • Number of links going into a page
    • Leverages other websites
  • Dynamic: click behavior: choice within set of links
    • Short vs. Long clicks
  • What information does the user use? This leverages users

Vertical search

  • Music: information sources
  • payment: from buy to rent

Personalized search

  • Explicit: custiomization
  • Implicit: based on user's past behavior; needs persistent history
  • Problem: multiple personalities
  • A9 , google...


Relevant is everything

  • The search paradigm: 2.4 words, a few clicks, and done
  • Relevance is 'speed'
  • Relevance is relative

Relevance is hard to measure

  • poorly defined, subjective notion
  • analysts have focused on surrogates that are easier to measure
  • methodology important
  • developement cycle


TopsyThe past and future of search on the internet

Vipul Prakash and Rishab Ghosh

external image topsy-com.jpg

The AltaVista model of the web

  • a collection of documents
  • key innovations
  • full text search comprehensive crawling multi-word queries

The AltaVista Conundrum

  • the expanding web
  • web goes from large to very large
  • to say nothing of spam
  • content is no longer a sufficient indicator of relevance

The Google model of the web

  • websites as journals and authors, pages as articles, and links as citations
  • key innovation
  • PageRank + Map/Reduce + Anchor Text
PageRank: The Anatomy of a Large-Scale Hypertextual Web Search Engine
Map/Reduce: MapReduce: Simplified Data Processing on Large Clusters
Anchor Text

The Google conundrum

  • but the web became social
  • many unrelated authors per site, and in many cases per page
  • breaks the central assumption of pagerank: a website was no longer equivalent to an author
  • comment spam, "GoogleBombing" allowed bloggers to define google results
  • in response google introduced "nofollow" 2005, ignoring the enw source of signal
  • revolutionary changes on the web, but no innovation in search

The Topsy model of the web

Social Graph on Twitter
Social Graph on Twitter

  • web as a network of authors, links as social citations
  • track each author, independently of sites and pages
  • citation quality derived from author reputation
  • author reputation derived from social graph
  • social citatin are a gold mine, not a garbage heap
  • social citations rank search results
  • key innovations: RPF, cargo, autoauto

How Topsy interprets a twitter page

  • author
  • social graph
  • social citation

Features on topsy.com


Q&A

Why you only crawling twitter?
It's a good starting point; a lot of citation, dense network of people, all giving good rankings
What's the difference between Topsy and Google?
Topsy: new document biasGoogle: old document biasThere will be search in a long time, so long as there is information overload.
Things to Consider after Class
Our new notion of identity: different from google model of the web.Where does real-time really matters?


Edited by: Neha Kothari nkot@stanford.edu
Yinfeng yinfeng@stanford.edu