What are the essential tools that every IT person should have in the modern IT world? That is the idea behind a new Kingston University module called IT Toolbox. Over a 12 week semester, first year students will be guided through a series of activities such as blogging, running a server, client and server side scripting, search, social networking and problem solving. Each of those activities will be published here and anyone is welcome to join in.

hide alert

Google: big and interesting? (Toolbox session 2)

Written by: Jonathan Briggs

October 3, 2009 [2712 views]

Video from Jonathan Briggs on Vimeo.

This is a lecture for IT Toolbox and as I mentioned last week I am trying to get students to ask themselves hard questions about common technology systems in order to prepare them for analysis, design and build projects of their own in the future.

Key questions for this session

  1. What was search like before Google and how did Google become so powerful?
  2. What sorts of technologies are involved?
  3. How does Google decide the order for search results?
  4. How does Google make money?
  5. Why does Google do all the other things it does?
  6. What will happen in the future?
  7. Terminology you should know about search and Google
  8. What can we learn from looking at Google that will be useful in the future?

Before Google

  • In September 2009, Netcraft http://news.netcraft.com/ recorded 226,099,841 accessible web servers.
  • Search engines are needed to help people locate the sites they need.
  • Before search there were human selected lists and then directories.
  • Automated search engines such as Alta Vista were able to catalogue much greater numbers of sites.
  • Alta Vista (Dec 1995) ranked results according to relevance (statistical match between a page and a search query)
  • Alta Vista developed a “portal based” model (1998) to improve stickiness (loyalty) and provide revenue http://searchenginewatch.com/2165951
  • Google (1998) introduced reputation based ranking (PageRank) which was less prone to manipulation. Google’s interface was kept very simple and very fast.

A history of search engines

Technologies of search engines

Search engine = crawler + indexer + front end

Specialist servers/software used for crawling, indexing, load balancing, ranking, proxy serving, image searching, image caching, advertising matching, ad serving, spelling, suggestions, tracking, reporting, data mining, spam filtering, click fraud checking

  • crawlers (spiders) collect data from web sites and pages by visiting sites and returning data
  • indexers take data and create indices about their content
  • front end matches visitor queries to indexed data and returns content

A crawl of the web probably takes 6-10 weeks.

Crawler starts with a list of URLs, ‘visits’ each, returns the contents of each document including additional URLs (internal and external links) which are added to the list to be crawled - recursively!

Interesting things about crawlers
  • Web analytics packages allow you to see when crawlers have visited your site
  • Google, Bing and yahoo! provide tools to control how often the spider visits
  • Data collected includes the relative links between sites as well as the data on the pages - this is only one of many types of meta data that are recorded
  • Sites that are updated more frequently are crawled more often
  • You could write a simple crawler in a few lines of code http://www.example-code.com/python/spider_begin.asp

PageRank - how Google decides on the order of its results

  • The original algorithm was produced as part of a computer science project at Stanford University and is published. The current algorithm is a commercial secret.
  • The algorithm ranks pages according to the popularity of the page as measured by the number (and quality) of in-bound links to that page.
  • Pages with more links rank higher than pages with fewer links but…
  • Pages with better links ranker higher than pages with poorer links and…
  • Better links come from pages with higher ranks - recursion again!
  • The home page of this site has about 774 links pointing at it including 94 links from other sites.
  • A link from the Guardian is worth more than a link from this site because the Guardian has a higher ranking than this site
  • It is very hard for spammers to create artificial reputation without getting real links from other high ranking sites (and Google works hard to detect these and modifies its algorithm as necessary)

BUT Relevance is still one of the most important factors in selecting which result is shown first. Within the most relevant pages, “PageRank (latest version)” is used to define the order.

  • You cannot pay to be listed higher in the results
  • Trying to trick Google’s algorithms can get you banned from the index completely
  • PageRank is not a simple algorithm these days and some sites are more trusted than others

How does Google make money?

  • Google is an advertising company and auctions the space on the right hand side of the search results as well as displaying ads on other sites.
  • Google made $21 billion from AdWords in 2008
  • Adwords is successful because it presents relevant advertising just as a potential customer has expressed a need or desire (via a search query)

Why does Google do all the other things it does?

  • Sometimes these other things create additional advertising opportunities such as Gmail
  • Sometimes they create additional data insights that help improve search
  • Sometimes they create tools to help advertisers reach their potential customers
  • Often they create additional opportunities for Google to keep talking to its audience
  • Always they are thinking about how they enlarge their business for the benefit of their shareholders

What will happen in the future?

  • Google will continue to improve its algorithm to try and provide the best possible search experience
  • Google will continue to provide tools to encourage advertisers to spend money
  • Bing and yahoo! will try to create better search experiences for customers
  • Location based, thematic and ‘meaning based’ searches are very much in fashion: cinema times, flight bookings, product search, image and video
  • “Semantic Web” ideas provide a fertile ground for future developments
  • Someone may come up with a new idea (and make billions!)

Some terminology you should know about search and Google

  • crawlers, spiders
  • paid and sponsored results versus natural or organic results
  • snippets - the little descriptions under each search result
  • PageRank and trust
  • Revelance and reputation
  • On page and off page search engine optimisation - things you can do to improve your relevance and reputation
  • Search engine marketing - buying sponsored links, pay-per-click advertising

What can we learn from looking at Google that will be useful in the future?

  • Simple looking systems can be very complex behind the scenes
  • Understanding the basic principles can help us understand some of the complexity
  • Very big systems require lots of hardware, software, support and power
  • New ideas can disrupt traditional businesses
  • Data about data can be key to creating a business
  • Computer science projects can turn into billion dollar projects

Recent comments:

On October 5, 2009 at 2:02 PM, Kunal Ramchandani wrote:

A good intro into all things Google! An interesting thing to note for your section on the future is that Google is working on Google Caffeine ("the new Google").
Mashable's take here: http://mashable.com/2009/08/10/google-caffeine/
Where you can test it: http://www2.sandbox.google.com/

www.kunalramchandani.com

Jonathan replies: Thanks Kunal (works with me at the OTHER media on all things Google)

What do you think?







Add your comments