Project 3--CMPSCI 377 (Fall 2008)

NOTE: All sample files for this assignment are on the edlab machines
and can be found by in the following directory:
cd /courses/cs300/cs377/cs377.f2008/project2


1 Overview

Goals of this assignment: understanding threads; understanding mutual
exclusion; using concurrency to hide I/O latency; using the work-queue
model.

For this assignment, you will write a mini web spider.  Search engines
use web spiders (also called crawlers) to retrieve documents
recursively from the Internet.


2 Spider

2.1 Input

Your program, to be called spider, will take three command-line inputs:
1) The root URL to start from
2) The maximum depth to crawl
3) The number of worker threads to spawn

2.1.1 Input: Root URL

The root website will be specified in the following form:
http://www.cs.umass.edu
OR 
http://www.cs.umass.edu/
OR
http://www.cs.umass.edu/index.html

See the section on the helper functions for using parse_single_URL() to parse this.

2.1.2 Input: Depth

The maximum depth tells your crawler how far to recurse.

A depth of zero means that the root URL should be retrieved, but no others.

A depth of one indicates that the root URL should be retrieved, and
all pages that it links to, but no others.

2.1.3 Input: Threads

This is the number of worker threads to spawn for crawling web pages.
You will have one additional thread that will do the parsing of pages
to find new URLs.

2.2 Crawling Web Pages

Your program will use a work queue style of concurrency, where
multiple threads pull work off of a single queue.

Each worker thread will pull a URL off the from of a queue, retrieve
that web page into a buffer, then insert that buffer into another queue
for parsing.

A separate thread will pull the buffers off of the parsing work queue,
parse them for new URLs, insert those URLs back onto the work queue
and so on.

The worker thread may not retrieve another web page until the previous
page has been processed.  This implies an ordering constraint!

No page should be crawled more than once.  Your program should track
which pages have been visited.  A host and file is unique, so if a
file is on more than one host, you should crawl each of them.

3 Threading, Retrieving and Parsing Pages

3.1 Retrieving Web Pages

You will be provided with a simple socket library that will make it
easy to connect up to a given web server and read the contents of  a
particular file.  The interface to that library is contained in simplesocket.h

The following code provides the basics of retrieving a page:

clientsocket sock (host.c_str(), 80, 0, false);
if (sock.connect()){
    sprintf (buf, "GET /%s HTTP/1.0\r\n\r\n", file.c_str());
    sock.write (buf, strlen(buf));
    
    int ret;
    int size = 0;
    sock.setTimeout(5);
    while ((ret = sock.read(buf+size, MAX_READ-1-size)) > 0){
      size += ret;
}
sock.close();

This code will timeout after 5 seconds if it fails to retrieve any
data.   The return value contains the number of bytes have been read from the
socket.  Multiple reads may be required to retrieve the page up to the
MAX_READ size.

Your code should not read any more than MAX_READ size, so it will only
get URLs that are in the first MAX_READ bytes of the page.

The code is also set to fail connecting if it doesn't complete after 5
seconds.

3.2 Parsing Web Pages for URLs

We have written a simple URL parser for you (see url.h), that can be called using:

parse_URLs(buf, size, urls);

where buf is the buffer you read from the web server, the size is the
size of the buffer and urls is a set containing url_t structs (see
url.h).

This isn't the smartest parser ever, so don't expect it to get every
URL, just the simpler ones.  It should find plenty of URLs in most
web pages to crawl.

3.3 Starting Threads

You should get your program to work as a single threaded program
first, then make it multi-threaded.  This is the hard part.

You will be using the popular pthreads threading package to complete
your spider.

The types and functions you should be concerned with are 

mutexes: pthread_mutex_t
condition variables: pthread_cond_t
pthread identifiers: pthread_t
pthrad attributes: pthread_attr_t

initialization for mutexes: pthread_mutex_init
initialization for condition variables: pthread_cond_init

thread join:   pthread_join;
thread create: pthread_create

lock: pthread_mutex_lock
unlock:  pthread_mutex_unlock

cv signal: pthread_cond_signal
cv broadcast: pthread_cond_broadcast
cv wait: pthread_cond_wait

3.3.1 Limiting Stack Sizes

For all of your threads, please limit their stack sizes

status = pthread_attr_setstacksize(&attr, 5*1024*1024);
  if (status) {
    cout << "pthread_attr_setstacksize returned " << status << endl;
      exit(1);
  }

status = pthread_attr_init(&attr);
if (status) {
   cout << "pthread_attr_init returned " << status << endl;
   exit(1);
}

3.3.2 Starting Threads

status = pthread_create(&thread_id[i], &attr, (void * (*)(void *)) worker_thread, (void *) i);

Assuming that the worker_thread is declared as:
void worker_thread (void *arg)

Note that you can "cheat" and use arg to pass integers:
For instance: int thread_id = (int) arg;

3.3.3 Joining Threads

You may need to wait for a thread to complete using join:
pthread_join(parse_thread_id, NULL);

4 Output

You program should only create two pieces of output.  Your output must
be identical.

The requester (worker thread) should output this:
cout << "requester " << thread_id << " url " << host << "/" << file << endl;

right before adding a buffer to the parser's work queue. 

The parsing thread should output:
cout << "service requester " << thread << " url " << url.host << "/" << url.file << endl;

after parsing the page, and before adding the new urls to the work queue.

Note that when using cout, you should be carefully about mutual
exclusion, as you don't want two pieces of output corrupting one
another.


5 Compiling, Testing and Hints

5.1 Compiling

Use the following command to compile your spider
g++ -Wall -g -o spider spider.cc libspider.a -lpthread

5.2 Debugging

Notice that your program should be much faster when running with a
number of threads than when it runs with just one thread. Verify this
by running it with /usr/bin/time. Make sure you link your program with
–lpthread, or it won’t actually spawn any threads (thanks, GNU libc).

You should not be holding any locks when connecting or retrieving a
web page.

We have set up a tree structure web page here:
http://www.cs.umass.edu/~mcorner/cs377/root_tree_1000.html

It has 3 levels, with a branching factor of 10.  This might be helpful
in debugging your spider.

5.3 Hints

The stack size in for each thread is limited to STACK_SIZE in thread.h
You will get odd segfaults if you go over this size, so be careful of
creating a huge numbers, or sizes, of stack variables.

The hardest part may be deciding when to quit!  One way is to track
how many pages are currently in the work queue, plus the number that
have been removed from the work queue and are currently being
retrieved.  If the sum of those two things is zero, and the parser is
not parsing anything, the program is done.

6 Handing Project In

Your project will be handed in using the autograding system.  The
autograder is checking for correctness and performance.  Please see
the web page for details on how to submit your solution.


Submit to the autograder as follows:
submit 3 spider.cc
NOTE: you have to specify a 3, which corresponds to this spider assignment.