crawler::Link Class Reference

List of all members.

Detailed Description

This is a basic class representing a url.

Some basic information about a url is stored in instances of this

  url        - the url this link represents
  scheme     - the scheme part of the url
  netloc     - the netloc part of the url
  path       - the path part of the url
  query      - the query part of the url
  parents    - list of parent links (all the Links that link to this
  children   - list of child links (the Links that this page links to)
  pagechildren - list of child pages, including children of embedded
  embedded   - list of links to embeded content
  anchors    - list of anchors defined on the page
  reqanchors - list of anchors requesten for this page anchor->link*
  depth      - the number of clicks from the base urls this page to
  isinternal - whether the link is considered to be internal
  isyanked   - whether the link should be checked at all
  isfetched  - whether the lis is fetched already
  ispage     - whether the link represents a page
  mtime      - modification time (in seconds since the Epoch)
  size       - the size of this document
  mimetype   - the content-type of the document
  encoding   - the character set used in the document
  title      - the title of this document (unicode)
  author     - the author of this document (unicode)
  status     - the result of retreiving the document
  linkproblems - list of problems with retrieving the link
  pageproblems - list of problems in the parsed page
  redirectdepth - the number of this redirect (=0 not a redirect)

   Instances of this class should be made through a site instance
   by adding internal urls and calling crawl().

Definition at line 282 of file crawler.py.

