Logo Search packages:      
Sourcecode: webcheck version File versions  Download package

def crawler::Site::_is_yanked (   self,
  link 
) [private]

Check whether the specified url should not be checked at all.
This uses the regualr expressions passed with add_yanked_re() and the
robots information present.

Definition at line 150 of file crawler.py.

00150                               :
        """Check whether the specified url should not be checked at all.
        This uses the regualr expressions passed with add_yanked_re() and the
        robots information present."""
        # check if it is yanked through the regexps
        for regexp in self._yanked_res.values():
            # if the url matches it is yanked and we can stop
            if regexp.search(link.url) is not None:
                return 'yanked'
        # check if we should avoid external links
        if not link.isinternal and config.AVOID_EXTERNAL_LINKS:
            return 'external avoided'
        # check if we should use robot parsers
        if not config.USE_ROBOTS:
            return False
        # skip schemes not haveing robot.txt files
        if link.scheme != 'http' and link.scheme != 'https':
            return False
        # skip robot checks for external urls
        # TODO: make this configurable
        if not link.isinternal:
            return False
        # check robots for remaining links
        rp = self._get_robotparser(link)
        if rp is not None and not rp.can_fetch('webcheck', link.url):
            return 'robot restriced'
        # fall back to allowing the url
        return False

    def get_link(self, url):


Generated by  Doxygen 1.6.0   Back to index