Softpedia
 


LINUX CATEGORIES:



GLOBAL PAGES >>
NEWS ARCHIVE >>
SOFTPEDIA REVIEWS >>
MEET THE EDITORS >>
WEEK'S BEST
  • Linux Kernel 3.9.2 / 3....
  • LibreOffice 3.6.6 / 4.0.3
  • MPlayer 1.1.1
  • systemd 204
  • Arch Linux 2013.05.01
  • Blender 2.67
  • KDE Software Compilatio...
  • CrunchBang Linux Stable...
  • Elementary OS 0.1 / 0.2...
  • SystemRescueCd 3.6.0
  • Home > Linux > Internet > HTTP (WWW)

    reppy 0.1.0

    Download button

    No screenshots available
    Downloads: 97  View global page NEW!  Tell us about an update
    User Rating:
    Rated by:
    NOT RATED
    0 user(s)
    Developer:

    License / Price:

    Last Updated:

    Category:
    Dan Lecocq | More programs
    MIT/X Consortium Lic... / FREE
    November 12th, 2011, 08:33 GMT
    ROOT / Internet / HTTP (WWW)

     Read user reviews (0)  Refer to a friend  Subscribe

    reppy description

    Modern robots.txt Parser for Python

    reppy started out of a lack of memoization support in other robots.txt parsers encountered, and the lack of support for Crawl-delay and Sitemap in the built-in robotparser.

    Matching

    This package supports the 1996 RFC, as well as additional commonly-implemented features, like wildcard matching, crawl-delay, and sitemaps. There are varying approaches to matching Allow and Disallow. One approach is to use the longest match. Another is to use the most specific. This package chooses to follow the directive that is longest, the assumption being that it's the one that is most specific -- a term that is a little difficult to define in this context.

    Usage


    The easiest way to use reppy is to just ask if a url or urls is/are allowed:

    import reppy
    # This implicitly fetches example.com's robot.txt
    reppy.allowed('http://example.com/howdy')
    # => True
    # Now, it's cached based on when it should expire (read more in `Expiration`)
    reppy.allowed('http://example.com/hello')
    # => True
    # It also supports batch queries
    reppy.allowed(['http://example.com/allowed1', 'http://example.com/allowed2', 'http://example.com/disallowed'])
    # => ['http://example.com/allowed1', 'http://example.com/allowed2']
    # Batch queries are even supported accross several domains (though fetches are not done in parallel)
    reppy.allowed(['http://a.com/allowed', 'http://b.com/allowed', 'http://b.com/disallowed'])
    # => ['http://a.com/allowed', 'http://b.com/allowed']


    It's pretty easy to use. The default behavior is to fetch it for you with urllib2

    import reppy
    # Make a reppy object associated with a particular domain
    r = reppy.fetch('http://example.com/robots.txt')


    but you can just as easily parse a string that you fetched.

    import urllib2
    data = urllib2.urlopen('http://example.com/robots.txt').read()
    r = reppy.parse(data)


    Expiration

    The main advantage of having reppy fetch the robots.txt for you is that it can automatically refetch after its data has expired. It's completely transparent to you, so you don't even have to think about it -- just keep using it as normal. Or, if you'd prefer, you can set your own time-to-live, which takes precedence:

    import reppy
    r = reppy.fetch('http://example.com/robots.txt')
    r.ttl
    # => 10800 (How long to live?)
    r.expired()
    # => False (Has it expired?)
    r.remaining()
    # => 10798 (How long until it expires)
    r = reppy.fetch('http://example.com/robots.txt', ttl=1)
    # Wait 2 seconds
    r.expired()
    # => True


    Queries

    Reppy tries to keep track of the host so that you don't have to. This is done automatically when you use fetch, or you can optionally provide the url you fetched it from with parse. Doing so allows you to provide just the path when querying. Otherwise, you must provide the whole url:

    # This is doable
    r = reppy.fetch('http://example.com/robots.txt')
    r.allowed('/')
    r.allowed(['/hello', '/howdy'])
    # And so is this
    data = urllib2.urlopen('http://example.com/robots.txt').read()
    r = reppy.parse(data, url='http://example.com/robots.txt')
    r.allowed(['/', '/hello', '/howdy'])
    # However, we don't implicitly know which domain these are from
    reppy.allowed(['/', '/hello', '/howdy'])


    Crawl-Delay and Sitemaps

    Reppy also exposes the non-RFC, but widely-used Crawl-Delay and Sitemaps attributes. The crawl delay is considered on a per-user agent basis, but the sitemaps are considered global. If they are not specified, the crawl delay is None, and sitemaps is an empty list. For example, if this is my robots.txt:

    User-agent: *
    Crawl-delay: 1
    Sitemap: http://example.com/sitemap.xml
    Sitemap: http://example.com/sitemap2.xml


    Then these are accessible:

    with file('myrobots.txt', 'r') as f:
     r = reppy.parse(f.read())
    r.sitemaps
    # => ['http://example.com/sitemap.xml', 'http://example.com/sitemap2.xml']
    r.crawlDelay
    # => 1

    User-Agent Matching

    You can provide a user agent of your choosing for fetching robots.txt, and then the user agent string we match is defaulted to what appears before the first /. For example, if you provide the user agent as 'MyCrawler/1.0', then we'll use 'MyCrawler' as the string to match against User-agent. Comparisons are case-insensitive, and we do not support wildcards in User-Agent. If this default doesn't suit you, you can provide an alternative:

    # This will match against 'myuseragent' by default
    r = reppy.fetch('http://example.com/robots.txt', userAgent='MyUserAgent/1.0')
    # This will match against 'someotheragent' instead
    r = reppy.fetch('http://example.com/robots.txt', userAgent='MyUserAgent/1.0', userAgentString='someotheragent')


    Path-Matching

    Path matching supports both * and $


    Product's homepage

    Here are some key features of "reppy":

    · Memoization of fetched robots.txt
    · Expiration taken from the Expires header
    · Batch queries
    · Configurable user agent for fetching robots.txt
    · Automatic refetching basing on expiration
    · Support for Crawl-delay
    · Support for Sitemaps
    · Wildcard matching

    Requirements:

    · Python

      


    TAGS:

    robots.txt parser | robots parser | robots.txt | robots | parser

    Go to top

    WindowsGamesDriversMacLinuxScriptsMobileHandheldNews

    SUBMIT PROGRAM   |   ADVERTISE   |   GET HELP   |   SEND US FEEDBACK   |   RSS FEEDS   |   UPDATE YOUR SOFTWARE   |   ROMANIAN FORUM