unfluff 0.2

Statistical HTML content extraction in Python
unfluff is a statistical content extraction tool written in python - remove the useless fluff from arbitrary HTML pages.

Based on methods discussed (and implemented) in various places, but most directly:

 * http://www.spicylogic.com/allenday/blog/2008/05/27/statistical-html-content-extraction/
 * http://www2003.org/cdrom /papers/refereed/p583/p583-gupta.html

An experiment / work in progress.

Usage:

The command line tool can either take a file or a URL to extract. It prints the content tree to stdout:

unfluff /path/to/something.html

or

unfluff -u 'http://some-website.com/interesting-article.html'

The unfluff library has a few functions, which pretty much all do the same thing via different formats:

import unfluff
unfluff.from_url('http://whatever/')
unfluff.from_file('/tmp/input.html')
unfluff.from_string("< html >inline content< /html >")


Both of these are native (C) extensions, which means you're best off looking for them in your friendly neighborhood package manager.

last updated on:
January 12th, 2011, 10:14 GMT
price:
FREE!
developed by:
Tim Cuthbertson
homepage:
github.com
license type:
BSD License 
category:
ROOT \ Internet \ HTTP (WWW)

FREE!

In a hurry? Add it to your Download Basket!

user rating

UNRATED
0.0/5
 

0/5

Add your review!

SUBMIT