| <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| "http://www.w3.org/TR/html4/strict.dtd"> |
| <html> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
| <meta name="author" content="John J. Lee <jjl@@pobox.com>"> |
| <meta name="date" content="2005-01"> |
| <meta name="keywords" content="Python,HTML,browser,stateful,web,client,client-side,mechanize,form,ClientForm,ClientCookie,pullparser,WWW::Mechanize"> |
| <title>mechanize</title> |
| <style type="text/css" media="screen">@@import "../styles/style.css";</style> |
| <base href="http://wwwsearch.sourceforge.net/mechanize/"> |
| </head> |
| <body> |
| |
| @# This file is processed by EmPy to colorize Python source code |
| @# http://wwwsearch.sf.net/bits/colorize.py |
| @{from colorize import colorize} |
| |
| <div id="sf"><a href="http://sourceforge.net"> |
| <img src="http://sourceforge.net/sflogo.php?group_id=48205&type=2" |
| width="125" height="37" alt="SourceForge.net Logo"></a></div> |
| <!--<img src="../images/sflogo.png"--> |
| |
| <h1>mechanize</h1> |
| |
| <div id="Content"> |
| |
| <p>Stateful programmatic web browsing in Python, after Andy Lester's Perl |
| module <a |
| href="http://search.cpan.org/dist/WWW-Mechanize/"><code>WWW::Mechanize</code> |
| </a>. |
| |
| <ul> |
| <li><code>mechanize.Browser</code> is a subclass of |
| <code>mechanize.UserAgent</code>, which is, in turn, a subclass of |
| <code>ClientCookie.OpenerDirector</code> (like |
| <code>urllib2.OpenerDirector</code>) (so any URL can be opened, not just |
| <code>http:</code>). <code>mechanize.UserAgent</code> offers easy dynamic |
| configuration of user-agent features like protocol, cookie, redirection and |
| <code>robots.txt</code> handling, without having to make a new |
| <code>OpenerDirector</code> each time, eg. by calling |
| <code>build_opener()</code> (it's not stable yet, though). |
| <li>Easy HTML form filling, using <a href="../ClientForm/">ClientForm</a> |
| interface. |
| <li>Convenient link parsing and following. |
| <li>Browser history (<code>.back()</code> and <code>.reload()</code> |
| methods). |
| <li>The <code>Referer</code> HTTP header is added properly (optional). |
| <li>Automatic observance of <a |
| href="http://www.robotstxt.org/wc/norobots.html"> |
| <code>robots.txt</code></a>. |
| <li>In future, should be able to optionally use DOMForm (implementation of |
| ClientForm interface on top of HTML DOM) instead of ClientForm, which would |
| allow easy "escape" to the lower-level HTML DOM API in cases |
| where the higher-level <code>mechanize.Browser</code> / ClientForm API is |
| not sufficient. |
| </ul> |
| |
| <p>An example: |
| |
| @{colorize(r""" |
| import re |
| from mechanize import Browser |
| |
| b = Browser() |
| b.open("http://www.example.com/") |
| # follow second link with element text matching regular expression |
| response = b.follow_link(text_regex=re.compile(r"cheese\s*shop"), nr=1) |
| assert b.viewing_html() |
| print b.title() |
| print response.geturl() |
| print response.info() # headers |
| print response.read() # body |
| response.close() |
| |
| b.select_form(name="order") |
| # Browser passes through unknown attributes (including methods) |
| # to the selected HTMLForm (from ClientForm). |
| b["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__) |
| response2 = b.submit() # submit current form |
| |
| response3 = b.back() # back to cheese shop |
| # the history mechanism uses cached requests and responses |
| assert response3 is response |
| # we can still use the response, even though we closed it: |
| response3.seek(0) |
| response3.read() |
| response4 = b.reload() |
| assert response4 is not response3 |
| |
| for form in b.forms(): |
| print form |
| # .links() optionally accepts the keyword args of .follow_/.find_link() |
| for link in b.links(url_regex=re.compile("python.org")): |
| print link |
| b.follow_link(link) # takes EITHER Link instance OR keyword args |
| b.back() |
| """)} |
| |
| <p>Full documentation is in the docstrings. |
| |
| <p>Thanks to Ian Bicking, for persuading me that a <code>UserAgent</code> class |
| would be useful. |
| |
| |
| <h2>Todo</h2> |
| |
| <ul> |
| <li>Fix <code>.response()</code> method (each call should return independent |
| pointer to same data). |
| <li>Should work with either Python 2.4 <code>urllib2</code> or ClientCookie |
| (currently depends on latter: just a matter of deciding on a way to specify |
| this). |
| <li>Stabilise <code>mechanize.UserAgent</code>. |
| <li>Test with non-http URLs. |
| <li>History cache expiration. |
| <li>Do auth and proxies properly (ClientCookie probably needs some work here, |
| too -- and maybe urllib2 also). Need to configure local squid and apache, |
| yawn... |
| <li>Integrate with DOMForm, and sort out any resulting interface issues |
| (including replacing pullparser). DOMForm will be an optional replacement |
| for ClientForm. |
| <li>Add some utilities useful for testing (eg. fetch images and stylesheets in |
| page, easy assertion of things like: cookies sent by server, redirections, |
| HTTP error codes etc.). |
| </ul> |
| |
| |
| <a name="download"></a> |
| <h2>Download</h2> |
| <p>All documentation (including this web page) is included in the distribution. |
| |
| <p>This is an alpha release: interfaces may change, and there will be bugs. |
| |
| <p><em>Development release.</em> |
| |
| <ul> |
| <li><a href="./src/mechanize-0.0.9a.tar.gz">mechanize-0.0.9a.tar.gz</a> |
| <li><a href="./src/mechanize-0_0_9a.zip">mechanize-0_0_9a.zip</a> |
| <li><a href="./src/ChangeLog.txt">Change Log</a> (included in distribution) |
| <li><a href="./src/">Older versions.</a> |
| </ul> |
| |
| <p>For installation instructions, see the INSTALL file included in the |
| distribution. |
| |
| <h2>See also</h2> |
| |
| <p>Richard Jones' <a href="http://mechanicalcat.net/tech/webunit/">webunit</a> |
| (this is not the same as Steven Purcell's <a |
| href="http://webunit.sourceforge.net/">code of the same name</a>). webunit and |
| mechanize are quite similar. On the minus side, webunit is missing things like |
| browser history, high-level forms and links handling, thorough cookie handling, |
| refresh redirection, adding of the Referer header, observance of robots.txt and |
| easy extensibility. On the plus side, webunit has a bunch of utility functions |
| bound up in its WebFetcher class, which look useful for writing tests (though |
| they'd be easy to duplicate using mechanize). In general, webunit has more of |
| a frameworky emphasis, with aims limited to writing tests, where mechanize and |
| the modules it depends on try hard to be general-purpose libraries. |
| |
| <p>There are many related links in the <a |
| href="../bits/GeneralFAQ.html">General FAQ</a> page, too. |
| |
| |
| <a name="faq"></a> |
| <h2>FAQs</h2> |
| <ul> |
| <li>Which version of Python do I need? |
| <p>2.2 or above. |
| <li>What else do I need? |
| <p><a href="../ClientCookie/">ClientCookie</a> 0.4.19 or newer (<strong>note |
| the required version!</strong>), <a href="../ClientForm/">ClientForm</a> |
| 0.1.x, and <a href="../pullparser/">pullparser</a> 0.0.4b or newer. |
| <li>Which license? |
| <p>The <a href="http://www.opensource.org/licenses/bsd-license.php"> |
| BSD license</a> (included in distribution). |
| </ul> |
| |
| <p><a href="mailto:jjl@@pobox.com">John J. Lee</a>, January 2005. |
| |
| <hr> |
| |
| </div> |
| |
| <div id="Menu"> |
| |
| <a href="..">Home</a><br> |
| <!--<a href=""></a><br>--> |
| |
| <br> |
| |
| <a href="../ClientCookie/">ClientCookie</a><br> |
| <a href="../ClientForm/">ClientForm</a><br> |
| <a href="../DOMForm/">DOMForm</a><br> |
| <a href="../python-spidermonkey/">python-spidermonkey</a><br> |
| <a href="../ClientTable/">ClientTable</a><br> |
| <span class="thispage">mechanize</span><br> |
| <a href="../pullparser/">pullparser</span><br> |
| <a href="../bits/GeneralFAQ.html">General FAQs</a><br> |
| <a href="../bits/urllib2_152.py">1.5.2 urllib2.py</a><br> |
| <a href="../bits/urllib_152.py">1.5.2 urllib.py</a><br> |
| |
| <br> |
| |
| <a href="./#download">Download</a><br> |
| <a href="./#faq">FAQs</a><br> |
| |
| </div> |
| |
| |
| </body> |
| </html> |