| <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| "http://www.w3.org/TR/html4/strict.dtd"> |
| @# This file is processed by EmPy: do not edit |
| @# http://wwwsearch.sf.net/bits/colorize.py |
| @{from colorize import colorize} |
| @{import time} |
| @{import release} |
| @{last_modified = release.svn_id_to_time("$Id$")} |
| <html> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
| <meta name="author" content="John J. Lee <jjl@@pobox.com>"> |
| <meta name="date" content="@(time.strftime("%Y-%m-%d", last_modified))"> |
| <meta name="keywords" content="Python,HTML,browser,stateful,web,client,client-side,mechanize,form,ClientForm,ClientCookie,pullparser,WWW::Mechanize"> |
| <title>mechanize</title> |
| <style type="text/css" media="screen">@@import "../styles/style.css";</style> |
| <base href="http://wwwsearch.sourceforge.net/mechanize/"> |
| </head> |
| <body> |
| |
| <div id="sf"><a href="http://sourceforge.net"> |
| <img src="http://sourceforge.net/sflogo.php?group_id=48205&type=2" |
| width="125" height="37" alt="SourceForge.net Logo"></a></div> |
| <!--<img src="../images/sflogo.png"--> |
| |
| <h1>mechanize</h1> |
| |
| <div id="Content"> |
| |
| <p>Stateful programmatic web browsing in Python, after Andy Lester's Perl |
| module <a |
| href="http://search.cpan.org/dist/WWW-Mechanize/"><code>WWW::Mechanize</code> |
| </a>. |
| |
| <ul> |
| <li><code>mechanize.Browser</code> is a subclass of |
| <code>mechanize.UserAgent</code>, which is, in turn, a subclass of |
| <code>urllib2.OpenerDirector</code> (or of |
| <code>ClientCookie.OpenerDirector</code> for pre-2.4 versions of Python), so: |
| <ul> |
| <li>any URL can be opened, not just <code>http:</code> |
| <li><code>mechanize.UserAgent</code> offers easy dynamic configuration of |
| user-agent features like protocol, cookie, redirection and |
| <code>robots.txt</code> handling, without having to make a new |
| <code>OpenerDirector</code> each time, eg. by calling |
| <code>build_opener()</code>. |
| </ul> |
| <li>Easy HTML form filling, using <a href="../ClientForm/">ClientForm</a> |
| interface. |
| <li>Convenient link parsing and following. |
| <li>Browser history (<code>.back()</code> and <code>.reload()</code> |
| methods). |
| <li>The <code>Referer</code> HTTP header is added properly (optional). |
| <li>Automatic observance of <a |
| href="http://www.robotstxt.org/wc/norobots.html"> |
| <code>robots.txt</code></a>. |
| <li>Automatic handling of HTTP-Equiv and Refresh, using <a |
| href="../ClientCookie/">ClientCookie</a>. |
| </ul> |
| |
| |
| <a name="examples"></a> |
| <h2>Examples</h2> |
| |
| <p>The two below are just to give the gist. There are also some <a |
| href="./#tests">actual working examples</a>. |
| |
| @{colorize(r""" |
| import re |
| from mechanize import Browser |
| |
| br = Browser() |
| br.open("http://www.example.com/") |
| # follow second link with element text matching regular expression |
| response1 = br.follow_link(text_regex=re.compile(r"cheese\s*shop"), nr=1) |
| assert br.viewing_html() |
| print br.title() |
| print response1.geturl() |
| print response1.info() # headers |
| print response1.read() # body |
| response1.close() # (shown for clarity; in fact Browser does this for you) |
| |
| br.select_form(name="order") |
| # Browser passes through unknown attributes (including methods) |
| # to the selected HTMLForm (from ClientForm). |
| br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__) |
| response2 = br.submit() # submit current form |
| |
| response3 = br.back() # back to cheese shop (same data as response1) |
| # the history mechanism returns cached response objects |
| # we can still use the response, even though we closed it: |
| response3.seek(0) |
| response3.read() |
| response4 = br.reload() # fetches from server |
| |
| for form in br.forms(): |
| print form |
| # .links() optionally accepts the keyword args of .follow_/.find_link() |
| for link in br.links(url_regex=re.compile("python.org")): |
| print link |
| br.follow_link(link) # takes EITHER Link instance OR keyword args |
| br.back() |
| """)} |
| |
| <p>You may control the browser's policy by using the methods of |
| <code>mechanize.Browser</code>'s base class, <code>mechanize.UserAgent</code>. |
| For example: |
| |
| @{colorize(""" |
| br = Browser() |
| # Don't handle HTTP-EQUIV headers (HTTP headers embedded in HTML). |
| br.set_handle_equiv(False) |
| # Ignore robots.txt. Do not do this without thought and consideration. |
| br.set_handle_robots(False) |
| # Don't handle cookies |
| br.set_cookiejar() |
| # Supply your own ClientCookie.CookieJar (NOTE: cookie handling is ON by |
| # default: no need to do this unless you have some reason to use a |
| # particular cookiejar) |
| br.set_cookiejar(cj) |
| # Log information about HTTP redirects and Refreshes. |
| br.set_debug_redirects(True) |
| # Log HTTP response bodies (ie. the HTML, most of the time). |
| br.set_debug_responses(True) |
| # Print HTTP headers. |
| br.set_debug_http(True) |
| |
| # To make sure you're seeing all debug output: |
| for logger in [ |
| logging.getLogger("ClientCookie"), |
| logging.getLogger("cookielib"), |
| ]: |
| logger.addHandler(logging.StreamHandler(sys.stdout)) |
| logger.setLevel(logging.INFO) |
| """)} |
| |
| |
| <a name="docs"></a> |
| <h2>Documentation</h2> |
| |
| <p>Full documentation is in the docstrings. |
| |
| |
| <a name="credits"></a> |
| <h2>Credits</h2> |
| |
| <p>Thanks to Ian Bicking, for persuading me that a <code>UserAgent</code> class |
| would be useful, and to everyone who has reported bugs. |
| |
| <p>And of course thanks to Andy Lester for the original, <a |
| href="http://search.cpan.org/dist/WWW-Mechanize/"><code>WWW::Mechanize</code> |
| </a>. |
| |
| |
| <a name="todo"></a> |
| <h2>Todo</h2> |
| |
| <p>Contributions welcome! |
| |
| <h3>Specific to mechanize</h3> |
| <ul> |
| <li>Make encoding_finder public, I guess. |
| <li>Fix BeautifulSoup support to use a single BeautifulSoup instance |
| per page. |
| <li>Test BeautifulSoup support better / fix encoding issue. |
| <li>Support Mark Pilgrim's universal encoding detector? |
| <li>Add another History implementation or two and finalise interface. |
| <li>History cache expiration. |
| <li>Investigate possible leak (see Balazs Ree's list posting). |
| <li>Add <code>Browser.form_as_string()</code> and |
| <code>Browser.__str__()</code> methods. |
| <li>Add two-way links between BeautifulSoup & ClientForm object models. |
| </ul> |
| |
| <h3>mechanize documentation</h3> |
| <ul> |
| <li>Auth / proxies. |
| <li>Document means of processing response on ad-hoc basis with |
| .set_response() - e.g. to fix bad encoding in Content-type header or |
| clean up bad HTML. |
| <li>Add example to documentation showing can pass None as handle arg |
| to <code>mechanize.UserAgent</code> methods and then .add_handler() |
| if need to give it a specific handler instance to use for one of the |
| things it UserAgent already handles. |
| <li>Add more functional tests. |
| </ul> |
| |
| <h3>Basic protocols / standards support</h3> |
| <ul> |
| <li>Implement RFC 3986 URL absolutization. |
| <li>Figure out the Right Thing (if such a thing exists) for %-encoding. |
| <li>How do IRIs fit into the world? |
| <li>IDNA (ClientCookie) -- must read about security stuff first. |
| <li>Unicode support in general (not sure yet how/when/whether this will |
| happen). |
| <li>Provide per-connection access to timeouts (ClientCookie). |
| <li>Keep-alive / connection caching. |
| <li>Pipelining?? |
| <li>Content negotiation. |
| </ul> |
| |
| |
| <a name="download"></a> |
| <h2>Getting mechanize</h2> |
| |
| <p>You can install the <a href="./#source">old-fashioned way</a>, or using <a |
| href="http://peak.telecommunity.com/DevCenter/EasyInstall">EasyInstall</a>. I |
| recommend the latter even though EasyInstall is still in alpha, because it will |
| automatically ensure you have the necessary dependencies, downloading if |
| necessary. |
| |
| <p><a href="./#svn">Subversion (SVN) access</a> is also available. |
| |
| <p>Since EasyInstall is new, I include some instructions below, but mechanize |
| follows standard EasyInstall / <code>setuptools</code> conventions, so you |
| should refer to the <a |
| href="http://peak.telecommunity.com/DevCenter/EasyInstall">EasyInstall</a> and |
| <a href="http://peak.telecommunity.com/DevCenter/setuptools">setuptools</a> |
| documentation if you need more detailed or up-to-date instructions. |
| |
| <h2>EasyInstall / setuptools</h2> |
| |
| <p>The benefit of EasyInstall and the new <code>setuptools</code>-supporting |
| <code>setup.py</code> is that they grab all dependencies for you (viz, |
| ClientForm, ClientCookie, and either pullparser or |
| <a href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a>). |
| |
| <p><strong>You need EasyInstall version 0.6a8 or newer.</strong> |
| |
| <h3>Using EasyInstall to download and install mechanize</h3> |
| |
| <ol> |
| <li><a href="http://peak.telecommunity.com/DevCenter/EasyInstall#installing-easy-install"> |
| Install easy_install</a> (you need version 0.6a8 or newer) |
| <li><code>easy_install mechanize</code> |
| </ol> |
| |
| <p>If you're on a Unix-like OS, you may need root permissions for that last |
| step (or see the <a |
| href="http://peak.telecommunity.com/DevCenter/EasyInstall">EasyInstall |
| documentation</a> for other installation options). |
| |
| <p>If you already have mechanize installed as a <a |
| href="http://peak.telecommunity.com/DevCenter/PythonEggs">Python Egg</a> (as |
| you do if you installed using EasyInstall, or using <code>setup.py |
| install</code> from mechanize 0.0.10a or newer), you can upgrade to the latest |
| version using: |
| |
| <pre>easy_install --upgrade mechanize</pre> |
| |
| <p>You probably want to read up on the <code>-m</code> option to |
| <code>easy_install</code>, which lets you install multiple versions of a |
| package. |
| |
| <a name="svnhead"></a> |
| <h3>Using EasyInstall to download and install the latest in-development (SVN HEAD) version of mechanize</h3> |
| |
| <pre>easy_install "mechanize==dev"</pre> |
| |
| <p>Note that that will not necessarily grab the SVN versions of dependencies, |
| such as ClientCookie: It will use SVN to fetch dependencies if and only if the |
| SVN HEAD version of mechanize declares itself to depend on the SVN versions of |
| those dependencies; even then, those declared dependencies won't necessarily be |
| on SVN HEAD, but rather a particular revision. If you want SVN HEAD for a |
| dependency project, you should ask for it explicitly by running |
| <code>easy_install "projectname=dev"</code> for that project. |
| |
| <p>Note also that you can still carry on using a plain old SVN checkout as |
| usual if you like (optionally in conjunction with <a |
| href="./#develop"><code>setup.py develop</code></a> – this is |
| particularly useful on Windows, since it functions rather like symlinks). |
| |
| <h3>Using setup.py from a .tar.gz, .zip or an SVN checkout to download and install mechanize</h3> |
| |
| <p><code>setup.py</code> should correctly resolve and download dependencies: |
| |
| <pre>python setup.py install</pre> |
| |
| <p>Or, to get access to the same options that <code>easy_install</code> |
| accepts, use the <code>easy_install</code> distutils command instead of |
| <code>install</code> (see <code>python setup.py --help easy_install</code>) |
| |
| <pre>python setup.py easy_install mechanize</pre> |
| |
| <a name="develop"></a> |
| <h3>Using setup.py to install mechanize for development work on mechanize</h3> |
| |
| <p><strong>Note: this section is only useful for people who want to change |
| mechanize</strong>: It is not useful to do this if all you want is to <a |
| href="./#svnhead">keep up with SVN</a>. |
| |
| <p>For development of mechanize using EasyInstall (see the <a |
| href="http://peak.telecommunity.com/DevCenter/setuptools">setuptools</a> docs |
| for details), you have the option of using the <code>develop</code> distutils |
| command. This is particularly useful on Windows, since it functions rather |
| like symlinks. Get the mechanize source, then: |
| |
| <pre>python setup.py develop</pre> |
| |
| <p>Note that after every <code>svn update</code> on a |
| <code>develop</code>-installed project, you should run <code>setup.py |
| develop</code> to ensure that project's dependencies are updated if required. |
| |
| <p>Also note that, currently, if you also use the <code>develop</code> |
| distutils command on the <em>dependencies</em> of mechanize to keep up with |
| SVN, you must run <code>setup.py develop</code> for each dependency of |
| mechanize before running it for mechanize itself. As a result, in this case |
| it's probably simplest to just set up your sys.path manually rather than using |
| <code>setup.py develop</code>. |
| |
| <p>One convenient way to get the latest source is: |
| |
| <pre>easy_install --editable --build-directory mybuilddir "mechanize==dev"</pre> |
| |
| |
| <a name="source"></a> |
| <h2>Download</h2> |
| <p>All documentation (including this web page) is included in the distribution. |
| |
| <p>This is an alpha release: interfaces may change, and there will be bugs. |
| |
| <p><em>Development release.</em> |
| |
| <ul> |
| @{version = "0.1.0a"} |
| @{win_version = release.win_version(version)} |
| <li><a href="./src/mechanize-@(version).tar.gz">mechanize-@(version).tar.gz</a> |
| <li><a href="./src/mechanize-@(win_version).zip">mechanize-@(win_version).zip</a> |
| <li><a href="./src/ChangeLog.txt">Change Log</a> (included in distribution) |
| <li><a href="./src/">Older versions.</a> |
| </ul> |
| |
| <p>For old-style installation instructions, see the INSTALL file included in |
| the distribution. Better, <a href="./#download">use EasyInstall</a>. |
| |
| |
| <a name="svn"></a> |
| <h2>Subversion</h2> |
| |
| <p>The <a href="http://subversion.tigris.org/">Subversion (SVN)</a> trunk is <a href="http://codespeak.net/svn/wwwsearch/mechanize/trunk#egg=mechanize-dev">http://codespeak.net/svn/wwwsearch/mechanize/trunk</a>, so to check out the source: |
| |
| <pre> |
| svn co http://codespeak.net/svn/wwwsearch/mechanize/trunk mechanize |
| </pre> |
| |
| <a name="tests"></a> |
| <h2>Tests and examples</h2> |
| |
| <h3>Examples</h3> |
| |
| <p>The <code>examples</code> directory in the <a href="./#source">source |
| packages</a> contains a couple of silly, but working, scripts to demonstrate |
| basic use of the module. Note that it's in the nature of web scraping for such |
| scripts to break, so don't be too suprised if that happens – do let me |
| know, though! |
| |
| <p>It's worth knowing also that the examples on the <a |
| href="../ClientForm/">ClientForm web page</a> are useful for mechanize users, |
| and are now real run-able scripts rather than just documentation. |
| |
| <h3>Unit tests</h3> |
| |
| <p>Note that the dependencies of mechanize (ClientCookie, ClientForm, and |
| pullparser) have their own unit tests, which must be run separately. |
| |
| <p>To run the unit tests (none of which access the network), run the following |
| command: |
| |
| <pre>python test.py</pre> |
| |
| <p>This runs the tests against the source files extracted from the |
| package. For help on command line options: |
| |
| <pre>python test.py --help</pre> |
| |
| |
| <h2>See also</h2> |
| |
| <p>There are several wrappers around mechanize designed for functional testing |
| of web applications: |
| |
| <ul> |
| |
| <li><a href="http://cheeseshop.python.org/pypi?:action=display&name=zope.testbrowser"> |
| <code>zope.testbrowser</code></a> (or |
| <a href="http://cheeseshop.python.org/pypi?%3Aaction=display&name=ZopeTestbrowser"> |
| <code>ZopeTestBrowser</code></a>, the standalone version). |
| <li><a href="http://www.idyll.org/~t/www-tools/twill.html">twill</a>. |
| </ul> |
| |
| <p>Richard Jones' <a href="http://mechanicalcat.net/tech/webunit/">webunit</a> |
| (this is not the same as Steven Purcell's <a |
| href="http://webunit.sourceforge.net/">code of the same name</a>). webunit and |
| mechanize are quite similar. On the minus side, webunit is missing things like |
| browser history, high-level forms and links handling, thorough cookie handling, |
| refresh redirection, adding of the Referer header, observance of robots.txt and |
| easy extensibility. On the plus side, webunit has a bunch of utility functions |
| bound up in its WebFetcher class, which look useful for writing tests (though |
| they'd be easy to duplicate using mechanize). In general, webunit has more of |
| a frameworky emphasis, with aims limited to writing tests, where mechanize and |
| the modules it depends on try hard to be general-purpose libraries. |
| |
| <p>There are many related links in the <a |
| href="../bits/GeneralFAQ.html">General FAQ</a> page, too. |
| |
| |
| <a name="faq"></a> |
| <h2>FAQs</h2> |
| <ul> |
| <li>Which version of Python do I need? |
| <p>2.2.1 or above. |
| <li>What else do I need? |
| <p><a href="../ClientCookie/">ClientCookie</a>, |
| <a href="../ClientForm/">ClientForm</a> and |
| <a href="../pullparser/">pullparser</a>. |
| <p>The versions of those required modules are listed in the |
| <code>setup.py</code> for mechanize (included with the download). The |
| dependencies are automatically fetched by <a |
| href="http://peak.telecommunity.com/DevCenter/EasyInstall">EasyInstall</a> |
| (or by <a href="./#source">downloading</a> a mechanize source package and |
| running <code>python setup.py install</code>). If you like you can fetch |
| and install them manually, instead – see the <code>INSTALL.txt</code> |
| file (included with the distribution). |
| <li>Which license? |
| <p>mechanize is dual-licensed: you may pick either the |
| <a href="http://www.opensource.org/licenses/bsd-license.php">BSD license</a>, |
| or the <a href="http://www.zope.org/Resources/ZPL">ZPL 2.1</a> (both are |
| included in the distribution). |
| </ul> |
| |
| <p>I prefer questions and comments to be sent to the <a |
| href="http://lists.sourceforge.net/lists/listinfo/wwwsearch-general"> |
| mailing list</a> rather than direct to me. |
| |
| <p><a href="mailto:jjl@@pobox.com">John J. Lee</a>, |
| @(time.strftime("%B %Y", last_modified)). |
| |
| <hr> |
| |
| </div> |
| |
| <div id="Menu"> |
| |
| @(release.navbar('mechanize')) |
| |
| <br> |
| |
| <a href="./#examples">Examples</a><br> |
| <a href="./#docs">Documentation</a><br> |
| <a href="./#todo">To-do</a><br> |
| <a href="./#download">Download</a><br> |
| <a href="./#svn">Subversion</a><br> |
| <a href="./#tests">More examples</a><br> |
| <a href="./#faq">FAQs</a><br> |
| |
| </div> |
| |
| |
| </body> |
| </html> |