blob: b6f4f3deeab45d3548ce71a71047bcf95d74a2b9 [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
@# This file is processed by EmPy: do not edit
@# http://wwwsearch.sf.net/bits/colorize.py
@{
from colorize import colorize
import time
import release
last_modified = release.svn_id_to_time("$Id$")
try:
base
except NameError:
base = False
}
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<meta name="author" content="John J. Lee &lt;jjl@@pobox.com&gt;">
<meta name="date" content="@(time.strftime("%Y-%m-%d", last_modified))">
<meta name="keywords" content="Python,HTML,HTTP,browser,stateful,web,client,client-side,mechanize,cookie,form,META,HTTP-EQUIV,Refresh,ClientForm,ClientCookie,pullparser,WWW::Mechanize">
<meta name="keywords" content="cookie,HTTP,Python,web,client,client-side,HTML,META,HTTP-EQUIV,Refresh">
<title>mechanize</title>
<style type="text/css" media="screen">@@import "../styles/style.css";</style>
@[if base]<base href="http://wwwsearch.sourceforge.net/mechanize/">@[end if]
</head>
<body>
<div id="sf"><a href="http://sourceforge.net">
<img src="http://sourceforge.net/sflogo.php?group_id=48205&amp;type=2"
width="125" height="37" alt="SourceForge.net Logo"></a></div>
<!--<img src="../images/sflogo.png"-->
<h1>mechanize</h1>
<div id="Content">
<p>Stateful programmatic web browsing in Python, after Andy Lester's Perl
module <a
href="http://search.cpan.org/dist/WWW-Mechanize/"><code>WWW::Mechanize</code>
</a>.
<ul>
<li><code>mechanize.Browser</code> is a subclass of
<code>mechanize.UserAgentBase</code>, which is, in turn, a subclass of
<code>urllib2.OpenerDirector</code> (in fact, of
<code>mechanize.OpenerDirector</code>), so:
<ul>
<li>any URL can be opened, not just <code>http:</code>
<li><code>mechanize.UserAgentBase</code> offers easy dynamic
configuration of user-agent features like protocol, cookie,
redirection and <code>robots.txt</code> handling, without having
to make a new <code>OpenerDirector</code> each time, e.g. by
calling <code>build_opener()</code>.
</ul>
<li>Easy HTML form filling, using <a href="../ClientForm/">ClientForm</a>
interface.
<li>Convenient link parsing and following.
<li>Browser history (<code>.back()</code> and <code>.reload()</code>
methods).
<li>The <code>Referer</code> HTTP header is added properly (optional).
<li>Automatic observance of <a
href="http://www.robotstxt.org/wc/norobots.html">
<code>robots.txt</code></a>.
<li>Automatic handling of HTTP-Equiv and Refresh.
</ul>
<a name="examples"></a>
<h2>Examples</h2>
<p class="docwarning">This documentation is in need of reorganisation and
extension!</p>
<p>The two below are just to give the gist. There are also some <a
href="./#tests">actual working examples</a>.
@{colorize(r"""
import re
from mechanize import Browser
br = Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
assert br.viewing_html()
print br.title()
print response1.geturl()
print response1.info() # headers
print response1.read() # body
response1.close() # (shown for clarity; in fact Browser does this for you)
br.select_form(name="order")
# Browser passes through unknown attributes (including methods)
# to the selected HTMLForm (from ClientForm).
br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__)
response2 = br.submit() # submit current form
# print currently selected form (don't call .submit() on this, use br.submit())
print br.form
response3 = br.back() # back to cheese shop (same data as response1)
# the history mechanism returns cached response objects
# we can still use the response, even though we closed it:
response3.seek(0)
response3.read()
response4 = br.reload() # fetches from server
for form in br.forms():
print form
# .links() optionally accepts the keyword args of .follow_/.find_link()
for link in br.links(url_regex="python.org"):
print link
br.follow_link(link) # takes EITHER Link instance OR keyword args
br.back()
""")}
<p>You may control the browser's policy by using the methods of
<code>mechanize.Browser</code>'s base class, <code>mechanize.UserAgent</code>.
For example:
@{colorize("""
br = Browser()
# Explicitly configure proxies (Browser will attempt to set good defaults).
# Note the userinfo ("joe:password@") and port number (":3128") are optional.
br.set_proxies({"http": "joe:password@myproxy.example.com:3128",
"ftp": "proxy.example.com",
})
# Add HTTP Basic/Digest auth username and password for HTTP proxy access.
# (equivalent to using "joe:password@..." form above)
br.add_proxy_password("joe", "password")
# Add HTTP Basic/Digest auth username and password for website access.
br.add_password("http://example.com/protected/", "joe", "password")
# Don't handle HTTP-EQUIV headers (HTTP headers embedded in HTML).
br.set_handle_equiv(False)
# Ignore robots.txt. Do not do this without thought and consideration.
br.set_handle_robots(False)
# Don't add Referer (sic) header
br.set_handle_referer(False)
# Don't handle Refresh redirections
br.set_handle_refresh(False)
# Don't handle cookies
br.set_cookiejar()
# Supply your own mechanize.CookieJar (NOTE: cookie handling is ON by
# default: no need to do this unless you have some reason to use a
# particular cookiejar)
br.set_cookiejar(cj)
# Log information about HTTP redirects and Refreshes.
br.set_debug_redirects(True)
# Log HTTP response bodies (ie. the HTML, most of the time).
br.set_debug_responses(True)
# Print HTTP headers.
br.set_debug_http(True)
# To make sure you're seeing all debug output:
logger = logging.getLogger("mechanize")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.INFO)
# Sometimes it's useful to process bad headers or bad HTML:
response = br.response() # this is a copy of response
headers = response.info() # currently, this is a mimetools.Message
headers["Content-type"] = "text/html; charset=utf-8"
response.set_data(response.get_data().replace("<!---", "<!--"))
br.set_response(response)
""")}
<p>mechanize exports the complete interface of <code>urllib2</code>:
@{colorize("""
import mechanize
response = mechanize.urlopen("http://www.example.com/")
print response.read()
""")}
<p>so anything you would normally import from <code>urllib2</code> can
(and should, by preference, to insulate you from future changes) be
imported from mechanize instead. In many cases if you import an
object from mechanize it will be the very same object you would get if
you imported from urllib2. In many other cases, though, the
implementation comes from mechanize, either because bug fixes have
been applied or the functionality of urllib2 has been extended in some
way.
<a name="useragentbase"></a>
<h2>UserAgent vs UserAgentBase</h2>
<p><code>mechanize.UserAgent</code> is a trivial subclass of
<code>mechanize.UserAgentBase</code>, adding just one method,
<code>.set_seekable_responses()</code> (see the <a
href="./doc.html#seekable">documentation on seekable responses</a>).
<p>The reason for the extra class is that
<code>mechanize.Browser</code> depends on seekable response objects
(because response objects are used to implement the browser history).
<a name="compatnotes"></a>
<h2>Compatibility</h2>
<p>These notes explain the relationship between mechanize, ClientCookie,
<code>cookielib</code> and <code>urllib2</code>, and which to use when. If
you're just using mechanize, and not any of those other libraries, you can
ignore this section.
<ol>
<li>mechanize works with Python 2.4, Python 2.5, and Python 2.6.
<li>ClientCookie is no longer maintained as a separate package. The code is
now part of mechanize, and its interface is now exported through module
mechanize (since mechanize 0.1.0). Old code can simply be changed to
<code>import mechanize as ClientCookie</code> and should continue to
work.
<li>The cookie handling parts of mechanize are in Python 2.4 standard library
as module <code>cookielib</code> and extensions to module
<code>urllib2</code>.
</ol>
<p><strong>IMPORTANT:</strong> The following are the ONLY cases where
<code>mechanize</code> and <code>urllib2</code> code are intended to work
together. For all other code, use mechanize
<em><strong>exclusively</strong></em>: do NOT mix use of mechanize and
<code>urllib2</code>!
<ol>
<li>Handler classes that are missing from 2.4's <code>urllib2</code>
(e.g. <code>HTTPRefreshProcessor</code>, <code>HTTPEquivProcessor</code>,
<code>HTTPRobotRulesProcessor</code>) may be used with the
<code>urllib2</code> of Python 2.4 or newer. There are not currently any
functional tests for this in mechanize, however, so this feature may be
broken.
<li>If you want to use <code>mechanize.RefreshProcessor</code> with Python >=
2.4's <code>urllib2</code>, you must also use
<code>mechanize.HTTPRedirectHandler</code>.
<li><code>mechanize.HTTPRefererProcessor</code> requires special support from
<code>mechanize.Browser</code>, so cannot be used with vanilla
<code>urllib2</code>.
<li><code>mechanize.HTTPRequestUpgradeProcessor</code> and
<code>mechanize.ResponseUpgradeProcessor</code> are not useful outside of
mechanize.
<li>Request and response objects from code based on <code>urllib2</code> work
with mechanize, and vice-versa.
<li>The classes and functions exported by mechanize in its public interface
that come straight from <code>urllib2</code>
(e.g. <code>FTPHandler</code>, at the time of writing) do work with
mechanize (duh ;-). Exactly which of these classes and functions come
straight from <code>urllib2</code> without extension or modification will
change over time, though, so don't rely on it; instead, just import
everything you need from mechanize, never from <code>urllib2</code>. The
exception is usage as described in the first item in this list, which is
explicitly OK (though not well tested ATM), subject to the other
restrictions in the list above .
</ol>
<a name="docs"></a>
<h2>Documentation</h2>
<p>Full documentation is in the docstrings.
<p>The documentation in the web pages is in need of reorganisation at the
moment, after the merge of ClientCookie into mechanize.
<a name="credits"></a>
<h2>Credits</h2>
<p>Thanks to all the too-numerous-to-list people who reported bugs and provided
patches. Also thanks to Ian Bicking, for persuading me that a
<code>UserAgent</code> class would be useful, and to Ronald Tschalar for advice
on Netscape cookies.
<p>A lot of credit must go to Gisle Aas, who wrote libwww-perl, from which
large parts of mechanize originally derived, and Andy Lester for the original,
<a href="http://search.cpan.org/dist/WWW-Mechanize/"><code>WWW::Mechanize</code>
</a>. Finally, thanks to the (coincidentally-named) Johnny Lee for the MSIE
CookieJar Perl code from which mechanize's support for that is derived.
<a name="todo"></a>
<h2>To do</h2>
<p>Contributions welcome!
<p>The documentation to-do list has moved to the new "docs-in-progress"
directory in SVN.
<p><em>This is <strong>very</strong> roughly in order of priority</em>
<ul>
<li>Test <code>.any_response()</code> two handlers case: ordering.
<li>Test referer bugs (frags and don't add in redirect unless orig
req had Referer)
<li>Remove use of urlparse from _auth.py.
<li>Proper XHTML support!
<li>Fix BeautifulSoup support to use a single BeautifulSoup instance
per page.
<li>Test BeautifulSoup support better / fix encoding issue.
<li>Support BeautifulSoup 3.
<li>Add another History implementation or two and finalise interface.
<li>History cache expiration.
<li>Investigate possible leak further (see Balazs Ree's list posting).
<li>Make <code>EncodingFinder</code> public, I guess (but probably
improve it first). (For example: support Mark Pilgrim's universal
encoding detector?)
<li>Add two-way links between BeautifulSoup &amp; ClientForm object
models.
<li>In 0.2: switch to Python unicode strings everywhere appropriate
(HTTP level should still use byte strings, of course).
<li><code>clean_url()</code>: test browser behaviour. I <em>think</em>
this is correct...
<li>Use a nicer RFC 3986 join / split / unsplit implementation.
<li>Figure out the Right Thing (if such a thing exists) for %-encoding.
<li>How do IRIs fit into the world?
<li>IDNA -- must read about security stuff first.
<li>Unicode support in general.
<li>Provide per-connection access to timeouts.
<li>Keep-alive / connection caching.
<li>Pipelining??
<li>Content negotiation.
<li>gzip transfer encoding (there's already a handler for this in
mechanize, but it's poorly implemented ATM).
<li>proxy.pac parsing (I don't think this needs JS interpretation)
<li>Topological sort for handlers, instead of .handler_order
attribute. Ordering and other dependencies (where unavoidable)
should be defined separate from handlers themselves. Add new
build_opener and deprecate the old one? Actually, _useragent is
probably not far off what I'd have in mind (would just need a
method or two and a base class adding I think), and it's not a high
priority since I guess most people will just use the UserAgent and
Browser classes.
</ul>
<a name="download"></a>
<h2>Getting mechanize</h2>
<p>You can install the <a href="./#source">old-fashioned way</a>, or using <a
href="http://peak.telecommunity.com/DevCenter/EasyInstall">EasyInstall</a>. I
recommend the latter even though EasyInstall is still in alpha, because it will
automatically ensure you have the necessary dependencies, downloading if
necessary.
<p><a href="./#svn">Subversion (SVN) access</a> is also available.
<p>Since EasyInstall is new, I include some instructions below, but mechanize
follows standard EasyInstall / <code>setuptools</code> conventions, so you
should refer to the <a
href="http://peak.telecommunity.com/DevCenter/EasyInstall">EasyInstall</a> and
<a href="http://peak.telecommunity.com/DevCenter/setuptools">setuptools</a>
documentation if you need more detailed or up-to-date instructions.
<h2>EasyInstall / setuptools</h2>
<p>The benefit of EasyInstall and the new <code>setuptools</code>-supporting
<code>setup.py</code> is that they grab all dependencies for you. Also, using
EasyInstall is a one-liner for the common case, to be compared with the usual
download-unpack-install cycle with <code>setup.py</code>.
<h3>Using EasyInstall to download and install mechanize</h3>
<ol>
<li><a href="http://peak.telecommunity.com/DevCenter/EasyInstall#installing-easy-install">
Install easy_install</a>
<li><code>easy_install mechanize</code>
</ol>
<p>If you're on a Unix-like OS, you may need root permissions for that last
step (or see the <a
href="http://peak.telecommunity.com/DevCenter/EasyInstall">EasyInstall
documentation</a> for other installation options).
<p>If you already have mechanize installed as a <a
href="http://peak.telecommunity.com/DevCenter/PythonEggs">Python Egg</a> (as
you do if you installed using EasyInstall, or using <code>setup.py
install</code> from mechanize 0.0.10a or newer), you can upgrade to the latest
version using:
<pre>easy_install --upgrade mechanize</pre>
<p>You may want to read up on the <code>-m</code> option to
<code>easy_install</code>, which lets you install multiple versions of a
package.
<a name="svnhead"></a>
<h3>Using EasyInstall to download and install the latest in-development (SVN HEAD) version of mechanize</h3>
<pre>easy_install "mechanize==dev"</pre>
<p>Note that that will not necessarily grab the SVN versions of dependencies,
such as ClientForm: It will use SVN to fetch dependencies if and only if the
SVN HEAD version of mechanize declares itself to depend on the SVN versions of
those dependencies; even then, those declared dependencies won't necessarily be
on SVN HEAD, but rather a particular revision. If you want SVN HEAD for a
dependency project, you should ask for it explicitly by running
<code>easy_install "projectname=dev"</code> for that project.
<p>Note also that you can still carry on using a plain old SVN checkout as
usual if you like.
<h3>Using setup.py from a .tar.gz, .zip or an SVN checkout to download and install mechanize</h3>
<p><code>setup.py</code> should correctly resolve and download dependencies:
<pre>python setup.py install</pre>
<p>Or, to get access to the same options that <code>easy_install</code>
accepts, use the <code>easy_install</code> distutils command instead of
<code>install</code> (see <code>python setup.py --help easy_install</code>)
<pre>python setup.py easy_install mechanize</pre>
<a name="source"></a>
<h2>Download</h2>
<p>All documentation (including this web page) is included in the distribution.
<p>This is a stable release.
<p><em>Development release.</em>
<ul>
@{version = "0.1.10"}
<li><a href="./src/mechanize-@(version).tar.gz">mechanize-@(version).tar.gz</a>
<li><a href="./src/mechanize-@(version).zip">mechanize-@(version).zip</a>
<li><a href="./src/ChangeLog.txt">Change Log</a> (included in distribution)
<li><a href="./src/">Older versions.</a>
</ul>
<p>For old-style installation instructions, see the INSTALL file included in
the distribution. Better, <a href="./#download">use EasyInstall</a>.
<a name="svn"></a>
<h2>Subversion</h2>
<p>The <a href="http://subversion.tigris.org/">Subversion (SVN)</a> trunk is <a href="http://codespeak.net/svn/wwwsearch/mechanize/trunk#egg=mechanize-dev">http://codespeak.net/svn/wwwsearch/mechanize/trunk</a>, so to check out the source:
<pre>
svn co http://codespeak.net/svn/wwwsearch/mechanize/trunk mechanize
</pre>
<a name="tests"></a>
<h2>Tests and examples</h2>
<h3>Examples</h3>
<p>The <code>examples</code> directory in the <a href="./#source">source
packages</a> contains a couple of silly, but working, scripts to demonstrate
basic use of the module. Note that it's in the nature of web scraping for such
scripts to break, so don't be too suprised if that happens &#8211; do let me
know, though!
<p>It's worth knowing also that the examples on the <a
href="../ClientForm/">ClientForm web page</a> are useful for mechanize users,
and are now real run-able scripts rather than just documentation.
<h3>Functional tests</h3>
<p>To run the functional tests (which <strong>do</strong> access the network),
run the following
command:
<pre>python functional_tests.py</pre>
<h3>Unit tests</h3>
<p>Note that ClientForm (a dependency of mechanize) has its own unit tests,
which must be run separately.
<p>To run the unit tests (none of which access the network), run the following
command:
<pre>python test.py</pre>
<p>This runs the tests against the source files extracted from the
package. For help on command line options:
<pre>python test.py --help</pre>
<h2>See also</h2>
<p>There are several wrappers around mechanize designed for functional testing
of web applications:
<ul>
<li><a href="http://cheeseshop.python.org/pypi?:action=display&amp;name=zope.testbrowser">
<code>zope.testbrowser</code></a> (or
<a href="http://cheeseshop.python.org/pypi?%3Aaction=display&amp;name=ZopeTestbrowser">
<code>ZopeTestBrowser</code></a>, the standalone version).
<li><a href="http://www.idyll.org/~t/www-tools/twill.html">twill</a>.
</ul>
<p>Richard Jones' <a href="http://mechanicalcat.net/tech/webunit/">webunit</a>
(this is not the same as Steven Purcell's <a
href="http://webunit.sourceforge.net/">code of the same name</a>). webunit and
mechanize are quite similar. On the minus side, webunit is missing things like
browser history, high-level forms and links handling, thorough cookie handling,
refresh redirection, adding of the Referer header, observance of robots.txt and
easy extensibility. On the plus side, webunit has a bunch of utility functions
bound up in its WebFetcher class, which look useful for writing tests (though
they'd be easy to duplicate using mechanize). In general, webunit has more of
a frameworky emphasis, with aims limited to writing tests, where mechanize and
the modules it depends on try hard to be general-purpose libraries.
<p>There are many related links in the <a
href="../bits/GeneralFAQ.html">General FAQ</a> page, too.
<a name="faq"></a>
<h2>FAQs - pre install</h2>
<ul>
<li>Which version of Python do I need?
<p>Python 2.4, 2.5 or 2.6.
<li>What else do I need?
<p>mechanize depends on <a href="../ClientForm/">ClientForm</a>.
<li>Does mechanize depend on BeautifulSoup?
No. mechanize offers a few (still rather experimental) classes that make
use of BeautifulSoup, but these classes are not required to use mechanize.
mechanize bundles BeautifulSoup version 2, so that module is no longer
required. A future version of mechanize will support BeautifulSoup
version 3, at which point mechanize will likely no longer bundle the
module.
<p>The versions of those required modules are listed in the
<code>setup.py</code> for mechanize (included with the download). The
dependencies are automatically fetched by <a
href="http://peak.telecommunity.com/DevCenter/EasyInstall">EasyInstall</a>
(or by <a href="./#source">downloading</a> a mechanize source package and
running <code>python setup.py install</code>). If you like you can fetch
and install them manually, instead &#8211; see the <code>INSTALL.txt</code>
file (included with the distribution).
<li>Which license?
<p>mechanize is dual-licensed: you may pick either the
<a href="http://www.opensource.org/licenses/bsd-license.php">BSD license</a>,
or the <a href="http://www.zope.org/Resources/ZPL">ZPL 2.1</a> (both are
included in the distribution).
</ul>
<a name="usagefaq"></a>
<h2>FAQs - usage</h2>
<ul>
<li>I'm not getting the HTML page I expected to see.
<ul>
<li><a href="http://wwwsearch.sourceforge.net/mechanize/doc.html#debugging">Debugging tips</a>
<li><a href="http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html">More tips</a>
</ul>
<li>I'm <strong><em>sure</em></strong> this page is HTML, why does
<code>mechanize.Browser</code> think otherwise?
@{colorize("""
b = mechanize.Browser(
# mechanize's XHTML support needs work, so is currently switched off. If
# we want to get our work done, we have to turn it on by supplying a
# mechanize.Factory (with XHTML support turned on):
factory=mechanize.DefaultFactory(i_want_broken_xhtml_support=True)
)
""")}
</ul>
<p>I prefer questions and comments to be sent to the <a
href="http://lists.sourceforge.net/lists/listinfo/wwwsearch-general">
mailing list</a> rather than direct to me.
<p><a href="mailto:jjl@@pobox.com">John J. Lee</a>,
@(time.strftime("%B %Y", last_modified)).
<hr>
</div>
<div id="Menu">
@(release.navbar('mechanize'))
<br>
<a href="./#examples">Examples</a><br>
<a href="./#compatnotes">Compatibility</a><br>
<a href="./#docs">Documentation</a><br>
<a href="./#todo">To-do</a><br>
<a href="./#download">Download</a><br>
<a href="./#svn">Subversion</a><br>
<a href="./#tests">More examples</a><br>
<a href="./#faq">FAQs</a><br>
</div>
</body>
</html>