| <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| "http://www.w3.org/TR/html4/strict.dtd"> |
| @# This file is processed by EmPy |
| @{ |
| from colorize import colorize |
| import time |
| import release |
| last_modified = release.last_modified(empy.identify()[0]) |
| try: |
| base |
| except NameError: |
| base = False |
| try: |
| version |
| except NameError: |
| version = "dummy version" |
| } |
| <html> |
| <!--This file was generated by EmPy: do not edit--> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
| <meta name="author" content="John J. Lee <jjl@@pobox.com>"> |
| <meta name="date" content="@(time.strftime("%Y-%m-%d", last_modified))"> |
| <meta name="keywords" content="Python,HTML,HTTP,browser,stateful,web,client,client-side,mechanize,cookie,form,META,HTTP-EQUIV,Refresh,ClientForm,ClientCookie,pullparser,WWW::Mechanize"> |
| <meta name="keywords" content="cookie,HTTP,Python,web,client,client-side,HTML,META,HTTP-EQUIV,Refresh"> |
| <title>mechanize</title> |
| <style type="text/css" media="screen">@@import "../styles/style.css";</style> |
| <!--[if IE 6]> |
| <style type="text/css" media="screen">@@import "../styles/style-ie6.css";</style> |
| <![endif]--> |
| @[if base]<base href="http://wwwsearch.sourceforge.net/mechanize/">@[end if] |
| </head> |
| <body> |
| |
| <div id="sf"><a href="http://sourceforge.net"> |
| <img src="http://sourceforge.net/sflogo.php?group_id=48205&type=2" |
| width="125" height="37" alt="SourceForge.net Logo"></a></div> |
| <!--<img src="../images/sflogo.png"--> |
| |
| <h1>mechanize</h1> |
| |
| <div id="Content"> |
| |
| <p>Stateful programmatic web browsing in Python, after Andy Lester's Perl |
| module <a |
| href="http://search.cpan.org/dist/WWW-Mechanize/"><code>WWW::Mechanize</code> |
| </a>. |
| |
| <ul> |
| |
| <li><code>mechanize.Browser</code> and <code>mechanize.UserAgentBase</code> |
| implement the interface of <code>urllib2.OpenerDirector</code>, so: |
| <ul> |
| <li>any URL can be opened, not just <code>http:</code> |
| |
| <li><code>mechanize.UserAgentBase</code> offers easy dynamic |
| configuration of user-agent features like protocol, cookie, |
| redirection and <code>robots.txt</code> handling, without having |
| to make a new <code>OpenerDirector</code> each time, e.g. by |
| calling <code>build_opener()</code>. |
| |
| </ul> |
| <li>Easy HTML form filling. |
| <li>Convenient link parsing and following. |
| <li>Browser history (<code>.back()</code> and <code>.reload()</code> |
| methods). |
| <li>The <code>Referer</code> HTTP header is added properly (optional). |
| <li>Automatic observance of <a |
| href="http://www.robotstxt.org/wc/norobots.html"> |
| <code>robots.txt</code></a>. |
| <li>Automatic handling of HTTP-Equiv and Refresh. |
| </ul> |
| |
| |
| <a name="examples"></a> |
| <h2>Examples</h2> |
| |
| <p class="docwarning">This documentation is in need of reorganisation and |
| extension!</p> |
| |
| <p>The examples below are written for a website that does not exist |
| (<code>example.com</code>), so cannot be run. There are also |
| some <a href="./#tests">working examples</a> that you can run. |
| |
| @{colorize(r""" |
| import re |
| import mechanize |
| |
| br = mechanize.Browser() |
| br.open("http://www.example.com/") |
| # follow second link with element text matching regular expression |
| response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1) |
| assert br.viewing_html() |
| print br.title() |
| print response1.geturl() |
| print response1.info() # headers |
| print response1.read() # body |
| |
| br.select_form(name="order") |
| # Browser passes through unknown attributes (including methods) |
| # to the selected HTMLForm. |
| br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__) |
| # Submit current form. Browser calls .close() on the current response on |
| # navigation, so this closes response1 |
| response2 = br.submit() |
| |
| # print currently selected form (don't call .submit() on this, use br.submit()) |
| print br.form |
| |
| response3 = br.back() # back to cheese shop (same data as response1) |
| # the history mechanism returns cached response objects |
| # we can still use the response, even though it was .close()d |
| response3.get_data() # like .seek(0) followed by .read() |
| response4 = br.reload() # fetches from server |
| |
| for form in br.forms(): |
| print form |
| # .links() optionally accepts the keyword args of .follow_/.find_link() |
| for link in br.links(url_regex="python.org"): |
| print link |
| br.follow_link(link) # takes EITHER Link instance OR keyword args |
| br.back() |
| """)} |
| |
| <p>You may control the browser's policy by using the methods of |
| <code>mechanize.Browser</code>'s base class, <code>mechanize.UserAgent</code>. |
| For example: |
| |
| @{colorize(""" |
| br = mechanize.Browser() |
| # Explicitly configure proxies (Browser will attempt to set good defaults). |
| # Note the userinfo ("joe:password@") and port number (":3128") are optional. |
| br.set_proxies({"http": "joe:password@myproxy.example.com:3128", |
| "ftp": "proxy.example.com", |
| }) |
| # Add HTTP Basic/Digest auth username and password for HTTP proxy access. |
| # (equivalent to using "joe:password@..." form above) |
| br.add_proxy_password("joe", "password") |
| # Add HTTP Basic/Digest auth username and password for website access. |
| br.add_password("http://example.com/protected/", "joe", "password") |
| # Don't handle HTTP-EQUIV headers (HTTP headers embedded in HTML). |
| br.set_handle_equiv(False) |
| # Ignore robots.txt. Do not do this without thought and consideration. |
| br.set_handle_robots(False) |
| # Don't add Referer (sic) header |
| br.set_handle_referer(False) |
| # Don't handle Refresh redirections |
| br.set_handle_refresh(False) |
| # Don't handle cookies |
| br.set_cookiejar() |
| # Supply your own mechanize.CookieJar (NOTE: cookie handling is ON by |
| # default: no need to do this unless you have some reason to use a |
| # particular cookiejar) |
| br.set_cookiejar(cj) |
| # Log information about HTTP redirects and Refreshes. |
| br.set_debug_redirects(True) |
| # Log HTTP response bodies (ie. the HTML, most of the time). |
| br.set_debug_responses(True) |
| # Print HTTP headers. |
| br.set_debug_http(True) |
| |
| # To make sure you're seeing all debug output: |
| logger = logging.getLogger("mechanize") |
| logger.addHandler(logging.StreamHandler(sys.stdout)) |
| logger.setLevel(logging.INFO) |
| |
| # Sometimes it's useful to process bad headers or bad HTML: |
| response = br.response() # this is a copy of response |
| headers = response.info() # currently, this is a mimetools.Message |
| headers["Content-type"] = "text/html; charset=utf-8" |
| response.set_data(response.get_data().replace("<!---", "<!--")) |
| br.set_response(response) |
| """)} |
| |
| <p>mechanize exports the complete interface of <code>urllib2</code>: |
| |
| @{colorize(""" |
| import mechanize |
| response = mechanize.urlopen("http://www.example.com/") |
| print response.read() |
| """)} |
| |
| |
| <p>When using mechanize, anything you would normally import |
| from <code>urllib2</code> should be imported from mechanize instead. In many |
| cases, objects imported from mechanize are the same objects provided by |
| <code>urllib2</code>. In many other cases, though, the implementation comes |
| from mechanize, either because bug fixes have been applied or the functionality |
| of <code>urllib2</code> has been extended in some way. |
| |
| |
| <a name="useragentbase"></a> |
| <h2>UserAgent vs UserAgentBase</h2> |
| |
| <p><code>mechanize.UserAgent</code> is a trivial subclass of |
| <code>mechanize.UserAgentBase</code>, adding just one method, |
| <code>.set_seekable_responses()</code> (see the <a |
| href="./doc.html#seekable">documentation on seekable responses</a>). |
| |
| <p>The reason for the extra class is that |
| <code>mechanize.Browser</code> depends on seekable response objects |
| (because response objects are used to implement the browser history). |
| |
| |
| <a name="compatnotes"></a> |
| <h2>Compatibility</h2> |
| |
| <p>These notes explain the relationship between mechanize, ClientCookie, |
| <code>cookielib</code> and <code>urllib2</code>, and which to use when. If |
| you're just using mechanize, and not any of those other libraries, you can |
| ignore this section. |
| |
| <ol> |
| |
| <li>mechanize works with Python 2.4, Python 2.5, and Python 2.6. |
| |
| <li>When using mechanize, anything you would normally import |
| from <code>urllib2</code> should be imported from <code>mechanize</code> |
| instead. |
| |
| <li>Use of mechanize classes with <code>urllib2</code> (and vice-versa) is no |
| longer supported. However, existing classes implementing the urllib2 |
| Handler interface are likely to work unchanged with mechanize. |
| |
| <li>mechanize now only imports urllib2.URLError and urllib2.HTTPError. The |
| rest is forked. I intend to merge fixes from Python trunk frequently. |
| |
| <li>ClientCookie is no longer maintained as a separate package. The code is |
| now part of mechanize, and its interface is now exported through module |
| mechanize (since mechanize 0.1.0). Old code can simply be changed to |
| <code>import mechanize as ClientCookie</code> and should continue to |
| work. |
| |
| <li>The cookie handling parts of mechanize are in Python 2.4 standard library |
| as module <code>cookielib</code> and extensions to module |
| <code>urllib2</code>. mechanize does not currently use cookielib, due to |
| the presence of thread synchronisation code in cookielib that is not |
| present in the mechanize fork of cookielib. |
| |
| </ol> |
| |
| |
| |
| <a name="docs"></a> |
| <h2>Documentation</h2> |
| |
| <p>Full API documentation is in the docstrings. |
| |
| <p>The documentation in the web pages is in need of reorganisation at the |
| moment, after the merge of ClientCookie into mechanize. |
| |
| |
| <a name="credits"></a> |
| <h2>Credits</h2> |
| |
| <p>Thanks to all the too-numerous-to-list people who reported bugs and provided |
| patches. Also thanks to Ian Bicking, for persuading me that a |
| <code>UserAgent</code> class would be useful, and to Ronald Tschalar for advice |
| on Netscape cookies. |
| |
| <p>A lot of credit must go to Gisle Aas, who wrote libwww-perl, from which |
| large parts of mechanize originally derived, and Andy Lester for the original, |
| <a href="http://search.cpan.org/dist/WWW-Mechanize/"><code>WWW::Mechanize</code> |
| </a>. Finally, thanks to the (coincidentally-named) Johnny Lee for the MSIE |
| CookieJar Perl code from which mechanize's support for that is derived. |
| |
| |
| <a name="download"></a> |
| <h2>Download</h2> |
| |
| <p>You can install from source, or |
| using <a href="http://peak.telecommunity.com/DevCenter/EasyInstall">EasyInstall</a>: |
| |
| <pre>easy-install mechanize</pre> |
| |
| <p><a href="./#git">git access</a> is also available. |
| |
| <p>All documentation (including this web page) is included in the distribution. |
| |
| <p>This is a stable release. |
| |
| <ul> |
| <li><a href="./src/mechanize-@(version).tar.gz">mechanize-@(version).tar.gz</a> |
| <li><a href="./src/mechanize-@(version).zip">mechanize-@(version).zip</a> |
| <li><a href="./src/ChangeLog.txt">Change Log</a> (included in distribution) |
| <li><a href="./src/">Older versions.</a> |
| </ul> |
| |
| <p>For an installation procedure that does not invoke EasyInstall's dependency |
| resolution system, see the INSTALL file included with the distribution. |
| |
| |
| <a name="git"></a> |
| <h2>git repository</h2> |
| |
| <p>The <a href="http://git-scm.com/">git</a> repository is <a href="http://github.com/">here</a>. |
| To check it out: |
| |
| <pre> |
| git clone git://github.com/jjlee/mechanize.git |
| </pre> |
| |
| <a name="tests"></a> |
| <h2>Tests and examples</h2> |
| |
| <h3>Examples</h3> |
| |
| <p>The <code>examples</code> directory in the source packages contains a couple |
| of silly, but working, scripts to demonstrate basic use of the module. Note |
| that it's in the nature of web scraping for such scripts to break, so don't be |
| too surprised if that happens – do let me know, though! |
| |
| <p>See also the <a href="./forms/">forms examples</a> (these examples use the |
| forms code independently of Browser). |
| |
| <h3>Functional tests</h3> |
| |
| <p>To run the functional tests (which <strong>do</strong> access the network): |
| |
| <pre>python functional_tests.py</pre> |
| |
| <p>To start a local server and run the functional tests against that (depends |
| on <code>twisted.web2</code>): |
| |
| <pre>python functional_tests.py -l</pre> |
| |
| <h3>Unit tests</h3> |
| |
| <p>To run the unit tests (none of which access the network), run the following |
| command: |
| |
| <pre>python test.py</pre> |
| |
| |
| <h2>See also</h2> |
| |
| <p>There are several wrappers around mechanize designed for functional testing |
| of web applications: |
| |
| <ul> |
| |
| <li><a href="http://cheeseshop.python.org/pypi?:action=display&name=zope.testbrowser"> |
| <code>zope.testbrowser</code></a> (or |
| <a href="http://cheeseshop.python.org/pypi?%3Aaction=display&name=ZopeTestbrowser"> |
| <code>ZopeTestBrowser</code></a>, the standalone version). |
| <li><a href="http://www.idyll.org/~t/www-tools/twill.html">twill</a>. |
| </ul> |
| |
| <p>See <a href="../bits/GeneralFAQ.html">General FAQ</a> page for other links |
| to related software. |
| |
| |
| <a name="faq"></a> |
| <h2>FAQs - pre install</h2> |
| <ul> |
| <li>Which version of Python do I need? |
| <p>Python 2.4, 2.5 or 2.6. Python 3 is not yet supported. |
| <li>Does mechanize depend on BeautifulSoup? |
| <p>No. mechanize offers a few (still rather experimental) classes that make |
| use of BeautifulSoup, but these classes are not required to use mechanize. |
| mechanize bundles BeautifulSoup version 2, so that module is no longer |
| required. A future version of mechanize will support BeautifulSoup |
| version 3, at which point mechanize will likely no longer bundle the |
| module. |
| <li>Does mechanize depend on ClientForm? |
| <p>No, ClientForm is now part of mechanize. |
| <li>Which license? |
| <p>mechanize is dual-licensed: you may pick either the |
| <a href="http://www.opensource.org/licenses/bsd-license.php">BSD license</a>, |
| or the <a href="http://www.zope.org/Resources/ZPL">ZPL 2.1</a> (both are |
| included in the distribution). |
| </ul> |
| |
| <a name="usagefaq"></a> |
| <h2>FAQs - usage</h2> |
| <ul> |
| <li>I'm not getting the HTML page I expected to see. |
| <ul> |
| <li><a href="http://wwwsearch.sourceforge.net/mechanize/doc.html#debugging">Debugging tips</a> |
| <li><a href="http://wwwsearch.sourceforge.net/bits/GeneralFAQ.html">More tips</a> |
| </ul> |
| <li><code>Browser</code> doesn't have all of the forms/links I see in the |
| HTML. Why not? |
| <p>Perhaps the default parser can't cope with invalid HTML. Try using the |
| included BeautifulSoup 2 parser instead: |
| @{colorize(""" |
| import mechanize |
| |
| browser = mechanize.Browser(factory=mechanize.RobustFactory()) |
| browser.open("http://example.com/") |
| print browser.forms |
| """)} |
| <li>Is JavaScript supported? |
| <p>No, sorry. Try <a href="http://htmlunit.sourceforge.net/">htmlunit</a>. |
| <li>My HTTP response data is truncated. |
| <p><code>mechanize.Browser's</code> response objects support the .seek() |
| method, and can still be used after .close() has been called. Response |
| data is not fetched until it is needed, so navigation away from a URL |
| before fetching all of the response will truncate it. |
| Call <code>response.get_data()</code> before navigation if you don't want |
| that to happen. |
| <li>I'm <strong><em>sure</em></strong> this page is HTML, why does |
| <code>mechanize.Browser</code> think otherwise? |
| @{colorize(""" |
| b = mechanize.Browser( |
| # mechanize's XHTML support needs work, so is currently switched off. If |
| # we want to get our work done, we have to turn it on by supplying a |
| # mechanize.Factory (with XHTML support turned on): |
| factory=mechanize.DefaultFactory(i_want_broken_xhtml_support=True) |
| ) |
| """)} |
| <li>Why don't timeouts work for me? |
| <p>Timeouts are ignored with with versions of Python earlier than 2.6. |
| Timeouts do not apply to DNS lookups. |
| </ul> |
| |
| <a name="bug_tracker"></a> |
| <h2>Bug tracker</h2> |
| |
| <p>The (rather new) bug tracker is <a href="http://github.com/jjlee/mechanize/issues">here on github</a>. It's equally acceptable to file bugs on the tracker or post about them to the mailing list. |
| |
| <a name="mailing_list"></a> |
| <h2>Mailing list</h2> |
| |
| <p>There is |
| a <a href="http://lists.sourceforge.net/lists/listinfo/wwwsearch-general"> |
| mailing list</a>. I prefer questions and comments to be sent there rather than |
| direct to me. |
| |
| <p><a href="mailto:jjl@@pobox.com">John J. Lee</a>, |
| @(time.strftime("%B %Y", last_modified)). |
| |
| <hr> |
| |
| </div> |
| |
| <div id="Menu"> |
| |
| @(release.navbar('mechanize')) |
| |
| <br> |
| |
| <a href="./#examples">Examples</a><br> |
| <a href="./#compatnotes">Compatibility</a><br> |
| <a href="./#docs">Documentation</a><br> |
| <a href="./#download">Download</a><br> |
| <a href="./#git">git</a><br> |
| <a href="./#faq">FAQs</a><br> |
| <a href="./#bug_tracker">Bug tracker</a><br> |
| <a href="./#mailing_list">Mailing list</a><br> |
| |
| </div> |
| |
| |
| </body> |
| </html> |