| <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| "http://www.w3.org/TR/html4/strict.dtd"> |
| @# This file is processed by EmPy |
| @{ |
| from colorize import colorize |
| import time |
| import release |
| last_modified = release.last_modified(empy.identify()[0]) |
| try: |
| base |
| except NameError: |
| base = False |
| } |
| <html> |
| <!--This file was generated by EmPy: do not edit--> |
| <head> |
| <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
| <meta name="author" content="John J. Lee <jjl@@pobox.com>"> |
| <meta name="date" content="@(time.strftime("%Y-%m-%d", last_modified))"> |
| <title>mechanize documentation</title> |
| <style type="text/css" media="screen">@@import "../../styles/style.css";</style> |
| <!--[if IE 6]> |
| <style type="text/css" media="screen">@@import "../../styles/style-ie6.css";</style> |
| <![endif]--> |
| @[if base]<base href="http://wwwsearch.sourceforge.net/mechanize/doc.html">@[end if] |
| </head> |
| <body> |
| |
| <div id="sf"><a href="http://sourceforge.net"> |
| <img src="http://sourceforge.net/sflogo.php?group_id=48205&type=2" |
| width="125" height="37" alt="SourceForge.net Logo"></a></div> |
| |
| <h1>mechanize documentation: handlers</h1> |
| |
| <div id="Content"> |
| |
| <p class="docwarning">This documentation is in need of reorganisation!</p> |
| |
| <p>This page is the old ClientCookie documentation. It deals with operation on |
| the level of urllib2 Handler objects, and also with adding headers, debugging, |
| and cookie handling. Documentation for the higher-level browser-style |
| interface is <a href="./mechanize">elsewhere</a>. |
| |
| |
| <a name="examples"></a> |
| <h2>Examples</h2> |
| |
| @{colorize(r""" |
| import mechanize |
| response = mechanize.urlopen("http://foo.bar.com/") |
| """)} |
| |
| <p>This function behaves identically to <code>urllib2.urlopen()</code>, except |
| that it deals with cookies automatically. |
| |
| <p>Here is a more complicated example, involving <code>Request</code> objects |
| (useful if you want to pass <code>Request</code>s around, add headers to them, |
| etc.): |
| |
| @{colorize(r""" |
| import mechanize |
| request = mechanize.Request("http://www.acme.com/") |
| # note we're using the urlopen from mechanize, not urllib2 |
| response = mechanize.urlopen(request) |
| # let's say this next request requires a cookie that was set in response |
| request2 = mechanize.Request("http://www.acme.com/flying_machines.html") |
| response2 = mechanize.urlopen(request2) |
| |
| print response2.geturl() |
| print response2.info() # headers |
| print response2.read() # body (readline and readlines work too) |
| """)} |
| |
| <p>(The above example would also work with <code>urllib2.Request</code> objects |
| too, since <code>mechanize.HTTPRequestUpgradeProcessor</code> knows about |
| that class, but don't if you can avoid it, because this is an obscure hack for |
| compatibility purposes only). |
| |
| <p>In these examples, the workings are hidden inside the |
| <code>mechanize.urlopen()</code> function, which is an extension of |
| <code>urllib2.urlopen()</code>. Redirects, proxies and cookies are handled |
| automatically by this function (note that you may need a bit of configuration |
| to get your proxies correctly set up: see <code>urllib2</code> documentation). |
| |
| <p>Cookie processing (etc.) is handled by processor objects, which are an |
| extension of <code>urllib2</code>'s handlers: <code>HTTPCookieProcessor</code>, |
| <code>HTTPRefererProcessor</code> etc. They are used like any other handler. |
| |
| <p>There is also a <code>urlretrieve()</code> function, which works like |
| <code>urllib.urlretrieve()</code>. |
| |
| <p>An example at a slightly lower level shows how the module processes |
| cookies more clearly: |
| |
| @{colorize(r""" |
| # Don't copy this blindly! You probably want to follow the examples |
| # above, not this one. |
| import mechanize |
| |
| # Build an opener that *doesn't* automatically call .add_cookie_header() |
| # and .extract_cookies(), so we can do it manually without interference. |
| class NullCookieProcessor(mechanize.HTTPCookieProcessor): |
| def http_request(self, request): return request |
| def http_response(self, request, response): return response |
| opener = mechanize.build_opener(NullCookieProcessor) |
| |
| request = mechanize.Request("http://www.acme.com/") |
| response = mechanize.urlopen(request) |
| cj = mechanize.CookieJar() |
| cj.extract_cookies(response, request) |
| # let's say this next request requires a cookie that was set in response |
| request2 = mechanize.Request("http://www.acme.com/flying_machines.html") |
| cj.add_cookie_header(request2) |
| response2 = mechanize.urlopen(request2) |
| """)} |
| |
| <p>The <code>CookieJar</code> class does all the work. There are essentially |
| two operations: <code>.extract_cookies()</code> extracts HTTP cookies from |
| <code>Set-Cookie</code> (the original <a |
| href="http://www.netscape.com/newsref/std/cookie_spec.html">Netscape cookie |
| standard</a>) and <code>Set-Cookie2</code> (<a |
| href="http://www.ietf.org/rfc/rfc2965.txt">RFC 2965</a>) headers from a |
| response if and only if they should be set given the request, and |
| <code>.add_cookie_header()</code> adds <code>Cookie</code> headers if and only |
| if they are appropriate for a particular HTTP request. Incoming cookies are |
| checked for acceptability based on the host name, etc. Cookies are only set on |
| outgoing requests if they match the request's host name, path, etc. |
| |
| <p><strong>Note that if you're using <code>mechanize.urlopen()</code> (or if |
| you're using <code>mechanize.HTTPCookieProcessor</code> by some other |
| means), you don't need to call <code>.extract_cookies()</code> or |
| <code>.add_cookie_header()</code> yourself</strong>. If, on the other hand, |
| you want to use mechanize to provide cookie handling for an HTTP client other |
| than mechanize itself, you will need to use this pair of methods. You can make |
| your own <code>request</code> and <code>response</code> objects, which must |
| support the interfaces described in the docstrings of |
| <code>.extract_cookies()</code> and <code>.add_cookie_header()</code>. |
| |
| <p>There are also some <code>CookieJar</code> subclasses which can store |
| cookies in files and databases. <code>FileCookieJar</code> is the abstract |
| class for <code>CookieJar</code>s that can store cookies in disk files. |
| <code>LWPCookieJar</code> saves cookies in a format compatible with the |
| libwww-perl library. This class is convenient if you want to store cookies in |
| a human-readable file: |
| |
| @{colorize(r""" |
| import mechanize |
| cj = mechanize.LWPCookieJar() |
| cj.revert("cookie3.txt") |
| opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj)) |
| r = opener.open("http://foobar.com/") |
| cj.save("cookie3.txt") |
| """)} |
| |
| <p>The <code>.revert()</code> method discards all existing cookies held by the |
| <code>CookieJar</code> (it won't lose any existing cookies if the load fails). |
| The <code>.load()</code> method, on the other hand, adds the loaded cookies to |
| existing cookies held in the <code>CookieJar</code> (old cookies are kept |
| unless overwritten by newly loaded ones). |
| |
| <p><code>MozillaCookieJar</code> can load and save to the |
| Mozilla/Netscape/lynx-compatible <code>'cookies.txt'</code> format. This |
| format loses some information (unusual and nonstandard cookie attributes such |
| as comment, and also information specific to RFC 2965 cookies). The subclass |
| <code>MSIECookieJar</code> can load (but not save, yet) from Microsoft Internet |
| Explorer's cookie files (on Windows). <code>BSDDBCookieJar</code> (NOT FULLY |
| TESTED!) saves to a BSDDB database using the standard library's |
| <code>bsddb</code> module. There's an unfinished <code>MSIEDBCookieJar</code>, |
| which uses (reads and writes) the Windows MSIE cookie database directly, rather |
| than storing copies of cookies as <code>MSIECookieJar</code> does. |
| |
| <h2>Important note</h2> |
| |
| <p>Only use names you can import directly from the <code>mechanize</code> |
| package, and that don't start with a single underscore. Everything else is |
| subject to change or disappearance without notice. |
| |
| <a name="browsers"></a> |
| <h2>Cooperating with Mozilla/Netscape, lynx and Internet Explorer</h2> |
| |
| <p>The subclass <code>MozillaCookieJar</code> differs from |
| <code>CookieJar</code> only in storing cookies using a different, |
| Mozilla/Netscape-compatible, file format. The lynx browser also uses this |
| format. This file format can't store RFC 2965 cookies, so they are downgraded |
| to Netscape cookies on saving. <code>LWPCookieJar</code> itself uses a |
| libwww-perl specific format (`Set-Cookie3') - see the example above. Python |
| and your browser should be able to share a cookies file (note that the file |
| location here will differ on non-unix OSes): |
| |
| <p><strong>WARNING:</strong> you may want to backup your browser's cookies file |
| if you use <code>MozillaCookieJar</code> to save cookies. I <em>think</em> it |
| works, but there have been bugs in the past! |
| |
| @{colorize(r""" |
| import os, mechanize |
| cookies = mechanize.MozillaCookieJar() |
| cookies.load(os.path.join(os.environ["HOME"], "/.netscape/cookies.txt")) |
| # see also the save and revert methods |
| """)} |
| |
| <p>Note that cookies saved while Mozilla is running will get clobbered by |
| Mozilla - see <code>MozillaCookieJar.__doc__</code>. |
| |
| <p><code>MSIECookieJar</code> does the same for Microsoft Internet Explorer |
| (MSIE) 5.x and 6.x on Windows, but does not allow saving cookies in this |
| format. In future, the Windows API calls might be used to load and save |
| (though the index has to be read directly, since there is no API for that, |
| AFAIK; there's also an unfinished <code>MSIEDBCookieJar</code>, which uses |
| (reads and writes) the Windows MSIE cookie database directly, rather than |
| storing copies of cookies as <code>MSIECookieJar</code> does). |
| |
| @{colorize(r""" |
| import mechanize |
| cj = mechanize.MSIECookieJar(delayload=True) |
| cj.load_from_registry() # finds cookie index file from registry |
| """)} |
| |
| <p>A true <code>delayload</code> argument speeds things up. |
| |
| <p>On Windows 9x (win 95, win 98, win ME), you need to supply a username to the |
| <code>.load_from_registry()</code> method: |
| |
| @{colorize(r""" |
| cj.load_from_registry(username="jbloggs") |
| """)} |
| |
| <p>Konqueror/Safari and Opera use different file formats, which aren't yet |
| supported. |
| |
| <a name="file"></a> |
| <h2>Saving cookies in a file</h2> |
| |
| <p>If you have no need to co-operate with a browser, the most convenient way to |
| save cookies on disk between sessions in human-readable form is to use |
| <code>LWPCookieJar</code>. This class uses a libwww-perl specific format |
| (`Set-Cookie3'). Unlike <code>MozilliaCookieJar</code>, this file format |
| doesn't lose information. |
| |
| <a name="cookiejar"></a> |
| <h2>Using your own CookieJar instance</h2> |
| |
| <p>You might want to do this to <a href="./doc.html#browsers">use your |
| browser's cookies</a>, to customize <code>CookieJar</code>'s behaviour by |
| passing constructor arguments, or to be able to get at the cookies it will hold |
| (for example, for saving cookies between sessions and for debugging). |
| |
| <p>If you're using the higher-level <code>urllib2</code>-like interface |
| (<code>urlopen()</code>, etc), you'll have to let it know what |
| <code>CookieJar</code> it should use: |
| |
| @{colorize(r""" |
| import mechanize |
| cookies = mechanize.CookieJar() |
| # build_opener() adds standard handlers (such as HTTPHandler and |
| # HTTPCookieProcessor) by default. The cookie processor we supply |
| # will replace the default one. |
| opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies)) |
| |
| r = opener.open("http://acme.com/") # GET |
| r = opener.open("http://acme.com/", data) # POST |
| """)} |
| |
| <p>The <code>urlopen()</code> function uses a global |
| <code>OpenerDirector</code> instance to do its work, so if you want to use |
| <code>urlopen()</code> with your own <code>CookieJar</code>, install the |
| <code>OpenerDirector</code> you built with <code>build_opener()</code> using |
| the <code>mechanize.install_opener()</code> function, then proceed as usual: |
| |
| @{colorize(r""" |
| mechanize.install_opener(opener) |
| r = mechanize.urlopen("http://www.acme.com/") |
| """)} |
| |
| <p>Of course, everyone using <code>urlopen</code> is using the same global |
| <code>CookieJar</code> instance! |
| |
| <a name="policy"></a> |
| |
| <p>You can set a policy object (must satisfy the interface defined by |
| <code>mechanize.CookiePolicy</code>), which determines which cookies are |
| allowed to be set and returned. Use the policy argument to the |
| <code>CookieJar</code> constructor, or use the .set_policy() method. The |
| default implementation has some useful switches: |
| |
| @{colorize(r""" |
| from mechanize import CookieJar, DefaultCookiePolicy as Policy |
| cookies = CookieJar() |
| # turn on RFC 2965 cookies, be more strict about domains when setting and |
| # returning Netscape cookies, and block some domains from setting cookies |
| # or having them returned (read the DefaultCookiePolicy docstring for the |
| # domain matching rules here) |
| policy = Policy(rfc2965=True, strict_ns_domain=Policy.DomainStrict, |
| blocked_domains=["ads.net", ".ads.net"]) |
| cookies.set_policy(policy) |
| """)} |
| |
| |
| <a name="extras"></a> |
| <h2>Optional extras: robots.txt, HTTP-EQUIV, Refresh, Referer</h2> |
| |
| <p>These are implemented as processor classes. Processors are an extension of |
| <code>urllib2</code>'s handlers (now a standard part of urllib2 in Python 2.4): |
| you just pass them to <code>build_opener()</code> (example code below). |
| |
| <dl> |
| |
| <dt><code>HTTPRobotRulesProcessor</code> |
| |
| <dd><p>WWW Robots (also called wanderers or spiders) are programs that traverse |
| many pages in the World Wide Web by recursively retrieving linked pages. This |
| kind of program can place significant loads on web servers, so there is a <a |
| href="http://www.robotstxt.org/wc/norobots.html">standard</a> for a <code> |
| robots.txt</code> file by which web site operators can request robots to keep |
| out of their site, or out of particular areas of it. This processor uses the |
| standard Python library's <code>robotparser</code> module. It raises |
| <code>mechanize.RobotExclusionError</code> (subclass of |
| <code>mechanize.HTTPError</code>) if an attempt is made to open a URL prohibited |
| by <code>robots.txt</code>. |
| |
| <dt><code>HTTPEquivProcessor</code> |
| |
| <dd><p>The <code><META HTTP-EQUIV></code> tag is a way of including data |
| in HTML to be treated as if it were part of the HTTP headers. mechanize can |
| automatically read these tags and add the <code>HTTP-EQUIV</code> headers to |
| the response object's real HTTP headers. The HTML is left unchanged. |
| |
| <dt><code>HTTPRefreshProcessor</code> |
| |
| <dd><p>The <code>Refresh</code> HTTP header is a non-standard header which is |
| widely used. It requests that the user-agent follow a URL after a specified |
| time delay. mechanize can treat these headers (which may have been set in |
| <code><META HTTP-EQUIV></code> tags) as if they were 302 redirections. |
| Exactly when and how <code>Refresh</code> headers are handled is configurable |
| using the constructor arguments. |
| |
| <dt><code>HTTPRefererProcessor</code> |
| |
| <dd><p>The <code>Referer</code> HTTP header lets the server know which URL |
| you've just visited. Some servers use this header as state information, and |
| don't like it if this is not present. It's a chore to add this header by hand |
| every time you make a request. This adds it automatically. |
| <strong>NOTE</strong>: this only makes sense if you use each processor for a |
| single chain of HTTP requests (so, for example, if you use a single |
| HTTPRefererProcessor to fetch a series of URLs extracted from a single page, |
| <strong>this will break</strong>). <a |
| href="../mechanize/">mechanize.Browser</a> does this properly.</p> |
| |
| </dl> |
| |
| @{colorize(r""" |
| import mechanize |
| cookies = mechanize.CookieJar() |
| |
| opener = mechanize.build_opener(mechanize.HTTPRefererProcessor, |
| mechanize.HTTPEquivProcessor, |
| mechanize.HTTPRefreshProcessor, |
| ) |
| opener.open("http://www.rhubarb.com/") |
| """)} |
| |
| |
| |
| <a name="seekable"></a> |
| <h2>Seekable responses</h2> |
| |
| <p>Response objects returned from (or raised as exceptions by) |
| <code>mechanize.SeekableResponseOpener</code>, <code>mechanize.UserAgent</code> |
| (if <code>.set_seekable_responses(True)</code> has been called) and |
| <code>mechanize.Browser()</code> have <code>.seek()</code>, |
| <code>.get_data()</code> and <code>.set_data()</code> methods: |
| |
| @{colorize(r""" |
| import mechanize |
| opener = mechanize.OpenerFactory(mechanize.SeekableResponseOpener).build_opener() |
| response = opener.open("http://example.com/") |
| # same return value as .read(), but without affecting seek position |
| total_nr_bytes = len(response.get_data()) |
| assert len(response.read()) == total_nr_bytes |
| assert len(response.read()) == 0 # we've already read the data |
| response.seek(0) |
| assert len(response.read()) == total_nr_bytes |
| response.set_data("blah\n") |
| assert response.get_data() == "blah\n" |
| ... |
| """)} |
| |
| <p>This caching behaviour can be avoided by using |
| <code>mechanize.OpenerDirector</code> (as long as |
| <code>SeekableProcessor</code>, <code>HTTPEquivProcessor</code> and |
| <code>HTTPResponseDebugProcessor</code> are not used). It can also be avoided |
| with <code>mechanize.UserAgent</code>: |
| |
| @{colorize(r""" |
| import mechanize |
| ua = mechanize.UserAgent() |
| ua.set_seekable_responses(False) |
| ua.set_handle_equiv(False) |
| ua.set_debug_responses(False) |
| """)} |
| |
| <p>Note that if you turn on features that use seekable responses (currently: |
| HTTP-EQUIV handling and response body debug printing), returned responses |
| <em>may</em> be seekable as a side-effect of these features. However, this is |
| not guaranteed (currently, in these cases, returned response objects are |
| seekable, but raised respose objects — <code>mechanize.HTTPError</code> |
| instances — are not seekable). This applies regardless of whether you |
| use <code>mechanize.UserAgent</code> or <code>mechanize.OpenerDirector</code>. |
| If you explicitly request seekable responses by calling |
| <code>.set_seekable_responses(True)</code> on a |
| <code>mechanize.UserAgent</code> instance, or by using |
| <code>mechanize.Browser</code> or |
| <code>mechanize.SeekableResponseOpener</code>, which always return seekable |
| responses, then both returned and raised responses are guaranteed to be |
| seekable. |
| |
| <p>Handlers should call <code>response = |
| mechanize.seek_wrapped_response(response)</code> if they require the |
| <code>.seek()</code>, <code>.get_data()</code> or <code>.set_data()</code> |
| methods. |
| |
| <p>Note that <code>SeekableProcessor</code> (and |
| <code>ResponseUpgradeProcessor</code>) are deprecated since mechanize 0.1.6b. |
| The reason for the deprecation is that these were really abuses of the response |
| processing chain (the <code>.process_response()</code> support documented by |
| urllib2). The response processing chain is sensibly used only for processing |
| response headers and data, not for processing response <em>objects</em>, |
| because the same data may occur as different Python objects (this can occur for |
| example when <code>HTTPError</code> is raised by |
| <code>HTTPDefaultErrorHandler</code>), but should only get processed once |
| (during <code>.open()</code>). |
| |
| |
| |
| <a name="requests"></a> |
| <h2>Confusing fact about headers and Requests</h2> |
| |
| <p>mechanize automatically upgrades <code>urllib2.Request</code> objects to |
| <code>mechanize.Request</code>, as a backwards-compatibility hack. This |
| means that you won't see any headers that are added to Request objects by |
| handlers unless you use <code>mechanize.Request</code> in the first place. |
| Sorry about that. |
| |
| <p>Note also that handlers may create new <code>Request</code> instances (for |
| example when performing redirects) rather than adding headers to existing |
| <code>Request objects</code>. |
| |
| |
| <a name="headers"></a> |
| <h2>Adding headers</h2> |
| |
| <p>Adding headers is done like so: |
| |
| @{colorize(r""" |
| import mechanize |
| req = mechanize.Request("http://foobar.com/") |
| req.add_header("Referer", "http://wwwsearch.sourceforge.net/mechanize/") |
| r = mechanize.urlopen(req) |
| """)} |
| |
| <p>You can also use the headers argument to the <code>mechanize.Request</code> |
| constructor. |
| |
| <p>mechanize adds some headers to <code>Request</code> objects automatically - |
| see the next section for details. |
| |
| |
| <h2>Changing the automatically-added headers (User-Agent)</h2> |
| |
| <p><code>OpenerDirector</code> automatically adds a <code>User-Agent</code> |
| header to every <code>Request</code>. |
| |
| <p>To change this and/or add similar headers, use your own |
| <code>OpenerDirector</code>: |
| |
| @{colorize(r""" |
| import mechanize |
| cookies = mechanize.CookieJar() |
| opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies)) |
| opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"), |
| ("From", "responsible.person@example.com")] |
| """)} |
| |
| <p>Again, to use <code>urlopen()</code>, install your |
| <code>OpenerDirector</code> globally: |
| |
| @{colorize(r""" |
| mechanize.install_opener(opener) |
| r = mechanize.urlopen("http://acme.com/") |
| """)} |
| |
| <p>Also, a few standard headers (<code>Content-Length</code>, |
| <code>Content-Type</code> and <code>Host</code>) are added when the |
| <code>Request</code> is passed to <code>urlopen()</code> (or |
| <code>OpenerDirector.open()</code>). You shouldn't need to change these |
| headers, but since this is done by <code>AbstractHTTPHandler</code>, you can |
| change the way it works by passing a subclass of that handler to |
| <code>build_opener()</code> (or, as always, by constructing an opener yourself |
| and calling .add_handler()). |
| |
| |
| <a name="unverifiable"></a> |
| <h2>Initiating unverifiable transactions</h2> |
| |
| <p>This section is only of interest for correct handling of third-party HTTP |
| cookies. See <a href="./doc.html#standards">below</a> for an explanation of |
| 'third-party'. |
| |
| <p>First, some terminology. |
| |
| <p>An <em>unverifiable request</em> (defined fully by RFC 2965) is one whose |
| URL the user did not have the option to approve. For example, a transaction is |
| unverifiable if the request is for an image in an HTML document, and the user |
| had no option to approve the fetching of the image from a particular URL. |
| |
| <p>The <em>request-host of the origin transaction</em> (defined fully by RFC |
| 2965) is the host name or IP address of the original request that was initiated |
| by the user. For example, if the request is for an image in an HTML document, |
| this is the request-host of the request for the page containing the image. |
| |
| <p><strong>mechanize knows that redirected transactions are unverifiable, |
| and will handle that on its own (ie. you don't need to think about the origin |
| request-host or verifiability yourself).</strong> |
| |
| <p>If you want to initiate an unverifiable transaction yourself (which you |
| should if, for example, you're downloading the images from a page, and 'the |
| user' hasn't explicitly OKed those URLs): |
| |
| @{colorize(r""" |
| request = Request(origin_req_host="www.example.com", unverifiable=True) |
| """)} |
| |
| |
| <a name="rfc2965"></a> |
| <h2>RFC 2965 handling</h2> |
| |
| <p>RFC 2965 handling is switched off by default, because few browsers implement |
| it, so the RFC 2965 protocol is essentially never seen on the internet. To |
| switch it on, see <a href="./doc.html#policy">here</a>. |
| |
| |
| <a name="debugging"></a> |
| <h2>Debugging</h2> |
| |
| <!--XXX move as much as poss. to General page--> |
| |
| <p>First, a few common problems. The most frequent mistake people seem to make |
| is to use <code>mechanize.urlopen()</code>, <em>and</em> the |
| <code>.extract_cookies()</code> and <code>.add_cookie_header()</code> methods |
| on a cookie object themselves. If you use <code>mechanize.urlopen()</code> |
| (or <code>OpenerDirector.open()</code>), the module handles extraction and |
| adding of cookies by itself, so you should not call |
| <code>.extract_cookies()</code> or <code>.add_cookie_header()</code>. |
| |
| <p>Are you sure the server is sending you any cookies in the first place? |
| Maybe the server is keeping track of state in some other way |
| (<code>HIDDEN</code> HTML form entries (possibly in a separate page referenced |
| by a frame), URL-encoded session keys, IP address, HTTP <code>Referer</code> |
| headers)? Perhaps some embedded script in the HTML is setting cookies (see |
| below)? Maybe you messed up your request, and the server is sending you some |
| standard failure page (even if the page doesn't appear to indicate any |
| failure). Sometimes, a server wants particular headers set to the values it |
| expects, or it won't play nicely. The most frequent offenders here are the |
| <code>Referer</code> [<em>sic</em>] and / or <code>User-Agent</code> HTTP |
| headers (<a href="./doc.html#headers">see above</a> for how to set these). The |
| <code>User-Agent</code> header may need to be set to a value like that of a |
| popular browser. The <code>Referer</code> header may need to be set to the URL |
| that the server expects you to have followed a link from. Occasionally, it may |
| even be that operators deliberately configure a server to insist on precisely |
| the headers that the popular browsers (MS Internet Explorer, Mozilla/Netscape, |
| Opera, Konqueror/Safari) generate, but remember that incompetence (possibly on |
| your part) is more probable than deliberate sabotage (and if a site owner is |
| that keen to stop robots, you probably shouldn't be scraping it anyway). |
| |
| <p>When you <code>.save()</code> to or |
| <code>.load()</code>/<code>.revert()</code> from a file, single-session cookies |
| will expire unless you explicitly request otherwise with the |
| <code>ignore_discard</code> argument. This may be your problem if you find |
| cookies are going away after saving and loading. |
| |
| @{colorize(r""" |
| import mechanize |
| cj = mechanize.LWPCookieJar() |
| opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj)) |
| mechanize.install_opener(opener) |
| r = mechanize.urlopen("http://foobar.com/") |
| cj.save("/some/file", ignore_discard=True, ignore_expires=True) |
| """)} |
| |
| <p>If none of the advice above solves your problem quickly, try comparing the |
| headers and data that you are sending out with those that a browser emits. |
| Often this will give you the clue you need. Of course, you'll want to check |
| that the browser is able to do manually what you're trying to achieve |
| programatically before minutely examining the headers. Make sure that what you |
| do manually is <em>exactly</em> the same as what you're trying to do from |
| Python - you may simply be hitting a server bug that only gets revealed if you |
| view pages in a particular order, for example. In order to see what your |
| browser is sending to the server (even if HTTPS is in use), see <a |
| href="../clientx.html">the General FAQ page</a>. If nothing is obviously wrong |
| with the requests your program is sending and you're out of ideas, you can try |
| the last resort of good old brute force binary-search debugging. Temporarily |
| switch to sending HTTP headers (with <code>httplib</code>). Start by copying |
| Netscape/Mozilla or IE slavishly (apart from session IDs, etc., of course), |
| then begin the tedious process of mutating your headers and data until they |
| match what your higher-level code was sending. This will at least reliably |
| find your problem. |
| |
| <p>You can turn on display of HTTP headers: |
| |
| @{colorize(r""" |
| import mechanize |
| hh = mechanize.HTTPHandler() # you might want HTTPSHandler, too |
| hh.set_http_debuglevel(1) |
| opener = mechanize.build_opener(hh) |
| response = opener.open(url) |
| """)} |
| |
| <p>Alternatively, you can examine your individual request and response |
| objects to see what's going on. Note, though, that mechanize upgrades |
| <code>urllib2.Request</code> objects to <code>mechanize.Request</code>, so you |
| won't see any headers that are added to requests by handlers unless you use |
| <code>mechanize.Request</code> in the first place. In addition, requests may |
| involve "sub-requests" in cases such as redirection, in which case you will |
| also not see everything that's going on just by examining the original request |
| and final response. mechanize's responses can be made to |
| have <code>.seek()</code> and <code>.get_data()</code> methods. It's often |
| useful to use the <code>.get_data()</code> method during debugging |
| (see <a href="./doc.html#seekable">above</a>). |
| |
| <p>Also, note <code>HTTPRedirectDebugProcessor</code> (which prints information |
| about redirections) and <code>HTTPResponseDebugProcessor</code> (which prints |
| out all response bodies, including those that are read during redirections). |
| <strong>NOTE</strong>: as well as having these processors in your |
| <code>OpenerDirector</code> (for example, by passing them to |
| <code>build_opener()</code>) you have to turn on logging at the |
| <code>INFO</code> level or lower in order to see any output. |
| |
| <p>If you would like to see what is going on in mechanize's tiny mind, do |
| this: |
| |
| @{colorize(r""" |
| import sys, logging |
| # logging.DEBUG covers masses of debugging information, |
| # logging.INFO just shows the output from HTTPRedirectDebugProcessor, |
| logger = logging.getLogger("mechanize") |
| logger.addHandler(logging.StreamHandler(sys.stdout)) |
| logger.setLevel(logging.DEBUG) |
| """)} |
| |
| <p>The <code>DEBUG</code> level (as opposed to the <code>INFO</code> level) can |
| actually be quite useful, as it explains why particular cookies are accepted or |
| rejected and why they are or are not returned. |
| |
| <p>One final thing to note is that there are some catch-all bare |
| <code>except:</code> statements in the module, which are there to handle |
| unexpected bad input without crashing your program. If this happens, it's a |
| bug in mechanize, so please mail me the warning text. |
| |
| |
| <a name="script"></a> |
| <h2>Embedded script that sets cookies</h2> |
| |
| <p>It is possible to embed script in HTML pages (sandwiched between |
| <code><SCRIPT>here</SCRIPT></code> tags, and in |
| <code>javascript:</code> URLs) - JavaScript / ECMAScript, VBScript, or even |
| Python - that causes cookies to be set in a browser. See the <a |
| href="../bits/clientx.html">General FAQs</a> page for what to do about this. |
| |
| |
| <a name="dates"></a> |
| <h2>Parsing HTTP date strings</h2> |
| |
| <p>A function named <code>str2time</code> is provided by the package, |
| which may be useful for parsing dates in HTTP headers. |
| <code>str2time</code> is intended to be liberal, since HTTP date/time |
| formats are poorly standardised in practice. There is no need to use this |
| function in normal operations: <code>CookieJar</code> instances keep track |
| of cookie lifetimes automatically. This function will stay around in some |
| form, though the supported date/time formats may change. |
| |
| |
| <a name="badhtml"></a> |
| <h2>Dealing with bad HTML</h2> |
| |
| <p>XXX Intro |
| |
| <p>XXX Test me |
| |
| @{colorize("""\ |
| import copy |
| import mechanize |
| class CommentCleanProcessor(mechanize.BaseProcessor): |
| def http_response(self, request, response): |
| if not hasattr(response, "seek"): |
| response = mechanize.response_seek_wrapper(response) |
| response.seek(0) |
| new_response = copy.copy(response) |
| new_response.set_data( |
| re.sub("<!-([^-]*)->", "<!--\\1-->", response.read())) |
| return new_response |
| https_response = http_response |
| """)} |
| |
| <p>XXX TidyProcessor: mxTidy? tidylib? tidy? |
| |
| |
| <a name="standards"></a> |
| <h2>Note about cookie standards</h2> |
| |
| <p>The various cookie standards and their history form a case study of the |
| terrible things that can happen to a protocol. The long-suffering David |
| Kristol has written a <a |
| href="http://arxiv.org/abs/cs.SE/0105018">paper</a> about it, if you |
| want to know the gory details. |
| |
| <p>Here is a summary. |
| |
| <p>The <a href="http://www.netscape.com/newsref/std/cookie_spec.html">Netscape |
| protocol</a> (cookie_spec.html) is still the only standard supported by most |
| browsers (including Internet Explorer and Netscape). Be aware that |
| cookie_spec.html is not, and never was, actually followed to the letter (or |
| anything close) by anyone (including Netscape, IE and mechanize): the |
| Netscape protocol standard is really defined by the behaviour of Netscape (and |
| now IE). Netscape cookies are also known as V0 cookies, to distinguish them |
| from RFC 2109 or RFC 2965 cookies, which have a version cookie-attribute with a |
| value of 1. |
| |
| <p><a href="http://www.ietf.org/rfcs/rfc2109.txt">RFC 2109</a> was introduced |
| to fix some problems identified with the Netscape protocol, while still keeping |
| the same HTTP headers (<code>Cookie</code> and <code>Set-Cookie</code>). The |
| most prominent of these problems is the 'third-party' cookie issue, which was |
| an accidental feature of the Netscape protocol. When one visits www.bland.org, |
| one doesn't expect to get a cookie from www.lurid.com, a site one has never |
| visited. Depending on browser configuration, this can still happen, because |
| the unreconstructed Netscape protocol is happy to accept cookies from, say, an |
| image in a webpage (www.bland.org) that's included by linking to an |
| advertiser's server (www.lurid.com). This kind of event, where your browser |
| talks to a server that you haven't explicitly okayed by some means, is what the |
| RFCs call an 'unverifiable transaction'. In addition to the potential for |
| embarrassment caused by the presence of lurid.com's cookies on one's machine, |
| this may also be used to track your movements on the web, because advertising |
| agencies like doubleclick.net place ads on many sites. RFC 2109 tried to |
| change this by requiring cookies to be turned off during unverifiable |
| transactions with third-party servers - unless the user explicitly asks them to |
| be turned on. This clashed with the business model of advertisers like |
| doubleclick.net, who had started to take advantage of the third-party cookies |
| 'bug'. Since the browser vendors were more interested in the advertisers' |
| concerns than those of the browser users, this arguably doomed both RFC 2109 |
| and its successor, RFC 2965, from the start. Other problems than the |
| third-party cookie issue were also fixed by 2109. However, even ignoring the |
| advertising issue, 2109 was stillborn, because Internet Explorer and Netscape |
| behaved differently in response to its extended <code>Set-Cookie</code> |
| headers. This was not really RFC 2109's fault: it worked the way it did to |
| keep compatibility with the Netscape protocol as implemented by Netscape. |
| Microsoft Internet Explorer (MSIE) was very new when the standard was designed, |
| but was starting to be very popular when the standard was finalised. XXX P3P, |
| and MSIE & Mozilla options |
| |
| <p>XXX Apparently MSIE implements bits of RFC 2109 - but not very compliant |
| (surprise). Presumably other browsers do too, as a result. mechanize |
| already does allow Netscape cookies to have <code>max-age</code> and |
| <code>port</code> cookie-attributes, and as far as I know that's the extent of |
| the support present in MSIE. I haven't tested, though! |
| |
| <p><a href="http://www.ietf.org/rfcs/rfc2965.txt">RFC 2965</a> attempted to fix |
| the compatibility problem by introducing two new headers, |
| <code>Set-Cookie2</code> and <code>Cookie2</code>. Unlike the |
| <code>Cookie</code> header, <code>Cookie2</code> does <em>not</em> carry |
| cookies to the server - rather, it simply advertises to the server that RFC |
| 2965 is understood. <code>Set-Cookie2</code> <em>does</em> carry cookies, from |
| server to client: the new header means that both IE and Netscape completely |
| ignore these cookies. This prevents breakage, but introduces a chicken-egg |
| problem that means 2965 may never be widely adopted, especially since Microsoft |
| shows no interest in it. XXX Rumour has it that the European Union is unhappy |
| with P3P, and might introduce legislation that requires something better, |
| forming a gap that RFC 2965 might fill - any truth in this? Opera is the only |
| browser I know of that supports the standard. On the server side, Apache's |
| <code>mod_usertrack</code> supports it. One confusing point to note about RFC |
| 2965 is that it uses the same value (1) of the Version attribute in HTTP |
| headers as does RFC 2109. |
| |
| <p>Most recently, it was discovered that RFC 2965 does not fully take account |
| of issues arising when 2965 and Netscape cookies coexist, and errata were |
| discussed on the W3C http-state mailing list, but the list traffic died and it |
| seems RFC 2965 is dead as an internet protocol (but still a useful basis for |
| implementing the de-facto standards, and perhaps as an intranet protocol). |
| |
| <p>Because Netscape cookies are so poorly specified, the general philosophy |
| of the module's Netscape cookie implementation is to start with RFC 2965 |
| and open holes where required for Netscape protocol-compatibility. RFC |
| 2965 cookies are <em>always</em> treated as RFC 2965 requires, of course! |
| |
| |
| <a name="faq_pre"></a> |
| <h2>FAQs - pre install</h2> |
| <ul> |
| <li>Doesn't the standard Python library module, <code>Cookie</code>, do |
| this? |
| <p>No: Cookie.py does the server end of the job. It doesn't know when to |
| accept cookies from a server or when to pass them back. |
| <li>Where can I find out more about the HTTP cookie protocol? |
| <p>There is more than one protocol, in fact (see the <a href="./doc.html">docs</a> |
| for a brief explanation of the history): |
| <ul> |
| <li>The original <a href="http://www.netscape.com/newsref/std/cookie_spec.html"> |
| Netscape cookie protocol</a> - the standard still in use today, in |
| theory (in reality, the protocol implemented by all the major browsers |
| only bears a passing resemblance to the protocol sketched out in this |
| document). |
| <li><a href="http://www.ietf.org/rfcs/rfc2109.txt">RFC 2109</a> - obsoleted |
| by RFC 2965. |
| <li><a href="http://www.ietf.org/rfcs/rfc2965.txt">RFC 2965</a> - the |
| Netscape protocol with the bugs fixed (not widely used - the Netscape |
| protocol still dominates, and seems likely to remain dominant |
| indefinitely, at least on the Internet). |
| <a href="http://www.ietf.org/rfcs/rfc2964.txt">RFC 2964</a> discusses use |
| of the protocol. |
| <a href="http://kristol.org/cookie/errata.html">Errata</a> to RFC 2965 |
| are currently being discussed on the |
| <a href="http://lists.bell-labs.com/mailman/listinfo/http-state"> |
| http-state mailing list</a> (update: list traffic died months ago and |
| hasn't revived). |
| <li>A <a href="http://doi.acm.org/10.1145/502152.502153">paper</a> by David |
| Kristol setting out the history of the cookie standards in exhausting |
| detail. |
| <li>HTTP cookies <a href="http://www.cookiecentral.com/">FAQ</a>. |
| </ul> |
| <li>Which protocols does mechanize support? |
| <p>Netscape and RFC 2965. RFC 2965 handling is switched off by default. |
| <li>What about RFC 2109? |
| <p>RFC 2109 cookies are currently parsed as Netscape cookies, and treated |
| by default as RFC 2965 cookies thereafter if RFC 2965 handling is enabled, |
| or as Netscape cookies otherwise. RFC 2109 is officially obsoleted by RFC |
| 2965. Browsers do use a few RFC 2109 features in their Netscape cookie |
| implementations (<code>port</code> and <code>max-age</code>), and |
| mechanize knows about that, too. |
| </ul> |
| |
| |
| <a name="faq_use"></a> |
| <h2>FAQs - usage</h2> |
| <ul> |
| <li>Why don't I have any cookies? |
| <p>Read the <a href="./doc.html#debugging">debugging section</a> of this page. |
| <li>My response claims to be empty, but I know it's not! |
| <p>Did you call <code>response.read()</code> (eg., in a debug statement), |
| then forget that all the data has already been read? In that case, you |
| may want to use <code>mechanize.response_seek_wrapper</code>. |
| <li>How do I download only part of a response body? |
| <p>Just call <code>.read()</code> or <code>.readline()</code> methods on your |
| response object as many times as you need. The <code>.seek()</code> |
| method (which is not always present, see <a |
| href="./doc.html#seekable">above</a>) still works, because mechanize |
| caches read data. |
| <li>What's the difference between the <code>.load()</code> and |
| <code>.revert()</code> methods of <code>CookieJar</code>? |
| <p><code>.load()</code> <em>appends</em> cookies from a file. |
| <code>.revert()</code> discards all existing cookies held by the |
| <code>CookieJar</code> first (but it won't lose any existing cookies if |
| the loading fails). |
| <li>Is it threadsafe? |
| <p>No. <em>Tested</em> patches welcome. Clarification: As far as I know, |
| it's perfectly possible to use mechanize in threaded code, but it |
| provides no synchronisation: you have to provide that yourself. |
| <li>How do I do <X> |
| <p>The module docstrings are worth reading if you want to do something |
| unusual. |
| </ul> |
| |
| <p>I prefer questions and comments to be sent to the <a |
| href="http://lists.sourceforge.net/lists/listinfo/wwwsearch-general"> |
| mailing list</a> rather than direct to me. |
| |
| <p><a href="mailto:jjl@@pobox.com">John J. Lee</a>, |
| @(time.strftime("%B %Y", last_modified)). |
| |
| <hr> |
| |
| </div> |
| |
| <div id="Menu"> |
| |
| @(release.navbar('ccdocs')) |
| |
| <br> |
| |
| <a href="./doc.html#examples">Examples</a><br> |
| <a href="./doc.html#browsers">Mozilla & MSIE</a><br> |
| <a href="./doc.html#file">Cookies in a file</a><br> |
| <a href="./doc.html#cookiejar">Using a <code>CookieJar</code></a><br> |
| <a href="./doc.html#extras">Processors</a><br> |
| <a href="./doc.html#seekable">Seekable responses</a><br> |
| <a href="./doc.html#requests">Request confusion</a><br> |
| <a href="./doc.html#headers">Adding headers</a><br> |
| <a href="./doc.html#unverifiable">Verifiability</a><br> |
| <a href="./doc.html#rfc2965">RFC 2965</a><br> |
| <a href="./doc.html#debugging">Debugging</a><br> |
| <a href="./doc.html#script">Embedded scripts</a><br> |
| <a href="./doc.html#dates">HTTP date parsing</a><br> |
| <a href="./doc.html#standards">Standards</a><br> |
| <a href="./doc.html#faq_use">FAQs - usage</a><br> |
| |
| </div> |
| |
| </body> |
| |
| </html> |