doc.html.in - external/github.com/jjlee/mechanize - Git at Google

 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
         "http://www.w3.org/TR/html4/strict.dtd">
 @# This file is processed by EmPy
 @{
 from colorize import colorize
 import time
 import release
 last_modified = release.last_modified(empy.identify()[0])
 try:
     base
 except NameError:
     base = False
 }
 <html>
 <!--This file was generated by EmPy: do not edit-->
 <head>
   <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
   <meta name="author" content="John J. Lee &lt;jjl@@pobox.com&gt;">
   <meta name="date" content="@(time.strftime("%Y-%m-%d", last_modified))">
   <title>mechanize documentation</title>
   <style type="text/css" media="screen">@@import "../../styles/style.css";</style>
   <!--[if IE 6]>
   <style type="text/css" media="screen">@@import "../../styles/style-ie6.css";</style>
   <![endif]-->
   @[if base]<base href="http://wwwsearch.sourceforge.net/mechanize/doc.html">@[end if]
 </head>
 <body>

 <div id="sf"><a href="http://sourceforge.net">
 <img src="http://sourceforge.net/sflogo.php?group_id=48205&amp;type=2"
  width="125" height="37" alt="SourceForge.net Logo"></a></div>

 <h1>mechanize documentation: handlers</h1>

 <div id="Content">

 <p class="docwarning">This documentation is in need of reorganisation!</p>

 <p>This page is the old ClientCookie documentation.  It deals with operation on
 the level of urllib2 Handler objects, and also with adding headers, debugging,
 and cookie handling.  Documentation for the higher-level browser-style
 interface is <a href="./mechanize">elsewhere</a>.


 <a name="examples"></a>
 <h2>Examples</h2>

 @{colorize(r"""
 import mechanize
 response = mechanize.urlopen("http://foo.bar.com/")
 """)}

 <p>This function behaves identically to <code>urllib2.urlopen()</code>, except
 that it deals with cookies automatically.

 <p>Here is a more complicated example, involving <code>Request</code> objects
 (useful if you want to pass <code>Request</code>s around, add headers to them,
 etc.):

 @{colorize(r"""
 import mechanize
 request = mechanize.Request("http://www.acme.com/")
 # note we're using the urlopen from mechanize, not urllib2
 response = mechanize.urlopen(request)
 # let's say this next request requires a cookie that was set in response
 request2 = mechanize.Request("http://www.acme.com/flying_machines.html")
 response2 = mechanize.urlopen(request2)

 print response2.geturl()
 print response2.info()  # headers
 print response2.read()  # body (readline and readlines work too)
 """)}

 <p>(The above example would also work with <code>urllib2.Request</code> objects
 too, since <code>mechanize.HTTPRequestUpgradeProcessor</code> knows about
 that class, but don't if you can avoid it, because this is an obscure hack for
 compatibility purposes only).

 <p>In these examples, the workings are hidden inside the
 <code>mechanize.urlopen()</code> function, which is an extension of
 <code>urllib2.urlopen()</code>.  Redirects, proxies and cookies are handled
 automatically by this function (note that you may need a bit of configuration
 to get your proxies correctly set up: see <code>urllib2</code> documentation).

 <p>Cookie processing (etc.) is handled by processor objects, which are an
 extension of <code>urllib2</code>'s handlers: <code>HTTPCookieProcessor</code>,
 <code>HTTPRefererProcessor</code> etc.  They are used like any other handler.

 <p>There is also a <code>urlretrieve()</code> function, which works like
 <code>urllib.urlretrieve()</code>.

 <p>An example at a slightly lower level shows how the module processes
 cookies more clearly:

 @{colorize(r"""
 # Don't copy this blindly!  You probably want to follow the examples
 # above, not this one.
 import mechanize

 # Build an opener that *doesn't* automatically call .add_cookie_header()
 # and .extract_cookies(), so we can do it manually without interference.
 class NullCookieProcessor(mechanize.HTTPCookieProcessor):
     def http_request(self, request): return request
     def http_response(self, request, response): return response
 opener = mechanize.build_opener(NullCookieProcessor)

 request = mechanize.Request("http://www.acme.com/")
 response = mechanize.urlopen(request)
 cj = mechanize.CookieJar()
 cj.extract_cookies(response, request)
 # let's say this next request requires a cookie that was set in response
 request2 = mechanize.Request("http://www.acme.com/flying_machines.html")
 cj.add_cookie_header(request2)
 response2 = mechanize.urlopen(request2)
 """)}

 <p>The <code>CookieJar</code> class does all the work.  There are essentially
 two operations: <code>.extract_cookies()</code> extracts HTTP cookies from
 <code>Set-Cookie</code> (the original <a
 href="http://www.netscape.com/newsref/std/cookie_spec.html">Netscape cookie
 standard</a>) and <code>Set-Cookie2</code> (<a
 href="http://www.ietf.org/rfc/rfc2965.txt">RFC 2965</a>) headers from a
 response if and only if they should be set given the request, and
 <code>.add_cookie_header()</code> adds <code>Cookie</code> headers if and only
 if they are appropriate for a particular HTTP request.  Incoming cookies are
 checked for acceptability based on the host name, etc.  Cookies are only set on
 outgoing requests if they match the request's host name, path, etc.

 <p><strong>Note that if you're using <code>mechanize.urlopen()</code> (or if
 you're using <code>mechanize.HTTPCookieProcessor</code> by some other
 means), you don't need to call <code>.extract_cookies()</code> or
 <code>.add_cookie_header()</code> yourself</strong>.  If, on the other hand,
 you want to use mechanize to provide cookie handling for an HTTP client other
 than mechanize itself, you will need to use this pair of methods.  You can make
 your own <code>request</code> and <code>response</code> objects, which must
 support the interfaces described in the docstrings of
 <code>.extract_cookies()</code> and <code>.add_cookie_header()</code>.

 <p>There are also some <code>CookieJar</code> subclasses which can store
 cookies in files and databases.  <code>FileCookieJar</code> is the abstract
 class for <code>CookieJar</code>s that can store cookies in disk files.
 <code>LWPCookieJar</code> saves cookies in a format compatible with the
 libwww-perl library.  This class is convenient if you want to store cookies in
 a human-readable file:

 @{colorize(r"""
 import mechanize
 cj = mechanize.LWPCookieJar()
 cj.revert("cookie3.txt")
 opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
 r = opener.open("http://foobar.com/")
 cj.save("cookie3.txt")
 """)}

 <p>The <code>.revert()</code> method discards all existing cookies held by the
 <code>CookieJar</code> (it won't lose any existing cookies if the load fails).
 The <code>.load()</code> method, on the other hand, adds the loaded cookies to
 existing cookies held in the <code>CookieJar</code> (old cookies are kept
 unless overwritten by newly loaded ones).

 <p><code>MozillaCookieJar</code> can load and save to the
 Mozilla/Netscape/lynx-compatible <code>'cookies.txt'</code> format.  This
 format loses some information (unusual and nonstandard cookie attributes such
 as comment, and also information specific to RFC 2965 cookies).  The subclass
 <code>MSIECookieJar</code> can load (but not save, yet) from Microsoft Internet
 Explorer's cookie files (on Windows).  <code>BSDDBCookieJar</code> (NOT FULLY
 TESTED!) saves to a BSDDB database using the standard library's
 <code>bsddb</code> module.  There's an unfinished <code>MSIEDBCookieJar</code>,
 which uses (reads and writes) the Windows MSIE cookie database directly, rather
 than storing copies of cookies as <code>MSIECookieJar</code> does.

 <h2>Important note</h2>

 <p>Only use names you can import directly from the <code>mechanize</code>
 package, and that don't start with a single underscore.  Everything else is
 subject to change or disappearance without notice.

 <a name="browsers"></a>
 <h2>Cooperating with Mozilla/Netscape, lynx and Internet Explorer</h2>

 <p>The subclass <code>MozillaCookieJar</code> differs from
 <code>CookieJar</code> only in storing cookies using a different,
 Mozilla/Netscape-compatible, file format.  The lynx browser also uses this
 format.  This file format can't store RFC 2965 cookies, so they are downgraded
 to Netscape cookies on saving.  <code>LWPCookieJar</code> itself uses a
 libwww-perl specific format (`Set-Cookie3') - see the example above.  Python
 and your browser should be able to share a cookies file (note that the file
 location here will differ on non-unix OSes):

 <p><strong>WARNING:</strong> you may want to backup your browser's cookies file
 if you use <code>MozillaCookieJar</code> to save cookies.  I <em>think</em> it
 works, but there have been bugs in the past!

 @{colorize(r"""
 import os, mechanize
 cookies = mechanize.MozillaCookieJar()
 cookies.load(os.path.join(os.environ["HOME"], "/.netscape/cookies.txt"))
 # see also the save and revert methods
 """)}

 <p>Note that cookies saved while Mozilla is running will get clobbered by
 Mozilla - see <code>MozillaCookieJar.__doc__</code>.

 <p><code>MSIECookieJar</code> does the same for Microsoft Internet Explorer
 (MSIE) 5.x and 6.x on Windows, but does not allow saving cookies in this
 format.  In future, the Windows API calls might be used to load and save
 (though the index has to be read directly, since there is no API for that,
 AFAIK; there's also an unfinished <code>MSIEDBCookieJar</code>, which uses
 (reads and writes) the Windows MSIE cookie database directly, rather than
 storing copies of cookies as <code>MSIECookieJar</code> does).

 @{colorize(r"""
 import mechanize
 cj = mechanize.MSIECookieJar(delayload=True)
 cj.load_from_registry()  # finds cookie index file from registry
 """)}

 <p>A true <code>delayload</code> argument speeds things up.

 <p>On Windows 9x (win 95, win 98, win ME), you need to supply a username to the
 <code>.load_from_registry()</code> method:

 @{colorize(r"""
 cj.load_from_registry(username="jbloggs")
 """)}

 <p>Konqueror/Safari and Opera use different file formats, which aren't yet
 supported.

 <a name="file"></a>
 <h2>Saving cookies in a file</h2>

 <p>If you have no need to co-operate with a browser, the most convenient way to
 save cookies on disk between sessions in human-readable form is to use
 <code>LWPCookieJar</code>.  This class uses a libwww-perl specific format
 (`Set-Cookie3').  Unlike <code>MozilliaCookieJar</code>, this file format
 doesn't lose information.

 <a name="cookiejar"></a>
 <h2>Using your own CookieJar instance</h2>

 <p>You might want to do this to <a href="./doc.html#browsers">use your
 browser's cookies</a>, to customize <code>CookieJar</code>'s behaviour by
 passing constructor arguments, or to be able to get at the cookies it will hold
 (for example, for saving cookies between sessions and for debugging).

 <p>If you're using the higher-level <code>urllib2</code>-like interface
 (<code>urlopen()</code>, etc), you'll have to let it know what
 <code>CookieJar</code> it should use:

 @{colorize(r"""
 import mechanize
 cookies = mechanize.CookieJar()
 # build_opener() adds standard handlers (such as HTTPHandler and
 # HTTPCookieProcessor) by default.  The cookie processor we supply
 # will replace the default one.
 opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

 r = opener.open("http://acme.com/")  # GET
 r = opener.open("http://acme.com/", data)  # POST
 """)}

 <p>The <code>urlopen()</code> function uses a global
 <code>OpenerDirector</code> instance to do its work, so if you want to use
 <code>urlopen()</code> with your own <code>CookieJar</code>, install the
 <code>OpenerDirector</code> you built with <code>build_opener()</code> using
 the <code>mechanize.install_opener()</code> function, then proceed as usual:

 @{colorize(r"""
 mechanize.install_opener(opener)
 r = mechanize.urlopen("http://www.acme.com/")
 """)}

 <p>Of course, everyone using <code>urlopen</code> is using the same global
 <code>CookieJar</code> instance!

 <a name="policy"></a>

 <p>You can set a policy object (must satisfy the interface defined by
 <code>mechanize.CookiePolicy</code>), which determines which cookies are
 allowed to be set and returned.  Use the policy argument to the
 <code>CookieJar</code> constructor, or use the .set_policy() method.  The
 default implementation has some useful switches:

 @{colorize(r"""
 from mechanize import CookieJar, DefaultCookiePolicy as Policy
 cookies = CookieJar()
 # turn on RFC 2965 cookies, be more strict about domains when setting and
 # returning Netscape cookies, and block some domains from setting cookies
 # or having them returned (read the DefaultCookiePolicy docstring for the
 # domain matching rules here)
 policy = Policy(rfc2965=True, strict_ns_domain=Policy.DomainStrict,
                 blocked_domains=["ads.net", ".ads.net"])
 cookies.set_policy(policy)
 """)}


 <a name="extras"></a>
 <h2>Optional extras: robots.txt, HTTP-EQUIV, Refresh, Referer</h2>

 <p>These are implemented as processor classes.  Processors are an extension of
 <code>urllib2</code>'s handlers (now a standard part of urllib2 in Python 2.4):
 you just pass them to <code>build_opener()</code> (example code below).

 <dl>

 <dt><code>HTTPRobotRulesProcessor</code>

 <dd><p>WWW Robots (also called wanderers or spiders) are programs that traverse
 many pages in the World Wide Web by recursively retrieving linked pages.  This
 kind of program can place significant loads on web servers, so there is a <a
 href="http://www.robotstxt.org/wc/norobots.html">standard</a> for a <code>
 robots.txt</code> file by which web site operators can request robots to keep
 out of their site, or out of particular areas of it.  This processor uses the
 standard Python library's <code>robotparser</code> module.  It raises
 <code>mechanize.RobotExclusionError</code> (subclass of
 <code>mechanize.HTTPError</code>) if an attempt is made to open a URL prohibited
 by <code>robots.txt</code>.

 <dt><code>HTTPEquivProcessor</code>

 <dd><p>The <code>&lt;META HTTP-EQUIV&gt;</code> tag is a way of including data
 in HTML to be treated as if it were part of the HTTP headers.  mechanize can
 automatically read these tags and add the <code>HTTP-EQUIV</code> headers to
 the response object's real HTTP headers.  The HTML is left unchanged.

 <dt><code>HTTPRefreshProcessor</code>

 <dd><p>The <code>Refresh</code> HTTP header is a non-standard header which is
 widely used.  It requests that the user-agent follow a URL after a specified
 time delay.  mechanize can treat these headers (which may have been set in
 <code>&lt;META HTTP-EQUIV&gt;</code> tags) as if they were 302 redirections.
 Exactly when and how <code>Refresh</code> headers are handled is configurable
 using the constructor arguments.

 <dt><code>HTTPRefererProcessor</code>

 <dd><p>The <code>Referer</code> HTTP header lets the server know which URL
 you've just visited.  Some servers use this header as state information, and
 don't like it if this is not present.  It's a chore to add this header by hand
 every time you make a request.  This adds it automatically.
 <strong>NOTE</strong>: this only makes sense if you use each processor for a
 single chain of HTTP requests (so, for example, if you use a single
 HTTPRefererProcessor to fetch a series of URLs extracted from a single page,
 <strong>this will break</strong>).  <a
 href="../mechanize/">mechanize.Browser</a> does this properly.</p>

 </dl>

 @{colorize(r"""
 import mechanize
 cookies = mechanize.CookieJar()

 opener = mechanize.build_opener(mechanize.HTTPRefererProcessor,
                                 mechanize.HTTPEquivProcessor,
                                 mechanize.HTTPRefreshProcessor,
 				)
 opener.open("http://www.rhubarb.com/")
 """)}


 <a name="seekable"></a>
 <h2>Seekable responses</h2>

 <p>Response objects returned from (or raised as exceptions by)
 <code>mechanize.SeekableResponseOpener</code>, <code>mechanize.UserAgent</code>
 (if <code>.set_seekable_responses(True)</code> has been called) and
 <code>mechanize.Browser()</code> have <code>.seek()</code>,
 <code>.get_data()</code> and <code>.set_data()</code> methods:

 @{colorize(r"""
 import mechanize
 opener = mechanize.OpenerFactory(mechanize.SeekableResponseOpener).build_opener()
 response = opener.open("http://example.com/")
 # same return value as .read(), but without affecting seek position
 total_nr_bytes = len(response.get_data())
 assert len(response.read()) == total_nr_bytes
 assert len(response.read()) == 0  # we've already read the data
 response.seek(0)
 assert len(response.read()) == total_nr_bytes
 response.set_data("blah\n")
 assert response.get_data() == "blah\n"
 ...
 """)}

 <p>This caching behaviour can be avoided by using
 <code>mechanize.OpenerDirector</code> (as long as
 <code>SeekableProcessor</code>, <code>HTTPEquivProcessor</code> and
 <code>HTTPResponseDebugProcessor</code> are not used).  It can also be avoided
 with <code>mechanize.UserAgent</code>:

 @{colorize(r"""
 import mechanize
 ua = mechanize.UserAgent()
 ua.set_seekable_responses(False)
 ua.set_handle_equiv(False)
 ua.set_debug_responses(False)
 """)}

 <p>Note that if you turn on features that use seekable responses (currently:
 HTTP-EQUIV handling and response body debug printing), returned responses
 <em>may</em> be seekable as a side-effect of these features.  However, this is
 not guaranteed (currently, in these cases, returned response objects are
 seekable, but raised respose objects &#8212; <code>mechanize.HTTPError</code>
 instances &#8212; are not seekable).  This applies regardless of whether you
 use <code>mechanize.UserAgent</code> or <code>mechanize.OpenerDirector</code>.
 If you explicitly request seekable responses by calling
 <code>.set_seekable_responses(True)</code> on a
 <code>mechanize.UserAgent</code> instance, or by using
 <code>mechanize.Browser</code> or
 <code>mechanize.SeekableResponseOpener</code>, which always return seekable
 responses, then both returned and raised responses are guaranteed to be
 seekable.

 <p>Handlers should call <code>response =
 mechanize.seek_wrapped_response(response)</code> if they require the
 <code>.seek()</code>, <code>.get_data()</code> or <code>.set_data()</code>
 methods.

 <p>Note that <code>SeekableProcessor</code> (and
 <code>ResponseUpgradeProcessor</code>) are deprecated since mechanize 0.1.6b.
 The reason for the deprecation is that these were really abuses of the response
 processing chain (the <code>.process_response()</code> support documented by
 urllib2).  The response processing chain is sensibly used only for processing
 response headers and data, not for processing response <em>objects</em>,
 because the same data may occur as different Python objects (this can occur for
 example when <code>HTTPError</code> is raised by
 <code>HTTPDefaultErrorHandler</code>), but should only get processed once
 (during <code>.open()</code>).


 <a name="requests"></a>
 <h2>Confusing fact about headers and Requests</h2>

 <p>mechanize automatically upgrades <code>urllib2.Request</code> objects to
 <code>mechanize.Request</code>, as a backwards-compatibility hack.  This
 means that you won't see any headers that are added to Request objects by
 handlers unless you use <code>mechanize.Request</code> in the first place.
 Sorry about that.

 <p>Note also that handlers may create new <code>Request</code> instances (for
 example when performing redirects) rather than adding headers to existing
 <code>Request objects</code>.


 <a name="headers"></a>
 <h2>Adding headers</h2>

 <p>Adding headers is done like so:

 @{colorize(r"""
 import mechanize
 req = mechanize.Request("http://foobar.com/")
 req.add_header("Referer", "http://wwwsearch.sourceforge.net/mechanize/")
 r = mechanize.urlopen(req)
 """)}

 <p>You can also use the headers argument to the <code>mechanize.Request</code>
 constructor.

 <p>mechanize adds some headers to <code>Request</code> objects automatically -
 see the next section for details.


 <h2>Changing the automatically-added headers (User-Agent)</h2>

 <p><code>OpenerDirector</code> automatically adds a <code>User-Agent</code>
 header to every <code>Request</code>.

 <p>To change this and/or add similar headers, use your own
 <code>OpenerDirector</code>:

 @{colorize(r"""
 import mechanize
 cookies = mechanize.CookieJar()
 opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))
 opener.addheaders = [("User-agent", "Mozilla/5.0 (compatible; MyProgram/0.1)"),
                      ("From", "responsible.person@example.com")]
 """)}

 <p>Again, to use <code>urlopen()</code>, install your
 <code>OpenerDirector</code> globally:

 @{colorize(r"""
 mechanize.install_opener(opener)
 r = mechanize.urlopen("http://acme.com/")
 """)}

 <p>Also, a few standard headers (<code>Content-Length</code>,
 <code>Content-Type</code> and <code>Host</code>) are added when the
 <code>Request</code> is passed to <code>urlopen()</code> (or
 <code>OpenerDirector.open()</code>).  You shouldn't need to change these
 headers, but since this is done by <code>AbstractHTTPHandler</code>, you can
 change the way it works by passing a subclass of that handler to
 <code>build_opener()</code> (or, as always, by constructing an opener yourself
 and calling .add_handler()).


 <a name="unverifiable"></a>
 <h2>Initiating unverifiable transactions</h2>

 <p>This section is only of interest for correct handling of third-party HTTP
 cookies.  See <a href="./doc.html#standards">below</a> for an explanation of
 'third-party'.

 <p>First, some terminology.

 <p>An <em>unverifiable request</em> (defined fully by RFC 2965) is one whose
 URL the user did not have the option to approve.  For example, a transaction is
 unverifiable if the request is for an image in an HTML document, and the user
 had no option to approve the fetching of the image from a particular URL.

 <p>The <em>request-host of the origin transaction</em> (defined fully by RFC
 2965) is the host name or IP address of the original request that was initiated
 by the user.  For example, if the request is for an image in an HTML document,
 this is the request-host of the request for the page containing the image.

 <p><strong>mechanize knows that redirected transactions are unverifiable,
 and will handle that on its own (ie. you don't need to think about the origin
 request-host or verifiability yourself).</strong>

 <p>If you want to initiate an unverifiable transaction yourself (which you
 should if, for example, you're downloading the images from a page, and 'the
 user' hasn't explicitly OKed those URLs):

 @{colorize(r"""
 request = Request(origin_req_host="www.example.com", unverifiable=True)
 """)}


 <a name="rfc2965"></a>
 <h2>RFC 2965 handling</h2>

 <p>RFC 2965 handling is switched off by default, because few browsers implement
 it, so the RFC 2965 protocol is essentially never seen on the internet.  To
 switch it on, see <a href="./doc.html#policy">here</a>.


 <a name="debugging"></a>
 <h2>Debugging</h2>

 <!--XXX move as much as poss. to General page-->

 <p>First, a few common problems.  The most frequent mistake people seem to make
 is to use <code>mechanize.urlopen()</code>, <em>and</em> the
 <code>.extract_cookies()</code> and <code>.add_cookie_header()</code> methods
 on a cookie object themselves.  If you use <code>mechanize.urlopen()</code>
 (or <code>OpenerDirector.open()</code>), the module handles extraction and
 adding of cookies by itself, so you should not call
 <code>.extract_cookies()</code> or <code>.add_cookie_header()</code>.

 <p>Are you sure the server is sending you any cookies in the first place?
 Maybe the server is keeping track of state in some other way
 (<code>HIDDEN</code> HTML form entries (possibly in a separate page referenced
 by a frame), URL-encoded session keys, IP address, HTTP <code>Referer</code>
 headers)?  Perhaps some embedded script in the HTML is setting cookies (see
 below)?  Maybe you messed up your request, and the server is sending you some
 standard failure page (even if the page doesn't appear to indicate any
 failure).  Sometimes, a server wants particular headers set to the values it
 expects, or it won't play nicely.  The most frequent offenders here are the
 <code>Referer</code> [<em>sic</em>] and / or <code>User-Agent</code> HTTP
 headers (<a href="./doc.html#headers">see above</a> for how to set these).  The
 <code>User-Agent</code> header may need to be set to a value like that of a
 popular browser.  The <code>Referer</code> header may need to be set to the URL
 that the server expects you to have followed a link from.  Occasionally, it may
 even be that operators deliberately configure a server to insist on precisely
 the headers that the popular browsers (MS Internet Explorer, Mozilla/Netscape,
 Opera, Konqueror/Safari) generate, but remember that incompetence (possibly on
 your part) is more probable than deliberate sabotage (and if a site owner is
 that keen to stop robots, you probably shouldn't be scraping it anyway).

 <p>When you <code>.save()</code> to or
 <code>.load()</code>/<code>.revert()</code> from a file, single-session cookies
 will expire unless you explicitly request otherwise with the
 <code>ignore_discard</code> argument.  This may be your problem if you find
 cookies are going away after saving and loading.

 @{colorize(r"""
 import mechanize
 cj = mechanize.LWPCookieJar()
 opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
 mechanize.install_opener(opener)
 r = mechanize.urlopen("http://foobar.com/")
 cj.save("/some/file", ignore_discard=True, ignore_expires=True)
 """)}

 <p>If none of the advice above solves your problem quickly, try comparing the
 headers and data that you are sending out with those that a browser emits.
 Often this will give you the clue you need.  Of course, you'll want to check
 that the browser is able to do manually what you're trying to achieve
 programatically before minutely examining the headers.  Make sure that what you
 do manually is <em>exactly</em> the same as what you're trying to do from
 Python - you may simply be hitting a server bug that only gets revealed if you
 view pages in a particular order, for example.  In order to see what your
 browser is sending to the server (even if HTTPS is in use), see <a
 href="../clientx.html">the General FAQ page</a>.  If nothing is obviously wrong
 with the requests your program is sending and you're out of ideas, you can try
 the last resort of good old brute force binary-search debugging.  Temporarily
 switch to sending HTTP headers (with <code>httplib</code>).  Start by copying
 Netscape/Mozilla or IE slavishly (apart from session IDs, etc., of course),
 then begin the tedious process of mutating your headers and data until they
 match what your higher-level code was sending.  This will at least reliably
 find your problem.

 <p>You can turn on display of HTTP headers:

 @{colorize(r"""
 import mechanize
 hh = mechanize.HTTPHandler()  # you might want HTTPSHandler, too
 hh.set_http_debuglevel(1)
 opener = mechanize.build_opener(hh)
 response = opener.open(url)
 """)}

 <p>Alternatively, you can examine your individual request and response
 objects to see what's going on.  Note, though, that mechanize upgrades
 <code>urllib2.Request</code> objects to <code>mechanize.Request</code>, so you
 won't see any headers that are added to requests by handlers unless you use
 <code>mechanize.Request</code> in the first place.  In addition, requests may
 involve "sub-requests" in cases such as redirection, in which case you will
 also not see everything that's going on just by examining the original request
 and final response.  mechanize's responses can be made to
 have <code>.seek()</code> and <code>.get_data()</code> methods.  It's often
 useful to use the <code>.get_data()</code> method during debugging
 (see <a href="./doc.html#seekable">above</a>).

 <p>Also, note <code>HTTPRedirectDebugProcessor</code> (which prints information
 about redirections) and <code>HTTPResponseDebugProcessor</code> (which prints
 out all response bodies, including those that are read during redirections).
 <strong>NOTE</strong>: as well as having these processors in your
 <code>OpenerDirector</code> (for example, by passing them to
 <code>build_opener()</code>) you have to turn on logging at the
 <code>INFO</code> level or lower in order to see any output.

 <p>If you would like to see what is going on in mechanize's tiny mind, do
 this:

 @{colorize(r"""
 import sys, logging
 # logging.DEBUG covers masses of debugging information,
 # logging.INFO just shows the output from HTTPRedirectDebugProcessor,
 logger = logging.getLogger("mechanize")
 logger.addHandler(logging.StreamHandler(sys.stdout))
 logger.setLevel(logging.DEBUG)
 """)}

 <p>The <code>DEBUG</code> level (as opposed to the <code>INFO</code> level) can
 actually be quite useful, as it explains why particular cookies are accepted or
 rejected and why they are or are not returned.

 <p>One final thing to note is that there are some catch-all bare
 <code>except:</code> statements in the module, which are there to handle
 unexpected bad input without crashing your program.  If this happens, it's a
 bug in mechanize, so please mail me the warning text.


 <a name="script"></a>
 <h2>Embedded script that sets cookies</h2>

 <p>It is possible to embed script in HTML pages (sandwiched between
 <code>&lt;SCRIPT&gt;here&lt;/SCRIPT&gt;</code> tags, and in
 <code>javascript:</code> URLs) - JavaScript / ECMAScript, VBScript, or even
 Python - that causes cookies to be set in a browser.  See the <a
 href="../bits/clientx.html">General FAQs</a> page for what to do about this.


 <a name="dates"></a>
 <h2>Parsing HTTP date strings</h2>

 <p>A function named <code>str2time</code> is provided by the package,
 which may be useful for parsing dates in HTTP headers.
 <code>str2time</code> is intended to be liberal, since HTTP date/time
 formats are poorly standardised in practice.  There is no need to use this
 function in normal operations: <code>CookieJar</code> instances keep track
 of cookie lifetimes automatically.  This function will stay around in some
 form, though the supported date/time formats may change.


 <a name="badhtml"></a>
 <h2>Dealing with bad HTML</h2>

 <p>XXX Intro

 <p>XXX Test me

 @{colorize("""\
 import copy
 import mechanize
 class CommentCleanProcessor(mechanize.BaseProcessor):
       def http_response(self, request, response):
           if not hasattr(response, "seek"):
               response = mechanize.response_seek_wrapper(response)
           response.seek(0)
           new_response = copy.copy(response)
           new_response.set_data(
               re.sub("<!-([^-]*)->", "<!--\\1-->", response.read()))
           return new_response
       https_response = http_response
 """)}

 <p>XXX TidyProcessor: mxTidy?  tidylib?  tidy?


 <a name="standards"></a>
 <h2>Note about cookie standards</h2>

 <p>The various cookie standards and their history form a case study of the
 terrible things that can happen to a protocol.  The long-suffering David
 Kristol has written a <a
 href="http://arxiv.org/abs/cs.SE/0105018">paper</a> about it, if you
 want to know the gory details.

 <p>Here is a summary.

 <p>The <a href="http://www.netscape.com/newsref/std/cookie_spec.html">Netscape
 protocol</a> (cookie_spec.html) is still the only standard supported by most
 browsers (including Internet Explorer and Netscape).  Be aware that
 cookie_spec.html is not, and never was, actually followed to the letter (or
 anything close) by anyone (including Netscape, IE and mechanize): the
 Netscape protocol standard is really defined by the behaviour of Netscape (and
 now IE).  Netscape cookies are also known as V0 cookies, to distinguish them
 from RFC 2109 or RFC 2965 cookies, which have a version cookie-attribute with a
 value of 1.

 <p><a href="http://www.ietf.org/rfcs/rfc2109.txt">RFC 2109</a> was introduced
 to fix some problems identified with the Netscape protocol, while still keeping
 the same HTTP headers (<code>Cookie</code> and <code>Set-Cookie</code>).  The
 most prominent of these problems is the 'third-party' cookie issue, which was
 an accidental feature of the Netscape protocol.  When one visits www.bland.org,
 one doesn't expect to get a cookie from www.lurid.com, a site one has never
 visited.  Depending on browser configuration, this can still happen, because
 the unreconstructed Netscape protocol is happy to accept cookies from, say, an
 image in a webpage (www.bland.org) that's included by linking to an
 advertiser's server (www.lurid.com).  This kind of event, where your browser
 talks to a server that you haven't explicitly okayed by some means, is what the
 RFCs call an 'unverifiable transaction'.  In addition to the potential for
 embarrassment caused by the presence of lurid.com's cookies on one's machine,
 this may also be used to track your movements on the web, because advertising
 agencies like doubleclick.net place ads on many sites.  RFC 2109 tried to
 change this by requiring cookies to be turned off during unverifiable
 transactions with third-party servers - unless the user explicitly asks them to
 be turned on.  This clashed with the business model of advertisers like
 doubleclick.net, who had started to take advantage of the third-party cookies
 'bug'.  Since the browser vendors were more interested in the advertisers'
 concerns than those of the browser users, this arguably doomed both RFC 2109
 and its successor, RFC 2965, from the start.  Other problems than the
 third-party cookie issue were also fixed by 2109.  However, even ignoring the
 advertising issue, 2109 was stillborn, because Internet Explorer and Netscape
 behaved differently in response to its extended <code>Set-Cookie</code>
 headers.  This was not really RFC 2109's fault: it worked the way it did to
 keep compatibility with the Netscape protocol as implemented by Netscape.
 Microsoft Internet Explorer (MSIE) was very new when the standard was designed,
 but was starting to be very popular when the standard was finalised.  XXX P3P,
 and MSIE &amp; Mozilla options

 <p>XXX Apparently MSIE implements bits of RFC 2109 - but not very compliant
 (surprise).  Presumably other browsers do too, as a result.  mechanize
 already does allow Netscape cookies to have <code>max-age</code> and
 <code>port</code> cookie-attributes, and as far as I know that's the extent of
 the support present in MSIE.  I haven't tested, though!

 <p><a href="http://www.ietf.org/rfcs/rfc2965.txt">RFC 2965</a> attempted to fix
 the compatibility problem by introducing two new headers,
 <code>Set-Cookie2</code> and <code>Cookie2</code>.  Unlike the
 <code>Cookie</code> header, <code>Cookie2</code> does <em>not</em> carry
 cookies to the server - rather, it simply advertises to the server that RFC
 2965 is understood.  <code>Set-Cookie2</code> <em>does</em> carry cookies, from
 server to client: the new header means that both IE and Netscape completely
 ignore these cookies.  This prevents breakage, but introduces a chicken-egg
 problem that means 2965 may never be widely adopted, especially since Microsoft
 shows no interest in it.  XXX Rumour has it that the European Union is unhappy
 with P3P, and might introduce legislation that requires something better,
 forming a gap that RFC 2965 might fill - any truth in this?  Opera is the only
 browser I know of that supports the standard.  On the server side, Apache's
 <code>mod_usertrack</code> supports it.  One confusing point to note about RFC
 2965 is that it uses the same value (1) of the Version attribute in HTTP
 headers as does RFC 2109.

 <p>Most recently, it was discovered that RFC 2965 does not fully take account
 of issues arising when 2965 and Netscape cookies coexist, and errata were
 discussed on the W3C http-state mailing list, but the list traffic died and it
 seems RFC 2965 is dead as an internet protocol (but still a useful basis for
 implementing the de-facto standards, and perhaps as an intranet protocol).

 <p>Because Netscape cookies are so poorly specified, the general philosophy
 of the module's Netscape cookie implementation is to start with RFC 2965
 and open holes where required for Netscape protocol-compatibility.  RFC
 2965 cookies are <em>always</em> treated as RFC 2965 requires, of course!


 <a name="faq_pre"></a>
 <h2>FAQs - pre install</h2>
 <ul>
   <li>Doesn't the standard Python library module, <code>Cookie</code>, do
      this?
   <p>No: Cookie.py does the server end of the job.  It doesn't know when to
      accept cookies from a server or when to pass them back.
   <li>Where can I find out more about the HTTP cookie protocol?
   <p>There is more than one protocol, in fact (see the <a href="./doc.html">docs</a>
      for a brief explanation of the history):
   <ul>
     <li>The original <a href="http://www.netscape.com/newsref/std/cookie_spec.html">
         Netscape cookie protocol</a> - the standard still in use today, in
         theory (in reality, the protocol implemented by all the major browsers
         only bears a passing resemblance to the protocol sketched out in this
         document).
     <li><a href="http://www.ietf.org/rfcs/rfc2109.txt">RFC 2109</a> - obsoleted
         by RFC 2965.
      <li><a href="http://www.ietf.org/rfcs/rfc2965.txt">RFC 2965</a> - the
         Netscape protocol with the bugs fixed (not widely used - the Netscape
         protocol still dominates, and seems likely to remain dominant
         indefinitely, at least on the Internet).
         <a href="http://www.ietf.org/rfcs/rfc2964.txt">RFC 2964</a> discusses use
         of the protocol.
         <a href="http://kristol.org/cookie/errata.html">Errata</a> to RFC 2965
         are currently being discussed on the
         <a href="http://lists.bell-labs.com/mailman/listinfo/http-state">
         http-state mailing list</a> (update: list traffic died months ago and
         hasn't revived).
     <li>A <a href="http://doi.acm.org/10.1145/502152.502153">paper</a> by David
         Kristol setting out the history of the cookie standards in exhausting
         detail.
     <li>HTTP cookies <a href="http://www.cookiecentral.com/">FAQ</a>.
   </ul>
   <li>Which protocols does mechanize support?
      <p>Netscape and RFC 2965.  RFC 2965 handling is switched off by default.
   <li>What about RFC 2109?
      <p>RFC 2109 cookies are currently parsed as Netscape cookies, and treated
      by default as RFC 2965 cookies thereafter if RFC 2965 handling is enabled,
      or as Netscape cookies otherwise.  RFC 2109 is officially obsoleted by RFC
      2965.  Browsers do use a few RFC 2109 features in their Netscape cookie
      implementations (<code>port</code> and <code>max-age</code>), and
      mechanize knows about that, too.
 </ul>


 <a name="faq_use"></a>
 <h2>FAQs - usage</h2>
 <ul>
   <li>Why don't I have any cookies?
   <p>Read the <a href="./doc.html#debugging">debugging section</a> of this page.
   <li>My response claims to be empty, but I know it's not!
   <p>Did you call <code>response.read()</code> (eg., in a debug statement),
      then forget that all the data has already been read?  In that case, you
      may want to use <code>mechanize.response_seek_wrapper</code>.
   <li>How do I download only part of a response body?
   <p>Just call <code>.read()</code> or <code>.readline()</code> methods on your
      response object as many times as you need.  The <code>.seek()</code>
      method (which is not always present, see <a
      href="./doc.html#seekable">above</a>) still works, because mechanize
      caches read data.
   <li>What's the difference between the <code>.load()</code> and
       <code>.revert()</code> methods of <code>CookieJar</code>?
   <p><code>.load()</code> <em>appends</em> cookies from a file.
      <code>.revert()</code> discards all existing cookies held by the
      <code>CookieJar</code> first (but it won't lose any existing cookies if
      the loading fails).
   <li>Is it threadsafe?
   <p>No.  <em>Tested</em> patches welcome.  Clarification: As far as I know,
      it's perfectly possible to use mechanize in threaded code, but it
      provides no synchronisation: you have to provide that yourself.
   <li>How do I do &lt;X&gt;
   <p>The module docstrings are worth reading if you want to do something
      unusual.
 </ul>

 <p>I prefer questions and comments to be sent to the <a
 href="http://lists.sourceforge.net/lists/listinfo/wwwsearch-general">
 mailing list</a> rather than direct to me.

 <p><a href="mailto:jjl@@pobox.com">John J. Lee</a>,
 @(time.strftime("%B %Y", last_modified)).

 <hr>

 </div>

 <div id="Menu">

 @(release.navbar('ccdocs'))

 <br>

 <a href="./doc.html#examples">Examples</a><br>
 <a href="./doc.html#browsers">Mozilla &amp; MSIE</a><br>
 <a href="./doc.html#file">Cookies in a file</a><br>
 <a href="./doc.html#cookiejar">Using a <code>CookieJar</code></a><br>
 <a href="./doc.html#extras">Processors</a><br>
 <a href="./doc.html#seekable">Seekable responses</a><br>
 <a href="./doc.html#requests">Request confusion</a><br>
 <a href="./doc.html#headers">Adding headers</a><br>
 <a href="./doc.html#unverifiable">Verifiability</a><br>
 <a href="./doc.html#rfc2965">RFC 2965</a><br>
 <a href="./doc.html#debugging">Debugging</a><br>
 <a href="./doc.html#script">Embedded scripts</a><br>
 <a href="./doc.html#dates">HTTP date parsing</a><br>
 <a href="./doc.html#standards">Standards</a><br>
 <a href="./doc.html#faq_use">FAQs - usage</a><br>

 </div>

 </body>

 </html>