| =head1 NAME |
| |
| XML::LibXML::Parser - Parsing XML Data with XML::LibXML |
| |
| =head1 SYNOPSIS |
| |
| |
| |
| use XML::LibXML 1.70; |
| |
| # Parser constructor |
| |
| $parser = XML::LibXML->new(); |
| $parser = XML::LibXML->new(option=>value, ...); |
| $parser = XML::LibXML->new({option=>value, ...}); |
| |
| # Parsing XML |
| |
| $dom = XML::LibXML->load_xml( |
| location => $file_or_url |
| # parser options ... |
| ); |
| $dom = XML::LibXML->load_xml( |
| string => $xml_string |
| # parser options ... |
| ); |
| $dom = XML::LibXML->load_xml( |
| string => (\$xml_string) |
| # parser options ... |
| ); |
| $dom = XML::LibXML->load_xml({ |
| IO => $perl_file_handle |
| # parser options ... |
| ); |
| $dom = $parser->load_xml(...); |
| |
| # Parsing HTML |
| |
| $dom = XML::LibXML->load_html(...); |
| $dom = $parser->load_html(...); |
| |
| # Parsing well-balanced XML chunks |
| |
| $fragment = $parser->parse_balanced_chunk( $wbxmlstring, $encoding ); |
| |
| # Processing XInclude |
| |
| $parser->process_xincludes( $doc ); |
| $parser->processXIncludes( $doc ); |
| |
| # Old-style parser interfaces |
| |
| $doc = $parser->parse_file( $xmlfilename ); |
| $doc = $parser->parse_fh( $io_fh ); |
| $doc = $parser->parse_string( $xmlstring); |
| $doc = $parser->parse_html_file( $htmlfile, \%opts ); |
| $doc = $parser->parse_html_fh( $io_fh, \%opts ); |
| $doc = $parser->parse_html_string( $htmlstring, \%opts ); |
| |
| # Push parser |
| |
| $parser->parse_chunk($string, $terminate); |
| $parser->init_push(); |
| $parser->push(@data); |
| $doc = $parser->finish_push( $recover ); |
| |
| # Set/query parser options |
| |
| $parser->option_exists($name); |
| $parser->get_option($name); |
| $parser->set_option($name,$value); |
| $parser->set_options({$name=>$value,...}); |
| |
| # XML catalogs |
| |
| $parser->load_catalog( $catalog_file ); |
| |
| =head1 PARSING |
| |
| An XML document is read into a data structure such as a DOM tree by a piece of |
| software, called a parser. XML::LibXML currently provides four different parser |
| interfaces: |
| |
| |
| =over 4 |
| |
| =item * |
| |
| A DOM Pull-Parser |
| |
| |
| |
| =item * |
| |
| A DOM Push-Parser |
| |
| |
| |
| =item * |
| |
| A SAX Parser |
| |
| |
| |
| =item * |
| |
| A DOM based SAX Parser. |
| |
| |
| |
| =back |
| |
| |
| =head2 Creating a Parser Instance |
| |
| XML::LibXML provides an OO interface to the libxml2 parser functions. Thus you |
| have to create a parser instance before you can parse any XML data. |
| |
| =over 4 |
| |
| =item new |
| |
| |
| $parser = XML::LibXML->new(); |
| $parser = XML::LibXML->new(option=>value, ...); |
| $parser = XML::LibXML->new({option=>value, ...}); |
| |
| Create a new XML and HTML parser instance. Each parser instance holds default |
| values for various parser options. Optionally, one can pass a hash reference or |
| a list of option => value pairs to set a different default set of options. |
| Unless specified otherwise, the options C<<<<<< load_ext_dtd >>>>>>, C<<<<<< expand_entities >>>>>>, and C<<<<<< huge >>>>>> are set to 1. See L<<<<<< Parser Options >>>>>> for a list of libxml2 parser's options. |
| |
| |
| |
| =back |
| |
| |
| =head2 DOM Parser |
| |
| One of the common parser interfaces of XML::LibXML is the DOM parser. This |
| parser reads XML data into a DOM like data structure, so each tag can get |
| accessed and transformed. |
| |
| XML::LibXML's DOM parser is not only capable to parse XML data, but also |
| (strict) HTML files. There are three ways to parse documents - as a string, as |
| a Perl filehandle, or as a filename/URL. The return value from each is a L<<<<<< XML::LibXML::Document >>>>>> object, which is a DOM object. |
| |
| All of the functions listed below will throw an exception if the document is |
| invalid. To prevent this causing your program exiting, wrap the call in an |
| eval{} block |
| |
| =over 4 |
| |
| =item load_xml |
| |
| |
| $dom = XML::LibXML->load_xml( |
| location => $file_or_url |
| # parser options ... |
| ); |
| $dom = XML::LibXML->load_xml( |
| string => $xml_string |
| # parser options ... |
| ); |
| $dom = XML::LibXML->load_xml( |
| string => (\$xml_string) |
| # parser options ... |
| ); |
| $dom = XML::LibXML->load_xml({ |
| IO => $perl_file_handle |
| # parser options ... |
| ); |
| $dom = $parser->load_xml(...); |
| |
| |
| This function is available since XML::LibXML 1.70. It provides easy to use |
| interface to the XML parser that parses given file (or URL), string, or input |
| stream to a DOM tree. The arguments can be passed in a HASH reference or as |
| name => value pairs. The function can be called as a class method or an object |
| method. In both cases it internally creates a new parser instance passing the |
| specified parser options; if called as an object method, it clones the original |
| parser (preserving its settings) and additionally applies the specified options |
| to the new parser. See the constructor C<<<<<< new >>>>>> and L<<<<<< Parser Options >>>>>> for more information. |
| |
| |
| =item load_html |
| |
| |
| $dom = XML::LibXML->load_html(...); |
| $dom = $parser->load_html(...); |
| |
| |
| This function is available since XML::LibXML 1.70. It has the same usage as C<<<<<< load_xml >>>>>>, providing interface to the HTML parser. See C<<<<<< load_xml >>>>>> for more information. |
| |
| |
| |
| =back |
| |
| Parsing HTML may cause problems, especially if the ampersand ('&') is used. |
| This is a common problem if HTML code is parsed that contains links to |
| CGI-scripts. Such links cause the parser to throw errors. In such cases libxml2 |
| still parses the entire document as there was no error, but the error causes |
| XML::LibXML to stop the parsing process. However, the document is not lost. |
| Such HTML documents should be parsed using the I<<<<<< recover >>>>>> flag. By default recovering is deactivated. |
| |
| The functions described above are implemented to parse well formed documents. |
| In some cases a program gets well balanced XML instead of well formed documents |
| (e.g. an XML fragment from a database). With XML::LibXML it is not required to |
| wrap such fragments in the code, because XML::LibXML is capable even to parse |
| well balanced XML fragments. |
| |
| =over 4 |
| |
| =item parse_balanced_chunk |
| |
| $fragment = $parser->parse_balanced_chunk( $wbxmlstring, $encoding ); |
| |
| This function parses a well balanced XML string into a L<<<<<< XML::LibXML::DocumentFragment >>>>>>. The first arguments contains the input string, the optional second argument |
| can be used to specify character encoding of the input (UTF-8 is assumed by |
| default). |
| |
| |
| =item parse_xml_chunk |
| |
| This is the old name of parse_balanced_chunk(). Because it may causes confusion |
| with the push parser interface, this function should not be used anymore. |
| |
| |
| |
| =back |
| |
| By default XML::LibXML does not process XInclude tags within an XML Document |
| (see options section below). XML::LibXML allows to post process a document to |
| expand XInclude tags. |
| |
| =over 4 |
| |
| =item process_xincludes |
| |
| $parser->process_xincludes( $doc ); |
| |
| After a document is parsed into a DOM structure, you may want to expand the |
| documents XInclude tags. This function processes the given document structure |
| and expands all XInclude tags (or throws an error) by using the flags and |
| callbacks of the given parser instance. |
| |
| Note that the resulting Tree contains some extra nodes (of type |
| XML_XINCLUDE_START and XML_XINCLUDE_END) after successfully processing the |
| document. These nodes indicate where data was included into the original tree. |
| if the document is serialized, these extra nodes will not show up. |
| |
| Remember: A Document with processed XIncludes differs from the original |
| document after serialization, because the original XInclude tags will not get |
| restored! |
| |
| If the parser flag "expand_xincludes" is set to 1, you need not to post process |
| the parsed document. |
| |
| |
| =item processXIncludes |
| |
| $parser->processXIncludes( $doc ); |
| |
| This is an alias to process_xincludes, but through a JAVA like function name. |
| |
| |
| =item parse_file |
| |
| $doc = $parser->parse_file( $xmlfilename ); |
| |
| This function parses an XML document from a file or network; $xmlfilename can |
| be either a filename or an URL. Note that for parsing files, this function is |
| the fastest choice, about 6-8 times faster then parse_fh(). |
| |
| |
| =item parse_fh |
| |
| $doc = $parser->parse_fh( $io_fh ); |
| |
| parse_fh() parses a IOREF or a subclass of IO::Handle. |
| |
| Because the data comes from an open handle, libxml2's parser does not know |
| about the base URI of the document. To set the base URI one should use |
| parse_fh() as follows: |
| |
| |
| |
| my $doc = $parser->parse_fh( $io_fh, $baseuri ); |
| |
| |
| =item parse_string |
| |
| $doc = $parser->parse_string( $xmlstring); |
| |
| This function is similar to parse_fh(), but it parses an XML document that is |
| available as a single string in memory, or alternatively as a reference to a |
| scalar containing a string. Again, you can pass an optional base URI to the |
| function. |
| |
| |
| |
| my $doc = $parser->parse_string( $xmlstring, $baseuri ); |
| my $doc = $parser->parse_string(\$xmlstring, $baseuri); |
| |
| |
| =item parse_html_file |
| |
| $doc = $parser->parse_html_file( $htmlfile, \%opts ); |
| |
| Similar to parse_file() but parses HTML (strict) documents; $htmlfile can be |
| filename or URL. |
| |
| An optional second argument can be used to pass some options to the HTML parser |
| as a HASH reference. See options labeled with HTML in L<<<<<< Parser Options >>>>>>. |
| |
| |
| =item parse_html_fh |
| |
| $doc = $parser->parse_html_fh( $io_fh, \%opts ); |
| |
| Similar to parse_fh() but parses HTML (strict) streams. |
| |
| An optional second argument can be used to pass some options to the HTML parser |
| as a HASH reference. See options labeled with HTML in L<<<<<< Parser Options >>>>>>. |
| |
| Note: encoding option may not work correctly with this function in libxml2 < |
| 2.6.27 if the HTML file declares charset using a META tag. |
| |
| |
| =item parse_html_string |
| |
| $doc = $parser->parse_html_string( $htmlstring, \%opts ); |
| |
| Similar to parse_string() but parses HTML (strict) strings. |
| |
| An optional second argument can be used to pass some options to the HTML parser |
| as a HASH reference. See options labeled with HTML in L<<<<<< Parser Options >>>>>>. |
| |
| |
| |
| =back |
| |
| |
| =head2 Push Parser |
| |
| XML::LibXML provides a push parser interface. Rather than pulling the data from |
| a given source the push parser waits for the data to be pushed into it. |
| |
| This allows one to parse large documents without waiting for the parser to |
| finish. The interface is especially useful if a program needs to pre-process |
| the incoming pieces of XML (e.g. to detect document boundaries). |
| |
| While XML::LibXML parse_*() functions force the data to be a well-formed XML, |
| the push parser will take any arbitrary string that contains some XML data. The |
| only requirement is that all the pushed strings are together a well formed |
| document. With the push parser interface a program can interrupt the parsing |
| process as required, where the parse_*() functions give not enough flexibility. |
| |
| Different to the pull parser implemented in parse_fh() or parse_file(), the |
| push parser is not able to find out about the documents end itself. Thus the |
| calling program needs to indicate explicitly when the parsing is done. |
| |
| In XML::LibXML this is done by a single function: |
| |
| =over 4 |
| |
| =item parse_chunk |
| |
| $parser->parse_chunk($string, $terminate); |
| |
| parse_chunk() tries to parse a given chunk of data, which isn't necessarily |
| well balanced data. The function takes two parameters: The chunk of data as a |
| string and optional a termination flag. If the termination flag is set to a |
| true value (e.g. 1), the parsing will be stopped and the resulting document |
| will be returned as the following example describes: |
| |
| |
| |
| my $parser = XML::LibXML->new; |
| for my $string ( "<", "foo", ' bar="hello world"', "/>") { |
| $parser->parse_chunk( $string ); |
| } |
| my $doc = $parser->parse_chunk("", 1); # terminate the parsing |
| |
| |
| |
| =back |
| |
| Internally XML::LibXML provides three functions that control the push parser |
| process: |
| |
| =over 4 |
| |
| =item init_push |
| |
| $parser->init_push(); |
| |
| Initializes the push parser. |
| |
| |
| =item push |
| |
| $parser->push(@data); |
| |
| This function pushes the data stored inside the array to libxml2's parser. Each |
| entry in @data must be a normal scalar! This method can be called repeatedly. |
| |
| |
| =item finish_push |
| |
| $doc = $parser->finish_push( $recover ); |
| |
| This function returns the result of the parsing process. If this function is |
| called without a parameter it will complain about non well-formed documents. If |
| $restore is 1, the push parser can be used to restore broken or non well formed |
| (XML) documents as the following example shows: |
| |
| |
| |
| eval { |
| $parser->push( "<foo>", "bar" ); |
| $doc = $parser->finish_push(); # will report broken XML |
| }; |
| if ( $@ ) { |
| # ... |
| } |
| |
| This can be annoying if the closing tag is missed by accident. The following |
| code will restore the document: |
| |
| |
| |
| eval { |
| $parser->push( "<foo>", "bar" ); |
| $doc = $parser->finish_push(1); # will return the data parsed |
| # unless an error happened |
| }; |
| |
| print $doc->toString(); # returns "<foo>bar</foo>" |
| |
| Of course finish_push() will return nothing if there was no data pushed to the |
| parser before. |
| |
| |
| |
| =back |
| |
| |
| =head2 Pull Parser (Reader) |
| |
| XML::LibXML also provides a pull-parser interface similar to the XmlReader |
| interface in .NET. This interface is almost streaming, and is usually faster |
| and simpler to use than SAX. See L<<<<<< XML::LibXML::Reader >>>>>>. |
| |
| |
| =head2 Direct SAX Parser |
| |
| XML::LibXML provides a direct SAX parser in the L<<<<<< XML::LibXML::SAX >>>>>> module. |
| |
| |
| =head2 DOM based SAX Parser |
| |
| XML::LibXML also provides a DOM based SAX parser. The SAX parser is defined in |
| the module XML::LibXML::SAX::Parser. As it is not a stream based parser, it |
| parses documents into a DOM and traverses the DOM tree instead. |
| |
| The API of this parser is exactly the same as any other Perl SAX2 parser. See |
| XML::SAX::Intro for details. |
| |
| Aside from the regular parsing methods, you can access the DOM tree traverser |
| directly, using the generate() method: |
| |
| |
| |
| my $doc = build_yourself_a_document(); |
| my $saxparser = $XML::LibXML::SAX::Parser->new( ... ); |
| $parser->generate( $doc ); |
| |
| This is useful for serializing DOM trees, for example that you might have done |
| prior processing on, or that you have as a result of XSLT processing. |
| |
| I<<<<<< WARNING >>>>>> |
| |
| This is NOT a streaming SAX parser. As I said above, this parser reads the |
| entire document into a DOM and serialises it. Some people couldn't read that in |
| the paragraph above so I've added this warning. If you want a streaming SAX |
| parser look at the L<<<<<< XML::LibXML::SAX >>>>>> man page |
| |
| |
| =head1 SERIALIZATION |
| |
| XML::LibXML provides some functions to serialize nodes and documents. The |
| serialization functions are described on the L<<<<<< XML::LibXML::Node >>>>>> manpage or the L<<<<<< XML::LibXML::Document >>>>>> manpage. XML::LibXML checks three global flags that alter the serialization |
| process: |
| |
| |
| =over 4 |
| |
| =item * |
| |
| skipXMLDeclaration |
| |
| |
| |
| =item * |
| |
| skipDTD |
| |
| |
| |
| =item * |
| |
| setTagCompression |
| |
| |
| |
| =back |
| |
| of that three functions only setTagCompression is available for all |
| serialization functions. |
| |
| Because XML::LibXML does these flags not itself, one has to define them locally |
| as the following example shows: |
| |
| |
| |
| local $XML::LibXML::skipXMLDeclaration = 1; |
| local $XML::LibXML::skipDTD = 1; |
| local $XML::LibXML::setTagCompression = 1; |
| |
| If skipXMLDeclaration is defined and not '0', the XML declaration is omitted |
| during serialization. |
| |
| If skipDTD is defined and not '0', an existing DTD would not be serialized with |
| the document. |
| |
| If setTagCompression is defined and not '0' empty tags are displayed as open |
| and closing tags rather than the shortcut. For example the empty tag I<<<<<< foo >>>>>> will be rendered as I<<<<<< <foo></foo> >>>>>> rather than I<<<<<< <foo/> >>>>>>. |
| |
| |
| =head1 PARSER OPTIONS |
| |
| Handling of libxml2 parser options has been unified and improved in XML::LibXML |
| 1.70. You can now set default options for a particular parser instance by |
| passing them to the constructor as C<<<<<< XML::LibXML->new({name=>value, ...}) >>>>>> or C<<<<<< XML::LibXML->new(name=>value,...) >>>>>>. The options can be queried and changed using the following methods (pre-1.70 |
| interfaces such as C<<<<<< $parser->load_ext_dtd(0) >>>>>> also exist, see below): |
| |
| =over 4 |
| |
| =item option_exists |
| |
| $parser->option_exists($name); |
| |
| Returns 1 if the current XML::LibXML version supports the option C<<<<<< $name >>>>>>, otherwise returns 0 (note that this does not necessarily mean that the option |
| is supported by the underlying libxml2 library). |
| |
| |
| =item get_option |
| |
| $parser->get_option($name); |
| |
| Returns the current value of the parser option C<<<<<< $name >>>>>>. |
| |
| |
| =item set_option |
| |
| $parser->set_option($name,$value); |
| |
| Sets option C<<<<<< $name >>>>>> to value C<<<<<< $value >>>>>>. |
| |
| |
| =item set_options |
| |
| $parser->set_options({$name=>$value,...}); |
| |
| Sets multiple parsing options at once. |
| |
| |
| |
| =back |
| |
| IMPORTANT NOTE: This documentation reflects the parser flags available in |
| libxml2 2.7.3. Some options have no effect if an older version of libxml2 is |
| used. |
| |
| Each of the flags listed below is labeled |
| |
| =over 4 |
| |
| =item /parser/ |
| |
| if it can be used with a C<<<<<< XML::LibXML >>>>>> parser object (i.e. passed to C<<<<<< XML::LibXML->new >>>>>>, C<<<<<< XML::LibXML->set_option >>>>>>, etc.) |
| |
| |
| =item /html/ |
| |
| if it can be used passed to the C<<<<<< parse_html_* >>>>>> methods |
| |
| |
| =item /reader/ |
| |
| if it can be used with the C<<<<<< XML::LibXML::Reader >>>>>>. |
| |
| |
| |
| =back |
| |
| Unless specified otherwise, the default for boolean valued options is 0 |
| (false). |
| |
| The available options are: |
| |
| =over 4 |
| |
| =item URI |
| |
| /parser, html, reader/ |
| |
| In case of parsing strings or file handles, XML::LibXML doesn't know about the |
| base uri of the document. To make relative references such as XIncludes work, |
| one has to set a base URI, that is then used for the parsed document. |
| |
| |
| =item line_numbers |
| |
| /parser, html, reader/ |
| |
| If this option is activated, libxml2 will store the line number of each element |
| node in the parsed document. The line number can be obtained using the C<<<<<< line_number() >>>>>> method of the C<<<<<< XML::LibXML::Node >>>>>> class (for non-element nodes this may report the line number of the containing |
| element). The line numbers are also used for reporting positions of validation |
| errors. |
| |
| IMPORTANT: Due to limitations in the libxml2 library line numbers greater than |
| 65535 will be returned as 65535. Unfortunately, this is a long and sad story, |
| please see L<<<<<< http://bugzilla.gnome.org/show_bug.cgi?id=325533 >>>>>> for more details. |
| |
| |
| =item encoding |
| |
| /html/ |
| |
| character encoding of the input |
| |
| |
| =item recover |
| |
| /parser, html, reader/ |
| |
| recover from errors; possible values are 0, 1, and 2 |
| |
| A true value turns on recovery mode which allows one to parse broken XML or |
| HTML data. The recovery mode allows the parser to return the successfully |
| parsed portion of the input document. This is useful for almost well-formed |
| documents, where for example a closing tag is missing somewhere. Still, |
| XML::LibXML will only parse until the first fatal (non-recoverable) error |
| occurs, reporting recoverable parsing errors as warnings. To suppress even |
| these warnings, use recover=>2. |
| |
| Note that validation is switched off automatically in recovery mode. |
| |
| |
| =item expand_entities |
| |
| /parser, reader/ |
| |
| substitute entities; possible values are 0 and 1; default is 1 |
| |
| Note that although this flag disables entity substitution, it does not prevent |
| the parser from loading external entities; when substitution of an external |
| entity is disabled, the entity will be represented in the document tree by an |
| XML_ENTITY_REF_NODE node whose subtree will be the content obtained by parsing |
| the external resource; Although this nesting is visible from the DOM it is |
| transparent to XPath data model, so it is possible to match nodes in an |
| unexpanded entity by the same XPath expression as if the entity were expanded. |
| See also ext_ent_handler. |
| |
| |
| =item ext_ent_handler |
| |
| /parser/ |
| |
| Provide a custom external entity handler to be used when expand_entities is set |
| to 1. Possible value is a subroutine reference. |
| |
| This feature does not work properly in libxml2 < 2.6.27! |
| |
| The subroutine provided is called whenever the parser needs to retrieve the |
| content of an external entity. It is called with two arguments: the system ID |
| (URI) and the public ID. The value returned by the subroutine is parsed as the |
| content of the entity. |
| |
| This method can be used to completely disable entity loading, e.g. to prevent |
| exploits of the type described at (L<<<<<< http://searchsecuritychannel.techtarget.com/generic/0,295582,sid97_gci1304703,00.html >>>>>>), where a service is tricked to expose its private data by letting it parse a |
| remote file (RSS feed) that contains an entity reference to a local file (e.g. C<<<<<< /etc/fstab >>>>>>). |
| |
| A more granular solution to this problem, however, is provided by custom URL |
| resolvers, as in |
| |
| my $c = XML::LibXML::InputCallback->new(); |
| sub match { # accept file:/ URIs except for XML catalogs in /etc/xml/ |
| my ($uri) = @_; |
| return ($uri=~m{^file:/} |
| and $uri !~ m{^file:///etc/xml/}) |
| ? 1 : 0; |
| } |
| $c->register_callbacks([ \&match, sub{}, sub{}, sub{} ]); |
| $parser->input_callbacks($c); |
| |
| |
| |
| |
| =item load_ext_dtd |
| |
| /parser, reader/ |
| |
| load the external DTD subset while parsing; possible values are 0 and 1. Unless |
| specified, XML::LibXML sets this option to 1. |
| |
| This flag is also required for DTD Validation, to provide complete attribute, |
| and to expand entities, regardless if the document has an internal subset. Thus |
| switching off external DTD loading, will disable entity expansion, validation, |
| and complete attributes on internal subsets as well. |
| |
| |
| =item complete_attributes |
| |
| /parser, reader/ |
| |
| create default DTD attributes; possible values are 0 and 1 |
| |
| |
| =item validation |
| |
| /parser, reader/ |
| |
| validate with the DTD; possible values are 0 and 1 |
| |
| |
| =item suppress_errors |
| |
| /parser, html, reader/ |
| |
| suppress error reports; possible values are 0 and 1 |
| |
| |
| =item suppress_warnings |
| |
| /parser, html, reader/ |
| |
| suppress warning reports; possible values are 0 and 1 |
| |
| |
| =item pedantic_parser |
| |
| /parser, html, reader/ |
| |
| pedantic error reporting; possible values are 0 and 1 |
| |
| |
| =item no_blanks |
| |
| /parser, html, reader/ |
| |
| remove blank nodes; possible values are 0 and 1 |
| |
| |
| =item no_defdtd |
| |
| /html/ |
| |
| do not add a default DOCTYPE; possible values are 0 and 1 |
| |
| the default is (0) to add a DTD when the input html lacks one |
| |
| |
| =item expand_xinclude or xinclude |
| |
| /parser, reader/ |
| |
| Implement XInclude substitution; possible values are 0 and 1 |
| |
| Expands XInclude tags immediately while parsing the document. Note that the |
| parser will use the URI resolvers installed via C<<<<<< XML::LibXML::InputCallback >>>>>> to parse the included document (if any). |
| |
| |
| =item no_xinclude_nodes |
| |
| /parser, reader/ |
| |
| do not generate XINCLUDE START/END nodes; possible values are 0 and 1 |
| |
| |
| =item no_network |
| |
| /parser, html, reader/ |
| |
| Forbid network access; possible values are 0 and 1 |
| |
| If set to true, all attempts to fetch non-local resources (such as DTD or |
| external entities) will fail (unless custom callbacks are defined). |
| |
| It may be necessary to use the flag C<<<<<< recover >>>>>> for processing documents requiring such resources while networking is off. |
| |
| |
| =item clean_namespaces |
| |
| /parser, reader/ |
| |
| remove redundant namespaces declarations during parsing; possible values are 0 |
| and 1. |
| |
| |
| =item no_cdata |
| |
| /parser, html, reader/ |
| |
| merge CDATA as text nodes; possible values are 0 and 1 |
| |
| |
| =item no_basefix |
| |
| /parser, reader/ |
| |
| not fixup XINCLUDE xml#base URIS; possible values are 0 and 1 |
| |
| |
| =item huge |
| |
| /parser, html, reader/ |
| |
| relax any hardcoded limit from the parser; possible values are 0 and 1. Unless |
| specified, XML::LibXML sets this option to 1. |
| |
| |
| =item gdome |
| |
| /parser/ |
| |
| THIS OPTION IS EXPERIMENTAL! |
| |
| Although quite powerful, XML::LibXML's DOM implementation is incomplete with |
| respect to the DOM level 2 or level 3 specifications. XML::GDOME is based on |
| libxml2 as well and and provides a rather complete DOM implementation by |
| wrapping libgdome. This flag allows you to make use of XML::LibXML's full |
| parser options and XML::GDOME's DOM implementation at the same time. |
| |
| To make use of this function, one has to install libgdome and configure |
| XML::LibXML to use this library. For this you need to rebuild XML::LibXML! |
| |
| Note: this feature was not seriously tested in recent XML::LibXML releases. |
| |
| |
| |
| =back |
| |
| For compatibility with XML::LibXML versions prior to 1.70, the following |
| methods are also supported for querying and setting the corresponding parser |
| options (if called without arguments, the methods return the current value of |
| the corresponding parser options; with an argument sets the option to a given |
| value): |
| |
| |
| |
| $parser->validation(); |
| $parser->recover(); |
| $parser->pedantic_parser(); |
| $parser->line_numbers(); |
| $parser->load_ext_dtd(); |
| $parser->complete_attributes(); |
| $parser->expand_xinclude(); |
| $parser->gdome_dom(); |
| $parser->clean_namespaces(); |
| $parser->no_network(); |
| |
| The following obsolete methods trigger parser options in some special way: |
| |
| =over 4 |
| |
| =item recover_silently |
| |
| |
| |
| $parser->recover_silently(1);; |
| |
| If called without an argument, returns true if the current value of the C<<<<<< recover >>>>>> parser option is 2 and returns false otherwise. With a true argument sets the C<<<<<< recover >>>>>> parser option to 2; with a false argument sets the C<<<<<< recover >>>>>> parser option to 0. |
| |
| |
| =item expand_entities |
| |
| |
| |
| $parser->expand_entities(0); |
| |
| Get/set the C<<<<<< expand_entities >>>>>> option. If called with a true argument, also turns the C<<<<<< load_ext_dtd >>>>>> option to 1. |
| |
| |
| =item keep_blanks |
| |
| |
| |
| $parser->keep_blanks(0); |
| |
| This is actually the opposite of the C<<<<<< no_blanks >>>>>> parser option. If used without an argument retrieves negated value of C<<<<<< no_blanks >>>>>>. If used with an argument sets C<<<<<< no_blanks >>>>>> to the opposite value. |
| |
| |
| =item base_uri |
| |
| |
| |
| $parser->base_uri( $your_base_uri ); |
| |
| Get/set the C<<<<<< URI >>>>>> option. |
| |
| |
| |
| =back |
| |
| |
| =head1 XML CATALOGS |
| |
| C<<<<<< libxml2 >>>>>> supports XML catalogs. Catalogs are used to map remote resources to their local |
| copies. Using catalogs can speed up parsing processes if many external |
| resources from remote addresses are loaded into the parsed documents (such as |
| DTDs or XIncludes). |
| |
| Note that libxml2 has a global pool of loaded catalogs, so if you apply the |
| method C<<<<<< load_catalog >>>>>> to one parser instance, all parser instances will start using the catalog (in |
| addition to other previously loaded catalogs). |
| |
| Note also that catalogs are not used when a custom external entity handler is |
| specified. At the current state it is not possible to make use of both types of |
| resolving systems at the same time. |
| |
| =over 4 |
| |
| =item load_catalog |
| |
| $parser->load_catalog( $catalog_file ); |
| |
| Loads the XML catalog file $catalog_file. |
| |
| |
| |
| # Global external entity loader (similar to ext_ent_handler option |
| # but this works really globally, also in XML::LibXSLT include etc..) |
| |
| XML::LibXML::externalEntityLoader(\&my_loader); |
| |
| |
| |
| =back |
| |
| |
| =head1 ERROR REPORTING |
| |
| XML::LibXML throws exceptions during parsing, validation or XPath processing |
| (and some other occasions). These errors can be caught by using I<<<<<< eval >>>>>> blocks. The error is stored in I<<<<<< $@ >>>>>>. There are two implementations: the old one throws $@ which is just a message |
| string, in the new one $@ is an object from the class XML::LibXML::Error; this |
| class overrides the operator "" so that when printed, the object flattens to |
| the usual error message. |
| |
| XML::LibXML throws errors as they occur. This is a very common misunderstanding |
| in the use of XML::LibXML. If the eval is omitted, XML::LibXML will always halt |
| your script by "croaking" (see Carp man page for details). |
| |
| Also note that an increasing number of functions throw errors if bad data is |
| passed as arguments. If you cannot assure valid data passed to XML::LibXML you |
| should eval these functions. |
| |
| Note: since version 1.59, get_last_error() is no longer available in |
| XML::LibXML for thread-safety reasons. |
| |
| =head1 AUTHORS |
| |
| Matt Sergeant, |
| Christian Glahn, |
| Petr Pajas |
| |
| |
| =head1 VERSION |
| |
| 1.98 |
| |
| =head1 COPYRIGHT |
| |
| 2001-2007, AxKit.com Ltd. |
| |
| 2002-2006, Christian Glahn. |
| |
| 2006-2009, Petr Pajas. |
| |
| =cut |