| =head1 NAME |
| |
| XML::SAX::Intro - An Introduction to SAX Parsing with Perl |
| |
| =head1 Introduction |
| |
| XML::SAX is a new way to work with XML Parsers in Perl. In this article |
| we'll discuss why you should be using SAX, why you should be using |
| XML::SAX, and we'll see some of the finer implementation details. The |
| text below assumes some familiarity with callback, or push based |
| parsing, but if you are unfamiliar with these techniques then a good |
| place to start is Kip Hampton's excellent series of articles on XML.com. |
| |
| =head1 Replacing XML::Parser |
| |
| The de-facto way of parsing XML under perl is to use Larry Wall and |
| Clark Cooper's XML::Parser. This module is a Perl and XS wrapper around |
| the expat XML parser library by James Clark. It has been a hugely |
| successful project, but suffers from a couple of rather major flaws. |
| Firstly it is a proprietary API, designed before the SAX API was |
| conceived, which means that it is not easily replaceable by other |
| streaming parsers. Secondly it's callbacks are subrefs. This doesn't |
| sound like much of an issue, but unfortunately leads to code like: |
| |
| sub handle_start { |
| my ($e, $el, %attrs) = @_; |
| if ($el eq 'foo') { |
| $e->{inside_foo}++; # BAD! $e is an XML::Parser::Expat object. |
| } |
| } |
| |
| As you can see, we're using the $e object to hold our state |
| information, which is a bad idea because we don't own that object - we |
| didn't create it. It's an internal object of XML::Parser, that happens |
| to be a hashref. We could all too easily overwrite XML::Parser internal |
| state variables by using this, or Clark could change it to an array ref |
| (not that he would, because it would break so much code, but he could). |
| |
| The only way currently with XML::Parser to safely maintain state is to |
| use a closure: |
| |
| my $state = MyState->new(); |
| $parser->setHandlers(Start => sub { handle_start($state, @_) }); |
| |
| This closure traps the $state variable, which now gets passed as the |
| first parameter to your callback. Unfortunately very few people use |
| this technique, as it is not documented in the XML::Parser POD files. |
| |
| Another reason you might not want to use XML::Parser is because you |
| need some feature that it doesn't provide (such as validation), or you |
| might need to use a library that doesn't use expat, due to it not being |
| installed on your system, or due to having a restrictive ISP. Using SAX |
| allows you to work around these restrictions. |
| |
| =head1 Introducing SAX |
| |
| SAX stands for the Simple API for XML. And simple it really is. |
| Constructing a SAX parser and passing events to handlers is done as |
| simply as: |
| |
| use XML::SAX; |
| use MySAXHandler; |
| |
| my $parser = XML::SAX::ParserFactory->parser( |
| Handler => MySAXHandler->new |
| ); |
| |
| $parser->parse_uri("foo.xml"); |
| |
| The important concept to grasp here is that SAX uses a factory class |
| called XML::SAX::ParserFactory to create a new parser instance. The |
| reason for this is so that you can support other underlying |
| parser implementations for different feature sets. This is one thing |
| that XML::Parser has always sorely lacked. |
| |
| In the code above we see the parse_uri method used, but we could |
| have equally well |
| called parse_file, parse_string, or parse(). Please see XML::SAX::Base |
| for what these methods take as parameters, but don't be fooled into |
| believing parse_file takes a filename. No, it takes a file handle, a |
| glob, or a subclass of IO::Handle. Beware. |
| |
| SAX works very similarly to XML::Parser's default callback method, |
| except it has one major difference: rather than setting individual |
| callbacks, you create a new class in which to recieve the callbacks. |
| Each callback is called as a method call on an instance of that handler |
| class. An example will best demonstrate this: |
| |
| package MySAXHandler; |
| use base qw(XML::SAX::Base); |
| |
| sub start_document { |
| my ($self, $doc) = @_; |
| # process document start event |
| } |
| |
| sub start_element { |
| my ($self, $el) = @_; |
| # process element start event |
| } |
| |
| Now, when we instantiate this as above, and parse some XML with this as |
| the handler, the methods start_document and start_element will be |
| called as method calls, so this would be the equivalent of directly |
| calling: |
| |
| $object->start_element($el); |
| |
| Notice how this is different to XML::Parser's calling style, which |
| calls: |
| |
| start_element($e, $name, %attribs); |
| |
| It's the difference between function calling and method calling which |
| allows you to subclass SAX handlers which contributes to SAX being a |
| powerful solution. |
| |
| As you can see, unlike XML::Parser, we have to define a new package in |
| which to do our processing (there are hacks you can do to make this |
| uneccessary, but I'll leave figuring those out to the experts). The |
| biggest benefit of this is that you maintain your own state variable |
| ($self in the above example) thus freeing you of the concerns listed |
| above. It is also an improvement in maintainability - you can place the |
| code in a separate file if you wish to, and your callback methods are |
| always called the same thing, rather than having to choose a suitable |
| name for them as you had to with XML::Parser. This is an obvious win. |
| |
| SAX parsers are also very flexible in how you pass a handler to them. |
| You can use a constructor parameter as we saw above, or we can pass the |
| handler directly in the call to one of the parse methods: |
| |
| $parser->parse(Handler => $handler, |
| Source => { SystemId => "foo.xml" }); |
| # or... |
| $parser->parse_file($fh, Handler => $handler); |
| |
| This flexibility allows for one parser to be used in many different |
| scenarios throughout your script (though one shouldn't feel pressure to |
| use this method, as parser construction is generally not a time |
| consuming process). |
| |
| =head1 Callback Parameters |
| |
| The only other thing you need to know to understand basic SAX is the |
| structure of the parameters passed to each of the callbacks. In |
| XML::Parser, all parameters are passed as multiple options to the |
| callbacks, so for example the Start callback would be called as |
| my_start($e, $name, %attributes), and the PI callback would be called |
| as my_processing_instruction($e, $target, $data). In SAX, every |
| callback is passed a hash reference, containing entries that define our |
| "node". The key callbacks and the structures they receive are: |
| |
| =head2 start_element |
| |
| The start_element handler is called whenever a parser sees an opening |
| tag. It is passed an element structure consisting of: |
| |
| =over 4 |
| |
| =item LocalName |
| |
| The name of the element minus any namespace prefix it may |
| have come with in the document. |
| |
| =item NamespaceURI |
| |
| The URI of the namespace associated with this element, |
| or the empty string for none. |
| |
| =item Attributes |
| |
| A set of attributes as described below. |
| |
| =item Name |
| |
| The name of the element as it was seen in the document (i.e. |
| including any prefix associated with it) |
| |
| =item Prefix |
| |
| The prefix used to qualify this element's namespace, or the |
| empty string if none. |
| |
| =back |
| |
| The B<Attributes> are a hash reference, keyed by what we have called |
| "James Clark" notation. This means that the attribute name has been |
| expanded to include any associated namespace URI, and put together as |
| {ns}name, where "ns" is the expanded namespace URI of the attribute if |
| and only if the attribute had a prefix, and "name" is the LocalName of |
| the attribute. |
| |
| The value of each entry in the attributes hash is another hash |
| structure consisting of: |
| |
| =over 4 |
| |
| =item LocalName |
| |
| The name of the attribute minus any namespace prefix it may have |
| come with in the document. |
| |
| =item NamespaceURI |
| |
| The URI of the namespace associated with this attribute. If the |
| attribute had no prefix, then this consists of just the empty string. |
| |
| =item Name |
| |
| The attribute's name as it appeared in the document, including any |
| namespace prefix. |
| |
| =item Prefix |
| |
| The prefix used to qualify this attribute's namepace, or the |
| empty string if none. |
| |
| =item Value |
| |
| The value of the attribute. |
| |
| =back |
| |
| So a full example, as output by Data::Dumper might be: |
| |
| .... |
| |
| =head2 end_element |
| |
| The end_element handler is called either when a parser sees a closing |
| tag, or after start_element has been called for an empty element (do |
| note however that a parser may if it is so inclined call characters |
| with an empty string when it sees an empty element. There is no simple |
| way in SAX to determine if the parser in fact saw an empty element, a |
| start and end element with no content.. |
| |
| The end_element handler receives exactly the same structure as |
| start_element, minus the Attributes entry. One must note though that it |
| should not be a reference to the same data as start_element receives, |
| so you may change the values in start_element but this will not affect |
| the values later seen by end_element. |
| |
| =head2 characters |
| |
| The characters callback may be called in serveral circumstances. The |
| most obvious one is when seeing ordinary character data in the markup. |
| But it is also called for text in a CDATA section, and is also called |
| in other situations. A SAX parser has to make no guarantees whatsoever |
| about how many times it may call characters for a stretch of text in an |
| XML document - it may call once, or it may call once for every |
| character in the text. In order to work around this it is often |
| important for the SAX developer to use a bundling technique, where text |
| is gathered up and processed in one of the other callbacks. This is not |
| always necessary, but it is a worthwhile technique to learn, which we |
| will cover in XML::SAX::Advanced (when I get around to writing it). |
| |
| The characters handler is called with a very simple structure - a hash |
| reference consisting of just one entry: |
| |
| =over 4 |
| |
| =item Data |
| |
| The text data that was received. |
| |
| =back |
| |
| =head2 comment |
| |
| The comment callback is called for comment text. Unlike with |
| C<characters()>, the comment callback *must* be invoked just once for an |
| entire comment string. It receives a single simple structure - a hash |
| reference containing just one entry: |
| |
| =over 4 |
| |
| =item Data |
| |
| The text of the comment. |
| |
| =back |
| |
| =head2 processing_instruction |
| |
| The processing instruction handler is called for all processing |
| instructions in the document. Note that these processing instructions |
| may appear before the document root element, or after it, or anywhere |
| where text and elements would normally appear within the document, |
| according to the XML specification. |
| |
| The handler is passed a structure containing just two entries: |
| |
| =over 4 |
| |
| =item Target |
| |
| The target of the processing instrcution |
| |
| =item Data |
| |
| The text data in the processing instruction. Can be an empty |
| string for a processing instruction that has no data element. |
| For example E<lt>?wiggle?E<gt> is a perfectly valid processing instruction. |
| |
| =back |
| |
| =head1 Tip of the iceberg |
| |
| What we have discussed above is really the tip of the SAX iceberg. And |
| so far it looks like there's not much of interest to SAX beyond what we |
| have seen with XML::Parser. But it does go much further than that, I |
| promise. |
| |
| People who hate Object Oriented code for the sake of it may be thinking |
| here that creating a new package just to parse something is a waste |
| when they've been parsing things just fine up to now using procedural |
| code. But there's reason to all this madness. And that reason is SAX |
| Filters. |
| |
| As you saw right at the very start, to let the parser know about our |
| class, we pass it an instance of our class as the Handler to the |
| parser. But now imagine what would happen if our class could also take |
| a Handler option, and simply do some processing and pass on our data |
| further down the line? That in a nutshell is how SAX filters work. It's |
| Unix pipes for the 21st century! |
| |
| There are two downsides to this. Number 1 - writing SAX filters can be |
| tricky. If you look into the future and read the advanced tutorial I'm |
| writing, you'll see that Handler can come in several shapes and sizes. |
| So making sure your filter does the right thing can be tricky. |
| Secondly, constructing complex filter chains can be difficult, and |
| simple thinking tells us that we only get one pass at our document, |
| when often we'll need more than that. |
| |
| Luckily though, those downsides have been fixed by the release of two |
| very cool modules. What's even better is that I didn't write either of |
| them! |
| |
| The first module is XML::SAX::Base. This is a VITAL SAX module that |
| acts as a base class for all SAX parsers and filters. It provides an |
| abstraction away from calling the handler methods, that makes sure your |
| filter or parser does the right thing, and it does it FAST. So, if you |
| ever need to write a SAX filter, which if you're processing XML -> XML, |
| or XML -> HTML, then you probably do, then you need to be writing it as |
| a subclass of XML::SAX::Base. Really - this is advice not to ignore |
| lightly. I will not go into the details of writing a SAX filter here. |
| Kip Hampton, the author of XML::SAX::Base has covered this nicely in |
| his article on XML.com here <URI>. |
| |
| To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker |
| who's modules you will probably have heard of or used, wrote a very |
| clever module called XML::SAX::Machines. This combines some really |
| clever SAX filter-type modules, with a construction toolkit for filters |
| that makes building pipelines easy. But before we see how it makes |
| things easy, first lets see how tricky it looks to build complex SAX |
| filter pipelines. |
| |
| use XML::SAX::ParserFactory; |
| use XML::Filter::Filter1; |
| use XML::Filter::Filter2; |
| use XML::SAX::Writer; |
| |
| my $output_string; |
| my $writer = XML::SAX::Writer->new(Output => \$output_string); |
| my $filter2 = XML::SAX::Filter2->new(Handler => $writer); |
| my $filter1 = XML::SAX::Filter1->new(Handler => $filter2); |
| my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1); |
| |
| $parser->parse_uri("foo.xml"); |
| |
| This is a lot easier with XML::SAX::Machines: |
| |
| use XML::SAX::Machines qw(Pipeline); |
| |
| my $output_string; |
| my $parser = Pipeline( |
| XML::SAX::Filter1 => XML::SAX::Filter2 => \$output_string |
| ); |
| |
| $parser->parse_uri("foo.xml"); |
| |
| One of the main benefits of XML::SAX::Machines is that the pipelines |
| are constructed in natural order, rather than the reverse order we saw |
| with manual pipeline construction. XML::SAX::Machines takes care of all |
| the internals of pipe construction, providing you at the end with just |
| a parser you can use (and you can re-use the same parser as many times |
| as you need to). |
| |
| Just a final tip. If you ever get stuck and are confused about what is |
| being passed from one SAX filter or parser to the next, then |
| Devel::TraceSAX will come to your rescue. This perl debugger plugin |
| will allow you to dump the SAX stream of events as it goes by. Usage is |
| really very simple just call your perl script that uses SAX as follows: |
| |
| $ perl -d:TraceSAX <scriptname> |
| |
| And preferably pipe the output to a pager of some sort, such as more or |
| less. The output is extremely verbose, but should help clear some |
| issues up. |
| |
| =head1 AUTHOR |
| |
| Matt Sergeant, matt@sergeant.org |
| |
| $Id: Intro.pod,v 1.3 2002/04/30 07:16:00 matt Exp $ |
| |
| =cut |