[Python-projects] xmldiff?

Alexandre Fayolle alexandre.fayolle at logilab.fr
Wed Sep 26 10:07:28 CEST 2007

On Tue, Sep 25, 2007 at 04:52:55PM -0700, Dan Stromberg wrote:
> Alexandre Fayolle wrote:
>> Hi Dan,
>> I'm Cc'ing the python-projects list at Logilab where people may have
>> some better ideas (the main author of xmldiff dwells there).
> Good idea/thanks.
>> On Mon, Sep 24, 2007 at 06:19:40PM -0700, Dan Stromberg wrote:
>>> Hi Alexandre.
>>> I'm looking for a simple tool for making small changes to large XML 
>>> documents.
>>> xmldiff seems a likely candidate.
>> Depending on how large the xml documents, xmldiff may not be suitable.
>> The diff algorithm used is in O(n^2) where n is the number of XML nodes
>> in the document as far as I remember, so the cost of diffing increases
>> faster than the increase in size of the document.   
> I've tried xmldiff against a "large" (for my employer) xml document, and 
> performance was acceptable.  I'll keep the O(n^2) -ness in mind.  BTW, 
> could it not be reduced to O(nlogn) with some sorting or perhaps even 
> O(c*n) with some hashing?

No idea, I have not looked in the xmldiff internals for years....

>>> Are you aware of any related tools that might work better?
>> No, but I'm not too much in that field.   
> OK.
>> An approach which seems commonly used is to somehow normalize the way
>> the XML content is presented and to apply traditional unix diff on the
>> resulting files.   
> This is interesting.
> Do you happen to know of any FLOSS tools for XML normalization?  Do any of 
> them preserve comments and whitespace to some extent?  If you have a 
> favorite, I may try it first - otherwise I'll just "go fish" in google.

Since you seem to be a debian user, try apt-cache search before
googling, there are loads of things in there :-) A good keyword in c14n
(canonicalization), and the xmlstarlet package in debian based on libxml
seems to provide such a tool. In the python-xml package (unfortunately
dead upstream) the xmlproc_parse utility can canonicalize or normalize
the input if the -o option is provided. 

As for the behaviour of these tools wrt comments and whitespace, you'll
have to check the W3C's definition of "normalized" and "canonicalized"
XML (http://www.w3.org/TR/xml-c14n)

Alexandre Fayolle                              LOGILAB, Paris (France)
Formations Python, Zope, Plone, Debian:  http://www.logilab.fr/formations
Développement logiciel sur mesure:       http://www.logilab.fr/services
Informatique scientifique:               http://www.logilab.fr/science
Reprise et maintenance de sites CPS:     http://www.migration-cms.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 481 bytes
Desc: Digital signature
Url : http://lists.logilab.org/pipermail/python-projects/attachments/20070926/fb405ba4/attachment.pgp 

More information about the Python-Projects mailing list