alexandre.fayolle at logilab.fr
Wed Sep 26 10:07:28 CEST 2007
On Tue, Sep 25, 2007 at 04:52:55PM -0700, Dan Stromberg wrote:
> Alexandre Fayolle wrote:
>> Hi Dan,
>> I'm Cc'ing the python-projects list at Logilab where people may have
>> some better ideas (the main author of xmldiff dwells there).
> Good idea/thanks.
>> On Mon, Sep 24, 2007 at 06:19:40PM -0700, Dan Stromberg wrote:
>>> Hi Alexandre.
>>> I'm looking for a simple tool for making small changes to large XML
>>> xmldiff seems a likely candidate.
>> Depending on how large the xml documents, xmldiff may not be suitable.
>> The diff algorithm used is in O(n^2) where n is the number of XML nodes
>> in the document as far as I remember, so the cost of diffing increases
>> faster than the increase in size of the document.
> I've tried xmldiff against a "large" (for my employer) xml document, and
> performance was acceptable. I'll keep the O(n^2) -ness in mind. BTW,
> could it not be reduced to O(nlogn) with some sorting or perhaps even
> O(c*n) with some hashing?
No idea, I have not looked in the xmldiff internals for years....
>>> Are you aware of any related tools that might work better?
>> No, but I'm not too much in that field.
>> An approach which seems commonly used is to somehow normalize the way
>> the XML content is presented and to apply traditional unix diff on the
>> resulting files.
> This is interesting.
> Do you happen to know of any FLOSS tools for XML normalization? Do any of
> them preserve comments and whitespace to some extent? If you have a
> favorite, I may try it first - otherwise I'll just "go fish" in google.
Since you seem to be a debian user, try apt-cache search before
googling, there are loads of things in there :-) A good keyword in c14n
(canonicalization), and the xmlstarlet package in debian based on libxml
seems to provide such a tool. In the python-xml package (unfortunately
dead upstream) the xmlproc_parse utility can canonicalize or normalize
the input if the -o option is provided.
As for the behaviour of these tools wrt comments and whitespace, you'll
have to check the W3C's definition of "normalized" and "canonicalized"
Alexandre Fayolle LOGILAB, Paris (France)
Formations Python, Zope, Plone, Debian: http://www.logilab.fr/formations
Développement logiciel sur mesure: http://www.logilab.fr/services
Informatique scientifique: http://www.logilab.fr/science
Reprise et maintenance de sites CPS: http://www.migration-cms.com/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 481 bytes
Desc: Digital signature
Url : http://lists.logilab.org/pipermail/python-projects/attachments/20070926/fb405ba4/attachment.pgp
More information about the Python-Projects