Author |
Message
|
goffinf |
Posted: Wed Mar 27, 2013 9:16 am Post subject: Removing whitespace in XML element content |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
version: 6.1.0.10
Thought I'd seen something similar to this recently but search as I might I can't find it so ...
An XML message is received into a flow via an HTTPInput. It is configured to use the XMLNSC domain with On-Demand parsing.
Sometimes we get a message where the content of an element is blank but the start and end tags are on different lines (if you happened to look at them in an editor). For example
Code: |
<foobarbaz>
<foo>foo</foo>
<bar>bar</bar>
<baz>
</baz>
</foobarbaz>
|
The logical message tree in Broker will show that the content of <baz> is 0a09 (LF + TAB).
When we send this message out of the flow (after mapping to an MRM definition - TDS Fixed Length) it causes that part of the 'record' to appear on a separate line and the software which reads the resulting message barfs on it.
Now obviously I used a simplified example above. The real one has hundreds of fields any of which could have this problem.
Is there a simple and/or efficient way of getting rid of unwanted white-space in element content either when we parse the XML input or the MRM output ?
I did see a post that suggested that switching to XMLNS rather than XMLNSC would do that, but it didn't.
Regards
Fraser. |
|
Back to top |
|
 |
mqjeff |
Posted: Wed Mar 27, 2013 9:22 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
there's a "remove mixed-content" or something like that switch for XMLNSC (maybe it's 'retain mixed content'?). |
|
Back to top |
|
 |
goffinf |
Posted: Wed Mar 27, 2013 9:27 am Post subject: |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
mqjeff wrote: |
there's a "remove mixed-content" or something like that switch for XMLNSC (maybe it's 'retain mixed content'?). |
There is, although it's 'Retain mixed content' and it definitely 'unchecked' (i.e. off).
Technically this isn't mixed content as it's inside an element not between them.
Fraser. |
|
Back to top |
|
 |
mqjeff |
Posted: Wed Mar 27, 2013 9:36 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
goffinf wrote: |
mqjeff wrote: |
there's a "remove mixed-content" or something like that switch for XMLNSC (maybe it's 'retain mixed content'?). |
There is, although it's 'Retain mixed content' and it definitely 'unchecked' (i.e. off).
Technically this isn't mixed content as it's inside an element not between them.
Fraser. |
Oh, you meant the "baz" element content, not the formatting between the elements.
I think you have to deal with this on a field-by-field basis.  |
|
Back to top |
|
 |
kimbert |
Posted: Wed Mar 27, 2013 9:42 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
The white space in the <baz> tag is not mixed content. It is the actual text value of this tag.
You could
- add a whiteSpace facet to the XML Schema simple type that describes <baz>
- switch on validation in the message flow, and ensure that 'Build tree using XML Schema' is enabled on the input node
That should ensure that the whitespace in the XML is replaced by the empty string. Any whitespace within a string value will be collapsed to a single space character. |
|
Back to top |
|
 |
mqjeff |
Posted: Wed Mar 27, 2013 9:49 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
kimbert wrote: |
You could
- add a whiteSpace facet to the XML Schema simple type that describes <baz>and every other field that needs to be trimmed
- switch on validation in the message flow, and ensure that 'Build tree using XML Schema' is enabled on the input node
That should ensure that the whitespace in the XML is replaced by the empty string. Any whitespace within a string value will be collapsed to a single space character. |
adjusted that for you.
It's probably more correct to adjust the MRM model to ensure that it translates the LF and TABs into spaces, although again this needs to be done on each and every field.
But it's better in the long run to fix the sending application not to prettyprint the XML. |
|
Back to top |
|
 |
McueMart |
Posted: Wed Mar 27, 2013 9:50 am Post subject: |
|
|
 Chevalier
Joined: 29 Nov 2011 Posts: 490 Location: UK...somewhere
|
Hacky way to do it: Read the message in as a BLOB and do a global replace on the 0a bytes. Then reparse as XMLNSC. Obviously this method isnt completely safe if you are using a multi-byte character set! Also this assumes you don't want linefeeds anywhere! |
|
Back to top |
|
 |
goffinf |
Posted: Wed Mar 27, 2013 10:48 am Post subject: |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
mqjeff wrote: |
It's probably more correct to adjust the MRM model to ensure that it translates the LF and TABs into spaces, although again this needs to be done on each and every field.
|
Ah, I was hoping for something on the MRM side rather than the input XML.
Not having had much to do with MRM, how do I cause the LFs and TABs to be converted ? (actuallt it would be preferable if they didn't turn into spaces but perhaps NULs ?
mqjeff wrote: |
But it's better in the long run to fix the sending application not to prettyprint the XML.
|
I hear that. |
|
Back to top |
|
 |
goffinf |
Posted: Wed Mar 27, 2013 10:50 am Post subject: |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
McueMart wrote: |
Hacky way to do it: Read the message in as a BLOB and do a global replace on the 0a bytes. Then reparse as XMLNSC. Obviously this method isnt completely safe if you are using a multi-byte character set! Also this assumes you don't want linefeeds anywhere! |
Yes indeed. I had considered that but ... as you said it does feel a bit of a hack and I have had my fingers burnt with multi-byte character encoding before and not keen to repeat. Thanks for the suggestion though.
Fraser. |
|
Back to top |
|
 |
kimbert |
Posted: Thu Mar 28, 2013 1:09 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
Quote: |
It's probably more correct to adjust the MRM model to ensure that it translates the LF and TABs into spaces, although again this needs to be done on each and every field. |
Why is that more correct than using a whiteSpace facet? I'm not disagreeing ( not yet, anyway), just curious to know what your reasons are. |
|
Back to top |
|
 |
goffinf |
Posted: Thu Mar 28, 2013 2:38 am Post subject: |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
kimbert wrote: |
Quote: |
It's probably more correct to adjust the MRM model to ensure that it translates the LF and TABs into spaces, although again this needs to be done on each and every field. |
Why is that more correct than using a whiteSpace facet? I'm not disagreeing ( not yet, anyway), just curious to know what your reasons are. |
I know this question was directed at mqjeff, but here's some of my rationale.
I agree both approaches are valid. In fact, when the Dev who's looking at this asked me about it, my first response was that we should be able to effect the way in which element content is normalized by the XML parser to get what we want, I just didn't at that time know how to do so in Broker. So thanks for illuminating that approach.
As to the broader question of whether to apply normalization to the input or when producing the output thru MRM ...
For XML input I prefer to follow what others might recognize as the 'Postels Law' approach (even though that attribution isn't entirely correct). The mantra goes something like this .. 'be liberal in what you accept and conservative in what you emit' (other variations can be found).
In practice this means that whilst I am a supporter of a constraint model that provides benefit for the aspects of the interface that I (or rather 'the business') care about, it is very easy to create a brittle interface that would otherwise reject messages because of limitations in the expressiveness of XSD as a constraint language rather than whether they are 'business processable' or not.
That means that I can tolerate some unexpected data turning up if I don't care about it in my particular context (aka: the 'must ignore unknown' pattern) as well as some data which is 'missing' or perhaps even 'invalid', again, if it it doesn't impact the validity of the parts that I do.
I'm not saying this approach is a model for everyone, but techniques like selective validation (complementary to and/or as a complete replacement for XSD) and a greater ability to evolve the interface in ways that are less likely to introduce breaking change, can be of significant benefit especially when your callers are external trading partners (i.e resources that you don't control and usually prefer not spend any more than necessary). Again, I'm not preaching and there *are* challenges with doing this, but I just making a comment about reality, at least in the integration environment that I work in.
I also suspect that our Service Design team probably doesn't have an XSD in the case I am currently looking at and I'm not at all confident they could come up with one any time soon (but that's a separate story altogether).
So, .... whilst I am attracted to leveraging the XML parser behaviour, I would like to know more about how to remove these extraneous and unwelcome characters in the MRM model definition, another area where my personal knowledge is somewhat lacking.
Regards
Fraser. |
|
Back to top |
|
 |
mqjeff |
Posted: Thu Mar 28, 2013 3:38 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
I agree with everything that Fraser said.
In addition, it's a question of enforcing the rules of the *correct* contract. The MRM contract is entirely separate and entirely different from the XML contract. You shouldn't enforce rules of one contract by changing the other, necessarily.
That is, it may be perfectly reasonable and "okay" for the XML message to contain whitespace characters of various kinds in this field.
But it's not okay for the MRM message to contain anything other than specific characters from a much more restricted character range. |
|
Back to top |
|
 |
kimbert |
Posted: Thu Mar 28, 2013 4:39 am Post subject: |
|
|
 Jedi Council
Joined: 29 Jul 2003 Posts: 5542 Location: Southampton
|
@goffinf: I get it. It's not the whitespace facet itself that is the problem - it is the fact that you have to switch on schema validation in order to use it.
@mqjeff: I agree - this is primarily a transformation problem. The task is to make the input data safe for the output format. The ESQL solution is fiddly, but is probably the best solution. |
|
Back to top |
|
 |
goffinf |
Posted: Thu Mar 28, 2013 4:56 am Post subject: |
|
|
Chevalier
Joined: 05 Nov 2005 Posts: 401
|
kimbert wrote: |
@goffinf: I get it. It's not the whitespace facet itself that is the problem - it is the fact that you have to switch on schema validation in order to use it.
@mqjeff: I agree - this is primarily a transformation problem. The task is to make the input data safe for the output format. The ESQL solution is fiddly, but is probably the best solution. |
Does that mean there isn't anything that can be done in MRM model itself as mqjeff proposed ?
mqjeff wrote: |
It's probably more correct to adjust the MRM model to ensure that it translates the LF and TABs into spaces, although again this needs to be done on each and every field.
|
How would suggest removing the whitespace chars in ESQL ? |
|
Back to top |
|
 |
mqjeff |
Posted: Thu Mar 28, 2013 4:59 am Post subject: |
|
|
Grand Master
Joined: 25 Jun 2008 Posts: 17447
|
goffinf wrote: |
How would suggest removing the whitespace chars in ESQL ? |
That's the bit about accepting using BLOB parser and then doing a REPLACE.
Or you can just do a replace as part of any SET statement that assigns a field to an output field. |
|
Back to top |
|
 |
|