|  | 
 
  
    | RSS Feed - WebSphere MQ Support | RSS Feed - Message Broker Support |  
 
  
	|    |  |  
  
	| [Solved]Parsing Error for UTF-8 file containin chinese chars | « View previous topic :: View next topic » |  
  	| 
		
		
		  | Author | Message |  
		  | abhyyy | 
			  
				|  Posted: Sun Jan 15, 2012 2:01 am    Post subject: [Solved]Parsing Error for UTF-8 file containin chinese chars |   |  |  
		  | Voyager
 
 
 Joined: 29 Sep 2011Posts: 83
 
 
 | 
			  
				| Hi Friends, 
 I have an input file thrown by another application into folder having record format as below(pipe delimeted). That file contains some chinese characters, that is the reason other application is throwing it in UTF-8 and not in ANSI.
 
 Problem : When I try to Parse the file using message set with file node created to parse the delimited file. I am receiving parsing error.
 
 If I remove the chinese characters and replace them with some english characters, even then it is not working. But my same message set is working with ANSI file format containing only english characters. I have already forcing using CCSID 1208 in Message set and file node but didnt work.
 
 
 file record Sample :
 0|6594993543|XMAS2011|XMASOFFER|123456789|OFFER_FOR_MALAYSIANS|X|特殊字符测试|20111225121212|P|
 
 Please advice if I need to make any changes in Fileinput node and message set inorder to correctly read annd Parse Utf-8 file with Chinese characters.
 _________________
 ----------------------
 NeVeR StOp LeaRnInG.
 
 Last edited by abhyyy on Sun Jan 15, 2012 9:42 am; edited 1 time in total
 |  |  
		  | Back to top |  |  
		  |  |  
		  | smdavies99 | 
			  
				|  Posted: Sun Jan 15, 2012 3:48 am    Post subject: Re: Parsing Error for UTF-8 file containing chinese chars. |   |  |  
		  |  Jedi Council
 
 
 Joined: 10 Feb 2003Posts: 6076
 Location: Somewhere over the Rainbow this side of Never-never land.
 
 | 
			  
				| 
   
	| abhyyy wrote: |  
	| If I remove the chinese characters and replace them with some english characters, even then it is not working.
 
 |  
 Does that not tell that it might not be the chinese characters that are causing the failure?
 
 What is the exact error you are seeing? Take a user trace and post the relevant output. You might be surprised by the information it gives you.
 _________________
 WMQ User since 1999
 MQSI/WBI/WMB/'Thingy' User since 2002
 Linux user since 1995
 
 Every time you reinvent the wheel the more square it gets (anon). If in doubt think and investigate before you ask silly questions.
 |  |  
		  | Back to top |  |  
		  |  |  
		  | abhyyy | 
			  
				|  Posted: Sun Jan 15, 2012 9:39 am    Post subject: |   |  |  
		  | Voyager
 
 
 Joined: 29 Sep 2011Posts: 83
 
 
 | 
			  
				| Found the problem!! 
 The problem was in my First input fiield. If u check the Sample message that I posted earlier has first field as 0(an integer).
 
 Since a general principle of UTF-8 is that the first byte either is a single-byte character  or indicates length of multi-byte code by the number of 1's before the first 0 and is then filled up with data bits.
 So, I cannot keep my first field in TDS format as integer (it has to be a character) when I am reading tag delimeted records from UTF-8 encoded file using File input Node.
 
 Thanks a lot friends for you precious time.
 _________________
 ----------------------
 NeVeR StOp LeaRnInG.
 |  |  
		  | Back to top |  |  
		  |  |  
		  | kimbert | 
			  
				|  Posted: Mon Jan 16, 2012 2:16 am    Post subject: |   |  |  
		  |  Jedi Council
 
 
 Joined: 29 Jul 2003Posts: 5543
 Location: Southampton
 
 | 
			  
				| Glad you got it working...but your explanation is not correct. 
 
  That statement is correct 
	| Quote: |  
	| Since a general principle of UTF-8 is that the first byte either is a single-byte character or indicates length of multi-byte code by the number of 1's before the first 0 and is then filled up with data bits. |  
 
  That statement is not correct. The '0' at the start of the line is a character. It does not matter whether that character is encoded as UTF-8 or ASCII. If you tell the TDS parser that the field is one character long, then it will consume as many bytes as it needs to ( assuming that you have set the CCSID property correctly ). 
	| Quote: |  
	| So, I cannot keep my first field in TDS format as integer (it has to be a character) when I am reading tag delimeted records from UTF-8 encoded file using File input Node. |  Perhaps you had incorrectly set the 'length units' field to 'bytes' for that first field? But when you changed the type to 'xs:string' you changed the 'length units' to 'characters' which made it work again?
 |  |  
		  | Back to top |  |  
		  |  |  
		  |  |  |  
  
	|    |  | Page 1 of 1 |  
 
 
  
  	| 
		
		  | 
 
 | You cannot post new topics in this forum You cannot reply to topics in this forum
 You cannot edit your posts in this forum
 You cannot delete your posts in this forum
 You cannot vote in polls in this forum
 
 |  |  |  |