How to parse word-created special chars in java - java

I am trying to parse some word documents in java. Some of the values are things like a date range and instead of showing up like Startdate - endDate I am getting some funky characters like so
StartDate ΓÇô EndDate
This is where word puts in a special character hypen. Can you search for these characters and replace them with a regular - or something int he string so that I can then tokenize on a "-" and what is that character - ascii? unicode or what?
Edited to add some code:
String projDateString = "08/2010 ΓÇô Present"
Charset charset = Charset.forName("Cp1252");
CharsetDecoder decoder = charset.newDecoder();
ByteBuffer buf = ByteBuffer.wrap(projDateString.getBytes("Cp1252"));
CharBuffer cbuf = decoder.decode(buf);
String s = cbuf.toString();
println ("S: " + s)
println("projDatestring: " + projDateString)
Outputs the following:
S: 08/2010 ΓÇô Present
projDatestring: 08/2010 ΓÇô Present
Also, using the same projDateString above, if I do:
projDateString.replaceAll("\u0096", "\u2013");
projDateString.replaceAll("\u0097", "\u2014");
and then print out projDateString, it still prints as
projDatestring: 08/2010 ΓÇô Present

You are probably getting Windows-1252 which is a character set, not an encoding. (Torgamus - Googling for Windows-1232 didn't give me anything.)
Windows-1252, formerly "Cp1252" is almost Unicode, but keeps some characters that came from Cp1252 in their same places. The En Dash is character 150 (0x96) which falls within the Unicode C1 reserved control character range and shouldn't be there.
You can search for char 150 and replace it with \u2013 which is the proper Unicode code point for En Dash.
There are quite a few other character that MS has in the 0x80 to 0x9f range, which is reserved in the Unicode standard, including Em Dash, bullets, and their "smart" quotes.
Edit: By the way, Java uses Unicode code point values for characters internally. UTF-8 is an encoding, which Java uses as the default encoding when writing Strings to files or network connections.
Say you have
String stuff = MSWordUtil.getNextChunkOfText();
Where MSWordUtil would be something that you've written to somehow get pieces of an MS-Word .doc file. It might boil down to
File myDocFile = new File(pathAndFileFromUser);
InputStream input = new FileInputStream(myDocFile);
// and then start reading chunks of the file
By default, as you read byte buffers from the file and make Strings out of them, Java will treat it as UTF-8 encoded text. There are ways, as Lord Torgamus says, to tell what encoding should be used, but without doing that Windows-1252 is pretty close to UTF-8, except there are those pesky characters that are in the C1 control range.
After getting some String like stuff above, you won't find \u2013 or \u2014 in it, you'll find 0x96 and 0x97 instead.
At that point you should be able to do
stuff.replaceAll("\u0096", "\u2013");
I don't do that in my code where I've had to deal with this issue. I loop through an input CharSequence one char at a time, decide based on 0x80 <= charValue <= 0x9f if it has to be replaced, and look up in an array what to replace it with. The above replaceAll() is far easier if all you care about is the 1252 En Dash vs. the Unicode En Dash.

s = s.replace( (char)145, (char)'\'');
s = s.replace( (char)8216, (char)'\''); // left single quote
s = s.replace( (char)146, (char)'\'');
s = s.replace( (char)8217, (char)'\''); // right single quote
s = s.replace( (char)147, (char)'\"');
s = s.replace( (char)148, (char)'\"');
s = s.replace( (char)8220, (char)'\"'); // left double
s = s.replace( (char)8221, (char)'\"'); // right double
s = s.replace( (char)8211, (char)'-' ); // em dash??
s = s.replace( (char)150, (char)'-' );
http://www.coderanch.com/how-to/java/WeirdWordCharacters

Your problem almost certainly has to do with your encoding scheme not matching the encoding scheme Word saves in. Your code is probably using the Java default, likely UTF-8 if you haven't done anything to it. Your input, on the other hand, is likely Windows-1252, the default for Microsoft Word's .doc documents. See this site for more info. Notably,
Within Windows, ISO-8859-1 is replaced by Windows-1252, which often means that text copied from, say, a Microsoft Word document and pasted straight into a web page produces HTML validation errors.
So what does this mean for you? You'll have to tell your program that the input is using Windows-1252 encoding, and convert it to UTF-8. You can do this in varying flavors of "manually." Probably the most natural way is to take advantage of Java's built-in Charset class.
Windows-1252 is recognized by the IANA Charset Registry
Name: windows-1252
MIBenum: 2252
Source: Microsoft (http://www.iana.org/assignments/charset-reg/windows-1252) [Wendt]
Alias: None
so you it should be Charset-compatible. I haven't done this before myself, so I can't give you a code sample, but I will point out that there is a String constructor that takes a byte[] and a Charset as arguments.

Probably, that character is an en dash, and the strange blurb you see is due to a difference between the way Word encodes that character and the way that character is decoded by whatever (other) system you are using to display it.
If I remember correctly from when I did some work on character encodings in Java, String instances always internally use UTF-8; so, within such an instance, you may search and replace a single character by its Unicode form. For example, let's say you would like to substitute smart quotes with plain double quotes: given a String s, you may write
s = s.replace('\u201c', '"');
s = s.replace('\u201d', '"');
where 201c and 201d are the Unicode code points for the opening and closing smart quotes. According to the link above on Wikipedia, the Unicode code point for the en dash is 2013.

Related

Java byte stream non english characters

I read this code. As xanadu.txt content use "test". The file has 4 bytes size. If I use debug to run out.write(c) one byte at time and after each time open the file outagain.txt (with notepad) I see successively: t-->te-->tes-->test. OK
BUT if we change the content of the source file (xanadu.txt) to Greek (or other language) equivalent to test (τέστ) then the file now has 8 bytes size (I think because UTF we have 2 bytes per character). When debug again then it appears hieroglyphic character without meaning for each time out.write(c) runs. When the last byte (8th) prints then suddenly the original Greek word (τέστ) appears. Why? The same if we choose as destination the console stream (in netbeans) but in this case the strange characters remain at the end if debug but not if we run it normaly(!!!).
As you observe, a single char (16 bits in Java internal representation) turns into a variable number of bytes in a byte-stream representation, in particular UTF-8.
(Some characters occupy two char values; I shall ignore those, but the answer still applies, only more so)
If you're outputting 'byte-wise' as in your experiment, in some cases you'll have output a fractional character. That is an illegal sequence that makes no sense; some software (such as Notepad) will nevertheless try to make sense of it. That may even include guessing at the encoding. For example, I don't know this to be the case, but if the file is not valid UTF-8 in its first several bytes -- and we know your half-a-character output is not valid UTF-8 -- than maybe Notepad guesses at an entirely different encoding, that treats the byte sequence as a valid representation of entirely different characters.
tl;dr - garbage out, garbage displayed.
Modern computers have this gigantic table with 4 billion characters in it. Each character is identified by a single 32-bit number. All characters you can think of are in here; from the basic 'test' to 'τέστ' to snowman ( ☃ ), to special non-visible ones that indicate a right-to-left spelled word is coming up, to a bunch of ligatures (such as ff - which is a single character representing the ff ligature), to emoji, coloured and all: 😀.
This entire answer is essentially a sequence of these 32-bit numbers. But how would you like to store these in a file? That's where 'encoding' comes in. There are many, many encodings, and a crucial problem is that (almost) no encodings are 'detectable'.
It's like this:
If a complete stranger walks up to you and says "Hey!", what language are they speaking? Probably english. But maybe dutch, which also has 'Hey!'. It could also be japanese and they're not even greeting you, they're saying 'Yes' (more or less). How would you know?
The answer is, either from external context (if you're in the middle of Newcastle, UK, it's probably english), or because they explicitly tell you, but one is, well, external, and the other isn't common practice.
Text files are the same way.
They just contain the encoded text, they do not indicate what encoding it is. That means you need to tell the editor, or your newBufferedReader in java, or your browser when saving that txt content, what encoding you want. However, because that's annoying to have to do every time, most systems have a default choice. Some text editors even try to figure out what encoding it is, but just like that person say 'Hey!' to you might be english or japanese, with wildly different interpretations, the same happens with this semi-intelligent guessing at charset encoding.
This gets us to the following explanation:
You write τέστ in your editor and hit 'save'. What is your editor doing? Is it saving in UTF-16? UTF-8? UCS-4? ISO-8859-7? Completely different files are produced for all of these encodings! Given that it made 8 bytes, that meansa it's UTF-16 or UTF-8. Probably UTF-8.
You then copy these bytes over one by one, which is problematic: In UTF-8, a single byte can be half of a character. (You said: UTF-8 stores characters as 2 bytes; that's not true, UTF-8 stores characters such that every character is 1, 2, 3, or 4 bytes; it's variable length per byte! - each character in τέστ is stored as 2 bytes, though) - that means if you've copied over, say, 3 bytes, your text editor's ability to guess what it might be is severely hampered: It might guess UTF-8 but then realize that it isn't valid UTF-8 at all (because of that half-of-a-character you ended up with), so it guesses wrong, and shows you gobbledygook.
The lesson to learn here is twofold:
When you want to process characters, use char, Reader, Writer, String, and other character-oriented things.
When you want to process bytes, use byte, byte[], InputStream, OutputStream, and other byte-oriented things.
Never make the mistake that these two are easily interchangible, because they are not. Whenever you go from one 'world' to the other, you MUST specify charset encoding, because if not, java picks 'platform default', which you don't want (because now you have software that depends on an external factor and which cannot be tested. Yikes).
Default to UTF-8 for everything you can.
tl;dr
Read: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Do not parse text files by octets (bytes). Use classes purpose-built for handling text. For example, use Files and its readAllLines method.
Details
Notice at the bottom of that tutorial page the caution that this is not the proper way to handle text files:
CopyBytes seems like a normal program, but it actually represents a kind of low-level I/O that you should avoid. Since xanadu.txt contains character data, the best approach is to use character streams, as discussed in the next section.
Text files may or may not use single octets to represent single characters, such as US-ASCII files. Your example code assumes one octet per character, which works for test as the content but not for τέστ as the content.
As a programmer, you must know from the publisher of your data file what encoding was used in writing the data representing the original text. Generally best to use UTF-8 encoding when writing text.
Write a text file with two lines:
test
τέστ
…and save using a text-editor with an encoding of UTF-8.
Read that file as a collection of String objects.
Path path = Paths.get( "/Users/basilbourque/some_text.txt" );
try
{
List < String > lines = Files.readAllLines( path , StandardCharsets.UTF_8 );
for ( String line : lines )
{
System.out.println( "line = " + line );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
When run:
line = test
line = τέστ
UTF-16 versus UTF-8
You said:
I think because UTF we have 2 bytes per character)
No such thing as “UTF”.
UTF-16 encoding uses one or more pairs of octets per character.
UTF-8 encoding uses 1, 2, 3, or 4 octets per character.
Text content such as τέστ can be written to a file using either encoding, UTF-16 or UTF-8. Be aware that UTF-16 is “considered harmful”, and UTF-8 is preferred generally nowadays. Note that UTF-8 is an superset of US-ASCII, so any US-ASCII file is also a UTF-8 file.
Characters as code points
If you want to example each character in text, treat them as code point numbers.
Never use the char type in Java. That type is unable to represent even half of the characters defined in Unicode, and is now obsolete.
We can interrogate each character in our example file seen above by adding these two lines of code.
IntStream codePoints = line.codePoints();
codePoints.forEach( System.out :: println );
Like this:
Path path = Paths.get( "/Users/basilbourque/some_text.txt" );
try
{
List < String > lines = Files.readAllLines( path , StandardCharsets.UTF_8 );
for ( String line : lines )
{
System.out.println( "line = " + line );
IntStream codePoints = line.codePoints();
codePoints.forEach( System.out :: println );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
When run:
line = test
116
101
115
116
line = τέστ
964
941
963
964
If you are not yet familiar with streams, convert IntStream to a collection, such as a List of Integer objects.
Path path = Paths.get( "/Users/basilbourque/some_text.txt" );
try
{
List < String > lines = Files.readAllLines( path , StandardCharsets.UTF_8 );
for ( String line : lines )
{
System.out.println( "line = " + line );
List < Integer > codePoints = line.codePoints().boxed().collect( Collectors.toList() );
for ( Integer codePoint : codePoints )
{
System.out.println( "codePoint = " + codePoint );
}
}
}
catch ( IOException e )
{
e.printStackTrace();
}
When run:
line = test
codePoint = 116
codePoint = 101
codePoint = 115
codePoint = 116
line = τέστ
codePoint = 964
codePoint = 941
codePoint = 963
codePoint = 964
Given a code point number, we can determine the intended character.
String s = Character.toString( 941 ) ; // έ character.
Be aware that some textual characters may be represented as multiple code points, such as a letter with a diacritical. (Text-handling is not a simple matter.)

How do I convert a single character code to a `char` given a character set?

I want to convert decimal to ascii and this is the code returns the unexpected results. Here is the code I am using.
public static void main(String[] args) {
char ret= (char)146;
System.out.println(ret);// returns nothing.
I expect to get character single "'" as per http://www.ascii-code.com/
Anyone came across this? Thanks.
So, a couple of things.
First of all the page you linked to says this about the code point range in question:
The extended ASCII codes (character code 128-255)
There are several different variations of the 8-bit ASCII table. The table below is according to ISO 8859-1, also called ISO Latin-1. Codes 128-159 contain the Microsoft® Windows Latin-1 extended characters.
This is incorrect, or at least, to me, misleadingly worded. ISO 8859-1 / Latin-1 does not define code point 146 (and another reference just because). So that's already asking for trouble. You can see this also if you do the conversion through String:
String s = new String(new byte[] {(byte)146}, "iso-8859-1");
System.out.println(s);
Outputs the same "unexpected" result. It appears that what they are actually referring to is the Windows-1252 set (aka "Windows Latin-1", but this name is almost completely obsolete these days), which does define that code point as a right single quote (for other charsets that provide this character at 146 see this list and look for encodings that provide it at 0x92), and we can verify this as such:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
So the first mistake is that page is confusing.
But the big mistake is you can't do what you're trying to do in the way you are doing it. A char in Java is a UTF-16 code point (or half of one, if you're representing the supplementary characters > 0xFFFF, a single char corresponds to a BMP point, a pair of them or an int corresponds to the full range, including the supplementary ones).
Unfortunately, Java doesn't really expose a lot of API for single-character conversions. Even Character doesn't have any readily available ways to convert from the charset of your choice to UTF-16.
So one option is to do it via String as hinted at in the examples above, e.g. express your code points as a raw byte[] array and convert from there:
String s = new String(new byte[] {(byte)146}, "windows-1252");
System.out.println(s);
char c = s.charAt(0);
System.out.println(c);
You could grab the char again via s.charAt(0). Note that you have to be mindful of your character set when doing this. Here we know that our byte sequence is valid for the specified encoding, and we know that the result is only one char long, so we can do this.
However, you have to watch out for things in the general case. For example, perhaps your byte sequence and character set yield a result that is in the UTF-16 supplementary character range. In that case s.charAt(0) would not be sufficient and s.codePointAt(0) stored in an int would be required instead.
As an alternative, with the same caveats, you could use Charset to decode, although it's just as clunky, e.g.:
Charset cs = Charset.forName("windows-1252");
CharBuffer cb = cs.decode(ByteBuffer.wrap(new byte[] {(byte)146}));
char c = cb.get(0);
System.out.println(c);
Note that I am not entirely sure how Charset#decode handles supplementary characters and can't really test right now (but anybody, feel free to chime in).
As an aside: In your case, 146 (0x92) cast directly to char corresponds to the UTF-16 character "PRIVATE USE TWO" (see also), and all bets are off for what you'll end up displaying there. This character is classified by Unicode as a control character, and seems to fall in the range of characters reserved for ANSI terminal control (although AFAIK isn't actually used, but it's in that range regardless). I wouldn't be surprised if perhaps browsers in some locales rendered it as a right-single-quote for compatibility, but terminals did something weird with it.
Also, fyi, the official UTF-16 code point for right single quote is 0x2019. You could reliably store that in a char by using that value, e.g.:
System.out.println((char)0x2019);
You can also see this for yourself by looking at the value after the conversion from windows-1252:
String s = new String(new byte[] {(byte)146}, "windows-1252");
char c = s.charAt(0);
System.out.printf("0x%x\n", (int)c); // outputs 0x2019
Or, for completeness:
String s = new String(new byte[] {(byte)146}, "windows-1252");
int cp = s.codePointAt(0);
System.out.printf("0x%x\n", cp); // outputs 0x2019
The page you refer mention that values 160 to 255 correspond to the ISO-8859-1 (aka Latin 1) table; as for values in the range 128 to 159, they are from the Windows specific variant of the Latin 1 (ISO-8859-1 leave that range undefined, to be assigned by operating system).
Java characters are based on UTF16, which is itself based on the Unicode table. If you want to specifically refer to the right quote character, it is you can specify it as '\u2019' in Java (see http://www.fileformat.info/info/unicode/char/2019/index.htm).

Unable to convert Hyphen to UTF-8

I'm reading some text that I got from Wikipedia.
The text contains hyphen like in this String: "Australia for the [[2011–12 NBL season]]"
I'm trying to do is to convert the text to utf-8, using this code:
String myStr = "Australia for the [[2011–12 NBL season]]";
new String(myStr.getBytes(), "utf-8");
The result is:
Australia for the [[2011�12 NBL season]]
The problem is that the hyphen is not being mapped correctly.
The hyphen value in bytes is [-106] (I have no idea what to do with it...)
Do you know how to convert it to a hyphen that utf-8 encoding recognizes?
I would be happy to replace other special characters as well by some general code, but also specific "hyphens" replacement code will help.
The problem code point is U+2013 EN DASH which can be represented with the escape \u2013.
Try replacing the string with "2011\u201312". If this works then there is a mismatch between your editor character encoding and the one the compiler is using.
Otherwise, the problem is with the transcoding operation from string to whatever device you are writing to. Anywhere where you convert from bytes to chars or chars to bytes is a potential point of corruption when the wrong encoding is used; this can include System.out.
Note: Java strings are always UTF-16.
new String(myStr.getBytes(), "utf-8");
This code takes UTF-16, converts it to the platform encoding, which might be anything, then pretends its UTF-8 and converts it back to UTF-16. At best, the platform encoding is UTF-8 and this is a no-op; otherwise it will just corrupt the data.
This is how you create UTF-8 in Java:
byte[] utf8 = str.getBytes(StandardCharsets.UTF_8); // Java 7
You can read more here.
This is because the source code (editor) is maybe in Windows-1252 (extended Latin-1), and it is compiled with another encoding UTF-8 (compiler). These two encodings must be the same, or use in the source: "\u00AD", the ASCII representation of the hyphen.

Replacing Unicode character codes with characters in String in Java

I have a Java String like this: "peque\u00f1o". Note that it has an embedded Unicode character: '\u00f1'.
Is there a method in Java that will replace these Unicode character sequences with the actual characters? That is, a method that would return "pequeño" if you gave it "peque\u00f1o" as input?
Note that I have a string that has 12 chars (those that we see, that happen to be in the ASCII range).
Actually the string is "pequeño".
String s = "peque\u00f1o";
System.out.println(s.length());
System.out.println(s);
yields
7
pequeño
i.e. seven chars and the correct representation on System.out.
I remember giving the same response last week, use org.apache.commons.lang.StringEscapeUtils.
If you have the appropriate fonts, a println or setting the string in a JLabel or JTextArea should do the trick. The escaping is only for the compiler.
If you plan to copy-paste the readable strings in source, remember to also choose a suitable file encoding like UTF8.

Java Unicode Confusion

HEy all, I have only just started attempting to learn Java and have run into something that is really confusing!
I was typing out an example from the book I am using. It is to demonstrate the
char data type.
The code is as follows :
public class CharDemo
{
public static void main(String [] args)
{
char a = 'A';
char b = (char) (a + 1);
System.out.println(a + b);
System.out.println("a + b is " + a + b);
int x = 75;
char y = (char) x;
char half = '\u00AB';
System.out.println("y is " + y + " and half is " + half);
}
}
The bit that is confusing me is the statement, char half = '\u00AB'. The book states that \u00AB is the code for the symbol '1/2'. As described, when I compile and run the program from cmd the symbol that is produced on this line is in fact a '1/2'.
So everything appears to be working as it should. I decided to play around with the code and try some different unicodes. I googled multiple unicode tables and found none of them to be consistent with the above result.
In every one I found it stated that the code /u00AB was not for '1/2' and was in fact for this:
http://www.fileformat.info/info/unic...r/ab/index.htm
So what character set is Java using, I thought UNicode was supposed to be just that, Uni, only one. I have searched for hours and nowhere can I find a character set that states /u00AB is equal to a 1/2, yet this is what my java compiler interprets it as.
I must be missing something obvious here! Thanks for any help!
It's a well-known problem with console encoding mismatch on Windows platforms.
Java Runtime expects that encoding used by the system console is the same as the system default encoding. However, Windows uses two separate encodings: ANSI code page (system default encoding) and OEM code page (console encoding).
So, when you try to write Unicode character U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK to the console, Java runtime expects that console encoding is the ANSI encoding (that is Windows-1252 in your case), where this Unicode character is represented as 0xAB. However, the actual console encoding is the OEM encoding (CP437 in your case), where 0xAB means ½.
Therefore printing data to Windows console with System.out.println() produces wrong results.
To get correct results you can use System.console().writer().println() instead.
The \u00ab character is not the 1/2 character; see this definitive code page from the Unicode.org website.
What you are seeing is (I think) a consequence of using the System.out PrintStream on a platform where the default character encoding is not UTF-8 or Latin-1. Maybe it is some Windows character set as suggested by #axtavt's answer? (It also has a plausible explanation of why \u00ab is displayed as 1/2 ... and not some "splat" character.)
(In Unicode and Latin-1, \00BD is the codepoint for the 1/2 character.)
0xAB is 1/2 in good old Codepage 437, which is what Windows terminals will use by default, no matter what codepage you actually set.
So, in fact, the char value represents the "«" character to a Java program, and if you render that char in a GUI or run it on a sane operating system, you will get that character. If you want to see proper output in Windows as well, switch your Font settings in CMD away from "Raster Fonts" (click top-left icon, Properties, Font tab). For example, with Lucida Console, I can do this:
C:\Users\Documents>java CharDemo
131
a + b is AB
y is K and half is ½
C:\Users\Documents>chcp 1252
Active code page: 1252
C:\Users\Documents>java CharDemo
131
a + b is AB
y is K and half is «
C:\Users\Documents>chcp 437
Active code page: 437
One thing great about Java is that it is unicode based. That means, you can use characters from writing systems that are not english alphabets (e.g. Chinese or math symbols), not just in data strings, but in function and variable names too.
Here's a example code using unicode characters in class names and variable names.
class 方 {
String 北 = "north";
double π = 3.14159;
}
class UnicodeTest {
public static void main(String[] arg) {
方 x1 = new 方();
System.out.println( x1.北 );
System.out.println( x1.π );
}
}
Java was created around the time when the Unicode standard had values defined for a much smaller set of characters. Back then it was felt that 16-bits would be more than enough to encode all the characters that would ever be needed. With that in mind Java was designed to use UTF-16. In fact, the char data type was originally used to be able to represent a 16-bit Unicode code point.
The UTF-8 charset is specified by RFC 2279;
The UTF-16 charsets are specified by RFC 2781
The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\uFEFF'. Byte-order marks are handled as follows:
When decoding, the UTF-16BE and UTF-16LE charsets ignore byte-order marks; when encoding, they do not write byte-order marks.
When decoding, the UTF-16 charset interprets a byte-order mark to indicate the byte order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.
Also see this
Well, when I use that code I get the << as I should and 1/2 for \u00BD as it should be.
http://www.unicode.org/charts/

Categories