why is String.split("£", 2) not working? - java

I have a text file with 1000 lines in the following format:
19 x 75 Bullnose Architrave/Skirting £1.02
I am writing a method that reads the file line by line in - This works OK.
I then want to split each string using the "£" as a deliminater & write it out to
an ArrayList<String> in the following format:
19 x 75 Bullnose Architrave/Skirting, Metre, 1.02
This is how I have approached it (productList is the ArrayList, declared/instantiated outside the try block):
try{
br = new BufferedReader(new FileReader(aFile));
String inputLine = br.readLine();
String delim = "£";
while (inputLine != null){
String[]halved = inputLine.split(delim, 2);
String lineOut = halved[0] + ", Metre, " + halved[1];//Array out of bounds
productList.add(lineOut);
inputLine = br.readLine();
}
}
The String is not splitting and I keep getting an ArrayIndexOutOfBoundsException. I'm not very familiar with regex. I've also tried using the old StringTokenizer but get the same result.
Is there an issue with £ as a delim or is it something else? I did wonder if it is something to do with the second token not being read as a String?
Any ideas would be helpful.

Here are some of the possible causes:
The encoding of the file doesn't match the encoding that you are using to read it, and the "pound" character in the file is getting "mangled" into something else.
The file and your source code are using different pound-like characters. For instance, Unicode has two code points that look like a "pound sign" - the Pound Sterling character (00A3) and the Lira character (2084) ... then there is the Roman semuncia character (10192).
You are trying to compile a UTF-8 encoded source file without tell the compiler that it is UTF-8 encoded.
Judging from your comments, this is an encoding mismatch problem; i.e. the "default" encoding being used by Java doesn't match the actual encoding of the file. There are two ways to address this:
Change the encoding of the file to match Java's default encoding. You seem to have tried that and failed. (And it wouldn't be the way I'd do this ...)
Change the program to open the file with a specific (non default) encoding; e.g. change
new FileReader(aFile)
to
new FileReader(aFile, encoding)
where encoding is the name of the file's actual character encoding. The names of the encodings understood by Java are listed here, but my guess is that it is "ISO-8859-1" (aka Latin-1).

This is probably a case of encoding mismatch. To check for this,
Print delim.length and make sure it is 1.
Print inputLine.length and make sure it is the right value (42).
If one of them is not the expected value then you have to make sure you are using UTF-8 everywhere.
You say delim.length is 1, so this is good. On the other hand if inputLine.length is 34, this is very wrong. For "19 x 75 Bullnose Architrave/Skirting £1.02" you should get 42 if all was as expected. If your file was UTF-8 encoded but read as ISO-8859-1 or similar you would have gotten 43.
Now I am a little at a loss. To debug this you could print individually each character of the string and check what is wrong with them.
for (int i = 0; i < inputLine.length; i++)
System.err.println("debug: " + i + ": " + inputLine.charAt(i) + " (" + inputLine.codePointAt(i) + ")");

Many thanks for all your replies.
Specifying the encoding within the read & saving the original text file as UTF -8 has worked.
However, the experience has taught me that delimiting text using "£" or indeed other characters that may have multiple representations in different encodings is a poor strategy.
I have decided to take a different approach:
1) Find the last space in the input string & replace it with "xxx" or similar.
2) Split this using the delimiter "xxx." which should split the strings & rip out the "£".
3) Carry on..

Related

How to print the symbols and < and > to a file as space < and > respectively

How do you print symbols in Java to a file when you have only the symbol description?
I received a string from DB2 which contains symbols.
Two samples:
1) <0800>
2) 51V 3801Z
Such a string goes to two different places. One is a JSP rendering it as HTML. That is perfect; I get <0800> and 51V 3801Z, respectively. The other place is a CSV file created with java.io.FileWriter, and it does not convert to "<", ">", and " ". Instead, it is printed exactly as it came from DB2:
<0800>
and 51V 3801Z.
Is there anything the "new" nio library could help me? I have tried apache.commons.lang3.StringScapeUtils.escapeHTML4 without success.
I suggest looking into Apache's StringEscapeUtils, namely the unescapeHtml4() method.
Example:
String input = "<0800>";
String output = StringEscapeUtils.unescapeHtml4(input);
Ensure you are using the unescapeHtml4 method, and not the regular escapeHtml4 method!

Appending extended ascii in strings

I'm trying to use extended ascii character 179(looks like pipe).
Here is how I use it.
String cmd = "";
char pipe = (char) 179;
// cmd ="02|CO|0|101|03|0F""
cmd ="02"+pipe+"CO"+pipe+"0"+pipe+"101"+pipe+"03"+pipe+"0F";
System.out.println("cmd "+cmd);
Output
cmd 02³CO³0³101³03³0F
But the output is like this . I have read that extended ascii characters are not displayed correctly.
Is my code correct and just the ascii is not correctly displayed
or my code is wrong.
I'm not concerned about showing this string to user I need to send it to server.
EDIT
The vendor's api document states that we need to use ascii 179 (looks like pipe) . The server side code needs 179(part of extended ascii) as pipe/vertical line so I cannot use 124(pipe)
EDIT 2
Here is the table for extended ascii
On the other hand this table shows that ascii 179 is "3" . Why
are there different interpretation of the same and which one should I
consider??
EDIT 3
My default charset value is (is this related to my problem?)
System.out.println("Default Charset=" + Charset.defaultCharset());
Default Charset=windows-1252
Thanks!
I have referred to
How to convert a char to a String?
How to print the extended ASCII code in java from integer value
Thanks
Use the below code.
String cmd = "";
char pipe = '\u2502';
cmd ="02"+pipe+"CO"+pipe+"0"+pipe+"101"+pipe+"03"+pipe+"0F";
System.out.println("cmd "+cmd);
System.out.println("int value: " + (int)pipe);
Output:
cmd 02│CO│0│101│03│0F
int value: 9474
I am using IntelliJ. This is the output I am getting.
Your code is correct; concatenating String values and char values does what one expects. It's the value of 179 that is wrong. You can google "unicode 179", and you'll find "Unicode Character 'SUPERSCRIPT THREE' (U+00B3)", as one might expect. And, you could simply say "char pipe = '|';" instead of using an integer. Or even better: String pipe = "|"; which also allows you the flexibility to use more than one character :)
In response to the new edits...
May I suggest that you fix this rather low-level problem not at the Java String level, but instead replace the byte encoding this character before sending the bytes to the server?
E.g. something like this (untested)
byte[] bytes = cmd.getBytes(); // all ascii, so this should be safe.
for (int i = 0; i < bytes.length; i++) {
if (bytes[i] == '|') {
bytes[i] = (byte)179;
}
}
// send command bytes to server
// don't forget endline bytes/chars or whatever the protocol might require. good luck :)

How to read special characters from file system in libgdx

String jsonData = Gdx.files.internal("data/" + spreadsheet + ".json").readString();
When try to print this String , characters such as ö,ä,ü , show up as √º and other characters not similar to original for e.g. ü shows up as √º.
How can I remedy this?
I want to serialise this later into a class instance.
Should I use some other method instead of readString?
Something else I tried - I passed the filehandle itself to the Json object to serialise, but still characters show up as some other characters
I specified the charset and the problem is solved now
....readString("UTF-8");

How to remove Ascii code from the JTextArea?

My java program gets some weather information from an API. But it has weird letters in the text. Looks like ASCII code.
Here is an example:
Min temp: 0°C (32°F) which should be: Min temp: 0C (32F) (i think).
How can I change it?
Well one solution can be before posting you can do following
String withoutDegSymbol = str.replaceAll("°", "");
Where str contains you temperature data.
try this
String s = "0°C (32°F)".replaceAll("[\u0080-\u00FF]", "");
or if you have HTML character references in your text use
String s = "Min temp: 0°C (32°F)".replaceAll("&#x.+?;", "");
If using ASCII character encoding in your codes, when you saved your code, did your IDE asked you in what format you want to save it. Because in Eclipse IDE, if you are using an ASCII character, it prompts you to save your code in UTF-8 format. Hope this helps.
You need to know that :
° :
is Unicode Character for 'DEGREE SIGN'
Encoding :HTML Entity (hex)
so if this is the just problem you have (i mean this is the only character you use "Degree sign") , so you can convert it manually Like that :
String s = "Min temp: 0°C (32°F)".replaceAll("°", "°");
System.out.println(s);
if this is not the special character you have , so you may use : Class StringEscapeUtils
or jsoup library to convert it to java

Java Western + Arabic String concatenation issues

I'm having trouble in concatenating pieces of text mixing Western and Arabic chars.
I've a list of tokens like this:
-LRB-
دریای
مازندران
-RRB-
,
I use the following procedure to concatenate these list of tokens:
String str = "";
for (String tok : tokens) {
str += tok + " ";
}
This is the output of my procedure:
-LRB- دریای مازندران -RRB- ,
As can be seen, the position of the Arabic words is inverted.
How can I solve this (maybe suggesting to Java to ignore the information about text direction)?
EDIT
Actually, it seems that my problem was a false problem.
Now I've a new one. I need to wrap each word inside a string like this (word *) so that my output will be like this:
(word1 *)(word2 *)(word3 *)...
The procedure that I use is the following:
String str = "";
for (String tok : tokens) {
str += "(" + tok + "*)";
}
However, the result that I got is this:
(-LRB- *)(دریای *)(مازندران *)(-RRB- *)(, *)
instead of:
(-LRB- *)(دریای)(* مازندران *)(-RRB- *)(, *)
** EDIT2 **
Actually, I've discovered that my problem is not a problem. I wrote my string on a file and I opened it with nano (in the console). And it was correctly concatenated.
So the problem was due to the Eclipse console (and also gedit) which --let's say-- incorrectly rendered the string.
Anyway, thanks for your help!
The output is correct, and if you are presenting this text to an Arabic-speaking user you should not override the directionality of the text. Arabic is written from right to left. When you concatenate two Arabic strings together, the first will appear to the right of the second. This is controlled by the BiDi algorithm, the details of which are covered in http://www.unicode.org/reports/tr9/.
First, I would suggest using StringBuilder instead of raw String concatination. You will make your Garbage Collector a lot happier. Second, not seeing the input or how your StringTokenizer is setup, I would venture a guess that it seems like you are having problems tokenizing the string properly.

Categories