I'm having trouble in concatenating pieces of text mixing Western and Arabic chars.
I've a list of tokens like this:
-LRB-
دریای
مازندران
-RRB-
,
I use the following procedure to concatenate these list of tokens:
String str = "";
for (String tok : tokens) {
str += tok + " ";
}
This is the output of my procedure:
-LRB- دریای مازندران -RRB- ,
As can be seen, the position of the Arabic words is inverted.
How can I solve this (maybe suggesting to Java to ignore the information about text direction)?
EDIT
Actually, it seems that my problem was a false problem.
Now I've a new one. I need to wrap each word inside a string like this (word *) so that my output will be like this:
(word1 *)(word2 *)(word3 *)...
The procedure that I use is the following:
String str = "";
for (String tok : tokens) {
str += "(" + tok + "*)";
}
However, the result that I got is this:
(-LRB- *)(دریای *)(مازندران *)(-RRB- *)(, *)
instead of:
(-LRB- *)(دریای)(* مازندران *)(-RRB- *)(, *)
** EDIT2 **
Actually, I've discovered that my problem is not a problem. I wrote my string on a file and I opened it with nano (in the console). And it was correctly concatenated.
So the problem was due to the Eclipse console (and also gedit) which --let's say-- incorrectly rendered the string.
Anyway, thanks for your help!
The output is correct, and if you are presenting this text to an Arabic-speaking user you should not override the directionality of the text. Arabic is written from right to left. When you concatenate two Arabic strings together, the first will appear to the right of the second. This is controlled by the BiDi algorithm, the details of which are covered in http://www.unicode.org/reports/tr9/.
First, I would suggest using StringBuilder instead of raw String concatination. You will make your Garbage Collector a lot happier. Second, not seeing the input or how your StringTokenizer is setup, I would venture a guess that it seems like you are having problems tokenizing the string properly.
Related
My java program gets some weather information from an API. But it has weird letters in the text. Looks like ASCII code.
Here is an example:
Min temp: 0°C (32°F) which should be: Min temp: 0C (32F) (i think).
How can I change it?
Well one solution can be before posting you can do following
String withoutDegSymbol = str.replaceAll("°", "");
Where str contains you temperature data.
try this
String s = "0°C (32°F)".replaceAll("[\u0080-\u00FF]", "");
or if you have HTML character references in your text use
String s = "Min temp: 0°C (32°F)".replaceAll("&#x.+?;", "");
If using ASCII character encoding in your codes, when you saved your code, did your IDE asked you in what format you want to save it. Because in Eclipse IDE, if you are using an ASCII character, it prompts you to save your code in UTF-8 format. Hope this helps.
You need to know that :
° :
is Unicode Character for 'DEGREE SIGN'
Encoding :HTML Entity (hex)
so if this is the just problem you have (i mean this is the only character you use "Degree sign") , so you can convert it manually Like that :
String s = "Min temp: 0°C (32°F)".replaceAll("°", "°");
System.out.println(s);
if this is not the special character you have , so you may use : Class StringEscapeUtils
or jsoup library to convert it to java
I have a text file with 1000 lines in the following format:
19 x 75 Bullnose Architrave/Skirting £1.02
I am writing a method that reads the file line by line in - This works OK.
I then want to split each string using the "£" as a deliminater & write it out to
an ArrayList<String> in the following format:
19 x 75 Bullnose Architrave/Skirting, Metre, 1.02
This is how I have approached it (productList is the ArrayList, declared/instantiated outside the try block):
try{
br = new BufferedReader(new FileReader(aFile));
String inputLine = br.readLine();
String delim = "£";
while (inputLine != null){
String[]halved = inputLine.split(delim, 2);
String lineOut = halved[0] + ", Metre, " + halved[1];//Array out of bounds
productList.add(lineOut);
inputLine = br.readLine();
}
}
The String is not splitting and I keep getting an ArrayIndexOutOfBoundsException. I'm not very familiar with regex. I've also tried using the old StringTokenizer but get the same result.
Is there an issue with £ as a delim or is it something else? I did wonder if it is something to do with the second token not being read as a String?
Any ideas would be helpful.
Here are some of the possible causes:
The encoding of the file doesn't match the encoding that you are using to read it, and the "pound" character in the file is getting "mangled" into something else.
The file and your source code are using different pound-like characters. For instance, Unicode has two code points that look like a "pound sign" - the Pound Sterling character (00A3) and the Lira character (2084) ... then there is the Roman semuncia character (10192).
You are trying to compile a UTF-8 encoded source file without tell the compiler that it is UTF-8 encoded.
Judging from your comments, this is an encoding mismatch problem; i.e. the "default" encoding being used by Java doesn't match the actual encoding of the file. There are two ways to address this:
Change the encoding of the file to match Java's default encoding. You seem to have tried that and failed. (And it wouldn't be the way I'd do this ...)
Change the program to open the file with a specific (non default) encoding; e.g. change
new FileReader(aFile)
to
new FileReader(aFile, encoding)
where encoding is the name of the file's actual character encoding. The names of the encodings understood by Java are listed here, but my guess is that it is "ISO-8859-1" (aka Latin-1).
This is probably a case of encoding mismatch. To check for this,
Print delim.length and make sure it is 1.
Print inputLine.length and make sure it is the right value (42).
If one of them is not the expected value then you have to make sure you are using UTF-8 everywhere.
You say delim.length is 1, so this is good. On the other hand if inputLine.length is 34, this is very wrong. For "19 x 75 Bullnose Architrave/Skirting £1.02" you should get 42 if all was as expected. If your file was UTF-8 encoded but read as ISO-8859-1 or similar you would have gotten 43.
Now I am a little at a loss. To debug this you could print individually each character of the string and check what is wrong with them.
for (int i = 0; i < inputLine.length; i++)
System.err.println("debug: " + i + ": " + inputLine.charAt(i) + " (" + inputLine.codePointAt(i) + ")");
Many thanks for all your replies.
Specifying the encoding within the read & saving the original text file as UTF -8 has worked.
However, the experience has taught me that delimiting text using "£" or indeed other characters that may have multiple representations in different encodings is a poor strategy.
I have decided to take a different approach:
1) Find the last space in the input string & replace it with "xxx" or similar.
2) Split this using the delimiter "xxx." which should split the strings & rip out the "£".
3) Carry on..
I have a string
Mr praneel PIDIKITI
When I use this regular expression
String[] nameParts = name.split("\\s+");
instead of getting three parts I am only getting two, Mr and Praneel PIDIKITI.
I am unable to split the second string. Does anyone know what could be the problem?
I even used split(" ");.
The problem is I used replaceAll("\\<.*?>", " ").trim(); to convert html into this string and then I am using name.split("\\s+"); to get the name value.
I think it must be something other than space (some special character).
Your code should work. I suspect your input. There could be a non printable junk character between Praneel and PIDIKITI. For example,
String name = "Mr praneel" + (char)1 +"PIDIKITI";
String[] nameParts = name.split("\\s+");
for(String s : nameParts)
System.out.println(s);
Are you sure that there is no junk character between Praneel and PIDIKITI?
Remove non printable characters like this:
// remove non printable characters excluding white space characters
name = name.replaceAll("[^\\p{Print}\\s]","");
If you're parsing HTML, may I recommend JSoup? Its a good HTML parser for java
I have a .txt text file, containing some lines..
I load the contain using the RequestBuilder object, and split the responseText with
words = String.split("\n"); but i wonder, why the result is contains the "\n" part..
For example, my text:
abc
def
ghi
the result is,
words[0] = "abc\n"
words[1] = "def\n"
words[2] = "ghi\n"
Any help is highly appreciated. Thanks in advance.
Try using string.split("\\n+"). Or even better - split("[\\r\\n]+")
You may also want to consider String[] lines = text.split("\\\\n");
Windows carriage returns ("\r\n") shouldn't make a visible difference to your results, nor should you need to escape the regular expression you pass to String.split().
Here's proof of both the above using str.split("\n"): http://ideone.com/4PnZi
If you do have Windows carriage returns though, you should (although strictly not necessary) use str.split("\r\n"): http://ideone.com/XcF3C
If split uses regex you should use "\\n" instead of "\n"
Try using string.split("\\\\n") . It works for me.
Maybe it's trivial, but the .split method is sensitive to spaces and text breaks. If we don't know how the original text is written, we have to consider that this could make some differences (single line, breaks, multilines, etc).
Single line text:
const inlineText = "Hello world!";
console.log(inlineText.split(' ')) //['Hello', 'world!']
Multi lines text:
const multilinesText = `
Hello world!
`
console.log(multilinesText.split(' ')) // ['\nHello', 'world!', '\n']
This has been asked several times for several languages but I can't get it to work.
I have a string like this
String str = "This is a string.\nThis is a long string.";
And I'm trying to replace the \n with <br /> using
str = str.replaceAll("(\r\n|\n)", "<br />");
but the \n is not getting replaced.
I tried to use this RegEx Tool to verify and I see the same result. The input string does not have a match for "(\r\n|\n)". What am i doing wrong ?
It works for me.
public class Program
{
public static void main(String[] args) {
String str = "This is a string.\nThis is a long string.";
str = str.replaceAll("(\r\n|\n)", "<br />");
System.out.println(str);
}
}
Result:
This is a string.<br />This is a long string.
Your problem is somewhere else.
For me, this worked:
rawText.replaceAll("(\\\\r\\\\n|\\\\n)", "\\\n");
Tip: use regex tester for quick testing without compiling in your environment
A little more robust version of what you're attempting:
str = str.replaceAll("(\r\n|\n\r|\r|\n)", "<br />");
Since my account is new I can't up-vote Nino van Hooff's answer. If your strings are coming from a Windows based source such as an aspx based server, this solution does work:
rawText.replaceAll("(\\\\r\\\\n|\\\\n)", "<br />");
Seems to be a weird character set issue as the double back-slashes are being interpreted as single slash escape characters. Hence the need for the quadruple slashes above.
Again, under most circumstances "(\\r\\n|\\n)" should work, but if your strings are coming from a Windows based source try the above.
Just an FYI tried everything to correct the issue I was having replacing those line endings. Thought at first was failed conversion from Windows-1252 to UTF-8. But that didn't working either. This solution is what finally did the trick. :)
It works for me. The Java code works exactly as you wrote it. In the tester, the input string should be:
This is a string.
This is a long string.
...with a real linefeed. You can't use:
This is a string.\nThis is a long string.
...because it treats \n as the literal sequence backslash 'n'.
That should work, but don't kill yourself trying to figure it out. Just use 2 passes.
str = str.replaceAll("(\r\n)", "<br />");
str = str.replaceAll("(\n)", "<br />");
Disclaimer: this is not very efficient.
This should work. You need to put in two slashes
str = str.replaceAll("(\\r\\n|\\n)", "<br />");
In this Reference, there is an example which shows
private final String REGEX = "\\d"; // a single digit
I have used two slashes in many of my projects and it seems to work fine!