This is the useful part of code:
java.util.List<Element> elems = src.getAllElements();
Iterator it = elems.iterator();
Element el;
String key,value,date="",place="";
String [] data;
int k=0;
Segment content;
String contentstr;
String classname;
while(it.hasNext()){
el = (Element)it.next();
if(el.getName().equals("span"))
{
classname=el.getAttributeValue("class");
if(classname.equals("edit_body"))
{
//java.util.List<Element> elemsinner = el.getChildElements();
//Iterator itinner = elemsinner.iterator();
content=el.getContent();
contentstr=content.toString();
if(true)
{
System.out.println("Done!");
System.out.println(classname);
System.out.println(contentstr);
}
}
}
}
No output. But if I remove the if(classname.equals("edit_body")) condition it does print (in one of the iterations):
Done!
edit_body
"I honestly think it is better to be a failure at something you love than to be a success at something you hate."
Can't get the bug part... help!
I am using an external java library BTW for html parsing.
BTW there are two errors at the start of the output, which is there in both the cases, with or without if condition.:
Dec 20, 2012 11:53:11 AM net.htmlparser.jericho.LoggerProviderJava$JavaLogger error SEVERE: EndTag br at (r1992,c60,p94048) not recognised as type '/normal' because its name and closing delimiter are separated by characters other than white space
Dec 20, 2012 11:53:11 AM net.htmlparser.jericho.LoggerProviderJava$JavaLogger error SEVERE: Encountered possible EndTag at (r1992,c60,p94048) whose content does not match a registered EndTagType
Hope that wont cause the error
Ok guys, Somebody explain me please! "edit_body".equals(el.getAttributeValue("class")) worked!!
I had right now the exactly same problem.
I success to solve it by using: SomeStringVar.replaceAll("\\P{Print}","");.
This command remove all the Unicode characters in the variant (characteres that you cant see- the strings look like equal, even they not really equal).
I use this command on each variant i needed in the equalization, and it works for me as well.
Looks like you are having leading or trailing whitespaces in your classname.
Try using this: -
if(classname.trim().equals("edit_body"))
This will trim any of those whitespaces at the ends.
Firstly, String.equals() is NOT broken. It works for millions of other programs / programmers. This is NOT the cause of your problems (unless you or someone has deliberately modified ... and broken your Java installation ...)
So why can two apparently equal strings compare as unequal?
There could be leading or trailing whitespace characters on the String.
There could be embedded non-printing characters.
There could be pairs Unicode characters that look the same when you display them with a typical font, but in fact are not the same. For instance the Greek code page contains characters that look by Latin vowels ... but are in fact different codes, and hence are not equal.
change the code to:
classname="edit_body"; //<- hardcode
if(classname.equals("edit_body"))
if the code enters the if statement now, then there must obviously be some difference in the string content when you use the original "classname=el.getAttributeValue("class");".
in such case, loop over the individual characters and compare those to find the difference.
If the code still doesnt enter the if statement, either your code is not compiling and you are running old code, or your java installation is broken ;-)
OR.
if java is anything like .net (I don't know java)
is "el.getAttributeValue" typed as string?
if it is typed as object, then the if statement would not enter since those are two different instances of the same string.
equals() is a method of String class. So, it works with double quotes.
if(someString.equals("something")) ✓
if(someString.equals('something')) ×
Related
I'm reading a CSV file using Java. Inside the file, each row is in this format:
operation, start, end.
I need to do a different operation for different input. But something weird happened when I'm trying to compare two string.
I used equals to compare two strings. And one of the operation is "add", but the first element I fetched from the document always give me the wrong answer. I know that's an "add" and I printed it out it looks like an "add", but when I'm using operation.equals("add"), it's false. For all rest of Strings it's correct except the first one. Is there anything special about the first row in CSV file?
Here is my code:
while ((line = br.readLine()) != null) {
String[] data = line.split(",");
String operation = data[0];
int start = Integer.parseInt(data[1]);
int end = Integer.parseInt(data[2]);
System.out.println(operation + " " + start + " " + end);
System.out.println(operation.equals("add"));
For example, it printed out
add 1 3
false
add 4 6
true
And I really don't know why. These two add looks exactly the same.
And here is what my csv file look like:
enter image description here
There are (at least) 4 reasons why two string that "look" like they are the same when you display / print them could turn out to be non-equal:
If you compare Strings using == rather than equals(Object), then you will often get the wrong answer. (This is not the problem here ... since you are using the equals method. However, this is a common problem.)
Unexpected leading or trailing whitespace characters on one string. These can be removed using trim().
Other leading, trailing or embedded control characters or Unicode "funky" characters. For example stray Unicode BOM (byte order mark) characters.
Homoglyphs. There are a number of examples where two or more distinct Unicode code points are rendered on the screen using the same or virtually the same glyphs.
Cases 3 and 4 can only be reliably detected by using traceprints or a debugger to examine the lengths and the char values in the two strings.
(Screen shots of the CSV file won't help us to diagnose this! A cut-and-paste of the CSV file might help.)
You should remove the double quotes from the first element and then check with equals method.
Try this:
String operation = operation.substring(1, to.length() - 1);
operation.equals("add")
Hope it works for you.
It looks like your line in image looks fine. I suppose in this case, that you could set wrong document encoding. E.g. when UTF, and you do not put it, then is has special header at the beginning. It could be a reason, why you read first word incorrectly.
I ran into a wee problem with Java regex. (I must say in advance, I'm not very experienced in either Java or regex.)
I have a string, and a set of three characters. I want to find out if the string is built from only these characters. Additionally (just to make it even more complicated), two of the characters must be in the string, while the third one is **optional*.
I do have a solution, my question is rather if anyone can offer anything better/nicer/more elegant, because this makes me cry blood when I look at it...
The set-up
There mandatory characters are: | (pipe) and - (dash).
The string in question should be built from a combination of these. They can be in any order, but both have to be in it.
The optional character is: : (colon).
The string can contain colons, but it does not have to. This is the only other character allowed, apart from the above two.
Any other characters are forbidden.
Expected results
Following strings should work/not work:
"------" = false
"||||" = false
"---|---" = true
"|||-|||" = true
"--|-|--|---|||-" = true
...and...
"----:|--|:::|---::|" = true
":::------:::---:---" = false
"|||:|:::::|" = false
"--:::---|:|---G---n" = false
...etc.
The "ugly" solution
Now, I have a solution that seems to work, based on this stackoverflow answer. The reason I'd like a better one will become obvious when you've recovered from seeing this:
if (string.matches("^[(?\\:)?\\|\\-]*(([\\|\\-][(?:\\:)?])|([(?:\\:)?][\\|\\-]))[(?\\:)?\\|\\-]*$") || string.matches("^[(?\\|)?\\-]*(([\\-][(?:\\|)?])|([(?:\\|)?][\\-]))[(?\\|)?\\-]*$")) {
//do funny stuff with a meaningless string
} else {
//don't do funny stuff with a meaningless string
}
Breaking it down
The first regex
"^[(?\\:)?\\|\\-]*(([\\|\\-][(?:\\:)?])|([(?:\\:)?][\\|\\-]))[(?\\:)?\\|\\-]*$"
checks for all three characters
The next one
"^[(?\\|)?\\-]*(([\\-][(?:\\|)?])|([(?:\\|)?][\\-]))[(?\\|)?\\-]*$"
check for the two mandatory ones only.
...Yea, I know...
But believe me I tried. Nothing else gave the desired result, but allowed through strings without the mandatory characters, etc.
The question is...
Does anyone know how to do it a simpler / more elegant way?
Bonus question: There is one thing I don't quite get in the regexes above (more than one, but this one bugs me the most):
As far as I understand(?) regular expressions, (?\\|)? should mean that the character | is either contained or not (unless I'm very much mistaken), still in the above setup it seems to enforce that character. This of course suits my purpose, but I cannot understand why it works that way.
So if anyone can explain, what I'm missing there, that'd be real great, besides, this I suspect holds the key to a simpler solution (checking for both mandatory and optional characters in one regex would be ideal.
Thank you all for reading (and suffering ) through my question, and even bigger thanks for those who reply. :)
PS
I did try stuff like ^[\\|\\-(?:\\:)?)]$, but that would not enforce all mandatory characters.
Use a lookahead based regex.
^(?=.*\\|)(?=.*-)[-:|]+$
or
^(?=.*\\|)[-:|]*-[-:|]*$
or
^[-:|]*(?:-:*\\||\\|:*-)[-:|]*$
DEMO 1DEMO 2
(?=.*\\|) expects atleast one pipe.
(?=.*-) expects atleast one hyphen.
[-:|]+ any char from the list one or more times.
$ End of the line.
Here is a simple answer:
(?=.*\|.*-|.*-.*\|)^([-|:]+)$
This says that the string needs to have a '-' followed by '|', or a '|' followed by a '-', via the look-ahead. Then the string only matches the allowed characters.
Demo: http://fiddle.re/1hnu96
Here is one without lookbefore and -hind.
^[-:|]*\\|[-:|]*-[-:|]*|[-:|]*-[-:|]*\\|[-:|]*$
This doesn't scale, so Avinash's solution is to be preferred - if your regex system has the lookbe*.
I am unable to print/compare the letters æøå with the uppercase letters ÆØÅ. My code is running on Mac OS X 10.6.4 in Eclipse STS 2.5 and I have set Eclipse to use UTF-8 instead of MacRoman. It seems that neither equalsIgnoreCase, toUpperCase and toLowerCase work, and I cannot print the letters correctly to the console. Any idea on what I am missing?
Example:
String ae1 = "æ";
String ae2 = "Æ";
System.out.println(ae1);
System.out.println(ae2.toLowerCase());
if(ae1.equalsIgnoreCase(ae2))
System.out.println("match");
else
System.out.println("no match");
Returns:
æ
ß
no match
Well, it's not at all clear which of the following situations you're in:
Your string literals are being compiled correctly, equalsIgnoreCase is failing, and the console is failing
Your string literals are being compiled incorrectly - and once you've got garbage data, nothing else is going to work
I strongly suggest you try using the \uxxxx format to make sure you get the right input data. You could analyze your current code by printing out the value of (int) ae1.charAt(0) and seeing which Unicode character that is.
Once you've separated things out to work out exactly which stage is failing, you can adjust the code appropriately - whether that's using a Collator or some other approach.
equals() is not meant for comparing natural languages. You should be using Collator: http://java-x.blogspot.com/2006/09/javatextcollator-for-string-comparison.html
Your output clearly says that your source files are UTF-8, but compiler is configured to read sources as Mac OS Roman.
Since you say you configured Eclipse to use UTF-8, perhaps your configuration is somehow wrong or incomplete.
To make sure that it's a problem with source encoding mismatch, you can replace these characters by their Unicode escapes. In this case equalsIgnoreCase() works as expected:
String ae1 = "\u00e6";
String ae2 = "\u00c6";
I guess my string literals are being compiled incorrectly because the compiler or eclipse is not configured properly, but I have not figured out what it is. Using the \uxxxx format did however solved my issues, so I will leave it at that for now.
If i stumble upon a solution I will post it here.
Thanks for your answers!
Testing out someone elses code, I noticed a few JSP pages printing funky non-ASCII characters. Taking a dip into the source I found this tidbit:
// remove any periods from first name e.g. Mr. John --> Mr John
firstName = firstName.trim().replace('.','\0');
Does replacing a character in a String with a null character even work in Java? I know that '\0' will terminate a C-string. Would this be the culprit to the funky characters?
Does replacing a character in a String with a null character even work in Java? I know that '\0' will terminate a c-string.
That depends on how you define what is working. Does it replace all occurrences of the target character with '\0'? Absolutely!
String s = "food".replace('o', '\0');
System.out.println(s.indexOf('\0')); // "1"
System.out.println(s.indexOf('d')); // "3"
System.out.println(s.length()); // "4"
System.out.println(s.hashCode() == 'f'*31*31*31 + 'd'); // "true"
Everything seems to work fine to me! indexOf can find it, it counts as part of the length, and its value for hash code calculation is 0; everything is as specified by the JLS/API.
It DOESN'T work if you expect replacing a character with the null character would somehow remove that character from the string. Of course it doesn't work like that. A null character is still a character!
String s = Character.toString('\0');
System.out.println(s.length()); // "1"
assert s.charAt(0) == 0;
It also DOESN'T work if you expect the null character to terminate a string. It's evident from the snippets above, but it's also clearly specified in JLS (10.9. An Array of Characters is Not a String):
In the Java programming language, unlike C, an array of char is not a String, and neither a String nor an array of char is terminated by '\u0000' (the NUL character).
Would this be the culprit to the funky characters?
Now we're talking about an entirely different thing, i.e. how the string is rendered on screen. Truth is, even "Hello world!" will look funky if you use dingbats font. A unicode string may look funky in one locale but not the other. Even a properly rendered unicode string containing, say, Chinese characters, may still look funky to someone from, say, Greenland.
That said, the null character probably will look funky regardless; usually it's not a character that you want to display. That said, since null character is not the string terminator, Java is more than capable of handling it one way or another.
Now to address what we assume is the intended effect, i.e. remove all period from a string, the simplest solution is to use the replace(CharSequence, CharSequence) overload.
System.out.println("A.E.I.O.U".replace(".", "")); // AEIOU
The replaceAll solution is mentioned here too, but that works with regular expression, which is why you need to escape the dot meta character, and is likely to be slower.
Should be probably changed to
firstName = firstName.trim().replaceAll("\\.", "");
I think it should be the case. To erase the character, you should use replace(".", "") instead.
Does replacing a character in a String
with a null character even work in
Java?
No.
Would this be the culprit to the funky characters?
Quite likely.
This does cause "funky characters":
System.out.println( "Mr. Foo".trim().replace('.','\0'));
produces:
Mr[] Foo
in my Eclipse console, where the [] is shown as a square box. As others have posted, use String.replace().
In my Java application I have been passed in a string that looks like this:
"\u00a5123"
When printing that string into the console, I get the same string as the output (as expected).
However, I want to print that out by having the unicode converted into the actual yen symbol (\u00a5 -> yen symbol) - how would I go about doing this?
i.e. so it looks like this: "[yen symbol]123"
I wrote a little program:
public static void main(String[] args) {
System.out.println("\u00a5123");
}
It's output:
¥123
i.e. it output exactly what you stated in your post. I am not sure there is not something else going on. What version of Java are you using?
edit:
In response to your clarification, there are a couple of different techniques. The most straightforward is to look for a "\u" followed by 4 hex-code characters, extract that piece and replace with a unicode version with the hexcode (using the Character class). This of course assumes the string will not have a \u in front of it.
I am not aware of any particular system to parse the String as though it was an encoded Java String.
As has been mentioned before, these strings will have to be parsed to get the desired result.
Tokenize the string by using \u as separator. For example: \u63A5\u53D7 => { "63A5", "53D7" }
Process these strings as follows:
String hex = "63A5";
int intValue = Integer.parseInt(hex, 16);
System.out.println((char)intValue);
You're probably going to have to write a parse for these, unless you can find one in a third party library. There is nothing in the JDK to parse these for you, I know because I fairly recently had an idea to use these kind of escapes as a way to smuggle unicode through a Latin-1-only database. (I ended up doing something else btw)
I will tell you that java.util.Properties escapes and unescapes Unicode characters in this manner when reading and writing files (since the files have to be ASCII). The methods it uses for this are private, so you can't call them, but you could use the JDK source code to inspire your solution.
Could replace the above with this:
System.out.println((char)0x63A5);
Here is the code to print all of the box building unicode characters.
public static void printBox()
{
for (int i=0x2500;i<=0x257F;i++)
{
System.out.printf("0x%x : %c\n",i,(char)i);
}
}