I'm trying to do some parsing in Java and I'm using Cobra HTML Parser to get the HTML into a DOM then I'm using XPath to get the nodes I want. When I get down to the desired level I call node.getTextContents(), but this gives me a string like
"\n\n\nValue\n-\nValue\n\n\n"
Is there a built in way to get rid of the line breaks? I would like to do a RegEx like
(?:\s*([^-]+)\s*-\s*([^-]+)\s*)
on the inner text and would really prefer not to have to deal with the possible different white space symbols in between the text.
Example Input:
Value
-
Value
Thanks
You can use String.replaceAll().
String trimmed = original_string.replaceAll("\n", "");
The first argument is a regular expression: you could replace all contiguous blocks of whitespace in the original string with replaceAll("\\s+", "") for instance.
I'm not totally sure I understood the question correctly, but the simplest way to remove all the whitespace would be:
String s = node.getTextContents().replaceAll("\\s","");
If you just want to get rid of the leading/trailing whitespace, use trim().
Related
I get some string from server with known and unknow parts. For example:
<simp>example1</simp><op>example2</op><val>example2</val>
I do not wish to parse XML or any use of parsing. What I wish to do is replace
<op>example2</op>
with empty string ("") which string will look like:
<simp>example1</simp><val>example2</val>
What I know it start with op (in <>) and ends with /op (in <>) but the content (example2) may vary.
Can you give me pointer how accomplish this?
You can use regex. Something like
<op>[A-Za-z0-9]*<\/op>
should match. But you can adapt it so that it fits your requirements better. For example if you know that only certain characters can be shown, you can change it.
Afterwards you can use the String#replaceAll method to remove all matching occurrences with the empty string.
Take a look here to test the regex: https://regex101.com/r/WhPIv4/3
and here to check the replaceAll method that takes the regex and the replacement as a parameter: https://developer.android.com/reference/java/lang/String#replaceall
You can try
str.replace(str.substring(str.indexOf("<op>"),str.indexOf("</op>")+5),"");
To remove all, use replaceAll()
str.replaceAll(str.substring(str.indexOf("<op>"),str.indexOf("</op>")+5),"");
I tried sample,
String str="<simp>example1</simp><op>example2</op><val>example2</val><simp>example1</simp><op>example2</op><val>example2</val><simp>example1</simp><op>example2</op><val>example2</val>";
Log.d("testit", str.replaceAll(str.substring(str.indexOf("<op>"), str.indexOf("</op>") + 5), ""));
And the log output was
D/testit: <simp>example1</simp><val>example2</val><simp>example1</simp><val>example2</val><simp>example1</simp><val>example2</val>
Edit
As #Elsafar said , str.replaceAll("<op>.*?</op>", "") will work.
Use like this:
String str = "<simp>example1</simp><op>example2</op><val>example2</val>";
String garbage = str.substring(str.indexOf("<op>"),str.indexOf("</op>")+5).trim();
String newString = str.replace(garbage,"");
I combined all the answers and eventually used:
st.replaceAll("<op>.*?<\\/op>","");
Thank you all for the help
Suppose I have a String containing static tags that looks like this:
mystring = "[tag]some text[/tag] untagged text [tag]some more text[/tag]"
I want to remove everything between each tag pair. I've figured out how to do so by using the following regex:
mystring = mystring.replaceAll("(?<=\\[tag])(.*?)(?=\\[/tag])", "");
The result of which will be:
mystring = "[tag][/tag] untagged text [tag][/tag]"
However, I'm unsure how to accomplish the same goal if the opening tag is dynamic. Example:
mystring = "[tag parameter="123"]some text[/tag] untagged text [tag parameter="456"]some more text[/tag]"
The "value" of the parameter portion of the tag is dynamic. Somehow, I have to introduce a wildcard to my current regex, but I am unsure how to do this.
Essentially, replace the contents of all pairings of "[tag*]" and "[/tag]" with empty string.
An obvious solution would be to do something like this:
mystring = mystring.replaceAll("(?<=\\[tag)(.*?)(?=\\[/tag])", "");
However, I feel like that would be hacking around the problem because I'm not really capturing a full tag.
Could anyone provide me with a solution to this problem? Thanks!
I guess I've got it.
I thought long and hard about what #AshishMathew said, and yeah, lookbehinds can't have unfixed, lengths, but maybe instead of replacing it with nothing, we add a ] to it, like so:
mystring = mystring.replaceAll("(?<=\\[tag)(.*?)(?=\\[/tag])", "]");
(?<=\\[tag) is the look-behind which matches [tag
(.*?) is all the code between [tag and [/tag], which may even be the parameters of the tag, all of which is replaced by a ]
When I tried this code by replacing the match with "", I got [tag[/tag] untagged text [tag[/tag] as the output. Hence, by replacing the match with a ] instead of nothing, you get the (hopefully) desired output.
So this is my lazy solution (pardon the regex pun) to the problem.
I suggest matching the whole tag with content and replacing with the opening/closing tags without content :
mystring.replaceAll("\\[tag[^\\]]*\\][^\\[]*\\[/tag]", "[tag][/tag]")
Ideone test.
Note that I didn't bother conserving the tag attributes since you mentionned in another answer's comments that you didn't need them, but they could be kept by using a capturing group.
Good morning. I realize there are a ton of questions out there regarding replace and replaceAll() but i havnt seen this.
What im looking to do is parse a string (which contains valid html to a point) then after I see the second instance of <p> in the string i want to remove everything that starts with & and ends with ; until i see the next </p>
To do the second part I was hoping to use something along the lines of s.replaceAll("&*;","")
That doesnt work but hopefully it gets my point across that I am looking to replace anything that starts with & and ends with ;
You should probably leave the parsing to a DOM parser (see this question). I can almost guarantee you'll have to do this to find text within the <p> tags.
For the replacement logic, String.replaceAll uses regular expressions, which can do the matching you want.
The "wildcard" in regular expressions that you want is the .* expression. Using your example:
String ampStr = "This &escape;String";
String removed = ampStr.replaceAll("&.*;", "");
System.out.println(removed);
This outputs This String. This is because the . represents any character, and the * means "this character 0 or more times." So .* basically means "any number of characters." However, feeding it:
"This &escape;String &anotherescape;Extended"
will probably not do what you want, and it will output This Extended. To fix this, you specify exactly what you want to look for instead of the . character. This is done using [^;], which means "any character that's not a semicolon:
String removed = ampStr.replaceAll("&[^;]*;", "");
This has performance benefits over &.*?; for non-matching strings, so I highly recommend using this version, especially since not all HTML files will contain a &abc; token and the &.*?; version can have huge performance bottle-necks as a result.
The expression you want is:
s.replaceAll("&.*?;","");
But do you really want to be parsing HTML this way? You may be better off using an XML parser.
I have a String like below:
<script language="JavaScript" type="text/javascript" src="http://dns.net/adj/myhost.com/index;size=5x10;zipc=12345;myzon=north_west;|en;tile=10;ord=7jkllk456?"></script>
I want to access whatever is between src=" and ">. I have developed a code something like below:
int i=str.indexOf("src=\"");
str=str.substring(i+5);
i=str.indexOf("\">");
str=str.substring(0,i);
System.out.println(str);
Do you know if this is the right way? My only worry is that sometimes there could be a space between src and = or space between " and > and in this case my code will not work so I was thinking to use Regex. But I am not able to come up with any Regular expression. Do you have any suggestions?
This will work, but you should look into Regular Expressions, they provide a powerful way to spot patterns and extract text accordingly.
If you don't want to bother with regex, you can do this:
testString.split("src\\=")[1].split(">")[0]);
Of course it still doesn't solve your other concerns with different formats, but you can still use an applicable regex (like RanRag's answer) with the String.split() instead of the 5 lines of code you were using.
You can also try this regex src\s+"[=](.*)"\s+>.
Lets break it down
src match for src in string
\s+ look for one or more than one occurence of whitespace
[=] match for equal to
(.*) zero or more than one occurence of text until "\s>
Perhaps this is overkill for your situation, but you might want to consider using an HTML parser. This would take care of all the document formatting issues and let you get at the tags and attributes in a standard way. While Regex may work for simple HTML, once things become more complicated you could run into trouble (false matches or missed matches).
Here is a list of available open source parsers for Java: http://java-source.net/open-source/html-parsers
If there can't be any escaped double quotes in the string you want, try this expression: src="([^"]*)". This will src=" and match anything up to the first " that follows and capture the text between the double quotes into group 1 (group 0 is always the entire matched string).
Since whitespace around = is allowed, you might extend the expression to src\s*=\s*"([^"]*)".
Just a word of warning: HTML isn't a regular language and thus it can't be parsed using regular expressions. For simple cases like this it is ok but don't fall into the trap and think you can parse more complex html structures.
I have a Java string which looks like this, it is actually an XML tag:
"article-idref="527710" group="no" height="267" href="pc011018.pct" id="pc011018" idref="169419" print-rights="yes" product="wborc" rights="licensed" type="photo" width="322" "
Now I want to remove the article-idref="52770" segment by using regular expression, I came up with the following one:
trimedString.replaceAll("\\article-idref=.*?\"","");
but it doesn't seem to work, could anybody give me an idea on where I got wrong in my regular expression? I need this to be represented as a String in my Java class, so probably HTMLParser won't help me a lot here.
Thanks in advance!
Try this:
trimedString.replaceAll("article-idref=\"[^\"]*\" *","");
I corrected the regular expression by adding quotes and a word boundary (to prevent false matches). Also, in case you didn't, remember to reassign to your string after the replacement:
trimmedString = trimmedString.replaceAll("\\barticle-idref=\".*?\"", "");
See it working at ideone.
Also since this is from an XML document it might be better to use an XML parser to extract the correct attributes instead of a regular expression. This is because XML is quite a complex data format to parse correctly. The example in your question is simple enough. However a regular expression could break on a more complex case, such as a document that includes XML comments. This could be an issue if you are reading data from an untrusted source.
if you are sure the article-idref is allways at the beginning try this:
// removes everything from the beginning to the first whitespace
trimedString = trimedString.replaceFirst("^\\s","");
Be sure to assign the result to trimedString again, since replace does not midify the string itself but returns another string.