Need a little help on this regular expression - java

I have a Java string which looks like this, it is actually an XML tag:
"article-idref="527710" group="no" height="267" href="pc011018.pct" id="pc011018" idref="169419" print-rights="yes" product="wborc" rights="licensed" type="photo" width="322" "
Now I want to remove the article-idref="52770" segment by using regular expression, I came up with the following one:
trimedString.replaceAll("\\article-idref=.*?\"","");
but it doesn't seem to work, could anybody give me an idea on where I got wrong in my regular expression? I need this to be represented as a String in my Java class, so probably HTMLParser won't help me a lot here.
Thanks in advance!

Try this:
trimedString.replaceAll("article-idref=\"[^\"]*\" *","");

I corrected the regular expression by adding quotes and a word boundary (to prevent false matches). Also, in case you didn't, remember to reassign to your string after the replacement:
trimmedString = trimmedString.replaceAll("\\barticle-idref=\".*?\"", "");
See it working at ideone.
Also since this is from an XML document it might be better to use an XML parser to extract the correct attributes instead of a regular expression. This is because XML is quite a complex data format to parse correctly. The example in your question is simple enough. However a regular expression could break on a more complex case, such as a document that includes XML comments. This could be an issue if you are reading data from an untrusted source.

if you are sure the article-idref is allways at the beginning try this:
// removes everything from the beginning to the first whitespace
trimedString = trimedString.replaceFirst("^\\s","");
Be sure to assign the result to trimedString again, since replace does not midify the string itself but returns another string.

Related

Different result between Javascript and Java regular expression matches

Now I am trying to match some patterns from a String containing elasticsearch's structured bulk requests. Here is an example:
index {[event_20191209][event][null], source[{"haha":"haha","jaja":"jaja"}]}, update {[event_20191209][event][xxx], doc_as_upsert[false], doc[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]}, delete {[event_20191208][_doc][sjdos]}, update {[event_20191209][event][yyy], doc_as_upsert[false], upsert[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]}
My goal is to match every separate request out of the bulk requests string, i.e to get strings like:
index {[event_20191209][event][null], source[{"haha":"haha","jaja":"jaja"}]},
update {[event_20191209][event][xxx], doc_as_upsert[false], doc[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]},
delete {[event_20191208][_doc][sjdos]},
update {[event_20191209][event][yyy], doc_as_upsert[false], upsert[index {[null][_doc][null], source[{"haha":"haha","jaja":"jaja"}]}], scripted_upsert[false], detect_noop[true]}
And my pattern expression is [a-z]+\s\{.+?\}[,\w\t\r\n]+? which works fine on a Javascript based regular expression online tester like below:
However, when I copied this pattern expression to my Java code, the output was not what I expected. It was like this:
So I realized there exists some differences between Javascript and Java regular expression engine, but I cannot figure out how to update my expression so that it could work well in Java after so much coding and googling.
I would be so grateful if someone could give me some favor or hint for this.
After a short nap, I found epiphany. I was a fool in the morning....
The workaround is so easy to implement. Elasticsearch has well overridden toString() for us.
At first glance, I wouldn't suggest using regex right away. It looks like those lines follow some kind of pattern that you could parse and split up first.
After that, if you're talking about regex, I'd try:
Taking a look at the java regex format: https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
How about using an online java regex tool instead?

Using regex to remove JSON quotes

I am being given some JSON from an external process that I can't change, and I need to modify this JSON string for a downstream Java process to work. The JSON string looks like:
{"widgets":"blah","is_dog":"1"}
But it needs to look like:
{"widgets":blah,"is_dog":"1"}
I have to remove the quotes around blah. In reality, blah is a huge JSON object, and so I've simplified it for the sake of this question. So I figured I'd attack the problem by doing two String#replace calls, one before blah, and one after it:
dataString = dataString.replaceAll("{\"widgets\":\"", "{\"widgets\":");
dataString = dataString.replaceAll("\",\"is_dog\":\"1\"}", ",\"is_dog\":\"1\"}");
When I run this I get a vague runtime error:
Illegal repetition
Can any regex maestros spot where I'm going awrye? Thanks in advance.
I believe you need to escape braces. Braces are used for repetition ((foo){3} looks for foo three times in a row); hence the error.
Note: in this case it needs to be double escaping: \\{.
{ and } in regex have special meaning. They are to mention allowed repetition of patterns. So they are to be escaped here.
Use \\{\"widgets\":\"", "\\{\"widgets\": instead of {\"widgets\":\"", "{\"widgets\":.
Since the input string looks to be valid json, your best bet would be to parse it with an actual parser to a map-like structure. Regexes are not the right tools for this. Serializing this structure to to something not quite json would then be relatively simple.
I do wonder if you're better off taking the code for JSONObject and modifying the toString() method to make this a more reliable transformation than using regexps. Here's the source code, and you're looking for invocations of the quote() method
Well, why don't you simply do the following?
1) Decode the first JSON (which is correct with quotes) into varJSON1
2) Get the String "blah" in varJSON1 into varJSON2
3) Then decode the varJSON2

Java Regex for Finding a Pattern and Getting Value in It?

I am working on a plugin. I will parse HTML files. I have a naming convention like that:
<!--$include="a.html" -->
or
<!--$include="a.html"-->
is similar
According to this pattern(similar to server side includes) I want to search an HTML file.
Question is that:
Find that pattern and get value (a.html at my example, it is variable)
It should be like:
while(!notFinishedWholeFile){
fileName = findPatternFunc(htmlFile)
replaceFunc(fileName,something)
}
PS: Using regex at Java or implementing it different(as like using .indexOf()) I don't know which one is better. If regex is good at this situation by performence I want to use it.
Any ideas?
You mean like this?
<!--\$include=\"(?<htmlName>[a-z-_]*).html\"\s?-->
Read a file into a string then
str = str.replaceAll("(?<=<!--\\$include=\")[^\"]+(?=\" ?-->)", something);
will replace the filenames with the string something, then the string can be written back to the file.
(Note: this replaces any text inside the double quotes, not just valid filenames.)
If you want only want to replace filenames with the html extension, swap the [^\"]+ for [^.]+.html.
Using regex for this task is fine performance wise, but see e.g.
How to use regular expressions to parse HTML in Java? and Java Regex performance etc.
I have used that pattern:
"<!--\\$include=\"(.+)(.)(html|htm)\"-->"

is it possible to use replaceAll() with wildcards

Good morning. I realize there are a ton of questions out there regarding replace and replaceAll() but i havnt seen this.
What im looking to do is parse a string (which contains valid html to a point) then after I see the second instance of <p> in the string i want to remove everything that starts with & and ends with ; until i see the next </p>
To do the second part I was hoping to use something along the lines of s.replaceAll("&*;","")
That doesnt work but hopefully it gets my point across that I am looking to replace anything that starts with & and ends with ;
You should probably leave the parsing to a DOM parser (see this question). I can almost guarantee you'll have to do this to find text within the <p> tags.
For the replacement logic, String.replaceAll uses regular expressions, which can do the matching you want.
The "wildcard" in regular expressions that you want is the .* expression. Using your example:
String ampStr = "This &escape;String";
String removed = ampStr.replaceAll("&.*;", "");
System.out.println(removed);
This outputs This String. This is because the . represents any character, and the * means "this character 0 or more times." So .* basically means "any number of characters." However, feeding it:
"This &escape;String &anotherescape;Extended"
will probably not do what you want, and it will output This Extended. To fix this, you specify exactly what you want to look for instead of the . character. This is done using [^;], which means "any character that's not a semicolon:
String removed = ampStr.replaceAll("&[^;]*;", "");
This has performance benefits over &.*?; for non-matching strings, so I highly recommend using this version, especially since not all HTML files will contain a &abc; token and the &.*?; version can have huge performance bottle-necks as a result.
The expression you want is:
s.replaceAll("&.*?;","");
But do you really want to be parsing HTML this way? You may be better off using an XML parser.

Java Inner Text (getTextContents()) Problem

I'm trying to do some parsing in Java and I'm using Cobra HTML Parser to get the HTML into a DOM then I'm using XPath to get the nodes I want. When I get down to the desired level I call node.getTextContents(), but this gives me a string like
"\n\n\nValue\n-\nValue\n\n\n"
Is there a built in way to get rid of the line breaks? I would like to do a RegEx like
(?:\s*([^-]+)\s*-\s*([^-]+)\s*)
on the inner text and would really prefer not to have to deal with the possible different white space symbols in between the text.
Example Input:
Value
-
Value
Thanks
You can use String.replaceAll().
String trimmed = original_string.replaceAll("\n", "");
The first argument is a regular expression: you could replace all contiguous blocks of whitespace in the original string with replaceAll("\\s+", "") for instance.
I'm not totally sure I understood the question correctly, but the simplest way to remove all the whitespace would be:
String s = node.getTextContents().replaceAll("\\s","");
If you just want to get rid of the leading/trailing whitespace, use trim().

Categories