Encoding URL strings with regular expression - java

I'm trying to replace several different characters with different values. For example, if I have: #love hate then I would like to do is get back %23love%20hate
Is it something to do with groups? i tried to understand using groups but i really didn't understand it.

You can try to do this:
String encodedstring = URLEncoder.encode("#love hate","UTF-8");
It will give you the result you want. To revers it you should do this:
String loveHate = URLDecoder.decode(encodedstring);

You don't need RegEx to replace single characters. RegEx is an overkill for such porposes. You can simply use the plain replace method of String class in a loop, for each character that you want to replace.
String output = input.replace("#", "%23");
output = output.replace(" ", "%20");
How many such characters do you want to get replaced?

If you are trying to encode a URL to utf-8 or some encoding using existing classes will be much easier
eg.
commons-httpclient project
URIUtil.encodeWithinQuery(input,"UTF-8");

No, you will need multiple replaces. Another option is to use group to find the next occurrence of one of several strings, inspect what the string is and replace appropriately, perhaps using a map.

i think what you want to achieve is kind of url encoding instead of pure replacement.
see some answers on this thread of SO , especially the one with 7 votes which may be more interesting for you.
HTTP URL Address Encoding in Java

As Mat said, the best way to solve this problem is with URLEncoder. However, if you insist on using regex, then see the sample code in the documentation for java.util.regex.Matcher.appendReplacement:
Pattern p = Pattern.compile("cat");
Matcher m = p.matcher("one cat two cats in the yard");
StringBuffer sb = new StringBuffer();
while (m.find()) {
m.appendReplacement(sb, "dog");
}
m.appendTail(sb);
System.out.println(sb.toString());
Within the loop, you can use m.group() to see what substring matched and then do a custom substitution based on that. This technique can be used for replacing ${variables} by looking them up in a map, etc.

Related

Removing specific part of string

I'm parsing every line of a file (XML file) and I need to find path="this_is_my_path". After this, I need to extract whats inside the \". I need to get this_is_my_path.
This is what I'm doing:
String pattern = ".*path=\"(.*?)\"";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(the_text_file);
while (m.find()) {
System.out.println(m.group().trim());
}
After running this, I'm getting this:
path="path_to_file"
test="ui_test" path="path_to_other_file"
.....
I should be printing this:
path_to_file
path_to_other_file
path_to_other_fileX
path_to_other_fileW
If you need to use regex, try with this:
(?<=path=\")(.*?)(?=\")
DEMO
Or you can use your regex, but without .* at the begenning, because it match also any content before path= in same line. Then get value by group 1.
Why reinvent the wheel? Unless this is a challenge or something?
http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
One should really try and collect the many reasons why using a regular expression is insufficient for getting anything out reliably from an XML file, even if that "anything" is just a measly attribute, e.g. path and its (string) value. A simple pattern such as "path=\"(.*?)\"" is doomed to fail due to the tiniest amount of freedom the XML spec leaves for writing legal XML, and more.
White space, including line breaks, may occur before and after the equal sign.
Apostrophes can be used instead of quotes.
Any character can be written as a numeric or named entity.
The string could be part of an element or attribute value.
The string could occur in an XML comment.
The XML file may be written in an encoding which naive reading as a vanilla text file fails to take into account; hence data may be garbage.
So, just for the record: I strongly suggest to use an XSLT transformation to extract the desired attribute values. This requires just a very simple template. Using an XML parser requires more lines of codes, but it is equally reliable.
And here is the Java code I strongly advocate not to use - it just covers two out of the points mentioned above.
String theText = ...;
String pattern = "\\bpath\\s*=\\s*(\"(.*?)\"|'(.*?)')";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(theText);
while (m.find()) {
System.out.println(m.group(1).trim());
}
(And did you notice the word boundary preceding path? Just another chance to go wrong with this approach.)

How to extract substrings from a string in java

I am not so confident in Java so I need some help to extract multiple substrings from a string.string is as given below.
I have a text file with possibly thousands of similar POS-tagged lines that I need to extract the original text from that.I have tried using tokenizer but didn't really get the result I wanted.I tried using Pattern Matcher and I am having problems with the regex.
String="I_PRP recently_RB purchased_VBD this_DT camera_NN";
I want to get the output= I recently purchased this camera.
I use
Regex: [\/](.*?)\s\b
But its not working.Please help me.
try
String s= "I_PRP recently_RB purchased_VBD this_DT camera_NN";
s = s.replaceAll("_\\w+(?=(\\s|$))", "");
System.out.println(s);
prints
I recently purchased this camera
It seems that you are attaching a tag to indicate the word type (e.g. noun, verb or pronoun) if this suffix will be always capital letters, it is more safe to use the following regex in your replaceAll
s = s.replaceAll("_[A-Z]+(?=(\\s|$))", "");

Pattern match numbers/operators

Hey, I've been trying to figure out why this regular expression isn't matching correctly.
List l_operators = Arrays.asList(Pattern.compile(" (\\d+)").split(rtString.trim()));
The input string is "12+22+3"
The output I get is -- [,+,+]
There's a match at the beginning of the list which shouldn't be there? I really can't see it and I could use some insight. Thanks.
Well, technically, there is an empty string in front of the first delimiter (first sequence of digits). If you had, say a line of CSV, such as abc,def,ghi and another one ,jkl,mno you would clearly want to know that the first value in the second string was the empty string. Thus the behaviour is desirable in most cases.
For your particular case, you need to deal with it manually, or refine your regular expression somehow. Like this for instance:
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(rtString);
if (m.find()) {
List l_operators = Arrays.asList(p.split(rtString.substring(m.end()).trim()));
// ...
}
Ideally however, you should be using a parser for these type of strings. You can't for instance deal with parenthesis in expressions using just regular expressions.
That's the behavior of split in Java. You just have to take it (and deal with it) or use other library to split the string. I personally try to avoid split from Java.
An example of one alternative is to look at Splitter from Google Guava.
Try Guava's Splitter.
Splitter.onPattern("\\d+").omitEmptyStrings().split(rtString)

Java Regex - exclude empty tags from xml

let's say I have two xml strings:
String logToSearch = "<abc><number>123456789012</number></abc>"
String logToSearch2 = "<abc><number xsi:type=\"soapenc:string\" /></abc>"
String logToSearch3 = "<abc><number /></abc>";
I need a pattern which finds the number tag if the tag contains value, i.e. the match should be found only in the logToSearch.
I'm not saying i'm looking for the number itself, but rather that the matcher.find method should return true only for the first string.
For now i have this:
Pattern pattern = Pattern.compile("<(" + pattrenString + ").*?>",
Pattern.CASE_INSENSITIVE);
where the patternString is simply "number". I tried to add "<(" + pattrenString + ")[^/>].*?> but it didn't work because in [^/>] each character is treated separately.
Thanks
This is absolutely the wrong way to parse XML. In fact, if you need more than just the basic example given here, there's provably no way to solve the more complex cases with regex.
Use an easy XML parser, like XOM. Now, using xpath, query for the elements and filter those without data. I can only imagine that this question is a precursor to future headaches unless you modify your approach right now.
So a search for "<number[^/>]*>" would find the opening tag. If you want to be sure it isn't empty, try "<number[^/>]*>[^<]" or "<number[^/>]*>[0-9]"

replacing regex in java string

I have this java string:
String bla = "<my:string>invalid_content</my:string>";
How can I replace the "invalid_content" piece?
I know I should use something like this:
bla.replaceAll(regex,"new_content");
in order to have:
"<my:string>new_content</my:string>";
but I can't discover how to create the correct regex
help please :)
You could do something like
String ResultString = subjectString.replaceAll("(<my:string>)(.*)(</my:string>)", "$1whatever$3");
Mark's answer will work, but can be improved with two simple changes:
The central parentheses are redundant if you're not using that group.
Making it non-greedy will help if you have multiple my:string tags to match.
Giving:
String ResultString = SubjectString.replaceAll
( "(<my:string>).*?(</my:string>)" , "$1whatever$2" );
But that's still not how I'd write it - the replacement can be simplified using lookbehind and lookahead, and you can avoid repeating the tag name, like this:
String ResultString = SubjectString.replaceAll
( "(?<=<(my:string)>).*?(?=</\1>)" , "whatever" );
Of course, this latter one may not be as friendly to those who don't yet know regex - it is however more maintainable/flexible, so worth using if you might need to match more than just my:string tags.
See Java regex tutorial and check out character classes and capturing groups.
The PCRE would be:
/invalid_content/
For a simple substitution. What more do you want?
Is invalid_content a fix value? If so you could simply replace that with your new content using:
bla = bla.replaceAll("invalid_content","new_content");

Categories