Regex for finding http and https url from a string - java

I have a string which contains multiple url starting from http and https I need to fetch all those url and put into a list.
I have tried below code.
List<String> httpLinksList = new ArrayList<>();
String hyperlinkRegex = "((http:\/\/|https:\/\/)?(([a-zA-Z0-9-]){2,}\.){1,4}([a-zA-Z]){2,6}(\/([a-zA-Z-_\/\.0-9#:?=&;,]*)?)?)";
String synopsis = "This is http://stackoverflow.com/questions and https://test.com/method?param=wasd The code below catches all urls in text and returns urls in list";
Pattern pattern = Pattern.compile(hyperlinkRegex);
Matcher matcher = pattern.matcher(synopsis);
while(matcher.find()){
System.out.println(matcher.find()+" "+matcher.group(1)+" "+matcher.groupCount()+" "+matcher.group(2));
httpLinksList.add(matcher.group());
}
System.out.println(httpLinksList);
I need below result
[http://stackoverflow.com/questions,
https://test.com/method?param=wasd]
But getting below output
[https://test.com/method?param=wasd]

This regex will match all the valid urls, including FTP and other
String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:##%/;$()~_?\\+-=\\\\\\.&]*)";
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class xmlValue {
public static void main(String[] args) {
String text = "This is http://stackoverflow.com/questions and https://test.com/method?param=wasd The code below catches all urls in text and returns urls in list";
System.out.println(extractUrls(text));
}
public static List<String> extractUrls(String text)
{
List<String> containedUrls = new ArrayList<String>();
String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:##%/;$()~_?\\+-=\\\\\\.&]*)";
Pattern pattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);
Matcher urlMatcher = pattern.matcher(text);
while (urlMatcher.find())
{
containedUrls.add(text.substring(urlMatcher.start(0),
urlMatcher.end(0)));
}
return containedUrls;
}
}
Output:
[http://stackoverflow.com/questions,
https://test.com/method?param=wasd]
credits #BullyWiiPlaza

So I know this is not exactly what you asked since you are specifically looking for regex, but I thought this would fun to try out with an indexOf variant. I will leave it here as an alternative to the regex someone comes up with:
public static void main(String[] args){
String synopsis = "This is http://stackoverflow.com/questions and https://test.com/method?param=wasd The code below catches all urls in text and returns urls in list";
ArrayList<String> list = splitUrl(synopsis);
for (String s : list) {
System.out.println(s);
}
}
public static ArrayList<String> splitUrl(String s)
{
ArrayList<String> list = new ArrayList<>();
int spaceIndex = 0;
while (true) {
int httpIndex = s.indexOf("http", spaceIndex);
if (httpIndex < 0) {
break;
}
spaceIndex = s.indexOf(" ", httpIndex);
if (spaceIndex < 0) {
list.add(s.substring(httpIndex));
break;
}
else {
list.add(s.substring(httpIndex, spaceIndex));
}
}
return list;
}
All the logic is contained in the splitUrl(String s) method, it takes in a String as a parameter and outputs the ArrayList<String> of all the split urls.
It first searches for the index of any http and then the first space that occurs after the url and substrings the difference. It then uses the space it found as the second parameter in indexOf(String, int) to start searching the String beginning after the http that was already found so it does not repeat the same ones.
Additionally a case had to be made when the http is the final part of the String as there is no space afterward. This is done when the indexOf the space returns negative, I use substring(int) instead of substring(int, int) which will take the current location and substring the rest of the String.
The loop ends when either indexOf returns with a negative, though if the space returns negative it does that final substring operation before the break.
Output:
http://stackoverflow.com/questions
https://test.com/method?param=wasd
Note: As someone mentioned in the comments too, this implementation will work with non-Latin characters such as Hiragana too, which could be an advantage over regex.

Related

Splitting string by new line with a condition

I am trying to split a String by \n only when it's not in my "action block".
Here is an example of a text message\n [testing](hover: actions!\nnew line!) more\nmessage I want to split when ever the \n is not inside the [](this \n should be ignored), I made a regex for it that you can see here https://regex101.com/r/RpaQ2h/1/ in the example it seems like it's working correctly so I followed up with an implementation in Java:
final List<String> lines = new ArrayList<>();
final Matcher matcher = NEW_LINE_ACTION.matcher(message);
String rest = message;
int start = 0;
while (matcher.find()) {
if (matcher.group("action") != null) continue;
final String before = message.substring(start, matcher.start());
if (!before.isEmpty()) lines.add(before.trim());
start = matcher.end();
rest = message.substring(start);
}
if (!rest.isEmpty()) lines.add(rest.trim());
return lines;
This should ignore any \n if they are inside the pattern showed above, however it never matches the "action" group, seems like when it is added to java and a \n is present it never matches it. I am a bit confused as to why, since it worked perfectly on the regex101.
Instead of checking whether the group is action, you can simply use regex replacement with the group $1 (the first capture group).
I also changed your regex to (?<action>\[[^\]]*]\([^)]*\))|(?<break>\\n) as [^\]]* doesn't backtrack (.*? backtracks and causes more steps). I did the same with [^)]*.
See code working here
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
final String regex = "(?<action>\\[[^\\]]*\\]\\([^)]*\\))|(?<break>\\\\n)";
final String string = "message\\n [testing test](hover: actions!\\nnew line!) more\\nmessage";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
final String result = matcher.replaceAll("$1");
System.out.println(result);
}
}

regex to get two different words from a string in java

I will be getting the string as app1(down) and app2(up)
the words in the brackets indicate status of the app, they may be up or down depending,
now i need to use a regex to get the status of the apps like a comma seperated string
ex:ill get app1(UP) and app2(DOWN)
required result UP,DOWN
It's easy using RegEx like this:
\\((.*?)\\)
String x = "app1(UP) and app2(DOWN)";
Matcher m = Pattern.compile("\\((.*?)\\)").matcher(x);
String tmp = "";
while(m.find()) {
tmp+=(m.group(1))+",";
}
System.out.println(tmp);
Output:
UP,DOWN,
Java 8: using StringJoiner
String x = "app1(UP) and app2(DOWN)";
Matcher m = Pattern.compile("\\((.*?)\\)").matcher(x);
StringJoiner sj = new StringJoiner(",");
while(m.find()) {
sj.add((m.group(1)));
}
System.out.print(sj.toString());
Output:
UP,DOWN
(Last , is removed)
import java.util.ArrayList;
import java.util.List;
import java.util.regex.*;
public class ValidateDemo
{
public static void main(String[] args)
{
String input = "ill get app1(UP) and app2(DOWN)";
Pattern p = Pattern.compile("app[0-9]+\\(([A-Z]+)\\)");
Matcher m = p.matcher(input);
List<String> found = new ArrayList<String>();
while (m.find())
{
found.add(m.group(1));
}
System.out.println(found.toString());
}
}
my first java script, have mercy
Consider this code:
private static final Pattern RX_MATCH_APP_STATUS = Pattern.compile("\\s*(?<name>[^(\\s]+)\\((?<status>[^(\\s]+)\\)");
final String input = "app1(UP) or app2(down) let's have also app-3(DOWN)";
final Matcher m = RX_MATCH_APP_STATUS.matcher(input);
while (m.find()) {
final String name = m.group("name");
final String status = m.group("status");
System.out.printf("%s:%s\n", name, status);
}
This plucks from input line as many app status entries, as they really are there, and put each app name and its status into proper variable. It's then up to you, how you want to handle them (print or whatever).
Plus, this gives you advantage if there will come other states than UP and DOWN (like UNKNOWN) and this will still work.
Minus, if there are sentences in brackets prefixed with some name, that is actually not a name of an app and the content of the brackets is not an app state.
Use this as regex and test it on http://regexr.com/
[UP]|[DOWN]

Replace pattern Java

I am making a program that allows the user to set variables and then use them in their messages such as %variable1% and I need a way of detecting the pattern which indicates a variable (%STRING%) . I am aware that I can use regex to find the patterns but am unsure how to use it to replace text.
I can also see a problem arising when using multiple variables in a single string as it may detect the space between 2 variables as a third variable
e.g. %var1%<-text that may be detected as a variable->%var2%, would this happen and is there any way to stop it?
Thanks.
A non-greedy regex would be helpful in extracting the variables which are within the 2 distinct % signs:
Pattern regex = Pattern.compile("\\%.*?\\%");
In this case if your String is %variable1%mndhokajg%variable2%" it should print
%variable1%
%variable2%
If your String is %variable1%variable2% it should print
%variable1%
%variable1%%variable2% should print
%variable1%
%variable2%
You can now manipulate/use the extracted variables for your purpose:
Code:
public static void main(String[] args) {
try {
String tag = "%variable1%%variable2%";
Pattern regex = Pattern.compile("\\%.*?\\%");
Matcher regexMatcher = regex.matcher(tag);
while (regexMatcher.find()) {
System.out.println(regexMatcher.group());
}
} catch (Exception e) {
e.printStackTrace();
}
}
Try playing around with different Strings, there can be invalid scenarios with % as part of the String but your requirement doesn't seem to be that stringent.
Oracle's tutorial on the Pattern and Matcher classes should get you started. Here is an example from the tutorial that you may be interested in:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class ReplaceDemo {
private static String REGEX = "dog";
private static String INPUT =
"The dog says meow. All dogs say meow.";
private static String REPLACE = "cat";
public static void main(String[] args) {
Pattern p = Pattern.compile(REGEX);
// get a matcher object
Matcher m = p.matcher(INPUT);
INPUT = m.replaceAll(REPLACE);
System.out.println(INPUT);
}
}
Your second problem shouldn't happen if you use regex properly.
You can use this method for variable detection and their replacements from a passed HashMap:
// regex to detect variables
private final Pattern varRE = Pattern.compile("%([^%]+)%");
public String varReplace(String input, Map<String, String> dictionary) {
Matcher matcher = varRE.matcher( input );
// StringBuffer to hold replaced input
StringBuffer buf = new StringBuffer();
while (matcher.find()) {
// get variable's value from dictionary
String value = dictionary.get(matcher.get(1));
// if found replace the variable's value in input string
if (value != null)
matcher.appendReplacement(buf, value);
}
matcher.appendTail(buf);
return buf.toString();
}

Performing multiple string replacements with metacharacter regex patterns

I am trying to perform multiple string replacements using Java's Pattern and Matcher, where the regex pattern may include metacharacters (e.g. \b, (), etc.). For example, for the input string fit i am, I would like to apply the replacements:
\bi\b --> EYE
i --> I
I then followed the coding pattern from two questions (Java Replacing multiple different substring in a string at once, Replacing multiple substrings in Java when replacement text overlaps search text). In both, they create an or'ed search pattern (e.g foo|bar) and a Map of (pattern, replacement), and inside the matcher.find() loop, they look up and apply the replacement.
The problem I am having is that the matcher.group() function does not contain information on matching metacharacters, so I cannot distinguish between i and \bi\b. Please see the code below. What can I do to fix the problem?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.*;
public class ReplacementExample
{
public static void main(String argv[])
{
Map<String, String> replacements = new HashMap<String, String>();
replacements.put("\\bi\\b", "EYE");
replacements.put("i", "I");
String input = "fit i am";
String result = doit(input, replacements);
System.out.printf("%s\n", result);
}
public static String doit(String input, Map<String, String> replacements)
{
String patternString = join(replacements.keySet(), "|");
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(input);
StringBuffer resultStringBuffer = new StringBuffer();
while (matcher.find())
{
System.out.printf("match found: %s at start: %d, end: %d\n",
matcher.group(), matcher.start(), matcher.end());
String matchedPattern = matcher.group();
String replaceWith = replacements.get(matchedPattern);
// Do the replacement here.
matcher.appendReplacement(resultStringBuffer, replaceWith);
}
matcher.appendTail(resultStringBuffer);
return resultStringBuffer.toString();
}
private static String join(Set<String> set, String delimiter)
{
StringBuilder sb = new StringBuilder();
int numElements = set.size();
int i = 0;
for (String s : set)
{
sb.append(Pattern.quote(s));
if (i++ < numElements-1) { sb.append(delimiter); }
}
return sb.toString();
}
}
This prints out:
match found: i at start: 1, end: 2
match found: i at start: 4, end: 5
fIt I am
Ideally, it should be fIt EYE am.
You mistyped one of your regexes:
replacements.put("\\bi\\", "EYE"); //Should be \\bi\\b
replacements.put("i", "I");
You may also want to make your regexes unique. There is no guarantee of order with map.getKeySet() so it may just be replacing i with I before checking \\bi\\b.
You could use capture groups, without straying too far from your existing design. So instead of using the matched pattern as the key, you look up based on the order within a List.
You would need to change the join method to put parantheses around each of the patterns, something like this:
private static String join(Set<String> set, String delimiter) {
StringBuilder sb = new StringBuilder();
sb.append("(");
int numElements = set.size();
int i = 0;
for (String s : set) {
sb.append(s);
if (i++ < numElements - 1) {
sb.append(")");
sb.append(delimiter);
sb.append("("); }
}
sb.append(")");
return sb.toString();
}
As a side note, the use of Pattern.quote in the original code listing would have caused the match to fail where those metacharacters were present.
Having done this, you would now need to determine which of the capture groups was responsible for the match. For simplicity I'm going to assume that none of the match patterns will themselves contain capture groups, in which case something like this would work, within the matcher while loop:
int index = -1;
for (int j=1;j<=replacements.size();j++){
if (matcher.group(j) != null) {
index = j;
break;
}
}
if (index >= 0) {
System.out.printf("Match on index %d = %s %d %d\n", index, matcher.group(index), matcher.start(index), matcher.end(index));
}
Next, we would like to use the resulting index value to index straight back into the replacements. The original code uses a HashMap, which is not suitable for this; you're going to have to refactor that to use a pair of Lists in some form, one containing the list of match patterns and the other the corresponding list of replacement strings. I won't do that here, but I hope that provides enough detail to create a working solution.

replace StringTokenizer by String.split(..)

Is it possible to build a regexp for use with Javas Pattern.split(..) method to reproduce the StringTokenizer("...", "...", true) behaveiour?
So that the input is split to an alternating sequence of the predefined token characters and any abitrary strings running between them.
The JRE reference states for StringTokenizer it should be considered deprecated and String.split(..) could be used instead way. So it is considered possible there.
The reason I want to use split is that regular expressions are often highly optimized. The StringTokenizer for example is quite slow on the Android Platforms VM, while regex patterns are executed by optimized native code there it seems.
Considering that the documentation for split doesn't specify this behavior and has only one optional parameter that tells how large the array should be.. no you can't.
Also looking at the only other class I can think of that could have this feature - a scanner - it doesn't either. So I think the easiest would be to continue using the Tokenizer, even if it's deprecated. Better than writing your own class - while that shouldn't be too hard (quite trivial really) I can think of better ways to spend ones time.
a regex Pattern can help you
Patter p = Pattern.compile("(.*?)(\\s*)");
//put the boundary regex in between the second brackets (where the \\s* now is)
Matcher m = p.matcher(string);
int endindex=0;
while(m.find(endindex)){
//m.group(1) is the part between the pattern
//m.group(2) is the match found of the pattern
endindex = m.end();
}
//then the remainder of the string is string.substring(endindex);
import java.util.List;
import java.util.LinkedList;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Splitter {
public Splitter(String s, String delimiters) {
this.string = s;
this.delimiters = delimiters;
Pattern pattern = Pattern.compile(delimiters);
this.matcher = pattern.matcher(string);
}
public String[] split() {
String[] strs = string.split(delimiters);
String[] delims = delimiters();
if (strs.length == 0) { return new String[0];}
assert(strs.length == delims.length + 1);
List<String> output = new LinkedList<String>();
int i;
for(i = 0;i < delims.length;i++) {
output.add(strs[i]);
output.add(delims[i]);
}
output.add(strs[i]);
return output.toArray(new String[0]);
}
private String[] delimiters() {
List<String> delims = new LinkedList<String>();
while(matcher.find()) {
delims.add(string.subSequence(matcher.start(), matcher.end()).toString());
}
return delims.toArray(new String[0]);
}
public static void main(String[] args) {
Splitter s = new Splitter("a b\tc", "[ \t]");
String[] tokensanddelims = s.split();
assert(tokensanddelims.length == 5);
System.out.print(tokensanddelims[0].equals("a"));
System.out.print(tokensanddelims[1].equals(" "));
System.out.print(tokensanddelims[2].equals("b"));
System.out.print(tokensanddelims[3].equals("\t"));
System.out.print(tokensanddelims[4].equals("c"));
}
private Matcher matcher;
private String string;
private String delimiters;
}

Categories