Find URL in String - java

hi im tring to find a URL in a string, i founded many topics about this using regex but i have a problem. Using this pattern:
String regex = "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" +
"(\\w+:\\w+#)?(([-\\w]+\\.)+(com|org|net|gov" +
"|mil|biz|info|mobi|name|aero|jobs|museum" +
"|travel|[a-z]{2}))(:[\\d]{1,5})?" +
"(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" +
"((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" +
"(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" +
"(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b";
Its works pretty well in most of pages, but i have an issue with other. For example:
http://hello.com/hello world
returns
http://hello.com/hello
The problems is that space.
Anyone have a nice pattern that solve this?
Thanks.
EDIT:: this is my code
private ArrayList<String> pullLinks(String text) {
ArrayList<String> links = new ArrayList<String>();
String regex = "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" +
"(\\w+:\\w+#)?(([-\\w]+\\.)+(com|org|net|gov" +
"|mil|biz|info|mobi|name|aero|jobs|museum" +
"|travel|[a-z]{2}))(:[\\d]{1,5})?" +
"(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" +
"((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" +
"(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" +
"(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while(m.find()) {
String urlStr = m.group();
if (urlStr.startsWith("(") && urlStr.endsWith(")"))
{
urlStr = urlStr.substring(1, urlStr.length() - 1);
}
links.add(urlStr);
}
return links;
}

Spaces are not allowed in URLs (they need to be replaced by %20). See for instance the answer to this question:
Spaces in URLs?
If you allow URLs to include spaces anyway, then how would you interpret for instance http://www.google.com/ig is a nice webpage? Clearly the part after /ig should not be included!

Space is not a valid URL character.
Also, if you don't use whitespace as your terminator how are you going to find the end of the URL?
Your regex is also failing to account for other top level domains (like .int). I'm not actually sure why it is looking for specific TLDs at all as they are not required to form a valid URL.

Related

Pattern matching and get multiple values from URL using java

I am using Java-8, I would like to check whether the URL is valid or not based on pattern.
If valid then I should get the attributes bookId, authorId, category, mediaId
Pattern: <basepath>/books/<bookId>/author/<authorId>/<isbn>/<category>/mediaId/<filename>
And this is the sample URL
URL => https:/<baseurl>/v1/files/library/books/1234-4567/author/56784589/32475622347586/media/324785643257567/507f1f77bcf86cd799439011_400.png
Here Basepath is /v1/files/library.
I see some pattern matchings but I couldn't relate with my use-case, probably I was not good at reg-ex. I am also using apache-common-utils but I am not sure How to achieve it either.
Any help or hint would be really appreciable.
Try this solution (uses named capture groups in regex):
public static void main(String[] args)
{
Pattern p = Pattern.compile("http[s]?:.+/books/(?<bookId>[^/]+)/author/(?<authorId>[^/]+)/(?<isbn>[^/]+)/media/(?<mediaId>[^/]+)/(?<filename>.+)");
Matcher m = p.matcher("https:/<baseurl>/v1/files/library/books/1234-4567/author/56784589/32475622347586/media/324785643257567/507f1f77bcf86cd799439011_400.png");
if (m.matches())
{
System.out.println("bookId = " + m.group("bookId"));
System.out.println("authorId = " + m.group("authorId"));
System.out.println("isbn = " + m.group("isbn"));
System.out.println("mediaId = " + m.group("mediaId"));
System.out.println("filename = " + m.group("filename"));
}
}
prints:
bookId = 1234-4567
authorId = 56784589
isbn = 32475622347586
mediaId = 324785643257567
filename = 507f1f77bcf86cd799439011_400.png

How to replace all domains with pattern in a XML string in Java?

I have an XML output like this (<xml> element or xlink:href attribute are just fiction and you cannot rely on them to create regex pattern.)
<xml>http://localhost:8080/def/abc/xyx</xml>
<element xlink:href="http://localhostABCDEF/def/ABC/XYZ">Some Text</element>
...
What I want to do is using Java regex to replace the domain pattern (I don't know about existing domains):
"http(s)?://.*/def/.*
with an input domain (e.g: http://google.com/def) and the result will be:
<xml>http://google.com/def/abc/xyx</xml>
<element xlink:href="http://google.com.com/def/ABC/XYZ">Some Text</element>
...
How can I do it? I think Regex in Java can do or String.replaceAll (but this one seems not possible).
Regex: http[s]?:\/{2}.+\/def Substitution: http://google.com/def
Details:
? Matches between zero and one times
[] Match a single character present in the list
. Matches any character
+ Matches between one and unlimited times
Java code:
String domain = "http://google.com/def";
String html = "<xml>http://localhost:8080/def/abc/xyx</xml>\r\n<element xlink:href=\"http://localhostABCDEF/def/ABC/XYZ\">Some Text</element>";
html = html.replaceAll("http[s]?:\\/{2}.+\\/def", domain);
System.out.print(html);
Output:
<xml>http://google.com/def/abc/xyx</xml>
<element xlink:href="http://google.com/def/ABC/XYZ">Some Text</element>
Actually, this could be done with Regex and it is simple enough than parsing XML document. Here is the answer:
String text = "<epsg:CommonMetaData>\n"
+ " <epsg:type>geographic 2D</epsg:type>\n"
+ " <epsg:informationSource>EPSG. See 3D CRS for original information source.</epsg:informationSource>\n"
+ " <epsg:revisionDate>2007-08-27</epsg:revisionDate>\n"
+ " <epsg:changes>\n"
+ " <epsg:changeID xlink:href=\"http://www.opengis.net/def/change-request/EPSG/0/2002.151\"/>\n"
+ " <epsg:changeID xlink:href=\"http://www.opengis.net/def/change-request/EPSG/0/2003.370\"/>\n"
+ " <epsg:changeID xlink:href=\"http://www.opengis.net/def/change-request/EPSG/0/2006.810\"/>\n"
+ " <epsg:changeID xlink:href=\"http://www.opengis.net/def/change-request/EPSG/0/2007.079\"/>\n"
+ " </epsg:changes>\n"
+ " <epsg:show>true</epsg:show>\n"
+ " <epsg:isDeprecated>false</epsg:isDeprecated>\n"
+ " </epsg:CommonMetaData>\n"
+ " </gml:metaDataProperty>\n"
+ " <gml:metaDataProperty>\n"
+ " <epsg:CRSMetaData>\n"
+ " <epsg:projectionConversion xlink:href=\"http://www.opengis.net/def/coordinateOperation/EPSG/0/15593\"/>\n"
+ " <epsg:sourceGeographicCRS xlink:href=\"http://www.opengis.net/def/crs/EPSG/0/4979\"/>\n"
+ " </epsg:CRSMetaData>\n"
+ " </gml:metaDataProperty>"
+ "<gml:identifier codeSpace=\"OGP\">http://www.opengis.net/def/area/EPSG/0/1262</gml:identifier>";
String patternString1 = "(http(s)?://.*/def/.*)";
Pattern pattern = Pattern.compile(patternString1);
Matcher matcher = pattern.matcher(text);
String prefixDomain = "http://localhost:8080/def";
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
String url = prefixDomain + matcher.group(1).split("def")[1];
matcher.appendReplacement(sb, url);
System.out.println(url);
}
matcher.appendTail(sb);
System.out.println(sb.toString());
which returns output https://www.diffchecker.com/CyJ8fY8p

Nested/Repeated Group in Regex

I have to parse a multi line string and retrieve the email addresses in a specific location.
And I have done it using the below code:
String input = "Content-Type: application/ms-tnef; name=\"winmail.dat\"\r\n"
+ "Content-Transfer-Encoding: binary\r\n" + "From: ABC aa DDD <aaaa.b#abc.com>\r\n"
+ "To: DDDDD dd <sssss.r#abc.com>\r\n" + "CC: Rrrrr rrede <sssss.rv#abc.com>, Dsssssf V R\r\n"
+ " <dsdsdsds.vr#abc.com>, Psssss A <pssss.a#abc.com>, Logistics\r\n"
+ " <LOGISTICS#abc.com>, Gssss Bsss P <gdfddd.p#abc.com>\r\n"
+ "Subject: RE: [MyApps] (PRO-34604) PR for Additional Monitor allocation [CITS\r\n"
+ " Ticket:258849]\r\n" + "Thread-Topic: [MyApps] (PRO-34604) PR for Additional Monitor allocation\r\n"
+ " [CITS Ticket:258849]\r\n" + "Thread-Index: AQHRXMJHE6KqCFxKBEieNqGhdNy7Pp8XHc0A\r\n"
+ "Date: Mon, 1 Feb 2016 17:56:17 +0530\r\n"
+ "Message-ID: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "References: <JA.101.1453963700000#myapps.abc.com>\r\n"
+ " <JA.101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "In-Reply-To: <JIRA.450101.1453963700000.978.1454311765375#myapps.abc.com>\r\n"
+ "Accept-Language: en-US\r\n" + "Content-Language: en-US\r\n" + "X-MS-Has-Attach:\r\n"
+ "X-MS-Exchange-Organization-SCL: -1\r\n"
+ "X-MS-TNEF-Correlator: <B7F84439E634A44AB586E3FF2EA0033A29E27E47#JETWINSRVRPS01.abc.com>\r\n"
+ "MIME-Version: 1.0\r\n" + "X-MS-Exchange-Organization-AuthSource: TURWINSRVRPS01.abc.com\r\n"
+ "X-MS-Exchange-Organization-AuthAs: Internal\r\n" + "X-MS-Exchange-Organization-AuthMechanism: 04\r\n"
+ "X-Originating-IP: [1.1.1.7]";
Pattern pattern = Pattern.compile("To:(.*<([^>]*)>).*Message-ID", Pattern.DOTALL);
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
Pattern innerPattern = Pattern.compile("<([^>]*)>");
Matcher innerMatcher = innerPattern.matcher(matcher.group(1));
while (innerMatcher.find()) {
System.out.println("-->:" + innerMatcher.group(1));
}
}
Here it works fine. I'm first grouping the part from To till the Message which is the required part. And then I have another grouping to extract the email ids.
Is there any better way to do this? Can we do it with one pattern matcher set?
Update:
This is the expected output:
-->:sssss.r#abc.com
-->:sssss.rv#abc.com
-->:dsdsdsds.vr#abc.com
-->:pssss.a#abc.com
-->:LOGISTICS#abc.com
-->:gdfddd.p#abc.com
Ideally, you could have used lookarounds:
(?<=To:.*)<([^>]+)>(?=.*Message-ID)
Visualization by Debuggex
Unfortunately, Java doesn't support variable length in lookbehinds. A workaround could be:
(?<=To:.{0,1000})<([^>]+)>(?=.*Message-ID)
I think you are looking for all the emails inside <...> that come after To: and before Message-ID. So, you may use a \G based regex for one pass:
Pattern pt = Pattern.compile("(?:\\bTo:|(?!^)\\G).*?<([^>]*)>(?=.*Message-ID)", Pattern.DOTALL);
Matcher m = pt.matcher(input);
while (m.find()) {
System.out.println(m.group(1));
}
See IDEONE demo and a regex demo
The regex matches:
(?:\\bTo:|(?!^)\\G) - a leading boundary, either To: as a whole word or the location after the previous successful match
.*? - any characters, any number of occurrences up to the first
<([^>]*)> - substring starting with < followed with zero or more characters other than > (Group 1) and followed with a closing >
(?=.*Message-ID) - a positive lookahead that makes sure there is Message-ID somewhere ahead of the current match.

Intelligent String Parsing in java

I have an email Subject line that i need to parse. I need to find first occurance of any word given in a list of words and get the next word which can be separated by
('=' or ',' or ';' or 'blank' or '.').
for example
list of word for customer ["customer","client","kunden","kd.nr."]
list of word for Order ["order","auftrag","auftragsnummer","auftragnr."]
separator : [= , ; .]
subjectline: Customer 2013ABC has send an Aufrag 2056899A for Motif=A
I need to parse the information like
customer=2013ABC
order=2056899A
Motif=A
I am using Java 7 so Scanner class can be used as well.
Thanks for any tips in advance
You can achieve this by using regular expressions, here is a sample code:
Pattern p = Pattern.compile(".*(customer|client|kunden|kd\\.nr\\.)[=,;\\. ]*(\\w*).*(order|auftrag|auftragsnummer|auftragnr\\.)[=,;\\. ]*(\\w*).*[ ](.*)$", Pattern.CASE_INSENSITIVE);
String subject = "subjectline: kd.nr. 2013ABC has send an Auftrag 2056899A for Motif=A";
Matcher m = p.matcher(subject);
if(m.matches()) {
System.out.println(m.group(1) + " : " + m.group(2) );
System.out.println(m.group(3) + " : " + m.group(4));
System.out.println(m.group(5));
}
Hope this helps.

Regular expression not returning the .group() value

I'm new with java and using regular expressions. The method seems to be OK, and it's finding results on the subject string, but when I try to get the actual string using .group(), it's empty. here's the code:
public String TestRegularExpression(){
try{
Pattern regex = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
Matcher regexMatcher = regex.matcher(sourceCode);
while (regexMatcher.find()) {
results += "<li>" + regexMatcher.group() + "</li>";
matches ++;
}
} catch (PatternSyntaxException ex) {
results = "<li><strong class='ibm-important'>Syntax error in the regular expression</strong></li>";
}
if(results == null){results = "<li><strong class='ibm-important'>No meta tags found</strong></li>";}
return "<h3>" + h3Title + " (" + matches + " found)</h3><ul>" + results + "</ul>";
}
Any help will be much appreciated!!!
Couldn't it be that you're just not seeing the output? If you output the match directly to HTML without quoting it, that'll just insert the META tag in the HTML code, and the web browser won't render it.

Categories