I am trying to do some stuff with replacing String containing some URL to a browser compatible linked URL.
My initial String looks like this :
"hello, i'm some text with an url like http://www.the-url.com/ and I need to have an hypertext link !"
What I want to get is a String looking like :
"hello, i'm some text with an url like http://www.the-url.com/ and I need to have an hypertext link !"
I can catch URL with this code line :
String withUrlString = myString.replaceAll(".*://[^<>[:space:]]+[[:alnum:]/]", "HereWasAnURL");
Maybe the regexp expression needs some correction, but it's working fine, need to test in further time.
So the question is how to keep the expression catched by the regexp and just add a what's needed to create the link : catched string
Thanks in advance for your interest and responses !
Try to use:
myString.replaceAll("(.*://[^<>[:space:]]+[[:alnum:]/])", "HereWasAnURL");
I didn't check your regex.
By using () you can create groups. The $1 indicates the group index.
$1 will replace the url.
I asked a simalir question: my question
Some exemples: Capturing Text in a Group in a regular expression
public static String textToHtmlConvertingURLsToLinks(String text) {
if (text == null) {
return text;
}
String escapedText = HtmlUtils.htmlEscape(text);
return escapedText.replaceAll("(\\A|\\s)((http|https|ftp|mailto):\\S+)(\\s|\\z)",
"$1$2$4");
}
There may be better REGEXs out there, but this does the trick as long as there is white space after the end of the URL or the URL is at the end of the text. This particular implementation also uses org.springframework.web.util.HtmlUtils to escape any other HTML that may have been entered.
For anybody who is searching a more robust solution I can suggest the Twitter Text Libraries.
Replacing the URLs with this library works like this:
new Autolink().autolink(plainText)
Belows code replaces links starting with "http" or "https", links starting just with "www." and finally replaces also email links.
Pattern httpLinkPattern = Pattern.compile("(http[s]?)://(www\\.)?([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
Pattern wwwLinkPattern = Pattern.compile("(?<!http[s]?://)(www\\.+)([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
Pattern mailAddressPattern = Pattern.compile("[\\S&&[^#]]+#([\\S&&[^.#]]+)(\\.[\\S&&[^#]]+)");
String textWithHttpLinksEnabled =
"ajdhkas www.dasda.pl/asdsad?asd=sd www.absda.pl maiandrze#asdsa.pl klajdld http://dsds.pl httpsda http://www.onet.pl https://www.onsdas.plad/dasda";
if (Objects.nonNull(textWithHttpLinksEnabled)) {
Matcher httpLinksMatcher = httpLinkPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = httpLinksMatcher.replaceAll("$0");
final Matcher wwwLinksMatcher = wwwLinkPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = wwwLinksMatcher.replaceAll("$0");
final Matcher mailLinksMatcher = mailAddressPattern.matcher(textWithHttpLinksEnabled);
textWithHttpLinksEnabled = mailLinksMatcher.replaceAll("$0");
System.out.println(textWithHttpLinksEnabled);
}
Prints:
ajdhkas www.dasda.pl/asdsad?asd=sd www.absda.pl maiandrze#asdsa.pl klajdld http://dsds.pl httpsda http://www.onet.pl https://www.onsdas.plad/dasda
Assuming your regex works to capture the correct info, you can use backreferences in your substitution. See the Java regexp tutorial.
In that case, you'd do
myString.replaceAll(....., "\1")
In case of multiline text you can use this:
text.replaceAll("(\\s|\\^|\\A)((http|https|ftp|mailto):\\S+)(\\s|\\$|\\z)",
"$1<a href='$2'>$2</a>$4");
And here is full example of my code where I need to show user's posts with urls in it:
private static final Pattern urlPattern = Pattern.compile(
"(\\s|\\^|\\A)((http|https|ftp|mailto):\\S+)(\\s|\\$|\\z)");
String userText = ""; // user content from db
String replacedValue = HtmlUtils.htmlEscape(userText);
replacedValue = urlPattern.matcher(replacedValue).replaceAll("$1$2$4");
replacedValue = StringUtils.replace(replacedValue, "\n", "<br>");
System.out.println(replacedValue);
Related
I want to parse all google map links inside a String. The format is as follows :
1st example
https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7bcdecbb1df:0x715969d86d0b76bf!8m2!3d38.8976763!4d-77.0365298
https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z
https://www.google.com/maps/place//#38.8976763,-77.0387185,17z
https://maps.google.com/maps/place//#38.8976763,-77.0387185,17z
https://www.google.com/maps/place/#38.8976763,-77.0387185,17z
https://google.com/maps/place/#38.8976763,-77.0387185,17z
http://google.com/maps/place/#38.8976763,-77.0387185,17z
https://www.google.com.tw/maps/place/#38.8976763,-77.0387185,17z
These are all valid google map URLs (linking to White House)
Here is what I tried
String gmapLinkRegex = "(http|https)://(www\\.)?google\\.com(\\.\\w*)?/maps/(place/.*)?#(.*z)[^ ]*";
Pattern patternGmapLink = Pattern.compile(gmapLinkRegex , Pattern.CASE_INSENSITIVE);
Matcher m = patternGmapLink.matcher(s);
while (m.find()) {
logger.info("group0 = {}" , m.group(0));
String place = m.group(4);
place = StringUtils.stripEnd(place , "/"); // remove tailing '/'
place = StringUtils.stripStart(place , "place/"); // remove header 'place/'
logger.info("place = '{}'" , place);
String latLngZ = m.group(5);
logger.info("latLngZ = '{}'" , latLngZ);
}
It works in simple situation , but still buggy ...
for example
It need post-process to grab optional place information
And it cannot extract one line with two urls such as :
s = "https://www.google.com/maps/place//#38.8976763,-77.0387185,17z " +
" and http://google.com/maps/place/#38.8976763,-77.0387185,17z";
It should be two urls , but the regex matches the whole line ...
The points :
The whole URL should be matched in group(0) (including the tailing data part in 1st example),
in the 1st example , if the zoom level : 17z is removed , it is still a valid gmap URL , but my regex cannot match it.
Easier to extract optional place info
Lat / Lng extraction is must , zoom level is optional.
Able to parse multiple urls in one line
Able to process maps.google.com(.xx)/maps , I tried (www|maps\.)? but seems still buggy
Any suggestion to improve this regex ? Thanks a lot !
The dot-asterisk
.*
will always allow anything to the end of the last url.
You need "tighter" regexes, which match a single URL but not several with anything in between.
The "[^ ]*" might include the next URL if it is separated by something other than " ", which includes line break, tab, shift-space...
I propose (sorry, not tested on java), to use "anything but #" and "digit, minus, comma or dot" and "optional special string followed by tailored charset, many times".
"(http|https)://(www\.)?google\.com(\.\w*)?/maps/(place/[^#]*)?#([0123456789\.,-]*z)(\/data=[\!:\.\-0123456789abcdefmsx]+)?"
I tested the one above on a perl-regex compatible engine (np++).
Please adapt yourself, if I guessed anything wrong. The explicit list of digits can probably be replaced by "\d", I tried to minimise assumptions on regex flavor.
In order to match "URL" or "URL and URL", please use a variable storing the regex, then do "(URL and )*URL", replacing "URL" with regex var. (Asuming this is possible in java.) If the question is how to then retrieve the multiple matches: That is java, I cannot help. Let me know and I delete this answer, not to provoke deserved downvotes ;-)
(Edited to catch the data part in, previously not seen, first example, first line; and the multi URLs in one line.)
I wrote this regex to validate google maps links:
"(http:|https:)?\\/\\/(www\\.)?(maps.)?google\\.[a-z.]+\\/maps/?([\\?]|place/*[^#]*)?/*#?(ll=)?(q=)?(([\\?=]?[a-zA-Z]*[+]?)*/?#{0,1})?([0-9]{1,3}\\.[0-9]+(,|&[a-zA-Z]+=)-?[0-9]{1,3}\\.[0-9]+(,?[0-9]+(z|m))?)?(\\/?data=[\\!:\\.\\-0123456789abcdefmsx]+)?"
I tested with the following list of google maps links:
String location1 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location2 = "https://www.google.com.tw/maps/place/#38.8976763,-77.0387185,17z";
String location3 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location4 = "https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z/data=!3m1!4b1!4m5!3m4!1s0x89b7b7bcdecbb1df:0x715969d86d0b76bf!8m2!3d38.8976763!4d-77.0365298";
String location5 = "https://www.google.com/maps/place/white+house/#38.8976763,-77.0387185,17z";
String location6 = "https://www.google.com/maps/place//#38.8976763,-77.0387185,17z";
String location7 = "https://maps.google.com/maps/place//#38.8976763,-77.0387185,17z";
String location8 = "https://www.google.com/maps/place/#38.8976763,-77.0387185,17z";
String location9 = "https://google.com/maps/place/#38.8976763,-77.0387185,17z";
String location10 = "http://google.com/maps/place/#38.8976763,-77.0387185,17z";
String location11 = "https://www.google.com/maps/place/#/data=!4m2!3m1!1s0x3135abf74b040853:0x6ff9dfeb960ec979";
String location12 = "https://maps.google.com/maps?q=New+York,+NY,+USA&hl=no&sll=19.808054,-63.720703&sspn=54.337928,93.076172&oq=n&hnear=New+York&t=m&z=10";
String location13 = "https://www.google.com/maps";
String location14 = "https://www.google.fr/maps";
String location15 = "https://google.fr/maps";
String location16 = "http://google.fr/maps";
String location17 = "https://www.google.de/maps";
String location18 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location19 = "https://www.google.de/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location20 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4&layer=t&lci=com.panoramio.all,com.google.webcams,weather";
String location21 = "https://www.google.com/maps?ll=37.370157,0.615234&spn=45.047033,93.076172&t=m&z=4&layer=t";
String location22 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location23 = "https://www.google.de/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4";
String location24 = "https://www.google.com/maps?ll=37.0625,-95.677068&spn=45.197878,93.076172&t=h&z=4&layer=t&lci=com.panoramio.all,com.google.webcams,weather";
String location25 = "https://www.google.com/maps?ll=37.370157,0.615234&spn=45.047033,93.076172&t=m&z=4&layer=t";
String location26 = "http://www.google.com/maps/place/21.01196755,105.86306012";
String location27 = "http://google.com/maps/bylatlng?lat=21.01196022&lng=105.86298748";
String location28 = "https://www.google.com/maps/place/C%C3%B4ng+vi%C3%AAn+Th%E1%BB%91ng+Nh%E1%BA%A5t,+354A+%C4%90%C6%B0%E1%BB%9Dng+L%C3%AA+Du%E1%BA%A9n,+L%C3%AA+%C4%90%E1%BA%A1i+H%C3%A0nh,+%C4%90%E1%BB%91ng+%C4%90a,+H%C3%A0+N%E1%BB%99i+100000,+Vi%E1%BB%87t+Nam/#21.0121535,105.8443773,13z/data=!4m2!3m1!1s0x3135ab8ee6df247f:0xe6183d662696d2e9";
Currently, I have a String like
http://www.example.com/defg-/\nletters
I put this String into a TextView, and make the url clickable by setAutoLinkMask(Linkify.WEB_URLS) and setMovementMethod(LinkMovementMethod.getInstance())
However, the link is recognize wrongly, where only
http://www.example.com/defg <--missing "-/"
is highlighted but not
http://www.example.com/defg-/ <--I want this
, and results in a wrong url.
What should I do such that the url can be recognized correctly?
The Sample Result (2nd link is wrongly recognized)
Code Implementation
txtNorm = (TextView) findViewById(R.id.txtNorm);
txtNorm.setText("http://www.example.com/defg-/");
txtNorm.setAutoLinkMask(Linkify.WEB_URLS);
txtNorm.setMovementMethod(LinkMovementMethod.getInstance());
txtCustom = (TextView) findViewById(R.id.txtCustom);
txtCustom.setText("http://www.example.com/defg-/\nletters");
txtCustom.setAutoLinkMask(Linkify.WEB_URLS);
txtCustom.setMovementMethod(LinkMovementMethod.getInstance());
i found a way you can try this.. at first you need to know that if you add -/ at the end of url this is not common format of standard Web Url. so i made a custom pattern ..
String urlRegex="[://.a-zA-Z_-]+-/"; // carefully set your pattern.
Pattern pattern = Pattern.compile(urlRegex);
String url1="press http://www.example.com/defg-/\\ or on Android& to search it on google";
text.setText(url1);
Matcher matcher1=Pattern.compile(urlRegex).matcher(url1);
while (matcher1.find()) {
final String tag = matcher1.group(0);
Linkify.addLinks(text, pattern, tag);
}
text.setMovementMethod(LinkMovementMethod.getInstance());
Hi Can someone tell how to remove the footer from the mail.
I just need to store the body of the email and remove the other things be it a disclaimer or a footer.
There is meant to be a standard marker for email footers - see https://en.wikipedia.org/wiki/Signature_block#Standard_delimiter
Which is:
--
You can use a regex to look for that, e.g.
Pattern pattern = Pattern.compile("^-- $", Pattern.MULTILINE);
Matcher m = pattern.matcher(emailBodyText);
if (m.find()) {
emailBodyText = emailBodyText.substring(0, m.start());
}
Sadly it is not widely used these days. For example, Gmail does not apply it.
For a gmail message - You can look for data-smartmail="gmail_signature" in the email's html.
You might have to implement custom clean-up code for each major email system.
You could use a regular expression.
Let's say the email looks like
String emailContents =
"AAA this is the email header BBB\n" +
"This is the body\n" +
"CCC this is the email footer DDD";
You could do something like:
Pattern pattern = Pattern.compile("AAA.*BBB(.*)CCC.*DDD");
Matcher matcher = pattern.matcher(emailContents);
if (!matcher.matches()) throw new Exception("Invalid email");
String emailBody = matcher.group(1);
System.out.println(emailBody); // prints 'This is the body'
Note that .* matches any character multiple times and ( and ) represent a group. Full regex syntax here
I am creating a simple utility to retrieve all HTTP URL's from a webpage.
Initially I had planned to use a HTML parsing library to parse out the HREF tags but I got to know that I need to retrieve the URL contained inside the script too (Example script below) hence I started trying out regular expression to get all the HTTP url from the web page but for some reason my regular expression is not working properly.
The URL can be inside a javascript
<script>
if(jQuery.browser.msie)
{
var v= 'http://test.com/test/test';
}
</script>
My program:
try {
BufferedReader in=new BufferedReader(new FileReader("c:\\sample\\sample.html"));
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
String pattern = "http?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(inputLine.replaceAll("http://", "\nhttp://"));
while (!m.hitEnd()) {
if (m.find()) {
System.out.println("Found value: " + m.group(0));
} else {
//System.out.println("NO MATCH");
}
}
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
Can someone help me fix this issue or let me know the best way to retrieve all URL's from a web page?
Description
Your expression has a typo. It should make the s optional.
https?://([-\\w\\.]+)+(:\\d+)?(/([\\w/_\\.]*(\\?\\S+)?)?)?
^
Also I recommend:
replacing the (...) capture groups with non capture groups like (?:...)
you don't need to escape a . inside a character group [.]
add a test to ensure you're not captureing the close quotes surrounding your url
rewrite your section looking for /folder/subfolder sections as a repeating non-capture group looking for the initial slash followed by the folder name
regex: https?:\/\/(?:[\w-]+.)+(?::\d+)?(?:\/[\w\/_.]*)*?(?:\?\S+)?(?=['"\s])
as a Java string: "https?:\\/\\/(?:[\\w-]+.)+(?::\\d+)?(?:\\/[\\w\\/_.]*)*?(?:\\?\\S+)?(?=['\"\\s])"
Example
Live Demo
Sample Text
<script>
if(jQuery.browser.msie)
{
var v= 'http://test.com/test/test';
}
</script>
<a class="test" href="http://blablablablabla.com">Third Link</a>
Matches
[0] => http://test.com/test/test
[1] => http://blablablablabla.com
try using this
\A'http:\/\/[\w\W]+'\z
this will check that your url must be starting from http:// and is an string in starting and ending and as in between the url nowadys anything can come so we will have to allow special character like ?:,-_/\ and also the numbers digits etc.
so this will get you all the urls present in the file.
I need to get a file name from file's absolute path (I am aware of the file.getName() method, but I cannot use it here).
EDIT: I cannot use file.getName() because I don't need the file name only; I need the part of the file's path as well (but again, not the entire absoulte path). I need the part of file's path AFTER certain path provided.
Let's say the file is located in the folder:
C:\Users\someUser
On windows machine, if I make a pattern string as follows:
String patternStr = "C:\\Users\\someUser\\(.*+)";
I get an exception: java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence for backslash.
If I use Pattern.quote(File.pathSeparator):
String patternStr = "C:" + Pattern.quote(File.separator) + "Users" + Pattern.quote(File.separator) + "someUser" + Pattern.quote(File.separator) + "(.*+)";
the resulting pattern string is: C:\Q;\EUsers\Q;\EsomeUser\Q;\E(.*+) which of course has no match with the actual fileName "C:\Users\someUser\myFile.txt".
What am I missing here? What is the proper way to parse file name?
What is the proper way to parse file name?
The proper way to parse a file name is to use File(String). Using a regex for this is going to hard-wire platform dependencies into your code. That's a bad idea.
I know you said you can't use File.getName() ... but that is the proper solution. If you would care to say why you can't use File.getName() perhaps I could suggest an alternative solution.
If you indeed want to use a regular expressions, you should use
String patternStr = "C:\\\\Users\\\\someUser\\\\(.*+)";
^^ ^^ ^^
instead.
Why? Your string literal
"C:\\Users\\someUser\\(.*+)"
is compiled to
C:\Users\someUser\(.*+)
Since \ is used for escaping in regular expressions too, you'll have to escape them "twice".
Regarding your edit:
You probably want to have a look at URI.relativize(). Example:
File base = new File("C:/Users/someUser");
File file = new File("C:/Users/someUser/someDir/someFile.txt");
String relativePath = base.toURI().relativize(file.toURI()).getPath();
System.out.println(relativePath); // prints "someDir/someFile.txt"
(Note that / works as file-separator on Windows machines too.)
Btw, I don't know what you have as File.separator on your system, but if it's set to \, then
"C:" + Pattern.quote(File.separator) + "Users" + Pattern.quote(File.separator) +
"someUser" + Pattern.quote(File.separator) + "(.*+)";
should yield
C:\Q\\EUsers\Q\\EsomeUser\Q\\E(.*+)
String patternStr = "C:\\Users\\someUser\\(.*+)";
Backslashes (\) are escape characters in the Java Language. Your string contains the following after compilation:
C:\Users\someUser\(.*+)
This string is then parsed as a regex, which uses backslashes as an escape character as well. The regex parser tries to understand the escaped \U, \s and \(. One of them is incorrect regarding the regex syntax (hence your exception), and none of them are what you are trying to achieve.
Try
String patternStr = "C:\\\\Users\\\\someUser\\\\(.*+)";
If you want to solve it by pattern you need to escape your Pattern properly
String patternStr = "C:\\\\Users\\\\someUser\\\\(.*+)";
Try putting double-double-backslashes in your pattern. You need a second backslash to escape one in the patter, plus you'll need to double each one to escape them in the string. Hence you'll end up with something like:
String patternStr = "C:\\\\Users\\\\someUser\\\\(.*+)";
Move from end of string to first occurrence of file path separator* or begin.
File paths separator can be / or \.
public static final char ALTERNATIVE_DIRECTORY_SEPARATOR_CHAR = '/';
public static final char DIRECTORY_SEPARATOR_CHAR = '\\';
public static final char VOLUME_SEPARATOR_CHAR = ':';
public static String getFileName(String path) {
if(path == null || path.isEmpty()) {
return path;
}
int length = path.length();
int index = length;
while(--index >= 0) {
char c = path.charAt(index);
if(c == ALTERNATIVE_DIRECTORY_SEPARATOR_CHAR || c == DIRECTORY_SEPARATOR_CHAR || c == VOLUME_SEPARATOR_CHAR) {
return path.substring(index + 1, length);
}
}
return path;
}
Try to keep it simple ;-).
Try this :
String ResultString = null;
try {
Pattern regex = Pattern.compile("([^\\\\/:*?\"<>|\r\n]+$)");
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group(1);
}
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
Output :
myFile.txt
Also for input : C:/Users/someUser/myFile.txt
Output : myFile.txt
What am I missing here? What is the proper way to parse file name?
The proper way to parse a file name is to use the APIs that are already provided for the purpose. You've stated that you can't use File.getName(), without explanation. You are almost certainly mistaken about that.
I cannot use file.getName() because I don't need the file name only; I need the part of the file's path as well (but again, not the entire absoulte path).
OK. So what you want is something like this.
// Canonicalize paths to deal with ".", "..", symlinks,
// relative files and case sensitivity issues.
String directory = new File(someDirectory).canonicalPath();
String test = new File(somePathname).canonicalPath();
if (!directory.endsWith(File.separator)) {
directory += File.separator;
}
if (test.startsWith(directory)) {
String pathInDirectory = test.substring(directory.length()):
...
}
Advantages:
No regexes needed.
Doesn't break if the path separator is something other than \.
Doesn't break if there are symbolic links on the path.
Doesn't break due to case sensitivity issues.
Suppose the file name has special characters, specially when supporting MAC where special characters are allowing in filenames, server side Path.GetFileName(fileName) fails and throws error because of illegal characters in path. The following code using regex come for the rescue.
The following regex take care of 2 things
In IE, when file is uploaded, the file path contains folders aswell (i.e. c:\samplefolder\subfolder\sample.xls). Expression below will replace all folders with empty string and retain the file name
When used in Mac, filename is the only thing supplied as its safari browser and allows special chars in file name
var regExpDir = #"(^[\w]:\\)([\w].+\w\\)";
var fileName = Regex.Replace(fileName, regExpDir, string.Empty);