Creating string array from a delimited string. - java

I have a string of urls, example "domain.com/url1, domain.com/url2 etc". Sometimes they are comma, tab, or pipe delimited. What I'd like to do is split them up in a string array and automatically handle any potential use case. Does anybody know of a good way to handle this?
I started with something like this, but it doesn't function correctly nor does it handle all use cases.
Collection<String> newUrls = Arrays.asList(photoHolder.getPhotoURLs().replaceAll("\\|", ",").replaceAll("\\s+", "").split(","));

I believe this should be possible with only using the split method and providing a regex that will match any of your delimiters.
Collection<String> newUrls = Arrays.asList(photoHolder.getPhotoURLs().split("\\t|\\||,"));

You might use some "smart" regular expressions that are independent of the delimiter, but use the domain names (.com, .co.uk, IP addresses...) to separate the URLs.

I think you have make some split methods to cover all the scenarios and finally put them in a one list. And the other case about the potential use case use a try catch and in the catch handle the exception because we cannot handle every scenario. Think this will be helpful somehow.
Uri uri = Uri.parse("domain.com/url1/what ever_the_url");
String protocol = uri.getScheme();
String server = uri.getAuthority();
String path = uri.getPath();
Set<String> args = uri.getQueryParameterNames();
String limit = uri.getQueryParameter("limit");
Try this also.

Related

Regex pattern to split colon char with a condition

I have a string like this :
http://schemas/identity/claims/usertype:External
Then my goal is to split that string into 2 words by colon delimiter, but in need to specified how the regex worked, it will be split the colon but not including colon in "http://", so those strings will be split into :
http://schemas/identity/claims/usertype
External
I have tried regex like this :
(http:\/\/+schemas\/identity\/claims\/usertype)
So it will be :
http://schemas/identity/claims/usertype
:External
then after that i will replace the remaining colon with empty string.
but i think its not a best practice for this, because i rarely used regex.
Do you have any suggestion to simplified the regex ?
Thanks in advance
This is an X/Y problem. Fortunately, you asked the question in a great way, by explaining the underlying problem you are trying to solve (namely: Pull some string out of a URL), and then describing the direction you've chosen to solve your problem (which is bad, see below), and then asking about a problem you have with this solution (which is irrelevant, as the entire solution is bad).
URLs aren't parsable like this. You shouldn't treat them as a string you can lop into pieces like this. For example, the server part can contain colons too: For port number. In front of the server part, there can be an authentication which can also contain a colon. It's rarely used, of course.
Try this one, which shows the problem with your approach:
https://joe:joe#google.com:443/
That link just works. Port 443 was the default anyway, and google ignores the authentication header that ends up sending, but the point is, a URL may contain this stuff.
But rzwitserloot, it.. won't! I know!
That's bad programming mindset. That mindset leads to security issues. Why go for a solution that burdens your codebase with unstated assumptions (assumption: The places that provide a URL to this code are under my control and will never send port or auth headers)? If the 'server' part is configurable in a config file, will you mention in said config file that you cannot add a port? Will you remember 4 years from now?
The solution that does it right isn't going to burden your code with all these unstated (or very unwieldy if stated) assumptions.
Okay, so what is the right way?
First, toss that string into the constructor of java.net.URI. Then, use the methods there to get what you actually want, which is the path part. That is a string you can pull apart:
URI uri = new URI("http://schemas/identity/claims/usertype:External");
String path = uri.getPath();
String newPath = path.replaceAll(":.*", "");
String type = path.replaceAll(".*?:", "");
URI newUri = uri.resolve(newPath);
System.out.println(newUri);
System.out.println(type);
prints:
http://schemas/identity/claims/usertype
External
NB: Toss some ports or auth stuff in there, or make it a relative URL - do whatever you like, this code is far more robust in the face of changing the base URL than any attempt to count colons is going to be.
Use Negative Lookbehind and split
Regex:
"(?<!(http|https)):"
Regex in context:
public static void main(String[] args) {
String input = "http://schemas/identity/claims/usertype:External";
validateURI(input);
List<String> result = Arrays.asList(input.split("(?<!(http|https)):"));
result.forEach(System.out::println);
}
private static void validateURI(String input) {
try {
new URI(input);
} catch (URISyntaxException e) {
System.out.println("Invalid URI!!!");
e.printStackTrace();
}
}
Output:
http://schemas/identity/claims/usertype
External
I think this might help you:
public class Separator{
public static void main(String[] args) {
String input = "http://schemas/identity/claims/usertype:External";
String[] splitted = input.split("\\:");
System.out.println(splitted[splitted.length-1]);
}
}
Output
External

Java Regexp to match domain of url

I would like to use Java regex to match a domain of a url, for example,
for www.table.google.com, I would like to get 'google' out of the url, namely, the second last word in this URL string.
Any help will be appreciated !!!
It really depends on the complexity of your inputs...
Here is a pretty simple regex:
.+\\.(.+)\\..+
It fetches something that is inside dots \\..
And here are some examples for that pattern: https://regex101.com/r/L52oz6/1.
As you can see, it works for simple inputs but not for complex urls.
But why reinventing the wheel, there are plenty of really good libraries that correctly parse any complex url. But sure, for simple inputs a small regex is easily build. So if that does not solve the problem for your inputs then please callback, I will adjust the regex pattern then.
Note that you can also just use simple splitting like:
String[] elements = input.split("\\.");
String secondToLastElement = elements[elements.length - 2];
But don't forget the index-bound checking.
Or if you search for a very quick solution than walk through the input starting from the last position. Work your way through until you found the first dot, continue until the second dot was found. Then extract that part with input.substring(index1, index2);.
There is also already a delegate method for exactly that purpose, namely String#lastIndexOf (see the documentation).
Take a look at this code snippet:
String input = ...
int indexLastDot = input.lastIndexOf('.');
int indexSecondToLastDot = input.lastIndexOf('.', indexLastDot);
String secondToLastWord = input.substring(indexLastDot, indexSecondToLastDot);
Maybe the bounds are off by 1, haven't tested the code, but you get the idea. Also don't forget bound checking.
The advantage of this approach is that it is really fast, it can directly work on the internal structures of Strings without creating copies.
My attempt:
(?<scheme>https?:\/\/)?(?<subdomain>\S*?)(?<domainword>[^.\s]+)(?<tld>\.[a-z]+|\.[a-z]{2,3}\.[a-z]{2,3})(?=\/|$)
Demo. Works correctly for:
http://www.foo.stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.com/
http://stackoverflow.com
https://www.stackoverflow.com
www.stackoverflow.com
stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.co.uk
foo.www.stackoverflow.com
foo.www.stackoverflow.co.uk
foo.www.stackoverflow.co.uk/a/b/c
private static final Pattern URL_MATCH_GET_SECOND_AND_LAST =
Pattern.compile("www.(.*)//.google.(.*)", Pattern.CASE_INSENSITIVE);
String sURL = "www.table.google.com";
if (URL_MATCH_GET_SECOND_AND_LAST.matcher(sURL).find()){
Matcher matchURL = URL_MATCH_GET_SECOND_AND_LAST .matcher(sURL);
if (matchURL .find()) {
String sFirst = matchURL.group(1);
String sSecond= matchURL.group(2);
}
}

How do I split the rest of the URL from the last path of it

I have this file URL: http://xxx.xxx.xx.xx/resources/upload/2014/09/02/new sample.pdf which will be converted to http://xxx.xxx.xx.xx/resources/upload/2014/09/02/new%20sample.pdf later.
Now I can get the last path by:
public static String getLastPathFromUrl(String url) {
return url.replaceFirst(".*/([^/?]+).*", "$1");
}
which will give me new sample.pdf
but how do I get the remaining of the URL: http://xxx.xxx.xx.xx/resources/upload/2014/09/02/
?
Easier way to get last path from URL would be to use String.split function, like this:-
String url = "http://xxx.xxx.xx.xx/resources/upload/2014/09/02/new sample.pdf";
String[] urlArray = url.split("/");
String lastPath = urlArray[urlArray.length-1];
This converts your url into an Array which can then be used in many ways. There are various ways to get url-lastPath, one way could be to join the above generated Array using this answer. Or use lastIndexOf() and substring like this:-
String restOfUrl = url.substring(0,url.lastIndexOf("/"));
PS:- Although you can learn something by doing this but I think your best solution would be to replace space by %20 in the complete url String, that would be the fastest and make more sense.
I am not sure if I understood it correctly but when you say
I have this file URL: URL/new sample.pdf which will be converted to URL/new%20sample.pdf later.
It looks like you are trying to replace "space" with %20 in URL or said in simple words trying to take care of unwanted characters in URL. If that is what you need use pre-built
URLEncoder.encode(String url,String enc), You can us ÜTF-8 as encoding.
http://docs.oracle.com/javase/7/docs/api/java/net/URLEncoder.html
If you really need to split it, assuming that you interested in URL after http://, remove http:// and take store remaining URL in string variable called say remainingURL. then use
List myList = new ArrayList(Arrays.asList(remainingURL.split("/")));
You can iterate on myList to get rest of URL fragments.
I've found it:
File file=new File("http://xxx.xxx.xx.xx/resources/upload/2014/09/02/new sample.pdf");
System.out.println(file.getPath().replaceAll(file.getName(),""));
Output:
http://xxx.xxx.xx.xx/resources/upload/2014/09/02/
Spring solution:
List<String> pathSegments = UriComponentsBuilder.fromUriString(url).build().getPathSegments();
String lastPath = pathSegments.get(pathSegments.size()-1);

How to replace double slash with single slash for an url

For the given url like "http://google.com//view/All/builds", i want to replace the double slash with single slash. For example the above url should display as "http://google.com/view/All/builds"
I dint know regular expressions. Can any one help me, how can i achieve this using regular expressions.
To avoid replacing the first // in http:// use the following regex :
String to = from.replaceAll("(?<!http:)//", "/");
PS: if you want to handle https use (?<!(http:|https:))// instead.
Is Regex the right approach?
In case you wanted this solution as part of an exercise to improve your regex skills, then fine. But what is it that you're really trying to achieve? You're probably trying to normalize a URL. Replacing // with / is one aspect of normalizing a URL. But what about other aspects, like removing redundant ./ and collapsing ../ with their parent directories? What about different protocols? What about ///? What about the // at the start? What about /// at the start in case of file:///?
If you want to write a generic, reusable piece of code, using a regular expression is probably not the best appraoch. And it's reinventing the wheel. Instead, consider java.net.URI.normalize().
java.net.URI.normalize()
java.lang.String
String inputUrl = "http://localhost:1234//foo//bar//buzz";
String normalizedUrl = new URI(inputUrl).normalize().toString();
java.net.URL
URL inputUrl = new URL("http://localhost:1234//foo//bar//buzz");
URL normalizedUrl = inputUrl.toURI().normalize().toURL();
java.net.URI
URI inputUri = new URI("http://localhost:1234//foo//bar//buzz");
URI normalizedUri = inputUri.normalize();
Regex
In case you do want to use a regular expression, think of all possibilities. What if, in future, this should also process other protocols, like https, file, ftp, fish, and so on? So, think again, and probably use URI.normalize(). But if you insist on a regular expression, maybe use this one:
String noramlizedUri = uri.replaceAll("(?<!\\w+:/?)//+", "/");
Compared to other solutions, this works with all URLs that look similar to HTTP URLs just with different protocols instead of http, like https, file, ftp and so on, and it will keep the triple-slash /// in case of file:///. But, unlike java.net.URI.normalize(), this does not remove redundant ./, it does not collapse ../ with their parent directories, it does not other aspects of URL normalization that you and I might have forgotten about, and it will not be updated automatically with newer RFCs about URLs, URIs, and such.
String to = from.replaceAll("(?<!(http:|https:))[//]+", "/");
will match two or more slashes.
Here is the regexp:
/(?<=[^:\s])(\/+\/)/g
It finds multiple slashes in url preserving ones after protocol regardless of it.
Handles also protocol relative urls which start from //.
#Test
public void shouldReplaceMultipleSlashes() {
assertEquals("http://google.com/?q=hi", replaceMultipleSlashes("http://google.com///?q=hi"));
assertEquals("https://google.com/?q=hi", replaceMultipleSlashes("https:////google.com//?q=hi"));
assertEquals("//somecdn.com/foo/", replaceMultipleSlashes("//somecdn.com/foo///"));
}
private static String replaceMultipleSlashes(String url) {
return url.replaceAll("(?<=[^:\\s])(\\/+\\/)", "/");
}
Literally means:
(\/+\/) - find group: /+ one or more slashes followed by / slash
(?<=[^:\s]) - which follows the group (*posiive lookbehind) of this (*negated set) [^:\s] that excludes : colon and \s whitespace
g - global search flag
I suggest you simply use String.replace which documentation is http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#replace(java.lang.CharSequence, java.lang.CharSequence)
Something like
`myString.replace("//", "/");
If you want to remove the first occurence:
String[] parts = str.split("//", 2);
str = parts[0] + "//" + parts[1].replaceAll("//", "/");
Which is the simplest way (without regular expression). I don't know the regular expression corresponding, if there is an expert looking at the thread.... ;)

Regular expression Hostname

I am developing a http robot, and I developed this regular expression
(((?:f|ht)tp(?:s)?\\://)?|www)([^/]+) to detect and extract the hostname from a link (href).
Now I put here the results of the tests:
String -> http://www.meloteca.com/empresas-editoras.htm
Returns http://www.meloteca.com
String -> www.meloteca.com/empresas-editoras.htm
Returns www.meloteca.com
String -> /empresas-editoras.htm
Returns empresas-editoras.htm (without the slash)
In this case I was expecting that the regular expressions did not return any value? Why is this happening?
The same thing if I try with the following String
String -> empresas-editoras.htm
Returns empresas-editoras.htm
The snippet of code :
Pattern padrao = Pattern.compile("(((?:f|ht)tp(?:s)?\\://)?|www)([^/]+)");
Matcher mat = padrao.matcher("empresas-editoras.htm");
if(mat.find())
System.out.println("Host->"+mat.group());
It'd be better to use the URI class, and its methods like getHost() and getPath(), rather than a regular expression. The rules for constructing URIs are more complex than you probably realize, and your regex is likely to have lots of corner cases that won't be handled correctly.
If you remove one of the question marks, like this:
(((?:f|ht)tp(?:s)?\\://)|www)([^/]+)
it should work better.
The alternative ((?:f|ht)tp(?:s)?\\://)? is optional, so it can be the empty string, and then ([^/]+) just will match any string not containing /.

Categories