Grabbing partial data from URL in Selenium [duplicate] - java

This question already has answers here:
Parse a URI String into Name-Value Collection
(30 answers)
Closed 4 years ago.
In selenium, is there a way to grab partial dynamically changing data from a URL? Example, I ran driver.getCurrentUrl() in my code and retrieved the below URL. I only want to grab the numeric portion of it and concatenate it so I end up with 819499-1. However, next time I run it the values will be different. Is there a way to capture this information without capturing the entire URL?
http://custmaint/clnt_estimate_overview.jsp?estimateNbr=819499&versionNbr=1&estimateMgmt=true&clntAutoAssign=true

I think this is a more generic question, not only related to Selenium...
Basically, You could use regular expression matching like so:
String url = "http://custmaint/clnt_estimate_overview.jsp?estimateNbr=819499&versionNbr=1&estimateMgmt=true&clntAutoAssign=true";
final Pattern p = Pattern.compile("^.*\\?estimateNbr=(\\d+)&versionNbr=(\\d+).*$");
Matcher m = p.matcher(url);
if (m.find()) {
System.out.print(m.group(1) + "-" + m.group(2));
}
Another approach (more flexible) is to use a dedicated library like httpcomponents; see How to convert map to url query string?

String url = driver.getCurrentUrl();
String estimateNbr = url.substring(url.indexOf("eNbr=")+5, url.indexOf("&v"));
String versionNbr = url.substring(url.indexOf("nNbr=")+5, url.indexOf("&e"));
System.out.println(estimateNbr + "-" + versionNbr);

Related

Crawling & parsing results of querying google-like search engine

I have to write parser in Java (my first html parser by this way). For now I'm using jsoup library and I think it is very good solution for my problem.
Main goal is to get some information from Google Scholar (h-index, numbers of publications, years of scientific carier). I know how to parse html with 10 people, like this:
http://scholar.google.pl/citations?mauthors=Cracow+University+of+Economics&hl=pl&view_op=search_authors
for( Element element : htmlDoc.select("a[href*=/citations?user") ){
if( element.hasText() ) {
String findUrl = element.absUrl("href");
pagesToVisit.add(findUrl);
}
}
BUT I need to find information about all of scientists from asked university. How to do that? I was thinking about getting url from button, which is guiding us to next 10 results, like that:
Elements elem = htmlDoc.getElementsByClass("gs_btnPR");
String nextUrl = elem.attr("onclick");
But I get url like that:
citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dAGH+University+of+Science+and+Technology\x26after_author\x3dslQKAC78__8J\x26astart\x3d10
I have to translate \x signs and add that site to my "toVisit" sites? Or it is a better idea inside jsoup library or mayby in other library? Please let me know! I don't have any other idea, how to parse something like this...
I have to translate \x signs and add that site to my "toVisit" sites...I don't have any other idea, how to parse something like this...
The \xAA is hexadecimal encoded ascii. For instance \x3d is =, and \x26 is &. These values can be converted using Integer.parseInt with radix set to 16.
char c = (char)Integer.parseInt("\\x3d", 16);
System.out.println(c);
If you need to decode these values without a 3rd party library, you can do so using regular expressions. For example, using the String supplied in your question:
String st = "citations?view_op\\x3dsearch_authors\\x26hl\\x3dpl\\x26oe\\x3dLatin2\\x26mauthors\\x3dAGH+University+of+Science+and+Technology\\x26after_author\\x3dslQKAC78__8J\\x26astart\\x3d10";
System.out.println("Before Decoding: " + st);
Pattern p = Pattern.compile("\\\\x([0-9A-Fa-f]{2})");
Matcher m = p.matcher(st);
while ( m.find() ){
String c = Character.toString((char)Integer.parseInt(m.group(1), 16));
st = st.replaceAll("\\" + m.group(0), c);
m = p.matcher("After Decoding: " + st);//optional, but added for clarity as st has changed
}
System.out.println(st);
You currently get a URL like this using your code:
citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dAGH+University+of+Science+and+Technology\x26after_author\x3dQPQwAJz___8J\x26astart\x3d10
You have to extract that bold part (using a regex), and use that to construct the URL for getting the next page of search results, which looks like this:
scholar.google.pl/citations?view_op=search_authors&hl=plmauthors=Cracow+University+of+Economic&after_author=QPQwAJz___8J
You can then get that next page from this URL and parse using Jsoup, and repeat for getting all the next remaining pages.
Will put together some example code later.

What's the best way to url encode only query keys and params in Java?

I would like to encode only the query keys and parameters of a url (don't want to encode the /, ? or &). What's the best way to do this in Java?
For example, I want to convert
http://www.hello.com/bar/foo?a=,b &c =d
to
http://www.hello.com/bar/foo?a=%2Cb%20&c%20=d
Build the url something like this:
String url = "http://www.hello.com/bar/foo?";
url += "a=" + URLEncoder.encode(value_of_a);
url += "&c=" + URLEncoder.encode(value_of_c);
I'm going to leave the actual component encoding as a user-supplied function because it is an existing well-discussed problem without a trivial JCL solution .. In any case, the following is how I would approach this particular problem without the use of third-party libraries.
While regular expressions sometimes result in two problems, I am hesitant to suggest a more strict approach such as URI because I don't know how it will - or even if it will - work with such funky invalid URLs. As such, here is a solution using a regular expression with a dynamic replacement value.
// The following pattern is pretty liberal on what it matches;
// It ought to work as long as there is no unencoded ?, =, or & in the URL
// but, being liberal, it will also match absolute garbage input.
Pattern p = Pattern.compile("\\b(\\w[^=?]*)=([^&]*)");
Matcher m = p.matcher("http://www.hello.com/bar/foo?a=,b &c =d");
StringBuffer sb = new StringBuffer();
while (m.find()) {
String key = m.group(1);
String value = m.group(2);
m.appendReplacement(sb,
encodeURIComponent(key) + "=" encodeURIComponent(value));
}
m.appendTail(sb);
See the ideone example example with a fill-in encodeURIComponent.

How to validate URL in java using regex? [duplicate]

This question already has answers here:
What is the best regular expression to check if a string is a valid URL?
(62 answers)
Closed 9 years ago.
I want to validate url started with http/https/www/ftp and checks for /\ slashes and checks for .com,.org etc at the end of URL using regular expression. Is there any regex patttern for URL validation?
This works:
Pattern p = Pattern.compile("(#)?(href=')?(HREF=')?(HREF=\")?(href=\")?(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w]+)?");
Matcher m = p.matcher("your url here");
I am use the following code for that
String lRegex = "^(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]";
btw a search in google and you would find the solution by yourself.

Getting the last part of the referrer url [duplicate]

This question already has answers here:
How to obtain the last path segment of a URI
(13 answers)
Closed 7 years ago.
I wanted to get the last part of the referrer URL. How would be the best way to obtain this information?
String referrer = request.getHeader("referer");
Currently this is how I am getting the referrer URL. This is giving me the entire URL, but I only want part of the URL.
For Example: Requested URL: http://localhost:8080/TEST/ABC.html
If this is the referrer URL, I only want the ABC.html.
Thank you for the assistance, please let me know if there are any misunderstandings to my question.
This will give you XYZ.html :
String url = "http://localhost:8080/TEST/XYZ.html";
url = url.substring(url.lastIndexOf("/") + 1);
Use java.net.URL.getFile()
String path = new URL( request.getHeader( "referer" )).getPath();
int sep = path.lastIndexOf( '/' );
return ( sep < 0 ) ? path : path.substring( sep + 1 );
It's basic string manipulation, treat it like any other string. You can use String.indexOf/String.lastIndexOf and String.substring, or you can use String.split, or you can use StringTokenizer, or you can use Regular Expressions (the most flexible option, but requires learning regex).
Just get the substring after the last / occurence.

How to validate the URL using regex in Java?

I need to check if an URL is valid or not. The URL should contain some subdirectories like as:
example.com/test/test1/example/a.html
The URL should contain subdirectories test, test1 and example. How can I check if the URL is valid using regex in Java?
String url = "example.com/test/test1/example/a.html";
List<String> parts = Arrays.asList(url.split("/"));
return (parts.contains("test") && parts.contains("test1") && parts.contains("example"));
Since you want to do in regex, how about this...
Pattern p = Pattern.compile("example\\.com/test/test1/example/[\\w\\W]*");
System.out.println("OK: " + p.matcher("example.com/test/test1/example/a.html").find());
System.out.println("KO: " + p.matcher("example.com/test/test2/example/a.html").find());
You can simply pass your URL as an argument to the java.net.URL(String) constructor and check if the constructor throws java.net.MalformedURLException.
EDIT If, however, you simply want to check if a given string contains a given substring, use the String.contains(CharSequence) method. For example:
String url = "example.com/test/test1/example/a.html";
if (url.contains("/test/test1/")) {
// ...
}
This question is answered here using regular expressions:
Regular expression to match URLs in Java
But you can use the library Apache Commons Validators to use some tested validators instead to write your own.
Here is the library:
http://commons.apache.org/validator/
And here the javadoc of the URL Validator.
http://commons.apache.org/validator/apidocs/org/apache/commons/validator/UrlValidator.html

Categories