Crawling & parsing results of querying google-like search engine - java

I have to write parser in Java (my first html parser by this way). For now I'm using jsoup library and I think it is very good solution for my problem.
Main goal is to get some information from Google Scholar (h-index, numbers of publications, years of scientific carier). I know how to parse html with 10 people, like this:
http://scholar.google.pl/citations?mauthors=Cracow+University+of+Economics&hl=pl&view_op=search_authors
for( Element element : htmlDoc.select("a[href*=/citations?user") ){
if( element.hasText() ) {
String findUrl = element.absUrl("href");
pagesToVisit.add(findUrl);
}
}
BUT I need to find information about all of scientists from asked university. How to do that? I was thinking about getting url from button, which is guiding us to next 10 results, like that:
Elements elem = htmlDoc.getElementsByClass("gs_btnPR");
String nextUrl = elem.attr("onclick");
But I get url like that:
citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dAGH+University+of+Science+and+Technology\x26after_author\x3dslQKAC78__8J\x26astart\x3d10
I have to translate \x signs and add that site to my "toVisit" sites? Or it is a better idea inside jsoup library or mayby in other library? Please let me know! I don't have any other idea, how to parse something like this...

I have to translate \x signs and add that site to my "toVisit" sites...I don't have any other idea, how to parse something like this...
The \xAA is hexadecimal encoded ascii. For instance \x3d is =, and \x26 is &. These values can be converted using Integer.parseInt with radix set to 16.
char c = (char)Integer.parseInt("\\x3d", 16);
System.out.println(c);
If you need to decode these values without a 3rd party library, you can do so using regular expressions. For example, using the String supplied in your question:
String st = "citations?view_op\\x3dsearch_authors\\x26hl\\x3dpl\\x26oe\\x3dLatin2\\x26mauthors\\x3dAGH+University+of+Science+and+Technology\\x26after_author\\x3dslQKAC78__8J\\x26astart\\x3d10";
System.out.println("Before Decoding: " + st);
Pattern p = Pattern.compile("\\\\x([0-9A-Fa-f]{2})");
Matcher m = p.matcher(st);
while ( m.find() ){
String c = Character.toString((char)Integer.parseInt(m.group(1), 16));
st = st.replaceAll("\\" + m.group(0), c);
m = p.matcher("After Decoding: " + st);//optional, but added for clarity as st has changed
}
System.out.println(st);

You currently get a URL like this using your code:
citations?view_op\x3dsearch_authors\x26hl\x3dpl\x26oe\x3dLatin2\x26mauthors\x3dAGH+University+of+Science+and+Technology\x26after_author\x3dQPQwAJz___8J\x26astart\x3d10
You have to extract that bold part (using a regex), and use that to construct the URL for getting the next page of search results, which looks like this:
scholar.google.pl/citations?view_op=search_authors&hl=plmauthors=Cracow+University+of+Economic&after_author=QPQwAJz___8J
You can then get that next page from this URL and parse using Jsoup, and repeat for getting all the next remaining pages.
Will put together some example code later.

Related

Java Regexp to match domain of url

I would like to use Java regex to match a domain of a url, for example,
for www.table.google.com, I would like to get 'google' out of the url, namely, the second last word in this URL string.
Any help will be appreciated !!!
It really depends on the complexity of your inputs...
Here is a pretty simple regex:
.+\\.(.+)\\..+
It fetches something that is inside dots \\..
And here are some examples for that pattern: https://regex101.com/r/L52oz6/1.
As you can see, it works for simple inputs but not for complex urls.
But why reinventing the wheel, there are plenty of really good libraries that correctly parse any complex url. But sure, for simple inputs a small regex is easily build. So if that does not solve the problem for your inputs then please callback, I will adjust the regex pattern then.
Note that you can also just use simple splitting like:
String[] elements = input.split("\\.");
String secondToLastElement = elements[elements.length - 2];
But don't forget the index-bound checking.
Or if you search for a very quick solution than walk through the input starting from the last position. Work your way through until you found the first dot, continue until the second dot was found. Then extract that part with input.substring(index1, index2);.
There is also already a delegate method for exactly that purpose, namely String#lastIndexOf (see the documentation).
Take a look at this code snippet:
String input = ...
int indexLastDot = input.lastIndexOf('.');
int indexSecondToLastDot = input.lastIndexOf('.', indexLastDot);
String secondToLastWord = input.substring(indexLastDot, indexSecondToLastDot);
Maybe the bounds are off by 1, haven't tested the code, but you get the idea. Also don't forget bound checking.
The advantage of this approach is that it is really fast, it can directly work on the internal structures of Strings without creating copies.
My attempt:
(?<scheme>https?:\/\/)?(?<subdomain>\S*?)(?<domainword>[^.\s]+)(?<tld>\.[a-z]+|\.[a-z]{2,3}\.[a-z]{2,3})(?=\/|$)
Demo. Works correctly for:
http://www.foo.stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.com/
http://stackoverflow.com
https://www.stackoverflow.com
www.stackoverflow.com
stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.co.uk
foo.www.stackoverflow.com
foo.www.stackoverflow.co.uk
foo.www.stackoverflow.co.uk/a/b/c
private static final Pattern URL_MATCH_GET_SECOND_AND_LAST =
Pattern.compile("www.(.*)//.google.(.*)", Pattern.CASE_INSENSITIVE);
String sURL = "www.table.google.com";
if (URL_MATCH_GET_SECOND_AND_LAST.matcher(sURL).find()){
Matcher matchURL = URL_MATCH_GET_SECOND_AND_LAST .matcher(sURL);
if (matchURL .find()) {
String sFirst = matchURL.group(1);
String sSecond= matchURL.group(2);
}
}

How to copy certain text from a website with Java

I have a website that is in plain text. The website is in a format like this:
{"code1":"Text I want copied","code2":"Second text I want to copy"}
Every time the website refreshes though, the texts I want copied change in length. I am curious how I could retrieve the text starting after ' :" ' and before ' ", ', using Java. I want the same thing to happen with the second text as well. I also would like to remove the html tags. Help will be greatly appreciated.
Using the org.json library, you could parse the JSON like:
String myJSONString = "{\"code1\":\"Text I want copied\",\"code2\":\"Second text I want to copy\"}";
JSONObject object = new JSONObject(myJSONString);
String[] keys = JSONObject.getNames(object);
String firstText = (String) object.get(keys[0]);
String secondText = (String) object.get(keys[1]);
For parsing the web page, you can use the JSoup library. See an example from this answer.

Replace & only in links in a partial html document

I've tried a few methods (jsoup shown below) to turn &amp into & only in links. The difficulty I'm encountering suggests I'm going about this all wrong. I suspect I'll be facepalming when solutions are offered, but maybe good old regex is the best answer (as I need to only do the replacing in hrefs) unless the reader code is modified?
The parsing libraries (also tried NekoHTML) want to convert all &s to & so I'm having trouble using them to even get the true link hrefs with which to use String's replace method.
Input:
String toParse = "The Link with an encoded ampersand (&) is challenging."
Desired output:
The Link with an encoded ampersand (&) is challenging.
I'm encountering this trying to read an RSS feed that is rendering <link>s with & instead of &.
Update
I ended up using regex to identify the links, then using replace to insert a decoded link in place of the one with &s. Pattern.quote() turned out to be very handy, but I had to manually close and re-open the quoted portions so I could regex or my ampersand condition:
final String cleanLink = StringUtils.strip(link).replaceAll(" ", "%20").replaceAll("'", "%27");
String regex = Pattern.quote(link);
// end and re-start literal matching around my or condition
regex = regex.replaceAll("&", "\\\\E(&|&)\\\\Q");
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(result);
while (matcher.find()) {
int index = result.indexOf(matcher.group());
while (index != -1) {
// this replaces the links with & with the same links with &
// because cleanLink is from the DOM and has been properly decoded
result.replace(index, index + matcher.group().length(), cleanLink);
index += cleanLink.length();
index = result.indexOf(matcher.group(), index);
linkReplaced = true;
}
}
I'm not thrilled with this approach, but I had to handle too many conditions myself without using a DOM tool to identify links.
Have a look at the StringEscapeUtils. Try unescapeHtml() on your String.

Formatting a JSON-Like String Without External Library

How can I go about formatting this string to give me the "gs" value?
{"status":1,"gs":"a2fdee457d64cd48f399f1a9fea4a977","user_type_id":1,"uid":-980}
I have looked at many of the similar questions on stack overflow, however, most suggest to use an external library. I do not want to use an external library because I am using this for libgdx, and an external library will just create unnecessary complications.
If you know specifically that you are looking for the value after "gs", then you can simply do the following:
// String input = [string in your question];
input = input.substring(input.indexOf("\"gs\"") + "\"gs\"".length()); // we move past the ':'
input = input.substring(input.indexOf('"') + 1); // move past the first '"' after "gs"
String gs = input.substring(0, input.indexOf('"'));
And now gs contains the string a2fdee457d64cd48f399f1a9fea4a977.

Extracting an "encompassing" string based on a term within the string

I have a java function to extract a string out of the HTML Page source for any website...The function basically accepts the site name, along with a term to search for. Now, this search term is always contained within javascript tags. What I need to do is to pull the entire javascript (within the tags) that contains the search term.
Here is an example -
<script type="text/javascript">
//Roundtrip
rtTop = Number(new Date());
document.documentElement.className += ' jsenabled';
</script>
For the javascript snippet above, my search term would be "rtTop". Once found, I want my function to return the string containing everything within the script tags.
Any novel solution? Thanks.
You could use a regular expression along the lines of
String someHTML = //get your HTML from wherever
Pattern pattern = Pattern.compile("<script type=\"text/javascript\">(.*?rtTop.*?)</script>",Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String result = myMatcher.group(1);
I wish I could just comment on JacobM's answer, but I think I need more stackCred.
You could use an HTML parser, that's usually the better solution. That said, for limited scopes, I often use regEx. It's a mean beast though. One change I would make to JacobM's pattern is to replace the attributes within the opening element with [^<]+
That will allow you to match even if the "type" isn't present or if it has some other oddities. I'd also wrap the .*? with parens to make using the values later a little easier.
* UPDATE *
Borrowing from JacobM's answer. I'd change the pattern a little bit to handle multiple elements.
String someHTML = //get your HTML from wherever
String lKeyword = "rtTop";
String lRegexPattern = "(.*)(<script[^>]*>(((?!</).)*)"+lKeyword +"(((?!</).)*)</script>)(.*)";
Pattern pattern = Pattern.compile(lRegexPattern ,Pattern.DOTALL);
Matcher myMatcher = pattern.matcher(someHTML);
myMatcher.find();
String lPreKeyword = myMatcher.group(3);
String lPostKeyword = myMatcher.group(5);
String result = lPreKeyword + lKeyword + lPostKeyword;
An example of this pattern in action can be found here. Like I said, parsing HTML via regex can get real ugly real fast.

Categories