How to get URL of the first google search result? - java

I am trying to get the URL of the first search result. So far, I have tried converting the page to HTML using InputStream and AsyncTask. and then reading the string, stripping out the first URL using java regex.
String str = result;
String regex = "\\b(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println(matcher.group());
Toast.makeText(getBaseContext(), matcher.group(), Toast.LENGTH_LONG).show();
}
My code works very well stripping out the first URL from an HTML file, but I have noticed that there are no URL's in the HTML file when I save it using an android device. There must be a better way of doing this.

Instead of if(matcher.find()){} do while(matcher.find()){}
if there are multiple URLs in a single line, your regex will only parse the first URL in that line, ignoring any other important ones
i.e:
while((line = reader.readLine()) != null) {
Matcher matcher = pattern.matcher(line);
while(matcher.find()){
String url = matcher.group();
}
}
your code modified:
Pattern pattern = Pattern.compile("\\b(https?|ftp|file)://[-a-zA-Z0-9+&##/%?=~_|!:,.;]*[-a-zA-Z0-9+&##/%=~_|]");
Matcher matcher = pattern.matcher(result);
while (matcher.find()) {
String url = matcher.group();
}
I'm guessing you're attempting to get the first result though, and you're bound to see a lot of random google.com URLs, I recommend using Jsoup, as it's highly not recommended to try and parse XML/HTML with REGEX, it gets messy, and that'll take care of it all for you EASILY.
i.e:
Document connection = Jsoup.connect("https://www.google.com/search?q=query").get();
// all results are grouped into containers using the class "g" (group)
Elements groups = connection.getElementsByClass("g");
// check if any results were found
if(groups.size() <= 0) {
System.out.println("no results found!");
return;
}
// get the first result
Element firstGroup = groups.first();
// get the href from from first result
String href = firstGroup.getElementsByTag("a").first().attr("href");

Related

What's the correct regex to use for parsing data from a webpage source between tags?

Hi I'm having some trouble with parsing some data from a web source between two "tags"
Here's what a sample of the web source and the code I'm using to try and parse it.
<div class="ProfileTweet-contents">
<p class="ProfileTweet-text js-tweet-text u-dir"
dir="ltr">Come join us now! </span><span class="invisible">http://</span><span class="js-display-url">www.google.com</span><span class="invisible">/</span><span class="tco-ellipsis"><span class="invisible"> </span></span> <a href="http://t.co/jIw2344dDZz" class="twitter-timeline-link u-isHiddenVisually" data-pre-embedded="true" dir="ltr" >pic.twitter.com/jIwtc23juZz</a></p>
Code
while ((line = in.readLine()) != null) {
Pattern pattern = Pattern.compile("dir=.?!<a href=");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
tweets[0] = matcher.group();
System.out.println(matcher.group());
}
}
The item of data I'm trying to fetch is the following
dir="ltr">Come join us now! <a href=
For some reason it's not fetching the data inbetween dir= and < a href
Another working example which is parsing the web source just fine
URL addr = new URL(url);
URLConnection con = addr.openConnection();
ArrayList<String> data = new ArrayList<String>();
BufferedReader in = new BufferedReader(new InputStreamReader(con.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
Pattern p = Pattern.compile("<span itemprop=.*?</span>");
Pattern p2 = Pattern.compile(">.*?<");
Matcher m = p.matcher(inputLine);
Matcher m2;
while (m.find()) {
m2 = p2.matcher(m.group());
while (m2.find()) {
data.add(m2.group().replaceAll("<", "").replaceAll(">", "").replaceAll("&", "").replaceAll("#", "").replaceAll(";", "").replaceAll("3",""));
}
}
}
in.close();
addr = null;
con = null;
Edit: Sorry have just realised I was using a different regex from my other code example without realising.
(dir=).*?(<a href=)
Works fine
You're probably looking for a pattern such as:
(dir=\".+\">.+<a\\shref=).+rel
The reason your original pattern doesn't work is that you've not included several characters in your pattern such as " along with improperly using .? — it's not going capture anything between that and !.
Here a working example of the pattern above:
http://ideone.com/wbH9O6
Use a XML parser is the short version of the answer. If the html is mangled use a HTML parser that will try to make sense of the madness . Read this post as a bonus :
RegEx match open tags except XHTML self-contained tags

Processing a String and replacing al URL's with working links via Java

How would I transform a text like the following or any other text containing an URL (http ftp etc)
Go to this link http://www.google.com (ofc stack overflow already does this, on my website this is just plain text);
Into this
Go to this link www.google.com
I've come up with this method
public String transformURLIntoLinks(String text){
String urlValidationRegex = "(https?|ftp)://(www\\d?|[a-zA-Z0-9]+)?.[a-zA-Z0-9-]+(\\:|.)([a-zA-Z0-9.]+|(\\d+)?)([/?:].*)?";
Pattern p = Pattern.compile(urlValidationRegex);
Matcher m = p.matcher(text);
StringBuffer sb = new StringBuffer();
while(m.find()){
String found =m.group(1); //this String is only made of the http or ftp (just the first part of the link)
m.appendReplacement(sb, ""http"
}
m.appendTail(sb);
return sb.toString();
}
The problem is the regexes i've tried match only the first part("http" or "ftp").
My output becomes: Go to this link <a href='http'>http</a>
It should be this
Go to this link <a href='http://www.google.com'>http://www.google.com</a>
public String transformURLIntoLinks(String text){
String urlValidationRegex = "(https?|ftp)://(www\\d?|[a-zA-Z0-9]+)?.[a-zA-Z0-9-]+(\\:|.)([a-zA-Z0-9.]+|(\\d+)?)([/?:].*)?";
Pattern p = Pattern.compile(urlValidationRegex);
Matcher m = p.matcher(text);
StringBuffer sb = new StringBuffer();
while(m.find()){
String found =m.group(0);
m.appendReplacement(sb, "<a href='"+found+"'>"+found+"</a>");
}
m.appendTail(sb);
return sb.toString();
}
m.group(1) was the mistake. m.group(0) works.
This will transform any URL found in the text into an anchor.

Jmeter - regex in beanshell (matcher()/pattern() ) is cutting national characters

i need to cut some words from server response data.
Use Regular Expression Extractor I get
<span class="snippet_word">Działalność</span> <span class="snippet_word">lecznicza</span>.</a>
from that i need just: "Działalność lecznicza"
so i write a program in Beanshell which should do that and there's a problem because i get
"lecznicza lecznicza"
Here is my program:
import java.util.regex;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
String pattern = "\\w+(?=\\<)";
String co = vars.get("tresc");
int len = Integer.parseInt(vars.get("length"));
String phrase="";
StringBuffer sb = new StringBuffer();
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(co);
for(i=0; i < len ;i++){
if (m.find()){
strbuf = new StringBuffer(m.group(0));
}
else {
phrase="notfound";
}
sb.append(" ");
sb.append(strbuf);
}
phrase = sb.toString();
return phrase;
tresc - is my source from I extract pattern word.
Length - tells me how many words i'm extracting.
Program is working fine for phrase without national characters. Thats why I think there is some problem with encoding or somewhere here:
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(co);
but i don't know how to change my code.
\w does not match unicode. To match unicode in regex, you can use \p{L}:
String pattern = "\\p{L}+(?=\\<)";
Although for this type of work I would recommend using an XML parser as regular expressions are completely unsuitable for parsing HTML/XML as described in this post

Extracting specific urls from a text file using java

I have a text document in which I have a bunch of urls of the form /courses/......./.../..
and from among these urls, I only want to extract those urls that are of the form /courses/.../lecture-notes. Meaning the urls that begin with /courses and ends with /lecture-notes.
Would anyone know of a good way to do this with regular expressions or just by string matching?
Here's one alternative:
Scanner s = new Scanner(new FileReader("filename.txt"));
String str;
while (null != (str = s.findWithinHorizon("/courses/\\S*/lecture-notes", 0)))
System.out.println(str);
Given a filename.txt with the content
Here /courses/lorem/lecture-notes and
here /courses/ipsum/dolor/lecture-notes perhaps.
the above snippet prints
/courses/lorem/lecture-notes
/courses/ipsum/dolor/lecture-notes
The following will only return the middle part (ie: exclude /courses/ and /lectures-notes/:
Pattern p = Pattern.compile("/courses/(.*)/lectures-notes");
Matcher m = p.matcher(yourStrnig);
if(m.find()).
return m.group(1) // The "1" here means it'll return the first part of the regex between parethesis.
Assuming that you have 1 URL per line, could use:
BufferedReader br = new BufferedReader(new FileReader("urls.txt"));
String urlLine;
while ((urlLine = br.readLine()) != null) {
if (urlLine.matches("/courses/.*/lecture-notes")) {
// use url
}
}

extract data between two tags but not the tags using java regex [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I have this list of 100 names to be extracted that lie in between the tags. I need to extract just the data and not the tags using Java Regular Expressions.
Eg: I need the data Aaron,Teb, Abacha, Jui, Abashidze, Harry. All in a new line.
<a class="listing" href=http://eeee/a/hank_aaron/index.html">Aaron, Teb</a><br>
<a class="listing" href=http://eeee/t/sani_abacha/index.html">Abacha, Jui</a><br>
<a class="listing" href=http://eeee/i/aslan_abashidze/index.html">Abashidze, Harry</a><br>
I wrote the following code, but it extracts the tags too. Where am i going wrong. How do i replace the tags or Is the Regexp wrong.
public static void main(String[] args) throws Exception {
URL oracle = new URL("http://eeee/all/people/index.html");
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));
String input;
String REGEX = "<a class=\"listing\"[^>]*>";
while ((input = in.readLine()) != null){
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(input);
while(m.find()) {
System.out.println(input);
}
}
in.close();
}
Use this regexp:
(?:<a class=\"listing\"[^>]*>)([^<]*)(?:<)
Its group 1 will capture the name.
P.S. You should move Pattern p = Pattern.compile(REGEX); outside the loop.

Categories