I am using JSOUP package for getting a specific TITLE search like facebook title's . Here is my code which gives the output with TITLE's. From the TITLE's I want to select facebook URL.
PROGRAM :
package googlesearch;
import java.io.IOException;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class SearchRegexDiv {
private static String REGEX = ".?[facebook]";
public static void main(String[] args) throws IOException {
Pattern p = Pattern.compile(REGEX);
String google = "http://www.google.com/search?q=";
//String search = "stackoverflow";
String search = "hortonworks";
String charset = "UTF-8";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().select(".g>.r>a");
for (Element link: links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
//.?facebook
if (title.matches(REGEX)) {
System.out.println("Done");
title.substring(title.lastIndexOf(" ") + 1); //split the String
//(example.substring(example.lastIndexOf(" ") + 1));
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
}
OUTPUT :
Title: Hortonworks - Facebook logo
URL: https://www.facebook.com/hortonworks/
From the output I get the list of URL's and TITLE's in the above format.
I am trying to match Title containing word Facebook and I want to split it into two strings like
String socila_media = facebook;
String org = hortonworks;
use this code to split you String using multiple Character
Here is a Demo To Split character using multiple param
String word = "https://www.facebook.com/hortonworks/";
String [] array = word.split("[/.]");
for (String each1 : array)
System.out.println(each1);
Output is
https: //each splitted word in different line.
www
facebook
com
hortonworks
Related
I am learning jsoup. I want to parse the below script :
<script>
_cUq="1lj9lodlnq";
</script>
After parsing output : 1lj9lodlnq
Here is what I am trying:
String str = element.ownText().toString();
str = str.replace("\r","");
str = str.replace("\n","");
str = str.replace("<script>","");
str = str.replace("</script>","");
System.out.println(str);
if(str.contains("="))
split = str.split("=");
On debugging I can see the script is stored in the element tag but on assigning to str I get "". Correct me where I am going wrong.
You can extract the inner Javascript with Jsoup. This has the plus that your code is much easier to maintain. Also, you can use regular expressions to rule out the whitespaces instead of String.replace() them one by one.
import org.jsoup.Jsoup;
import org.junit.Test;
import static org.hamcrest.core.Is.is;
import static org.junit.Assert.assertThat;
public class JSoupSO {
#Test
public void script() {
String s = "<script>\n" +
"_cUq=\"1lj9lodlnq\";\n" +
"</script>";
// let Jsoup parse the HTML
String innerJavascript = Jsoup.parse(s).data();
// remove all whitespaces
innerJavascript = innerJavascript.replaceAll("\\s", "");
assertThat(innerJavascript, is("_cUq=\"1lj9lodlnq\";"));
}
}
I am looking for the simplest method in java which takes a XML string, and converts all tags (not their contents) to camel case, such as
<HeaderFirst>
<HeaderTwo>
<ID>id1</ID>
<TimeStamp>2016-11-04T02:46:34Z</TimeStamp>
<DetailedDescription>
<![CDATA[di]]>
</DetailedDescription>
</HeaderTwo>
</HeaderFirst>
will be converted to
<headerFirst>
<headerTwo>
<id>id1</id>
<timeStamp>2016-11-04T02:46:34Z</timeStamp>
<detailedDescription>
<![CDATA[di]]>
</detailedDescription>
</headerTwo>
</headerFirst>
Try something like this:
public void tagToCamelCase(String input){
char[] inputArray = input.toCharArray();
for (int i = 0; i < inputArray.length-2; i++){
if (inputArray[i] == '<'){
if(inputArray[i+1]!= '/')
inputArray[i+1] = Character.toLowerCase(inputArray[i+1]);
else
inputArray[i+2] = Character.toLowerCase(inputArray[i+2]);
}
}
System.out.println(new String(inputArray));
}
Note: the tag ID will be iD and not id. Hope this helps.
Here is a solution that is based on splitting the string on the ">" character and then processing the tokens in three different cases: CDATA, open tag, and close tag
The following code should work (see the program output below). There is, however, a problem with the tag "ID" -- how do we know that its camel case should be "id" not "iD"? This needs a dictionary to capture that knowledge. So the following routine convert() has two modes -- useDictionary being true or false. See if the following solution satisfies your requirement.
To use the "useDictionary" mode you also need to maintain a proper dictionary (the hashmap called "dict" in the program, right now there is only one entry in the dictionary "ID" should be camel-cased to "id"). Note that the dictionary can be ramped up incrementally -- you only need to add the special cases to the dictionary (e.g. the camel-case of "ID" is "id" not "iD")
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class CamelCase {
private static Map<String, String> dict = new HashMap<>();
static {
dict.put("ID", "id");
}
public static void main(String[] args) {
String input = "<HeaderFirst> "
+ "\n <HeaderTwo>"
+ "\n <ID>id1</ID>"
+ "\n <TimeStamp>2016-11-04T02:46:34Z</TimeStamp>"
+ "\n <DetailedDescription>"
+ "\n <![CDATA[di]]>"
+ "\n </DetailedDescription>"
+ "\n </HeaderTwo> "
+ "\n</HeaderFirst>";
System.out.println("===== output without using a dictionary =====");
System.out.println(convert(input, false /* useDictionary */));
System.out.println("===== output using a dictionary =====");
System.out.println(convert(input, true /* useDictionary */));
}
private static String convert(String input, boolean useDictionary) {
String splitter = ">";
String[] tokens = input.split(splitter);
StringBuilder sb = new StringBuilder();
Pattern cdataPattern = Pattern.compile("([^<]*)<!\\[CDATA\\[([^\\]]*)\\]\\]");
Pattern oTagPattern = Pattern.compile("([^<]*)<(\\w+)");
Pattern cTagPattern = Pattern.compile("([^<]*)</(\\w+)");
String prefix;
String tag;
String newTag;
for (String token : tokens) {
Matcher cdataMatcher = cdataPattern.matcher(token);
Matcher oTagMatcher = oTagPattern.matcher(token);
Matcher cTagMatcher = cTagPattern.matcher(token);
if (cdataMatcher.find()) { // CDATA - do not change
sb.append(token);
} else if (oTagMatcher.find()) {// open tag - change first char to lower case
prefix = oTagMatcher.group(1);
tag = oTagMatcher.group(2);
newTag = camelCaseOneTag(tag, useDictionary);
sb.append(prefix + "<" + newTag);
} else if (cTagMatcher.find()) {// close tag - change first char to lower case
prefix = cTagMatcher.group(1);
tag = cTagMatcher.group(2);
newTag = camelCaseOneTag(tag, useDictionary);
sb.append(prefix + "<" + newTag);
}
sb.append(splitter);
}
return sb.toString();
}
private static String camelCaseOneTag(String tag, boolean useDictionary) {
String newTag;
if (useDictionary
&& dict.containsKey(tag)) {
newTag = dict.get(tag);
} else {
newTag = tag.substring(0, 1).toLowerCase()
+ tag.substring(1);
}
return newTag;
}
}
The output of this program is this:
===== output without using a dictionary =====
<headerFirst>
<headerTwo>
<iD>id1<iD>
<timeStamp>2016-11-04T02:46:34Z<timeStamp>
<detailedDescription>
<![CDATA[di]]>
<detailedDescription>
<headerTwo>
<headerFirst>
===== output using a dictionary =====
<headerFirst>
<headerTwo>
<id>id1<id>
<timeStamp>2016-11-04T02:46:34Z<timeStamp>
<detailedDescription>
<![CDATA[di]]>
<detailedDescription>
<headerTwo>
<headerFirst>
I played a little with JSOUP, but I can't get the information I want with it from a website and I need some help.
For example I have this website, and I would like to extract some info like this :
-ROCK
--ACID ROCK
--PSYCHEDELIC ROCK
--BLUES ROCK
-----Aerosmith
----------One Way Street
-----AC/DC
----------Ain't No Fun (Waiting Round to Be a Millionaire)
With other words ... I want a list with genres containing lists with artist containing lists with songs ...
-Genre1
--Artist1
---Song1
---Song2
---Song3
--Artist2
---Song1
-Genre2
...
This is what i have so far (sorry for the messy code):
package parser;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class HTMLParser {
public static void main(String[] args) {
String HTMLSTring = "<!DOCTYPE html>"
+ "<html>"
+ "<head>"
+ "<title>Music</title>"
+ "</head>"
+ "<body>"
+ "<table><tr><td><h1>Artists</h1></tr>"
+ "</table>"
+ "</body>"
+ "</html>";
Document html = Jsoup.parse(HTMLSTring);
String genre = "genres";
String artist = "artist";
String album = "album";
String song = "song";
//Document htmlFile = null;
//Element div = html.body().getElementsByAttributeValueMatching(genre, null);
// String div = html.body().getElementsByAttributeValueMatching(genre, null).text();
//String cssClass = div.className();
List genreList = new ArrayList();
String title = html.title();
String h1 = html.body().getElementsByTag("h1").text();
String h2 = html.body().getElementsByClass(genre).text();
String gen = html.body().getAllElements().text();
Document doc;
try {
doc = Jsoup.connect("http://allmusic.com/").get();
title = doc.title();
h1 = doc.text();
h2 = doc.text();
} catch (IOException e)
{
e.printStackTrace();
}
System.out.println("Title : " + title);
//System.out.println("h1 : "+ h1);
//System.out.println("h2 : "+ h2);
System.out.println("gen : all elements : " + gen);
}
}
And this is my output:
Title : AllMusic | Record Reviews, Streaming Songs, Genres & Bands
gen : all elements : Artists Artists Artists Artists Artists Artists
I haven't got so far ...
I don't know how to extract the information ... (e.g. type of genres, artists names ...)
I have txt file with line:
1st line - 20-01-01 Abs Def est (xabcd)
2nd line - 290-01-01 Abs Def est ghj gfhj (xabcd fgjh fgjh)
3rd line - 20-1-1 Absfghfgjhgj (xabcd ghj 5676gyj)
I want to keep 3 diferent String array:
[0]20-01-01 [1]290-01-01 [2] 20-1-1
[0]Abs Def est [1]Abs Def est ghj gfhj [2] Absfghfgjhgj
[0]xabcd [1]xabcd fgjh fgjh [2] xabcd ghj 5676gyj
Using String[] array 1 = myLine.split(" ") i only have piece 20-01-01 but i also want to keep other 2 Strings
EDIT: I want to do this using regular Expressions (text file is large)
This is my piece of code:
Please help, i searching, but does not found anything
Thx.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.Comparator;
import java.util.Date;
import java.util.Set;
import java.util.TreeSet;
public class Holiday implements Comparable<Date>{
Date date;
String name;
public Holiday(Date date, String name){
this.date=date;
this.name=name;
}
public static void main(String[] args) throws IOException {
FileInputStream fis = new FileInputStream(new File("c:/holidays.txt"));
InputStreamReader isr = new InputStreamReader(fis, "windows-1251");
BufferedReader br = new BufferedReader(isr);
TreeSet<Holiday> tr=new TreeSet<>();
System.out.println(br.readLine());
String myLine = null;
while ( (myLine = br.readLine()) != null)
{
String[] array1 = myLine.split(" "); //OR use this
//String array1 = myLine.split(" ")[0];//befor " " read 1-st string
//String array2 = myLine.split("")[1];
//Holiday h=new Holiday(array1, name)
//String array1 = myLine.split(" ");
// check to make sure you have valid data
// String[] array2 = array1[1].split(" ");
System.out.println(array1[0]);
}
}
#Override
public int compareTo(Date o) {
// TODO Auto-generated method stub
return 0;
}
}
Pattern p = Pattern.compile("(.*?) (.*?) (\\(.*\\))");
Matcher m = p.matcher("20-01-01 Abs Def est (abcd)");
if (!m.matches()) throw new Exception("Invalid string");
String s1 = m.group(1); // 20-01-01
String s2 = m.group(2); // Abs Def est
String s3 = m.group(3); // (abcd)
Use a StringTokenizer, which has a " " as a delimiter by default.
You seem to be splitting based on whitespace. Each element of the string array would contain the individual whitespace-separate substrings, which you can then piece back together later on via string concatenation.
For instance,
array1[0] would be 20-01-01
array1[1] would be Abs
array1[2] would be Def
so on and so forth.
Another option is to Java regular expressions, but that may only be useful if your input text file is has a consistent formatting and if there's a lot of lines to process. It is very powerful, but requires some experience.
Match required text data by regular expression.
The regexp below ensure there are exactly 3 words in the middle and 1 word in the bracket.
String txt = "20-01-01 Abs Def est hhh (abcd)";
Pattern p = Pattern.compile("(\\d\\d-\\d\\d-\\d\\d) (\\w+ \\w+ \\w+) ([(](\\w)+[)])");
Matcher matcher = p.matcher(txt);
if (matcher.find()) {
String s1 = matcher.group(1);
String s2 = matcher.group(2);
String s3 = matcher.group(3);
System.out.println(s1);
System.out.println(s2);
System.out.println(s3);
}
However if you need more flexibility you may want to use code provided by Lence Java.
Need help in parsing html string
String str = "<div id=\"test\" > Amrit </div><div><a href=\"#bbbb\" > Amrit </a> </div><a href=\"#cccc\" ><a href=\"#dddd\" >";
String reg = ".*(<\\s*a\\s+href\\s*=\\s*\\\"(.+?)\"\\s*>).*";
str is my sample string and reg is my regex used to parse all the anchor tags, specially the value of href. Using this regex, it only shows the last part of the string.
Pattern MY_PATTERN = Pattern.compile(reg);
Matcher m = MY_PATTERN.matcher(str);
while (m.find()) {
for(int i=0; i<m.groupCount(); i++){
String s = m.group(i);
System.out.println("->" + s);
}
}
This is the code I did.
What is missing?
And also if i want particular occurrence of string to be replaced, generally if I have my url changed form [string]_[string] into [string]-[string]. How can I get "_" and replace it by "-" ?
Instead of parsing html using regex (regex is for regular language - HTML is not regular language) use HtmlUnit
http://htmlunit.sourceforge.net/
This may help: Options for HTML scraping?
It looks like you have a double escape too many.
This segment may fix it: "<\\s*a\\s+href\\s*=\\s*\"(.+?)\"\\s*>", but can't comment
on the entire regex if it works or not.
I would suggest to use JSoup. It could be much more flexible than using a regex. A sample code is put below.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ListLinks {
public static void main(String[] args) throws Exception {
String url = "http://www.umovietv.com/EntertainmentList.aspx";
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
print("%s", link.attr("abs:href"));
}
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
}
Refer to http://jsoup.org/ for more information.