how to get image using html parsing with jsoup - java

I want get all images using html parsing with jsoup.
I use below code ;
Elements images = doc.select("img[src~=(?i)\\.(jpe?g)]");
for (Element image : images) {
//System.out.println("\nsrc : " + image.attr("src"));
arrImageItem.add(image.attr("src"));
}
I parse this method all images but i want to parse this url
http://tvrehberi.hurriyet.com.tr/images/742/403742.jpg
I want to parse beginnig of this url
http://tvrehberi.hurriyet.com.tr/images .... .jpg
How to get parse like this ?

This will probably give you what you ask for, though your question is a bit unclear, so I can't be sure.
public static void main(String args[]){
Document doc = null;
String url = "http://tvrehberi.hurriyet.com.tr";
try {
doc = Jsoup.connect(url).get();
} catch (IOException e1) {
e1.printStackTrace();
}
for (Element e : doc.select("img[src~=(?i)\\.(jpe?g)]")) {
if(e.attr("src").startsWith("http://tvrehberi.hurriyet.com.tr/images")){
System.out.println(e.attr("src"));
}
}
}
So, this might not be a very "clean" solution, but the if-statement will make sure it only prints out the image URL's from the /images/-directory on the server.

If I understood correctly, you want to retrieve the URL path up to a certain point and cut off the rest. Do you even have to do that every time?
If you are only using URLs from the one site in your example, you could store "http://tvrehberi.hurriyet.com.tr/images" as a constant since it never changes. If, on the other hand, you fetch URLs from many different sites, you can parse your URL as described here.
Anyway, if you shared the purpose of parsing the URLs, we certainly could help you more.

Related

How do you use Jsoup to search for a string on a website?

I am trying to use Jsoup to search a website to see if it contains a string. Is this even possible, and if so, how is it done?
Yes it is possible and actually quite easy if you are using Jsoup. To simply see if a specific Web-Page contains a specific string then you might do something like the following example:
Let say we want to see if the following string exists within the Jsoup home web-page (https://jsoup.org/):
If you have any questions on how to use jsoup
Your code could look something like this:
String stringToFind = "If you have any questions on how to use jsoup";
try {
Document doc = Jsoup.connect("https://jsoup.org/").get();
if (doc.text().contains(stringToFind)) {
System.out.println("Yes...String exists in web-page.");
}
else {
System.out.println("No...String does not exist in web-page.");
}
}
catch (IOException ex) {
// Do whatever you like to handle the exception...
}

Unable to fetch the text from the div, Although i getting that content in response

Unable to fetch the text from the div... Using below code, Not sure whats the issue
try {
connection = new URL("https://en-gb.facebook.com/Aeysunnna/videos/265359950651320/")
.openConnection();
Scanner scanner = new Scanner(connection.getInputStream(),"UTF-8");
scanner.useDelimiter("\\A");
content = scanner.next();
} catch (Exception ex) {
ex.printStackTrace();
System.out.println("error");
}
Document doc = Jsoup.parse(content);
Thread.sleep(10000);
Elements elements = doc.select("div._1t6k");
System.out.println(elements.text());
It is probably because your Java browser is not logged in to Facebook. If you're not logged in you will see nothing except the login page. Thus you can not read the content of the div, you are currently at the starting page.
Try printing the whole content you're getting and check whether it is actually the starting page.
You find an in-depth explanation here: Finding a word in a web page using java
As shown there you will probably not find an easy fix for this issue.
Also note that scraping Facebook without using their own API violates their Terms of Service.

JSoup seems to ignore character codes?

I'm building a small CMS-like application in Java, that takes a .txt file with shirt names/descriptions and loads the names/descriptions into an ArrayList of customShirts (small class I made). Then, it iterates through the ArrayList, and uses JSoup to parse a template (template.html) and insert the unique details of the shirt into the HTML. Finally, it pumps out each shirt into its own HTML file in an output folder.
When the descriptions are loaded into the ArrayList of customShirts, I replace all special characters with the appropriate character codes so they can be properly displayed (for example, replacing apostrophes with ’). The issue is, I've noticed that JSoup seems to automatically turn the character codes into the actual character, which is an issue, since I need the output to be displayable (which requires character codes). Is there anything I can do to fix this? I've looked at other workarounds, like at: Jsoup unescapes special characters, but they seem to require parsing the file before insertion with replaceAll, and I insert the character-code sensitive text with JSoup, which doesn't seem to make this an option.
Below is the code for the HTML generator I made:
public void generateShirtHTML(){
for(int i = 0; i < arrShirts.size(); i++){
File input = new File("html/template/template.html");
Document doc = null;
try {
doc = Jsoup.parse(input, "UTF-8", "");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Element title = doc.select("title").first();
title.append(arrShirts.get(i).nameToCapitalized());
Element headingTitle = doc.select("h1#headingTitle").first();
headingTitle.html(arrShirts.get(i).nameToCapitalized());
Element shirtDisplay = doc.select("p#alt1").first();
shirtDisplay.html(arrShirts.get(i).name);
Element descriptionBox = doc.select("div#descriptionbox p").first();
descriptionBox.html(arrShirts.get(i).desc);
System.out.println(arrShirts.get(i).desc);
PrintWriter output;
try {
output = new PrintWriter("html/output/" + arrShirts.get(i).URL);
output.println(doc.outerHtml());
//System.out.println(doc.outerHtml());
output.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println("Shirt " + i + " HTML generated!");
}
}
Thanks in advance!
Expanding a little on my comment (since Stephan encouraged me..), you can use
doc.outputSettings().escapeMode(Entities.EscapeMode.extended);
To tell Jsoup to escape / encode special characters in the output, eg. left double quotes (“) as “. To make Jsoup encode all special characters, you may also need to add
doc.outputSettings().charset("ASCII");
In order to ensure that all Unicode special characters will be HTML encoded.
For larger projects where you have to fill in data into HTML files, you can look at using a template engine such as Thymeleaf - it's easier to use for this kind of job (less code and such), and it offers many more features specifically for this process. For small projects (like yours), Jsoup is good (I've used it like this in the past), but for bigger (or even small) projects, you'll want to look into some more specialized tools.

Open a link from Java, how to hide GET parameter

I want to open a link from Java I tried this
public static void main(String[] args) {
try {
//Set your page url in this string. For eg, I m using URL for Google Search engine
String url = "http://myurl.com?id=xx";
java.awt.Desktop.getDesktop().browse(java.net.URI.create(url));
}
catch (java.io.IOException e) {
System.out.println(e.getMessage());
}
}
It is working fine but the problem is that the query string is in that url. I don't want to pass it as a query string because it is a secret key. It should be passed as hidden to webpage request. How can I do this?
You can't, directly
You'd need to use a POST instead of a GET to hide the value and the URL does not encode the method used to access it, so it will always use GET.
You could conceivably write out a HTML file that automagically does a POST to the desired URL (using some JavaScript) and open that (using a file:// URL).
But note that "hiding" the parameter like this adds no real security! An interested user that wants to know the value that his PC sends to some site will be able to see it. It might take slightly more effort to find it, but it's definitely not impossible.
If there is no need to show the particular url in a browser, then you could handle the link as an HttpURLConnection (see JavaDoc).
And here you have an example.

java program to retreive page source from google search automatically [duplicate]

I have 30000 dictionary words. In that I want to to search each word in Google and want to find hits of each word using Java program. Is it possible?
Look up <estimatedTotalResultsCount> using Google's SOAP search API.
You'll be limited to 1000 queries per day though. This limit is removed if you use their AJAX API.
Since your duplicate post is closed, I'll post my answer here as well:
Whether this is possible or not doesn't really matter: Google doesn't want you to do that. They have a public AJAX-search API developers can use: http://code.google.com/apis/ajaxsearch/web.html
Here is a Sun tutorial on Reading to and Writing from an URLConnection.
The simplest URL I can see to make a Google search is like:
http://www.google.com/#q=wombat
Reading from a url with java is pretty straight forward. A basic working example is as follows
public Set<String> readUrl(String url) {
String line;
Set<String> lines = new HashSet<String>();
try {
URL url = new URL(url);
URLConnection page = url.openConnection();
BufferedReader in = new BufferedReader( new InputStreamReader(page.getInputStream()));
while ((line = in.readLine()) != null) {
lines.add(line);
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return lines;
}
I'd recommend separating your problem into pieces. Get each one working, then marry them together for the solution you want.
You have a few things going on here:
Downloading text from a URL
Scanning a stream of characters and breaking it up into words
Iterating through a list of words and tallying up the hits from your dictionary
Computer science is all about taking large problems and decomposing them into smaller ones. I'd recommend that you start learning how to do that now.

Categories