Why is JSoup printing a question mark - java

I'm trying to understand the following. I have some code reading a page from gutenberg.org. Almost everything is ok but some characters are not. They are ok in the browser.
package nl.atticworks.gutenberg;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Gutenberg {
private static final String GET_URL = "http://www.gutenberg.org/browse/languages/nl";
public static void main(String[] args) {
try {
Document doc = Jsoup.connect(GET_URL).get();
Elements data = doc.select("div.pgdbbylanguage");
for (Element d : data) {
Elements children = d.select("*");
for (Element child : children) {
if (child.tagName().equals("ul")) {
Element author = children.get(children.indexOf(child) - 1);
String a1 = author.select("a:last-child").text();
if (a1.startsWith("Kara")) {
System.out.println(a1);
Elements titles = child.select("li.pgdbetext a");
for (Element title : titles) {
System.out.println("\t" + title.text());
}
}
}
}
}
} catch (IOException ex) {
// do something...
}
}
}
The string a1 prints "Karadži?, Vuk Stefanovi?, 1787-1864" but should print "Karadžić, Vuk Stefanović, 1787-1864"
I'm pretty sure that the encoding is ok (UTF-8) but the c with acute isn't encoded properly.
Still, browsers do show the correct char, Jsoup doesn't. Why?
Regards,
Hans

As you haven't said what you are running your program in it is difficult to give a definitive answer, but basically there is nothing wrong with your code. JSoup is not responsible for your display problem, whichever console you are displaying on is the problem.
If you set your console (or IDE) to the UTF-8 encoding it should display correctly.

I tried this code on my own IDEA,and the output was just as you expected.
So I insist that the encoding is the problem.

Related

Is there a difference between HttpAsyncClient and multithreaded Jsoup connection class?

I'm implementing a web scraper in Java. After playing a little with websites that I'm going to crawl, I want to use best practice for concurrent HTTP connections in Java. I'm currently using Jsoup's connection method. I'd like to know if it's possible to create threads and make connections inside those threads similarly to HttpAsyncClient.
Jsoup does not use HttpAsyncClient. Jsoup's Jsoup.connect(String url) method uses blocking URL.openConnection() method.
If you want to use Jsoup asynchronously you can parallel all Jsoup.connect() executions. In Java 8 you can use parallel stream to do so. Let's say you have a list of URLs you want to scrape in parallel. Take a look at following example:
import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.Objects;
import java.util.concurrent.ExecutionException;
import java.util.stream.Collectors;
public class ConcurrentJsoupExample {
public static void main(String[] args) throws ExecutionException, InterruptedException {
final List<String> urls = Arrays.asList(
"https://google.com",
"https://stackoverflow.com/questions/48298219/is-there-a-difference-between-httpasyncclient-and-multithreaded-jsoup-connection",
"https://mvnrepository.com/artifact/org.jsoup/jsoup",
"https://docs.oracle.com/javase/7/docs/api/java/net/URL.html#openConnection()",
"https://docs.oracle.com/javase/7/docs/api/java/net/URLConnection.html"
);
final List<String> titles = urls.parallelStream()
.map(url -> {
try {
return Jsoup.connect(url).get();
} catch (IOException e) {
return null;
}
})
.filter(Objects::nonNull)
.map(doc -> doc.select("title"))
.map(Elements::text)
.peek(it -> System.out.println(Thread.currentThread().getName() + ": " + it))
.collect(Collectors.toList());
}
}
Here we have 5 URLs defined and the goal of this simple application is to get a text value of <title> HTML tag from these websites. What happens is we create parallel stream using list of URLs and we map each URL to Jsoup's Document object - .get() method throws checked exception so we have to try-catch it and if exception occurs we return null value. All null values gets filtered by .filter(Objects::nonNull) and after that we can extract elements we need - text value of <title> tag in this case. I also added .peek() that prints what is the value extracted and what is the thread name it runs on. Exemplary output may look like this:
ForkJoinPool.commonPool-worker-1: java - Is there a difference between HttpAsyncClient and multithreaded Jsoup connection class? - Stack Overflow
main: Maven Repository: org.jsoup » jsoup
ForkJoinPool.commonPool-worker-4: URL (Java Platform SE 7 )
ForkJoinPool.commonPool-worker-2: URLConnection (Java Platform SE 7 )
ForkJoinPool.commonPool-worker-3: Google
In the end we call .collect(Collectors.toList()) to terminate stream, execute all transformations and return a list of titles.
It is just a simple example, but it should give you a hint how to use Jsoup in parallel.
Alternatively you can use url.parallelStream().forEach() if functional-like approach does not convince you:
urls.parallelStream().forEach(url -> {
try {
final Document doc = Jsoup.connect(url).get();
final String title = doc.select("title").text();
System.out.println(Thread.currentThread().getName() + ": " + title);
// do something with extracted title...
} catch (IOException e) {
e.printStackTrace();
}
});

Trying to use jSoup to scrape data from a table

First time poster and fairly new coder, so please go easy on me. I'm trying to use jSoup to scrape data from a table. However, I'm having a couple problems:
1) I'm using NetBeans. I get a "stop" error on Line 30 (Elements tds...) that says cannot find symbol symbol method getElementsByTag. I'm confused because I thought I imported the correct package, and I use the same code a couple lines above and get no error.
2) When I run the code, I get an error that says:
Exception in thread "main" java.lang.NullPointerException
at mytest.JsoupTest1.main(JsoupTest1.java:26)
Which I thought means that a variable with a value of NULL is being used. Did I incorrectly enter the "row" variable in my for loop above?
Here's my code. I truly appreciate any help!
package mytest;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupTest1 {
private static Object row;
public static void main(String[] args) {
Document doc = null;
try {
doc = Jsoup.connect( "http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2015&month=0&season1=2015&ind=0&team=18&rost=0&age=0&filter=&players=0" ).get();
}
catch (IOException ioe) {
ioe.printStackTrace();
}
Element table = doc.getElementById( "LeaderBoard1_dg1_ct100" );
Elements rows = table.getElementsByTag( "tr" );
for( Element row:rows ) {
}
Elements tds = row.getElementsByTag( "td" );
for( int i=0; i < tds.size(); i++ ) {
System.out.println(tds.get(i).text());
}
}
}
Welcome to StackOverflow.
This works.
Document doc = null;
try {
doc = Jsoup
.connect(
"http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2015&month=0&season1=2015&ind=0&team=18&rost=0&age=0&filter=&players=0")
.get();
}
catch (IOException ioe) {
ioe.printStackTrace();
}
Element table = doc.getElementById("LeaderBoard1_dg1_ctl00");
Elements rows = table.getElementsByTag("tr");
for (Element row : rows) {
Elements tds = row.getElementsByTag("td");
for (int i = 0; i < tds.size(); i++) {
System.out.println(tds.get(i).text());
}
}
There are three problems with your code.
The id you are using is wrong. Instead of LeaderBoard1_dg1_ct100 use LeaderBoard1_dg1_ctl00. You mistook the l for 1.
The second problem is the Object row. No need for this one. Remove it.
You had the iteration of the rows outside of the for loop. And because you had the Object row variable, no compilation errors where present, thus hiding the problem.

jsoup: How to extract correct data from this website

I am trying to extract data from a Spanish dictionary using jsoup. Essentially, the user will input words he wants to define as command line arguments and the program will return a formatted list of definitions. Here is what I have done so far:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) {
String[] urls = new String[args.length];
for(int i=0; i<args.length; i++) {
urls[i] = "http://www.diccionarios.com/detalle.php?palabra="
+ args[i]
+ "&Buscar.x=0&Buscar.y=0&Buscar=submit&dicc_100=on&dicc_100=on";
try{
Document doc = Jsoup.connect(urls[i]).get();
Elements htmly = doc.getElementsByTag("html");
String untokenized = htmly.text();
System.out.println(untokenized);
}catch (Exception e) {
System.out.println("EXCEPTION: Word is probably not in this dictionary.");
}
}
}
}
That url array gives the correct urls where the information for the definition is.
Now, what I'm expecting to be returned is what you would get if you went to the try.jsoup website and used (for example) this : http://www.diccionarios.com/detalle.php?palabra=libro&Buscar.x=0&Buscar.y=0&Buscar=submit&dicc_100=on&dicc_100=on
as the link and typed in html as the CSS Query. I need that data so I can tokenize the definition from that.
So I guess my question is, what method would I use to obtain the same data that you can see on the try.jsoup website. Thanks a lot!
Edit: This is about interpreting the data from the url. The end result data I want (in this example) is "Conjunto de hojas escritas unidas o cosidas por uno de sus lados y cubiertas por tapas de cartón u otro material." That is the definition on the website. However, I noticed that on that try.jsoup website that if I put the html text in the CSS Query box then the result was a huge bunch of text. My assumption was that the following 2 lines of code would capture this huge bunch of text and save it as a string:
Elements htmly = doc.getElementsByTag("html");
String untokenized = htmly.text();
However, the output for when I print untokenized is instead this: "Usuario Clave ¿Olvidaste tu clave? Condiciones Privacidad Versión completa © 2011 Larousse Editorial, SL." So my question is, how to I obtain the string data for that huge bunch of text found on the try.jsoup website?
EDIT: I followed the advice of the question here: Jsoup - CSS Query selector issue (?) and it worked great.

How to get specific lines in a text file

I am trying read the contents of a text file. The idea is to get the first line with the 'title :' keyword, read the file, get the next 'title:' keyword again, keep doing it until the file is read. I am trying to store it in a database. Other ideas to do this are welcomed as well. Thanks.
This is the text file I am trying to read from.
title : Mothers Day
mattiebelle : YEA! A movie that grabbed me from beginning to end! Love to come across this kind of movie. A must see for all! Enjoy!
title : Pregnant in Heels
CuittePie : I CAN'T WATCH ANY OF THEES. :#
title : The Flintstones
Row_Sweet_Girl : Nice one to watch
title : Barter Kings
dragon3476 : Barter Kings - Season 1 Episode 4 - Rock and a Hard Place Air Date: 19/06/2012 Summary:Traders barter for a car and a pool table.
I think the easiest way would be using FileUtils from Apache Commons IO like this:
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.commons.io.FileUtils;
public class ReadFileLines {
public static void main(final String[] args) throws IOException {
List lines = FileUtils.readLines(new File("/tmp/myFile.txt"), "UTF-8");
for (Object line : lines) {
if (String.valueOf(line).startsWith("title : ")) {
System.out.println(line); // here you store it
}
}
}
}

How extract the data from a list?

I'm developing an Android application and I want to recognize hashtags, mentions and links. I have a code that can be usable in objective-c that do my propose. I question these and now I have these code:
import java.net.URL;
import java.util.List;
String input = /* text from edit text */;
String[] words = input.split("\\s");
List<URL> urls=null;
for (String s : words){
try
{
urls.add(new URL(s));
}
catch (MalformedURLException e) {
// not a url
}
}
Now I want to put these on a tweet, I have developed the code to do it, and the tweet is based on an string. My question is how I put the data from the list in the string?
//I test these
String tweet="Using my app"+urls
But in the tweet appears "Using my appnull"
How I reuse this code to recognize hashtags and mentions?
I think that is changing the input.split("\\s") by "#\\s" or "#\\s"
You could just use a library here:
https://github.com/twitter/twitter-text-java
that does what you're trying to do.

Categories