I'm implementing a web scraper in Java. After playing a little with websites that I'm going to crawl, I want to use best practice for concurrent HTTP connections in Java. I'm currently using Jsoup's connection method. I'd like to know if it's possible to create threads and make connections inside those threads similarly to HttpAsyncClient.
Jsoup does not use HttpAsyncClient. Jsoup's Jsoup.connect(String url) method uses blocking URL.openConnection() method.
If you want to use Jsoup asynchronously you can parallel all Jsoup.connect() executions. In Java 8 you can use parallel stream to do so. Let's say you have a list of URLs you want to scrape in parallel. Take a look at following example:
import org.jsoup.Jsoup;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.Objects;
import java.util.concurrent.ExecutionException;
import java.util.stream.Collectors;
public class ConcurrentJsoupExample {
public static void main(String[] args) throws ExecutionException, InterruptedException {
final List<String> urls = Arrays.asList(
"https://google.com",
"https://stackoverflow.com/questions/48298219/is-there-a-difference-between-httpasyncclient-and-multithreaded-jsoup-connection",
"https://mvnrepository.com/artifact/org.jsoup/jsoup",
"https://docs.oracle.com/javase/7/docs/api/java/net/URL.html#openConnection()",
"https://docs.oracle.com/javase/7/docs/api/java/net/URLConnection.html"
);
final List<String> titles = urls.parallelStream()
.map(url -> {
try {
return Jsoup.connect(url).get();
} catch (IOException e) {
return null;
}
})
.filter(Objects::nonNull)
.map(doc -> doc.select("title"))
.map(Elements::text)
.peek(it -> System.out.println(Thread.currentThread().getName() + ": " + it))
.collect(Collectors.toList());
}
}
Here we have 5 URLs defined and the goal of this simple application is to get a text value of <title> HTML tag from these websites. What happens is we create parallel stream using list of URLs and we map each URL to Jsoup's Document object - .get() method throws checked exception so we have to try-catch it and if exception occurs we return null value. All null values gets filtered by .filter(Objects::nonNull) and after that we can extract elements we need - text value of <title> tag in this case. I also added .peek() that prints what is the value extracted and what is the thread name it runs on. Exemplary output may look like this:
ForkJoinPool.commonPool-worker-1: java - Is there a difference between HttpAsyncClient and multithreaded Jsoup connection class? - Stack Overflow
main: Maven Repository: org.jsoup » jsoup
ForkJoinPool.commonPool-worker-4: URL (Java Platform SE 7 )
ForkJoinPool.commonPool-worker-2: URLConnection (Java Platform SE 7 )
ForkJoinPool.commonPool-worker-3: Google
In the end we call .collect(Collectors.toList()) to terminate stream, execute all transformations and return a list of titles.
It is just a simple example, but it should give you a hint how to use Jsoup in parallel.
Alternatively you can use url.parallelStream().forEach() if functional-like approach does not convince you:
urls.parallelStream().forEach(url -> {
try {
final Document doc = Jsoup.connect(url).get();
final String title = doc.select("title").text();
System.out.println(Thread.currentThread().getName() + ": " + title);
// do something with extracted title...
} catch (IOException e) {
e.printStackTrace();
}
});
Related
I am currently having issues with my programmable thermostat application. I run the app but do not get the required data back in a format that is acceptable. These are the directions to my assignment and the code I use.
Directions
Open your ProgrammableThermostat project in NetBeans, then write Java classes that employ standard Java coding conventions for building the display driver specified in the u03a1 User Story:
Part 1 – Create Classes and Build the Display Driver
Open your ProgrammableThermostat project in NetBeans, then write Java classes that employ standard Java coding conventions for building the display driver specified in the u03a1 User Story:
Do the following:
Your code will need to read temperature data stored in JSON format using a web service to access the JSON information stored on a Capella web server. You will create a web query to retrieve the data. The result of the query will be returned as JSON data.
Develop Java networking classes/methods that read the temperature data from the web service.
Develop Java classes/methods that convert the raw temperature data into Java objects.
Build a driver to display the temperature data on the Java Console. Note the following:
You do not need to create a GUI; your data should be displayed in the NetBeans Java Console.
https://courserooma.capella.edu/bbcswebdav/institution/IT/IT4774/180100/Course_Files/cf_u03a1_user_story.docx
http://media.capella.edu/BBCourse_Production/IT4774/temperature.json
I already have the json-simple 1.1.1 and the org.json package.
My code:
package programmable.thermostat;
import java.io.IOException;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.ProtocolException;
import java.net.URL;
import java.util.Scanner;
import org.json.JSONArray;
import org.json.JSONObject;
import org.json.simple.parser.JSONParser;
public class ProgrammableThermostat {
public static void main(String[] args) throws MalformedURLException, ProtocolException, IOException, org.json.simple.parser.ParseException{
//Need to string info to initialize variable for later use
String info = null;
//Here we have the http of the site that was given to us
URL url = new URL("http://media.capella.edu/BBCourse_Production/IT4774/temperature.json");
//Forms an http connection to the URL called conn and opens it
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
//sets get request type
conn.setRequestMethod("GET");
//Opens connection to API
conn.connect();
//Gets the response code for format
int responseCode = conn.getResponseCode();
//if response code not 200 throws runtime exception
if(responseCode != 200)
throw new RuntimeException("HttpResponseCode: " +responseCode);
/* else the scanner opens the URL stream and reads in the code line by line*/
else
{
Scanner sc = new Scanner(url.openStream());
while(sc.hasNext()) {
info += sc.nextLine();
}
System.out.println("JSON data in string format!");
System.out.println(info);
sc.close();
}
JSONParser parse = new JSONParser();
JSONObject object = (JSONObject)parse.parse(info);
JSONArray thermostatInfo = (JSONArray)object.get("results");
//get data for results array
for(int i = 0; i < thermostatInfo.length(); i++)
{
JSONObject jsonObject_1 = (JSONObject)thermostatInfo.get(i);
System.out.println("Items in result array!");
System.out.println("Identifier: " + jsonObject_1.get("identifier"));
System.out.println("\nName: " + jsonObject_1.get("name"));
System.out.println("\nThermostat Time: " + jsonObject_1.get("thermostatTime"));
System.out.println("\nutcTime: " + jsonObject_1.get("utcTime"));
System.out.println("\nRuntime: " + jsonObject_1.get("runtime"));
System.out.println("\nStatus: " + jsonObject_1.get("status"));
}
}
}
There are two expected results of doing this project. To Read each line of the JSON data and parse it out to different values for displaying. The second part is to display the data nicely on the screen.
The output of my application is:
JSON data in string format!
null{ "identifier": "318324702718", "name": "ProgrammableThermostat", "thermostatTime": "2015-02-11 15:58:03", "utcTime": "2015-02-11 20:58:03", "runtime": { "actualTemperature": 711, "actualHumidity": 42, }, "status": { "code": 0, "message": "" }}
Exception in thread "main" Unexpected token LEFT BRACE({) at position 4.
at org.json.simple.parser.JSONParser.parse(JSONParser.java:146)
at org.json.simple.parser.JSONParser.parse(JSONParser.java:81)
at org.json.simple.parser.JSONParser.parse(JSONParser.java:75)
at programmable.thermostat.ProgrammableThermostat.main(ProgrammableThermostat.java:52)
C:\Users\Deb\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 4 seconds)
Any help on this would be greatly appreciated! Thank you guys in advance.
Despite doing everything in the main() method, which your task I guess tries to avoid, your problem isn't in your code.
It is with the web service itself. If you access the URL you use with your browser (I did it with Firefox) you can see the same error that your Java console is telling you: the format of the JSON is not correct:
I hope this can help you!
You're initializing your data variable to null. When you start appending the JSON to the variable, you get null{json stuff here}. Initialize it to the empty string instead:
String data = "";
I'm trying to code a little program in Java, with a small UI, that lets you use some google search's keyword to improve your search.
I have 2 text field (one for the site and one for the keywords) and 2 date pickers to let the user select the date range for the searching result .
When I press the search button it will connect to the following url:
"https://www.google.it/search?q=" + site + Keywords + daterange
site = "site:SITE_MAIN_URL"
keywords are the keywords i am looking for
daterange = "daterange:JULIAN_DATE_1 - JULIAN_DATE_2"
after all this I fetch the first 10 result, but here's the problem...
If I select no dates I can easily fetch the links
If I set the daterange I get the HTTP 503 error that is the one for service unavailable (if I paste the generated URL on my web browser everything works fine)
(the User Agent is set to mozilla 5.0)
EDIT: didn't post any code :P
//here i generate the site
site = "site:" + website_field.getText();
//here i convert the dates using a class found on the net
d1 = (int) DateLabelFormatter.dateToJulian(date1);
d2 = (int) DateLabelFormatter.dateToJulian(date2);
daterange += "+daterange:" + d1 + "-" + d2;
//here i generate the keywords
keywords = keyword_field.getText();
String[] keyword = keywords.split(" ");
for (int i = 0; i < keyword.length; i++) {
tempKeyword += "+" + keyword[i];
}
//the query
query = "https://www.google.it/search?q=" + site + tempKeyword + daterange;
//the connection (wrapped in a try-catch)
Document jSoupDoc = Jsoup.connect(query).userAgent("Mozilla/5.0").timeout(5000).get();
//fetching the links
Elements links = jSoupDoc.select("a[href]");
Element link;
for (int i = 0; i < links.size(); i++) {
link = links.get(i);
String temp = link.attr("href");
// filtering the first 10 google links
if (temp.contains("url")) //donothing
if (temp.contains("webcache")) { //donothing
} else {
String[] splitTemp = temp.split("=");
String[] splitTemp2 = splitTemp[1].split("&sa");
System.out.println(splitTemp2[0]);
}
}
After executing all this (NotSoWellWritten)code if i select no date, and i use just the "site" and the "keywords" I can see on the console the first 10 result found on the google search page.
If i select a daterange from the datepickers i get the 503 error.
If you wanna try a working query, here's one that search on facebook.com the keyword "dog" starting from the 1st of november to the 15th generated with this "tool"
https://www.google.it/search?q=site:facebook.com+dog+daterange:2457328-2457342
`
I have no problems using the following code:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main
{
public static void main(String[] args) throws IOException
{
// the connection (wrapped in a try-catch)
Document jSoupDoc = Jsoup.connect("https://www.google.it/search?q=site:facebook.com+dog+daterange:2457328-2457342").userAgent("Mozilla/5.0").timeout(5000).get();
// fetching the links
Elements links = jSoupDoc.select("a[href]");
Element link;
for (int i = 0; i < links.size(); i++)
{
link = links.get(i);
String temp = link.attr("href");
// filtering the first 10 google links
if (temp.contains("url") && !temp.contains("webcache"))
{
String[] splitTemp = temp.split("=");
String[] splitTemp2 = splitTemp[1].split("&sa");
System.out.println(splitTemp2[0]);
}
}
}
}
The code gives this as output on my computer:
https://www.facebook.com/uniladmag/videos/1912071728815877/
https://it-it.facebook.com/DogEvolutionAsd
https://it-it.facebook.com/DylanDogSergioBonelliEditore
https://www.facebook.com/DelawareCountyDogShelter/
https://www.facebook.com/LostDogAlert/
https://it-it.facebook.com/pages/Toelettatura-Vanity-DOG/270854126382923
https://it-it.facebook.com/washdogsgm
https://www.facebook.com/thedailystar/videos/1193933410623520/
https://www.facebook.com/OakhurstDogPark/
https://www.facebook.com/bigdogdinerco/
A 503 error usually means that the web server is having temporary issues. Specifically:
503: The Web server (running the Web site) is currently unable to handle the HTTP request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay.
If this code works but your original code still does not, then your code is not generating the URL you posted and you should investigate further.
Besides the coding style, I don't see any functional problems with the provided code and it supplies the answers correctly (tested it locally). The problem might reside in the dateToJulian which I don't know what it returns and how the result is cast to int (if information is lost).
Also, consider the case in which the keywords contain dangerous characters and they are unescaped. They should be sanitized beforehand.
Another possibility is that Google is rejecting your queries if you are sending too many too fast. If this was done using a visual browser, you'd get a "We want to make sure you're not a robot." and a CAPTCHA page. That is why I'd recommend leveraging the Google API for your searches. See this SO for more info: How can you search Google Programmatically Java API
I'm developing an Android application and I want to recognize hashtags, mentions and links. I have a code that can be usable in objective-c that do my propose. I question these and now I have these code:
import java.net.URL;
import java.util.List;
String input = /* text from edit text */;
String[] words = input.split("\\s");
List<URL> urls=null;
for (String s : words){
try
{
urls.add(new URL(s));
}
catch (MalformedURLException e) {
// not a url
}
}
Now I want to put these on a tweet, I have developed the code to do it, and the tweet is based on an string. My question is how I put the data from the list in the string?
//I test these
String tweet="Using my app"+urls
But in the tweet appears "Using my appnull"
How I reuse this code to recognize hashtags and mentions?
I think that is changing the input.split("\\s") by "#\\s" or "#\\s"
You could just use a library here:
https://github.com/twitter/twitter-text-java
that does what you're trying to do.
I am looking for a SQL Library that will parse an SQL statement and return some sort of Object representation of the SQL statement. My main objective is actually to be able to parse the SQL statement and retrieve the list of table names present in the SQL statement (including subqueries, joins and unions).
I am looking for a free library with a license business friendly (e.g. Apache license). I am looking for a library and not for an SQL Grammar. I do not want to build my own parser.
The best I could find so far was JSQLParser, and the example they give is actually pretty close to what I am looking for. However it fails parsing too many good queries (DB2 Database) and I'm hoping to find a more reliable library.
I doubt you'll find anything prewritten that you can just use. The problem is that ISO/ANSI SQL is a very complicated grammar — something like more than 600 production rules IIRC.
Terence Parr's ANTLR parser generator (Java, but can generate parsers in any one of a number of target languages) has several SQL grammars available, including a couple for PL/SQL, one for a SQL Server SELECT statement, one for mySQL, and one for ISO SQL.
No idea how complete/correct/up-to-date they are.
http://www.antlr.org/grammar/list
You needn't reinvent the wheel, there is already such a reliable SQL parser library there, (it's commerical, not free), and this article shows how to retrieve the list of table names present in the SQL statement (including subqueries, joins and unions) that is exactly what you are looking for.
http://www.dpriver.com/blog/list-of-demos-illustrate-how-to-use-general-sql-parser/get-columns-and-tables-in-sql-script/
This SQL parser library supports Oracle, SQL Server, DB2, MySQL, Teradata and ACCESS.
You need the ultra light, ultra fast library to extract table names from SQL (Disclaimer: I am the owner)
Just add the following in your pom
<dependency>
<groupId>com.github.mnadeem</groupId>
<artifactId>sql-table-name-parser</artifactId>
<version>0.0.1</version>
And do the following
new TableNameParser(sql).tables()
For more details, refer the project
Old question, but I think this project contains what you need:
Data Tools Project - SQL Development Tools
Here's the documentation for the SQL Query Parser.
Also, here's a small sample program. I'm no Java programmer so use with care.
package org.lala;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.util.Iterator;
import java.util.List;
import org.eclipse.datatools.modelbase.sql.query.QuerySelectStatement;
import org.eclipse.datatools.modelbase.sql.query.QueryStatement;
import org.eclipse.datatools.modelbase.sql.query.TableReference;
import org.eclipse.datatools.modelbase.sql.query.ValueExpressionColumn;
import org.eclipse.datatools.modelbase.sql.query.helper.StatementHelper;
import org.eclipse.datatools.sqltools.parsers.sql.SQLParseErrorInfo;
import org.eclipse.datatools.sqltools.parsers.sql.SQLParserException;
import org.eclipse.datatools.sqltools.parsers.sql.SQLParserInternalException;
import org.eclipse.datatools.sqltools.parsers.sql.query.SQLQueryParseResult;
import org.eclipse.datatools.sqltools.parsers.sql.query.SQLQueryParserManager;
import org.eclipse.datatools.sqltools.parsers.sql.query.SQLQueryParserManagerProvider;
public class SQLTest {
private static String readFile(String path) throws IOException {
FileInputStream stream = new FileInputStream(new File(path));
try {
FileChannel fc = stream.getChannel();
MappedByteBuffer bb = fc.map(FileChannel.MapMode.READ_ONLY, 0,
fc.size());
/* Instead of using default, pass in a decoder. */
return Charset.defaultCharset().decode(bb).toString();
} finally {
stream.close();
}
}
/**
* #param args
* #throws IOException
*/
public static void main(String[] args) throws IOException {
try {
// Create an instance the Parser Manager
// SQLQueryParserManagerProvider.getInstance().getParserManager
// returns the best compliant SQLQueryParserManager
// supporting the SQL dialect of the database described by the given
// database product information. In the code below null is passed
// for both the database and version
// in which case a generic parser is returned
SQLQueryParserManager parserManager = SQLQueryParserManagerProvider
.getInstance().getParserManager("DB2 UDB", "v9.1");
// Sample query
String sql = readFile("c:\\test.sql");
// Parse
SQLQueryParseResult parseResult = parserManager.parseQuery(sql);
// Get the Query Model object from the result
QueryStatement resultObject = parseResult.getQueryStatement();
// Get the SQL text
String parsedSQL = resultObject.getSQL();
System.out.println(parsedSQL);
// Here we have the SQL code parsed!
QuerySelectStatement querySelect = (QuerySelectStatement) parseResult
.getSQLStatement();
List columnExprList = StatementHelper
.getEffectiveResultColumns(querySelect);
Iterator columnIt = columnExprList.iterator();
while (columnIt.hasNext()) {
ValueExpressionColumn colExpr = (ValueExpressionColumn) columnIt
.next();
// DataType dataType = colExpr.getDataType();
System.out.println("effective result column: "
+ colExpr.getName());// + " with data type: " +
// dataType.getName());
}
List tableList = StatementHelper.getTablesForStatement(resultObject);
// List tableList = StatementHelper.getTablesForStatement(querySelect);
for (Object obj : tableList) {
TableReference t = (TableReference) obj;
System.out.println(t.getName());
}
} catch (SQLParserException spe) {
// handle the syntax error
System.out.println(spe.getMessage());
#SuppressWarnings("unchecked")
List<SQLParseErrorInfo> syntacticErrors = spe.getErrorInfoList();
Iterator<SQLParseErrorInfo> itr = syntacticErrors.iterator();
while (itr.hasNext()) {
SQLParseErrorInfo errorInfo = (SQLParseErrorInfo) itr.next();
// Example usage of the SQLParseErrorInfo object
// the error message
String errorMessage = errorInfo.getParserErrorMessage();
String expectedText = errorInfo.getExpectedText();
String errorSourceText = errorInfo.getErrorSourceText();
// the line numbers of error
int errorLine = errorInfo.getLineNumberStart();
int errorColumn = errorInfo.getColumnNumberStart();
System.err.println("Error in line " + errorLine + ", column "
+ errorColumn + ": " + expectedText + " "
+ errorMessage + " " + errorSourceText);
}
} catch (SQLParserInternalException spie) {
// handle the exception
System.out.println(spie.getMessage());
}
System.exit(0);
}
}
I need to create an automated process (preferably using Java) that will:
Open browser with specific url.
Login, using the username and password specified.
Follow one of the links on the page.
Refresh the browser.
Log out.
This is basically done to gather some statistics for analysis. Every time a user follows the link a bunch of data is generated for this particular user and saved in database. The thing I need to do is, using around 10 fake users, ping the page every 5-15 min.
Can you tink about simple way of doing that? There has to be an alternative to endless login-refresh-logout manual process...
Try Selenium.
It's not Java, but Javascript. You could do something like:
window.location = "<url>"
document.getElementById("username").value = "<email>";
document.getElementById("password").value = "<password>";
document.getElementById("login_box_button").click();
...
etc
With this kind of structure you can easily cover 1-3. Throw in some for loops for page refreshes and you're done.
Use HtmlUnit if you want
FAST
SIMPLE
java based web interaction/crawling.
For example: here is some simple code showing a bunch of output and an example of accessing all IMG elements of the loaded page.
public class HtmlUnitTest {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage("http://www.google.com");
System.out.println(page.getTitleText());
for (HtmlElement node : page.getHtmlElementDescendants()) {
if (node.getTagName().toUpperCase().equals("IMG")) {
System.out.println("NAME: " + node.getTagName());
System.out.println("WIDTH:" + node.getAttribute("width"));
System.out.println("HEIGHT:" + node.getAttribute("height"));
System.out.println("TEXT: " + node.asText());
System.out.println("XMl: " + node.asXml());
}
}
}
}
Example #2 Accessing named input fields and entering data/clicking:
final HtmlPage page = webClient.getPage("http://www.google.com");
HtmlElement inputField = page.getElementByName("q");
inputField.type("Example input");
HtmlElement btnG = page.getElementByName("btnG");
Page secondPage = btnG.click();
if (secondPage instanceof HtmlPage) {
System.out.println(page.getTitleText());
System.out.println(((HtmlPage)secondPage).getTitleText());
}
NB: You can use page.refresh() on any Page object.
You could use Jakarta JMeter