I'm using Java to download HTML contents of websites whose URLs are stored in a database. I'd like to put their HTML into database, too.
I'm using Jsoup for this purpose:
public String downloadHTML(String byLink) {
String htmlInPage = "";
try {
Document doc = Jsoup.connect(byLink).get();
htmlInPage = doc.html();
} catch (org.jsoup.UnsupportedMimeTypeException e) {
// process this and some other exceptions
}
return htmlInPage;
}
I'd like to download websites concurrently and use this function:
public void downloadURL(int websiteId, String url,
String categoryName, ExecutorService executorService) {
executorService.submit((Runnable) () -> {
String htmlInPage = downloadHTML(url);
System.out.println("Category: " + categoryName + " " + websiteId + " " + url);
String insertQuery =
"INSERT INTO html_data (website_id, html_contents) VALUES (?,?)";
dbUtils.query(insertQuery, websiteId, htmlInPage);
});
}
dbUtils is my class based on Apache Commons DbUtils. Details are here: http://pastebin.com/iAKXchbQ
And I'm using everything mentioned above in a such way: (List<Object[]> details are explained on pastebin, too)
public static void main(String[] args) {
DbUtils dbUtils = new DbUtils("host", "db", "driver", "user", "pass");
List<String> categoriesList =
Arrays.asList("weapons", "planes", "cooking", "manga");
String sql = "SELECT lw.id, lw.website_url, category_name " +
"FROM list_of_websites AS lw JOIN list_of_categories AS lc " +
"ON lw.category_id = lc.id " +
"where category_name = ? ";
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (String category : categoriesList) {
List<Object[]> sitesInCategory = dbUtils.select(sql, category );
for (Object[] entry : sitesInCategory) {
int websiteId = (int) entry[0];
String url = (String) entry[1];
String categoryName = (String) entry[2];
downloadURL(websiteId, url, categoryName, executorService);
}
}
executorService.shutdown();
}
I'm not sure if this solution is correct but it works. Now I want to modify code to save HTML not from all websites in my database, but only their fixed ammount in each category.
For example, download and save HTML of 50 websites from the "weapons" category, 50 from "planes", etc. I don't think it's necessary to use sql for this purpose: if we select 50 sites per category, it doesn't mean we save them all, because of possibly incorrect syntax and connection problems.
I've tryed to create separate class implementing Runnable with fields: counter and maxWebsitesPerCategory, but these variables aren't updated. Another idea was to create field Map<String,Integer> sitesInCategory instead of counter, put each category as a key there and increment its value until it reaches maxWebsitesPerCategory, but it didn't work, too. Please, help me!
P.S: I'll also be grateful for any recommendations connected with my realization of concurrent downloading (I haven't worked with concurrency in Java before and this is my first attempt)
How about this?
for (String category : categoriesList) {
dbUtils.select(sql, category).stream()
.limit(50)
.forEach(entry -> {
int websiteId = (int) entry[0];
String url = (String) entry[1];
String categoryName = (String) entry[2];
downloadURL(websiteId, url, categoryName, executorService);
});
}
sitesInCategory has been replaced with a stream of at most 50 elements, then your code is run on each entry.
EDIT
In regard to comments. I've gone ahead and restructured a bit, you can modify/implement the content of the methods I've suggested.
public void werk(Queue<Object[]> q, ExecutorService executorService) {
executorService.submit(() -> {
try {
Object[] o = q.remove();
try {
String html = downloadHTML(o); // this takes one of your object arrays and returns the text of an html page
insertIntoDB(html); // this is the code in the latter half of your downloadURL method
}catch (/*narrow exception type indicating download failure*/Exception e) {
werk(q, executorService);
}
}catch (NoSuchElementException e) {}
});
}
^^^ This method does most of the work.
for (String category : categoriesList) {
Queue<Object[]> q = new ConcurrentLinkedQueue<>(dbUtils.select(sql, category));
IntStream.range(0, 50).forEach(i -> werk(q, executorService));
}
^^^ this is the for loop in your main
Now each category tries to download 50 pages, upon failure of downloading a page it moves on and tries to download another page. In this way, you will either download 50 pages or have attempted to download all pages in the category.
Related
I wrote a code to lookup for some movie names on IMDB, but if for instance I am searching for "Harry Potter", I will find more than one movie. I would like to use multithreading, but I don't have much knowledge on this area.
I am using strategy design pattern to search among more websites, and for instance inside one of the methods I have this code
for (Element element : elements) {
String searchedUrl = element.select("a").attr("href");
String movieName = element.select("h2").text();
if (movieName.matches(patternMatcher)) {
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + " " + searchedUrl);
resultList.add(result);
}
}
which, for each element (which is the movie name), will create a new connection on IMDB to lookup for ratings and other stuff, on the super.imdbConnection(movieName) line.
The problem is, I would like to have all the connections at the same time, because on 5-6 movies found, the process will take much longer than expected.
I am not asking for code, I want some ideeas. I thought about creating an inner class which implements Runnable, and to use it, but I don't find any meaning on that.
How can I rewrite that loop to use multithreading?
I am using Jsoup for parsing, Element and Elements are from that library.
The most simple way is parallelStream()
List<Result> resultList = elements.parallelStream()
.map(e -> {
String searchedUrl = element.select("a").attr("href");
String movieName = element.select("h2").text();
if(movieName.matches(patternMatcher)){
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + " " + searchedUrl);
return result;
}else{
return null;
}
}).filter(Objects::nonNull)
.collect(Collectors.toList());
If you don't like parallelStream() and want to use Threads, you can to this:
List<Element> elements = new ArrayList<>();
//create a function which returns an implementation of `Callable`
//input: Element
//output: Callable<Result>
Function<Element, Callable<Result>> scrapFunction = (element) -> new Callable<Result>() {
#Override
public Result call() throws Exception{
String searchedUrl = element.select("a").attr("href");
String movieName = element.select("h2").text();
if(movieName.matches(patternMatcher)){
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + " " + searchedUrl);
return result;
}else{
return null;
}
}
};
//create a fixed pool of threads
ExecutorService executor = Executors.newFixedThreadPool(elements.size());
//submit a Callable<Result> for every Element
//by using scrapFunction.apply(...)
List<Future<Result>> futures = elements.stream()
.map(e -> executor.submit(scrapFunction.apply(e)))
.collect(Collectors.toList());
//collect all results from Callable<Result>
List<Result> resultList = futures.stream()
.map(e -> {
try{
return e.get();
}catch(Exception ignored){
return null;
}
}).filter(Objects::nonNull)
.collect(Collectors.toList());
I need to show elements on a table depending on the element (Person) clicked on another table. The problem is that, using a Service, if the user clicks on two elements of the first table very quickly, the data of the two elements is showed in the table, and I only want to show the data from the last one clicked. Hope you can help me.
Here is my code:
personTable.getSelectionModel().selectedItemProperty().addListener(
(observable, oldValue, newValue) -> {
try {
contactoTable.setPlaceholder(new Label("Cargando..."));
showPersonDetails(newValue);
} catch (SQLException ex) {
Logger.getLogger(PersonOverviewController.class.getName()).log(Level.SEVERE, null, ex);
}
});
And showPersonDatails:
contactoTable.setVisible(true);
contactoTable.getItems().clear();
firstNameLabel.setText(person.getFirstName());
lastNameLabel.setText(person.getLastName());
mailLabel.setText(person.getMail());
phoneLabel.setText(person.getPhone());
descriptionLabel.setText(person.getDescription());
service = new Service<Void>() {
#Override
protected Task<Void> createTask() {
return new Task<Void>() {
#Override
protected Void call() throws Exception {
//Background work
DBManager db = new DBManager();
String query = "SELECT * FROM eventos";
ResultSet r = db.executeSelect(query);
contactoTable.getItems().clear();
contactoData.clear();
while (r.next()) {
String q = "SELECT * FROM " + r.getString("Nombre").replace(" ", "_") + " WHERE Nombre = '" + person.getFirstName() + "' AND Apellidos = '" + person.getLastName() + "' AND Correo = '" + person.getMail() + "'";
ResultSet result = db.executeSelect(q);
while (result.next()) {
contactoData.add(new Row(r.getString("Nombre"), result.getString("Asistencia")));
}
}
final CountDownLatch latch = new CountDownLatch(1);
Platform.runLater(() -> {
try {
//FX Stuff done here
contactoTable.setPlaceholder(new Label("No invitado a ningún evento"));
contactoTable.setItems(contactoData);
} finally {
latch.countDown();
}
});
latch.await();
//Keep with the background work
return null;
}
};
}
};
service.start();
You are referencing the same data list (contactoData) from multiple threads, with apparently no synchronization on the list. If the user selects two different items in rapid succession, you launch a service for each one, each service running its task in a different thread. Consequently you have no control over the order the two different threads perform their (multiple) manipulations on contactoData. For example, it is possible (even probable) that the order for two services executing asynchronously is:
First service clears the list
Second service clears the list
First service adds elements to the list
Second service adds elements to the list
and in this case the list contains elements generated by both services, not just one of them.
So you should have your tasks operate on, and return, a new list they create. Then process that list on the FX Application Thread.
It's also not clear why you need a service here, as you only seem to ever use each service once. You may as well just use a task directly.
You also probably want to ensure that the last selection is the one displayed. Since the tasks are running asynchronously, it's possible that if two tasks were started in quick succession, the second would complete before the first. This would result in the second selection being displayed, and then the first selection replacing it. You can avoid this by doing the UI update in an onSucceeded handler, and canceling any current task when you start a new one (thus preventing the currently-executing task from invoking its onSucceeded handler).
Finally, it's really not clear to me why you are making the task wait until the UI is updated.
Here is an updated version of your code:
private Task<List<Row>> updateContactTableTask ;
// ...
private void showPersonDetails(Person person) {
contactoTable.getItems().clear();
firstNameLabel.setText(person.getFirstName());
lastNameLabel.setText(person.getLastName());
mailLabel.setText(person.getMail());
phoneLabel.setText(person.getPhone());
descriptionLabel.setText(person.getDescription());
if (updateContactTableTask != null && updateContactTableTask.isRunning()) {
updateContactTableTask.cancel();
}
updateContactTableTask = new Task<List<Row>>() {
#Override
protected List<Row> call() throws Exception {
List<Row> resultList = new ArrayList<>() ;
//Background work
DBManager db = new DBManager();
String query = "SELECT * FROM eventos";
ResultSet r = db.executeSelect(query);
// quit if we got canceled here...
if (isCancelled()) {
return resultList;
}
while (r.next() && ! isCancelled()) {
// Note: building a query like this is inherently unsafe
// You should use a PreparedStatement in your DBManager class instead
String q = "SELECT * FROM " + r.getString("Nombre").replace(" ", "_") + " WHERE Nombre = '" + person.getFirstName() + "' AND Apellidos = '" + person.getLastName() + "' AND Correo = '" + person.getMail() + "'";
ResultSet result = db.executeSelect(q);
while (result.next()) {
resultList.add(new Row(r.getString("Nombre"), result.getString("Asistencia")));
}
}
return resultList ;
}
};
updateContactTableTask.setOnSucceeded(e -> {
// not really clear you still need contactoData, but if you do:
contactoData.setAll(updateContactTableTask.getValue());
contactoTable.setPlaceholder(new Label("No invitado a ningún evento"));
contactoTable.setItems(contactoData);
});
updateContactTableTask.setOnFailed(e -> {
// handle database errors here...
});
new Thread(updateContactTableTask).start();
}
As an aside, it's not clear to me if, and if so, how, you are closing your database resources. E.g. the result sets never seem to get closed. This could cause resource leaks. However this is incidental to the question (and relies on knowing how your DBManager class is implemented), so I won't address it here.
I am currently working on a Java program that crawls a webpage and prints out some information from it.
There is one part that I can't figure out, and thats when I try to print out one specific String Array with some information in it, all it gives me is " ] " for that line. However, a few lines before, I also try printing out another String array in the exact same way and it prints out fine. When I test what is actually being passed to the "categories" variable, its the correct information and can be printed out there.
public class Crawler {
private Document htmlDocument;
String [] keywords, categories;
public void printData(String urlToCrawl)
{
nextURL=urlToCrawl;
crawl();
//This does what its supposed to do. (Print Statement 1)
System.out.print("Keywords: ");
for (String i :keywords) {System.out.print(i+", ");}
//This doesnt. (Print Statement 2)
System.out.print("Categories: ");
for (String b :categories) {System.out.print(b+", ");}
}
public void crawl()
{
//Gather Data
//open up JSOUP for HTTP parsing.
Connection connection = Jsoup.connect(nextURL).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
this.htmlDocument=htmlDocument;
System.out.println("Recieved Webpage "+ nextURL);
int guacCounter = 0;
for(Element guac : htmlDocument.select("script"))
{
if(guacCounter==5)
{
//String concentratedGuac = guac.toString();
String[] items = guac.toString().split("\\n");
categories = processGuac(items);
break;
}
else if(guacCounter<5) {
guacCounter++;
}
}
}
public String[] processKeywords(String totalKeywords)
{
String [] separatedKeywords = totalKeywords.split(",");
//System.out.println(separatedKeywords.toString());
return separatedKeywords;
}
public String[] processGuac(String[] inputGuac)
{
int categoryIsOnLine = 6;
String categoryData = inputGuac[categoryIsOnLine-1];
categoryData = categoryData.replace(",","");
categoryData = categoryData.replace("'","");
categoryData = categoryData.replace("|",",");
categoryData = categoryData.split(":")[1];
//this prints out the list of categories in string form.(Print Statement 3)
System.out.println("Testing here: " + categoryData.toString());
String [] categoryList=categoryData.split(",");
//This prints out the list of categories in array form correctly.(Print statement 4)
System.out.println("Testing here too: " );
for(String a : categoryList) {System.out.println(a);}
return categoryList;
}
}
I cut out a lot of the irrelevant parts of my code so there might be some missing variables.
Here is what my printouts look like:
PS1:
Keywords: What makes a good friend, making friends, signs of a good friend, supporting friends, conflict management,
PS2:
]
PS3:
Testing here: wellbeing,friends-and-family,friendships
PS4:
Testing here too:
wellbeing
friends-and-family
friendships
I am trying to retrieve and process code from JIRA, unfortunately the pieces of information (which are in the Metadata-Plugin) are saved in a column, not a row.
Picture of JIRA-MySQL-Database
The goal is to save this in an object with following attributes:
public class DesiredObject {
private String Object_Key;
private String Aze.kunde.name;
private Long Aze.kunde.schluessel;
private String Aze.projekt.name;
private Long Aze.projekt.schluessel
//getters and setters here
}
My workbench is STS and it's a Spring-Boot-Application.
I can fetch a List of Object-Keys with the JRJC using:
JiraController jiraconnect = new JiraController();
List<JiraProject> jiraprojects = new ArrayList<JiraProject>();
jiraprojects = jiraconnect.findJiraProjects();
This is perfectly working, also the USER_KEY and USER_VALUE are easily retrievable, but I hope there is a better way than to perform
three SQL-Searches for each project and then somehow build an object from all those lists.
I was starting with
for (JiraProject jp : jiraprojects) {
String SQL = "select * from jira_metadata where ENRICHED_OBJECT_KEY = ?";
List<DesiredObject> do = jdbcTemplateObject.query(SQL, new Object[] { "com.atlassian.jira.project.Project:" + jp.getProjectkey() }, XXX);
}
to get a list with every object, but I'm stuck as i can't figure out a ObjectMapper (XXX) who is able to write this into an object.
Usually I go with
object.setter(rs.getString("SQL-Column"));
But that isn't working, as all my columns are called the same. (USER_KEY & USER_VALUE)
The Database is automatically created by JIRA, so I can't "fix" it.
The Object_Keys are unique which is why I tried to use those to collect all the data from my SQL-Table.
I hope all you need to enlighten me is in this post, if not feel free to ask for more!
Edit: Don't worry if there are some 'project' and 'projekt', that's because I gave most of my classes german names and descriptions..
I created a Hashmap with the Objectkey and an unique token in brackets, e.g.: "(1)JIRA".
String SQL = "select * from ao_cc6aeb_jira_metadata";
List<JiraImportObjekt> jioList = jdbcTemplateObject.query(SQL, new JiraImportObjektMapper());
HashMap<String, String> hmap = new HashMap<String, String>();
Integer unique = 1;
for (JiraImportObjekt jio : jioList) {
hmap.put("(" + unique.toString() + ")" + jio.getEnriched_Object_Key(),
jio.getUser_Key() + "(" + jio.getUser_Value() + ")");
unique++;
}
I changed this into a TreeMap
Map<String, String> tmap = new TreeMap<String, String>(hmap);
And then i iterated through that treemap via
String aktuProj = new String();
for (String s : tmap.keySet()) {
if (aktuProj.equals(s.replaceAll("\\([^\\(]*\\)", ""))) {
} else { //Add Element to list and start new Element }
//a lot of other stuff
}
What I did was to put all the data in the right order, iterate through and process everything like I wanted it.
Object hinfo = hmap.get(s);
if (hinfo.toString().replaceAll("\\([^\\(]*\\)", "").equals("aze.kunde.schluessel")) {
Matcher m = Pattern.compile("\\(([^)]+)\\)").matcher(hinfo.toString());
while (m.find()) {
jmo[obj].setAzeKundeSchluessel(Long.parseLong(m.group(1), 10));
// logger.info("AzeKundeSchluessel: " +
// jmo[obj].getAzeKundeSchluessel());
}
} else ...
After the loop I needed to add the last Element.
Now I have a List with the Elements which is easy to use and ready for further steps.
I cut out a lot of code because most of it is customized for my problem.. the roadmap should be enough to solve it though.
Good luck!
This is actually a re-do of an older question of mine that I have completely redone because my old question seemed to confuse people.
I have written a Java program that Queries a database and is intended to retrieve several rows of data. I have previously written the program in Informix-4GL and I am using a sql cursor to loop through the database and store each row into a "dynamic row of record". I understand there are no row of records in Java so I have ended up with the following code.
public class Main {
// DB CONNECT VARIABLE ===========================
static Connection gv_conn = null;
// PREPARED STATEMENT VARIABLES ==================
static PreparedStatement users_sel = null;
static ResultSet users_curs = null;
static PreparedStatement uinfo_sel = null;
static ResultSet uinfo_curs = null;
// MAIN PROGRAM START ============================
public static void main(String[] args) {
try {
// CONNECT TO DATABASE CODE
} catch(Exception log) {
// YOU FAILED CODE
}
f_prepare(); // PREPARE THE STATEMENTS
ArrayList<Integer> list_id = new ArrayList<Integer>();
ArrayList<String> list_name = new ArrayList<String>();
ArrayList<Integer> list_info = new ArrayList<String>();
ArrayList<String> list_extra = new ArrayList<String>();
try {
users_sel.setInt(1, 1);
users_curs = users_sel.executeQuery();
// RETRIEVE ROWS FROM USERS
while (users_curs.next()) {
int lv_u_id = users_curs.getInt("u_id");
String lv_u_name = users_curs.getString("u_name");
uinfo_sel.setInt(1, lv_u_id);
uinfo_curs = uinfo_sel.executeQuery();
// RETRIEVE DATA FROM UINFO RELATIVE TO USER
String lv_ui_info = uinfo_curs.getString("ui_info");
String lv_ui_extra = uinfo_curs.getString("ui_extra");
// STORE DATA I WANT IN THESE ARRAYS
list_id.add(lv_u_id);
list_name.add(lv_u_name);
list_info.add(lv_ui_info);
list_extra.add(lv_ui_extra);
}
} catch(SQLException log) {
// EVERYTHING BROKE
}
// MAKING SURE IT WORKED
System.out.println(
list_id.get(0) +
list_name.get(0) +
list_info.get(0) +
list_extra.get(0)
);
// TESTING WITH ARBITRARY ROWS
System.out.println(
list_id.get(2) +
list_name.get(5) +
list_info.get(9) +
list_extra.get(14)
);
}
// PREPARE STATEMENTS SEPARATELY =================
public static void f_prepare() {
String lv_sql = null;
try {
lv_sql = "select * from users where u_id >= ?"
users_sel = gv_conn.prepareStatement(lv_sql);
lv_sql = "select * from uinfo where ui_u_id = ?"
uinfo_sel = gv_conn.prepareStatement(lv_sql)
} catch(SQLException log) {
// IT WON'T FAIL COZ I BELIEEEVE
}
}
}
class DBConn {
// connect to SQLite3 code
}
All in all this code works, I can hit the database once, get all the data I need, store it in variables and work with them as I please however this does not feel right and I think it's far from the most suited way to do this in Java considering I can do it with only 15 lines of code in Informix-4GL.
Can anyone give me advice on a better way to achieve a similar result?
In order to use Java effectively you need to use custom objects. What you have here is a lot of static methods inside a class. It seems that you are coming from a procedural background and if you try to use Java as a procedural language, you will not much value from using it. So first off create a type, you can plop it right inside your class or create it as a separate file:
class User
{
final int id;
final String name;
final String info;
final String extra;
User(int id, String name, String info, String extra)
{
this.id = id;
this.name = name;
this.info = info;
this.name = name;
}
void print()
{
System.out.println(id + name + info + extra);
}
}
Then the loop becomes:
List<User> list = new ArrayList<User>();
try {
users_sel.setInt(1, 1);
users_curs = users_sel.executeQuery();
// RETRIEVE ROWS FROM USERS
while (users_curs.next()) {
int lv_u_id = users_curs.getInt("u_id");
String lv_u_name = users_curs.getString("u_name");
uinfo_sel.setInt(1, lv_u_id);
uinfo_curs = uinfo_sel.executeQuery();
// RETRIEVE DATA FROM UINFO RELATIVE TO USER
String lv_ui_info = uinfo_curs.getString("ui_info");
String lv_ui_extra = uinfo_curs.getString("ui_extra");
User user = new User(lv_u_id, lv_u_name, lv_ui_info, lv_ui_extra);
// STORE DATA
list.add(user);
}
} catch(SQLException log) {
// EVERYTHING BROKE
}
// MAKING SURE IT WORKED
list.get(0).print();
This doesn't necessarily address the number of lines. Most people who use Java don't interact with databases with this low-level API but in general, if you are looking to get down to the fewest number of lines (a questionable goal) Java isn't going to be your best choice.
Your code is actually quite close to box stock JDBC.
The distinction is that in Java, rather than having a discrete collection of arrays per field, we'd have a simple Java Bean, and a collection of that.
Some examples:
public class ListItem {
Integer id;
String name;
Integer info;
String extra;
… constructors and setters/getters ellided …
}
List<ListItems> items = new ArrayList<>();
…
while(curs.next()) {
ListItem item = new ListItem();
item.setId(curs.getInt(1));
item.setName(curs.getString(2));
item.setInfo(curs.getInfo(3));
item.setExtra(curs.getString(4));
items.add(item);
}
This is more idiomatic, and of course does not touch on the several frameworks and libraries available to make DB access a bit easier.