use of multithreading for downloading in java

use of multithreading for downloading in java - java

I'm trying to concurrently download HTML-code of websites whose urls are stored in the database (about 3 millions of entries).
It's obvious that I should use multithreading technology but I get into trouble how to do it in java.
Here's how I used to do it without multithreading:
final Connection c = dbConnect(); // register jdbc-driver and establish connection
checkRequiredDbAndTables(); // here we check the existence of the Db and necessary tables
try {
// now get list of urls from the db
String sql = "select id, website_url, category_id from list_of_websites";
PreparedStatement ps = c.prepareStatement(sql);
ResultSet rs = ps.executeQuery();
while (rs.next()) {
// column numeration in ResultSet is from 1 !
final long id = rs.getInt(1); // get website id
final String url = rs.getString(2); // get website url
System.out.println("Category: " + rs.getString(3) + " " + id + " " + url);
if ( isValidURL(url) && connectionOK(url) ) {
// checked url syntax and connection
String htmlInPage = downloadHTML(url);
if (!htmlInPage.equals("")) {
// add result to db
insertDataToDb( c, id, htmlInPage);
}
}
}
rs.close();
} catch (SQLException e) {
e.printStackTrace();
}
closeConnection(c); // database connection closed
The function donloadHTML uses JSoup library to do the main work.
Feels like my task is a kind of "producer consumer problem". I suppose that it can be represented in a such way: there's a buffer, containing N links; some processes getting the links from it and downloading HTML; and a process, which aim is to load new urls from the db into the buffer as it gets empty.
But I completely don't know how to do it. I've heard of Threads and ExecutorService providing ThreadPools but its really confusing for me.

You may want to use a Thread pool that has fixed amount of thread. Your program will first create a thread pool. Then it will read URLs from database. When a URL is read, the program will start a new task to download its content.
You program may maintain a queue. When a task finish downloading HTMLs, it can push the URL and the result together into a queue. When the main thread finish reading URLs and starting tasks, it can wait for the queue. Once the queue have any responses, take the response out and write it to database. The main thread can count how many responses are received, when it counts to the number of URLs, then all task was finish.
Your program can write a class for storing the response with the URL, for example:
class response {
public String URL;
public String result;
public response(String u, String r) { this.URL = u; this.result = r; }
}
If you still have any problem implementing or understanding ( I may not explain this clear enough, it is 00:40 now and I will probably go to sleep soon. ), please leave comments. If you want code, please also leave comments.

Main thread:
Start X "downloading" threads
Run query shown in question. for each record:
Add data from query to an ArrayBlockingQueue
Add end-of-data marker to queue
Wait for threads to stop (optional)
Return from main
Download thread:
Get data from queue. while not end-of-data marker:
Download HTML
Insert HTML to database
Put end-of-data marker back into queue for other threads to find
Exit thread

Related

ChromeDevTools in selenium, waiting for response bodies

I need to work on ajax response, that is one of responses received upon visiting a page. I use selenium dev tools and java. I create a listener, that intercepts a specific request and then I want to work on response it brings. However I need to setup static wait, or else selenium don't have time to save RequestId. I read Chrome Dev Tools documentation, but it's a new thing for me. I wonder if there is a method that would allow me to wait for this call to be completed, other than the static wait.
Here is my code:
#Test(groups = "test")
public void x() throws InterruptedException, JsonProcessingException {
User user = User.builder();
ManageAccountStep manageAccountStep = new ManageAccountStep(getDriver());
DashboardPO dashboardPO = new DashboardPO(getDriver());
manageAccountStep.login(user);
DevTools devTools = ((HasDevTools) getDriver()).maybeGetDevTools().orElseThrow();
devTools.createSessionIfThereIsNotOne();
devTools.send(Network.enable(Optional.empty(), Optional.empty(), Optional.empty()));
// end of boilerplate
final RequestId[] id = new RequestId[1];
devTools.addListener(Network.responseReceived(), response -> {
log.info(response.getResponse().getUrl());
if (response.getResponse().getUrl().contains(DESIRED_URL)){
id[0] = response.getRequestId();
}
});
dashboardPO
.clickLink(); // here is when my DESIRED_URL happens
Utils.sleep(5000); // Something like Thread.sleep(5000)
String responseBody = devTools.send(Network.getResponseBody(id[0])).getBody();
// some operations on responseBody
devTools.clearListeners();
devTools.disconnectSession();
}
If I don't use 5 seconds wait id variable gets never assigned and I null pointer exception requestId is required. During these 5 seconds log.info prints all api calls that are happening and it almost always finds my id. I would like to refrain from static wait though. I am thinking about something similiar to maybe jQuery.active()==0, but my page doesn't use jQuery.

You may try custom function Explicit Wait. Something like this:
public String getResponseBody(WebDriver driver, DevTools devTools) {
return new WebDriverWait(driver,5)
.ignoring(NullPointerException.class)
.until(driver ->
devTools.send(Network.getResponseBody(id[0])).getBody());
}
So, it won't wait for all 5 seconds. The moment it got the data, it would come of out of the until method. Also add whichever Exception that was coming up here.
Has put these lines of code as separate method because, devTools object is locally defined. In order to use them inside this anonymous inner function, it has to be final or effectively final.

I seem to run into this issue when running tests in parallel (and headless) and trying to capture the requests and responses, I get:
{"No data found for resource with given identifier"},"sessionId" ...
However, now .until seems to only take ExpectedCondition
So a similar solution (to the accepted answer), but without using "WebDriverWait.until" that I use is:
public static String getResponseBody(DevTools devTools, RequestId id) {
String requestPostData = "";
LocalDateTime then = LocalDateTime.now();
String err = "";
Integer it = 0;
while (true) {
err = "";
try{requestPostData = devTools.send(Network.getResponseBody(id)).getBody();} catch( Exception e){err = e.getMessage();};
if (requestPostData != null && !requestPostData.equals("")) {break;}
if (err.equals("")) {break;} // if we don't have an error message, its quite possible the responseBody really is an empty string
long timeTaken = ChronoUnit.SECONDS.between(then, LocalDateTime.now());
if (timeTaken >= 5) {requestPostData = err + ", timeTaken:" + timeTaken; break;}
if(it > 0) {TimeUnit.SECONDS.sleep(it);} // I prefer waiting longer and longer, avoiding stack overflows
it++;
}
return requestPostData;
}
It just loops until it doesn't error, and returns the string back as soon as it can (but I actually set timeTaken >= 60 due to many parallel requests)

All of my data points are lost when InfluxDB restarts

A cron job is being used to fire this script off once a day. When the script runs it seems to work as expected. The code builds a map, iterates over that map, creates points which are added to a batch, and finally writes those batched points to influxDB. I can connect to the influxDB and I can query my database and see that the points were added. I am using influxdb-java 2.2.
The issue that I am having is that when influxDB is restarted all of my data is being removed. The database still exists and the series still exists, however, all of the points/rows are gone (Each table is empty). My database is not the only database, there are several others, those databases are restored correctly. My guess is that the transaction is not being finalized. I am not aware of a way to make it do a flush and ensure that my points are persisted. I tried to adding:
influxDB.write(batchPoints);
influxDB.disableBatch(); // calls this.batchProcessor.flush() in InfluxDBImpl.java
This was an attempt to force a flush but this didn't work as expected. I am using influxDB 0.13.X
InfluxDB influxDB = InfluxDBFactory.connect(host, user, pass);
String dbName = "dataName";
influxDB.createDatabase(dbName);
BatchPoints batchPoints = BatchPoints
.database(dbName)
.tag("async", "true")
.retentionPolicy("default")
.consistency(ConsistencyLevel.ALL)
.build();
for (Tags type: Tags.values()) {
List<LinkedHashMap<String, Object>> myList = this.trendsMap.get(type.getDisplay());
if (myList != null) {
for (LinkedHashMap<String, Object> data : myList) {
Point point = null;
long time = (long) data.get("time");
if (data.get("date").equals(this.sdf.format(new Date()))) {
time = System.currentTimeMillis();
}
point = Point.measurement(type.getDisplay())
.time(time, TimeUnit.MILLISECONDS)
.field("count", data.get("count"))
.field("date", data.get("date"))
.field("day_of_week", data.get("day_of_week"))
.field("day_of_month", data.get("day_of_month"))
.build();
batchPoints.point(point);
}
}
}
influxDB.write(batchPoints);

Can you upgrade InfluxDB to 0.11.0? There have been many important changes since then and it would be best to test against that.

What is other better way to send response and print data in table structure?

I am new to ajax, I am making ajax call, it will hit servlet, it fetches data and prints data to jsp using out.println(). It is working fine but I feel its not good way. Here is my coding ,
Ajax call,
xmlHttpReq.open('GET', "RTMonitor?rtype=groups&groupname="+temp, true);
In RTMonitor servlet I have,
sql ="SELECT a.vehicleno,a.lat,a.lng,a.status,a.rdate,a.rtime from latlng a,vehicle_details b where a.vehicleno=b.vehicleno and b.clientid="+clientid +" and b.groupid in(select groupid from group_details where groupname='"+gname+"' and clientid='"+clientid+"')";
resultSet = statement.executeQuery(sql);
while(resultSet.next())
{
response.setContentType("text/html");
out.println("<tr>"+
"<td>"+"&nbsp"+"&nbsp"+resultSet.getString("vehicleno")+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"&nbsp"+"<br>"+"<br>"+"</td>"+);
//other <td>s
}
I think this is not good way. So I think about returning response as JSON object. Tell me how to return object as JSON and set values in <td>. and tell me JSON is a good way , or is there any other way please suggest me.

Object to JSON -conversion is explained in question: Converting Java Object to Json using Marshaller
Other things:
Your SQL is unsafe! Please refer to the following question that explains prepared statements and has examples too: Difference between Statement and PreparedStatement
Generally you should not write your low-level AJAX-code by yourself unless you are aiming to learn things. There are many cross browser functioning Javascript libraries that provide these things in a robust manner, such as JQuery. JQuery's API has the getJSON which you will undoubtedly find very useful (API doc):
var params = {myObjectId: 1337}; // just some parameters
$.getJSON( "myUrl/myAjaxAction", params, function( data ) { /* this is the success handler*/
alert(data.myObject.name); // assuming that returned data (JSON) is: {myObject: {name: 'Hello World!'}}
});

You should try to avoid as much as possible mixing server-side code with client-side code.
Your client-side code should only offer a nice and rich user interface, by manipulating the data which is provided by the server. The server-side code should only process the data - coming from different calls, or taken from a storage, usually a database.
Usually the comunication( asynchrounous or not) from a client and a server goes like that :
client sends a request to the server
server process the request and it gives a response, usually some html or json/xml
client process the response from the server
Ok, now lets move our attention to your specific problem.
Your ajax call : xmlHttpReq.open('GET', "RTMonitor?rtype=groups&groupname="+temp, true); should send the request to the servlet and expect some data back to process and render in a nice way to the user. Your servlet should handle the request, by querying the database( you should definitely change your code so it uses prepared statements as they are preventing SQL injection). By doing so, you've separate your client-side code from server-side code.
private List<YourObject> loadObjectsBySomeLogic() throws SQLException{
String sql ="SELECT a.vehicleno,a.lat,a.lng,a.status,a.rdate,a.rtime FROM latlng a,vehicle_details b WHERE a.vehicleno=b.vehicleno AND b.clientid= ? AND b.groupid in(select groupid from group_details where groupname= ? and clientid= ?)";
List<YourObject> list = new ArrayList<YourObject>();//new ArrayList<>(); for Java versions > 1.7
PreparedStatement ps = null;
ResultSet rs = null;
try{
ps = connection.prepareStatement(sql);
ps.setLong(1, clientId);
ps.setString(2, gname);
ps.setLong(3, clientId);
rs = ps.executeQuery();
while(rs .next())
{
//load data from ResultSet into an object/list of objects
}
}finally{
closeResources(rs , ps);
}
return list;
}
private static final void closeResources(final ResultSet rs , final PreparedStatement ps){
if (rs != null) {
try {
rs.close();
} catch (SQLException e) {
//nasty rs. log the exception?
LOGGER.error("Could not close the ResultSet!" , e);
}
}
if (ps != null) {
try {
ps.close();
} catch (SQLException e) {
//nasty ps. log the exception?
LOGGER.error("Could not close the PreparedStatement!" , e);
}
}
}
You could delegate this method to a different object, which handles the business/aplication domain logic, but that's not our point in this case.
You can use Json for your data format, because it has a nice, and easy to understand way to format data, and it is more lightweight compared to XML. You can use any Java library to encode data as Json. I'll provide an example which uses Gson library.
List<YourObject> list = loadObjectsBySomeLogic();
String json = new Gson().toJson(list);
response.setContentType("application/json");
response.setCharacterEncoding("UTF-8");
response.getWriter().write(json);
Now your Ajax request, should handle the Json data coming from server( I recommend you to use jQuery to make Ajax calls as it's been tested and it works great on all major browsers).
$.get('RTMonitor', function(responseJson) {
//handle your json response by rendering it using html + css.
});

How to optimize the import of data from a flat file to BD PostgreSQL?

Good morning to the community, I have a query you happen to have to import 14 million records containing the information of clients of a company.
Flat File. Txt weighs 2.8 GB, I have developed a java program that reads the flat file line by line, deal the information and put it in an object that in turn inserted into a table in the PostgreSQL database, the subject is that I have made a calculation that 100000 records inserted in a time of 112 minutes, but the issue is that I insert parts.
public static void main(String[] args) {
// PROCESSING 100,000 records in 112 minutes
  // PROCESSING 1,000,000 records in 770 minutes = 18.66 hours
loadData(0L, 0L, 100000L);
}
/**
* Load the number of records Depending on the input parameters.
* #param counterInitial - Initial counter, type long.
* #param loadInitial - Initial load, type long.
* #param loadLimit - Load limit, type long.
*/
private static void loadData(long counterInitial, long loadInitial, long loadLimit){
Session session = HibernateUtil.getSessionFactory().openSession();
try{
FileInputStream fstream = new FileInputStream("C:\\sppadron.txt");
DataInputStream entrada = new DataInputStream(fstream);
BufferedReader buffer = new BufferedReader(new InputStreamReader(entrada));
String strLinea;
while ((strLinea = buffer.readLine()) != null){
if(counterInitial > loadInitial){
if(counterInitial > loadLimit){
break;
}
Sppadron spadron= new Sppadron();
spadron.setSpId(counterInitial);
spadron.setSpNle(strLinea.substring(0, 9).trim());
spadron.setSpLib(strLinea.substring(9, 16).trim());
spadron.setSpDep(strLinea.substring(16, 19).trim());
spadron.setSpPrv(strLinea.substring(19, 22).trim());
spadron.setSpDst(strLinea.substring(22, 25).trim());
spadron.setSpApp(strLinea.substring(25, 66).trim());
spadron.setSpApm(strLinea.substring(66, 107).trim());
spadron.setSpNom(strLinea.substring(107, 143).trim());
String cadenaGriSecDoc = strLinea.substring(143, strLinea.length()).trim();
String[] tokensVal = cadenaGriSecDoc.split("\\s+");
if(tokensVal.length == 5){
spadron.setSpNac(tokensVal[0]);
spadron.setSpSex(tokensVal[1]);
spadron.setSpGri(tokensVal[2]);
spadron.setSpSec(tokensVal[3]);
spadron.setSpDoc(tokensVal[4]);
}else{
spadron.setSpNac(tokensVal[0]);
spadron.setSpSex(tokensVal[1]);
spadron.setSpGri(tokensVal[2]);
spadron.setSpSec(null);
spadron.setSpDoc(tokensVal[3]);
}
try{
session.getTransaction().begin();
session.save(spadron); // Insert
session.getTransaction().commit();
} catch (Exception e) {
session.getTransaction().rollback();
e.printStackTrace();
}
}
counterInitial++;
}
entrada.close();
} catch (Exception e) {
e.printStackTrace();
}finally{
session.close();
}
}
The main issue is if they check my code when I insert the first million records, the parameters would be as follows: loadData (0L, 0L, 1000000L);
The issue is that when you insert the following records in this case would be the next million records would be: loadData (0L, 1000000L, 2000000L);
What will cause it to scroll back the first 100 billion of records, and then when the counter is in the value 1000001 recently will begin insert following records, someone can give me a more optimal suggestion to insert the records, knowing that it is necessary treat information as seen in previous code shown.

See How to speed up insertion performance in PostgreSQL .
The first thing you should do is bypass Hibernate. ORMs are convienient, but you pay a price in speed for that convenience, especially with bulk operations.
You could group your inserts into reasonable sized transactions and use multi-valued inserts, using a JDBC PreparedStatement.
Personally though, I'd use PgJDBC's support for the COPY protocol to do the inserts more directly. Unwrap your Hibernate Session object to get the underlying java.sql.Connection, get the PGconnection interface for it, getCopyAPI() to get the CopyManager, and use copyIn to feed your data into the DB.
Since it looks like your data isn't in CSV form but fixed-width field form, what you'll need to do is start a thread that reads your data from the file, converts each datum into CSV form suitable for PostgreSQL input, and writes it to a buffer that copyIn can consume with the passed Reader. This sounds more complicated than it is, and there are lots of examples of Java producer/consumer threading implementations using java.io.Reader and java.io.Writer interfaces out there.
It's possible you may instead be able to write a filter for the Reader that wraps the underlying file reader and transforms each line. This would be much simpler than producer/consumer threading. Research it as the preferred option first.

ormlite with persistent h2 db - new tables not get persisted

When I am creating a new H2 database via ORMLite the database file get created but after I close my application, all the data that it stored in the database is lost:
JdbcConnectionSource connection =
new JdbcConnectionSource("jdbc:h2:file:" + path.getAbsolutePath() + ".h2.db");
TableUtils.createTable(connection, SomeClass.class);
Dao<SomeClass, Integer> dao = DaoManager.createDao(connection, SomeClass.class);
SomeClass sc = new SomeClass(id, ...);
dao.create(sc);
SomeClass retrieved = dao.queryForId(id);
System.out.println("" + retrieved);
This code will produce good results. It will print the object that I stored.
But when I start the application again this time without creating the table and storing new object I get an exception telling me that the required table is not exists:
JdbcConnectionSource connection =
new JdbcConnectionSource("jdbc:h2:file:" + path.getAbsolutePath() + ".h2.db");
Dao<SomeClass, Integer> dao = DaoManager.createDao(connection, SomeClass.class);
SomeClass retrieved = dao.queryForId(id); // will produce an exception..
System.out.println("" + retrieved);

The following worked fine for me if I ran it once and then a second time with the createTable turned off. The 2nd insert gave me a primary key violation of course but that was expected. It created the file with (as #Thomas mentioned) a ".h2.db.h2.db" prefix.
Some questions:
After you run your application the first time, can you see the path file being created?
Is it on permanent storage and not in some temporary location cleared by the OS?
Any chance some other part of your application is clearing it before the database code begins?
Hope this helps.
#Test
public void testStuff() throws Exception {
File path = new File("/tmp/x");
JdbcConnectionSource connection = new JdbcConnectionSource("jdbc:h2:file:"
+ path.getAbsolutePath() + ".h2.db");
// TableUtils.createTable(connection, SomeClass.class);
Dao<SomeClass, Integer> dao = DaoManager.createDao(connection,
SomeClass.class);
int id = 131233;
SomeClass sc = new SomeClass(id, "fopewjfew");
dao.create(sc);
SomeClass retrieved = dao.queryForId(id);
System.out.println("" + retrieved);
connection.close();
}
I can see Russia from my house:
> ls -l /tmp/
...
-rw-r--r-- 1 graywatson wheel 14336 Aug 31 08:47 x.h2.db.h2.db

Did you close the database? It is closed automatically but it's better to close it manually (so recovery is faster).
In many cases the database URL is the problem. Are you sure the same path is used in both cases? Otherwise you end up with two databases. By the way, ".h2.db" is added automatically, you don't need to add it manually.
To better analyze the problem, you could append ;TRACE_LEVEL_FILE=2 to the database URL, and then check in the *.trace.db file what SQL statements were executed against the database.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

use of multithreading for downloading in java - java

Related

ChromeDevTools in selenium, waiting for response bodies

All of my data points are lost when InfluxDB restarts

What is other better way to send response and print data in table structure?

How to optimize the import of data from a flat file to BD PostgreSQL?

ormlite with persistent h2 db - new tables not get persisted

Categories

Resources