Web scraping using multithreading

Web scraping using multithreading - java

I wrote a code to lookup for some movie names on IMDB, but if for instance I am searching for "Harry Potter", I will find more than one movie. I would like to use multithreading, but I don't have much knowledge on this area.
I am using strategy design pattern to search among more websites, and for instance inside one of the methods I have this code
for (Element element : elements) {
String searchedUrl = element.select("a").attr("href");
String movieName = element.select("h2").text();
if (movieName.matches(patternMatcher)) {
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + " " + searchedUrl);
resultList.add(result);
}
}
which, for each element (which is the movie name), will create a new connection on IMDB to lookup for ratings and other stuff, on the super.imdbConnection(movieName) line.
The problem is, I would like to have all the connections at the same time, because on 5-6 movies found, the process will take much longer than expected.
I am not asking for code, I want some ideeas. I thought about creating an inner class which implements Runnable, and to use it, but I don't find any meaning on that.
How can I rewrite that loop to use multithreading?
I am using Jsoup for parsing, Element and Elements are from that library.

The most simple way is parallelStream()
List<Result> resultList = elements.parallelStream()
.map(e -> {
String searchedUrl = element.select("a").attr("href");
String movieName = element.select("h2").text();
if(movieName.matches(patternMatcher)){
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + " " + searchedUrl);
return result;
}else{
return null;
}
}).filter(Objects::nonNull)
.collect(Collectors.toList());
If you don't like parallelStream() and want to use Threads, you can to this:
List<Element> elements = new ArrayList<>();
//create a function which returns an implementation of `Callable`
//input: Element
//output: Callable<Result>
Function<Element, Callable<Result>> scrapFunction = (element) -> new Callable<Result>() {
#Override
public Result call() throws Exception{
String searchedUrl = element.select("a").attr("href");
String movieName = element.select("h2").text();
if(movieName.matches(patternMatcher)){
Result result = new Result();
result.setName(movieName);
result.setLink(searchedUrl);
result.setTitleProp(super.imdbConnection(movieName));
System.out.println(movieName + " " + searchedUrl);
return result;
}else{
return null;
}
}
};
//create a fixed pool of threads
ExecutorService executor = Executors.newFixedThreadPool(elements.size());
//submit a Callable<Result> for every Element
//by using scrapFunction.apply(...)
List<Future<Result>> futures = elements.stream()
.map(e -> executor.submit(scrapFunction.apply(e)))
.collect(Collectors.toList());
//collect all results from Callable<Result>
List<Result> resultList = futures.stream()
.map(e -> {
try{
return e.get();
}catch(Exception ignored){
return null;
}
}).filter(Objects::nonNull)
.collect(Collectors.toList());

Related

How to access an object attribute from a String in Java?

I have a String that tells me what attribute I should use to make some filtering. How can I use this String to actually access the data in the object ?
I have a method that returns a List of strings telling me how to filter my List of objects. Such as:
String[] { "id=123", "name=foo" }
So my first idea was to split the String into 2 parts with:
filterString.split("=") and use the first part of the String (e.g. "id") to identify the attribute being filtered.
Coming for a JS background, I would do it like this:
const attr = filterString.split('=')[0]; // grabs the "id" part from the string "id=123", for example
const filteredValue = filterString.split('=')[1]; // grabs the "123" part from the string "id=123", for example
items.filter(el => el[`${attr}`] === filteredValue) // returns an array with the items where the id == "123"
How would I be able to do that with Java ?

You can use reflections to get fields of class by dynamic name.
#Test
void test() throws NoSuchFieldException, IllegalAccessException {
String[] filters = {"id=123", "name=foo"};
List<Item> list = newArrayList(new Item(123, "abc"), new Item(2, "foo"), new Item(123, "foo"));
Class<Item> itemClass = Item.class;
for (String filter : filters) {
String key = StringUtils.substringBefore(filter, "=");
String value = StringUtils.substringAfter(filter, "=");
Iterator<Item> iterator = list.iterator();
while (iterator.hasNext()) {
Item item = iterator.next();
Field field = itemClass.getDeclaredField(key);
field.setAccessible(true);
Object itemValue = field.get(item);
if (!value.equals(String.valueOf(itemValue))) {
iterator.remove();
}
}
}
assertEquals(1, list.size());
}
But I agree with comment from sp00m - it's slow and potentially dangerous.

This code should work :
//create the filter map
Map<String, String> expectedFieldValueMap = new HashMap<>();
for (String currentDataValue : input) {
String[] keyValue = currentDataValue.split("=");
String expectedField = keyValue[0];
String expectedValue = keyValue[1];
expectedFieldValueMap.put(expectedField, expectedValue);
}
Then iterate over input object list ( have used Employee class with id and name fields & prepared a test data list with few Employee objects called inputEmployeeList which is being iterated ) and see if all filters passes, using reflection, though slow, is one way:
for (Employee e : inputEmployeeList) {
try {
boolean filterPassed = true;
for (String expectedField : expectedFieldValueMap.keySet()) {
String expectedValue = expectedFieldValueMap.get(expectedField);
Field fieldData = e.getClass().getDeclaredField(expectedField);
fieldData.setAccessible(true);
if (!expectedValue.equals(fieldData.get(e))) {
filterPassed = false;
break;
}
}
if (filterPassed) {
System.out.println(e + " object passed the filter");
}
} catch (Exception any) {
any.printStackTrace();
// handle
}
}

JPA - find by multiple attributes in collections of objects

I have an event object with following attributes:
class Event {
String name;
String location;
LocalDateTime date;
String description;
}
Lets say I get from web API a list of events:
List<Events> events = getEvents(); // e.g. 5 events
And now I want to check how many of these events I already have in my DB.
Event is unique if combination of values: name, location and date is also unique.
So basically I want to a create query to do this:
Optional<Event> getByNameAndLocationAndDate(String name, String location, LocalDate date);
but for a list of item in just one query. Something like:
Optional<Event> getByNameAndLocationAndDate(List<Events> events);
Is it possible with JPA?

There is no built-in or specially pretty way of doing this. But you could generate a query by using a loop:
public List<Event> getByNameAndLocationAndDate(List<Event> events) {
if (events.isEmpty()) {
return new ArrayList<>();
}
final StringBuilder queryBuilder = new StringBuilder("select e from Event e where ");
int i = 0;
for (final Event event : events) {
if (i > 0) {
queryBuilder.append("or")
}
queryBuilder.append(" (e.name = :name" + i);
queryBuilder.append(" and e.location = :location" + i);
queryBuilder.append(" and e.date = :date" + i + ") ");
i++;
}
final TypedQuery<Event> query = em.createQuery(queryBuilder.toString());
int j = 0;
for (final Event event : events) {
query.setParameter("name" + j, event.getName());
query.setParameter("location" + j, event.getLocation());
query.setParameter("date" + j, event.getDate());
}
return query.getResultList();
}
Like I said, not very pretty. Might be better with criteria API. Then again, unless you have very strict requirements for execution speed, you might be better off looping through the list checking one event at the time. It will result in the more queries run against the database, but also much prettier code.
Edit: Here is attempt using criteria API, haven't used it much so created just by googling, no guarantee it works as it is..
public List<Event> getByNameAndLocationAndDate(List<Event> events) {
if (events.isEmpty()) {
return new ArrayList<>();
}
final CriteriaBuilder cb = em.getCriteriaBuilder();
final CriteriaQuery<Event> query = cb.createQuery(Event.class);
final Root<Event> root = query.from(Event.class);
final List<Predicate> predicates = new ArrayList<>();
final List<Predicate> predicates = events.stream().map(event -> {
return cb.and(cb.equal(root.get("name"), event.getName()),
cb.equal(root.get("location"), event.getLocation()),
cb.equal(root.get("date"), event.getDate()));
}).collect(Collectors.toList());
query.select(root).where(cb.or(predicates.toArray(new Predicate[]{})));
return em.createQuery(query).getResultList();
}

try
List<Event> findByNameInAndLocationInAndDateIn(List<String> names,List<String> locations,List<Date> dates);
but this returns a list, not a single event, if you need verify if one event is not in database, the only way to do this is search one by one,
you can use this function for decide if needs that.
I'm not sure if this function behaves as you wish

Combine 2 array lists of objects that have null values

I'm trying to concatenate 2 array lists of objects into one but i can't figure out how to do it. I've tried with addAll and add but those methods won't really do what i want.
Basically, i have one array list with values like this:
SearchResult1 [title=null, url=null, price=19 690 EUR]
And another one with values like this:
SearchResult2 [title=Ford Car, url=http://www.something.com, price=null]
How can i combine those 2 arrays into one with values like this:
SearchResult3 [title=Ford Car, url=http://www.something.com, price=19 690 EUR]
This is the code so far:
public List searchMethod() {
try {
final String query = "ford";
final Document page = Jsoup.connect("link" + URLEncoder.encode(query, "UTF-8")).userAgent(USER_AGENT).get();
List<SearchResult> resultList1 = new ArrayList<SearchResult>();
List<SearchResult> resultList2 = new ArrayList<SearchResult>();
List<SearchResult> resultList3 = new ArrayList<SearchResult>();
for(Element searchResult : page.select(".offer-price")) {
String price = searchResult.text();
resultList1.add(new SearchResult(price));
}
for(Element searchResult : page.select(".offer-title__link")) {
String title = searchResult.text();
String url = searchResult.attr("href");
resultList2.add(new SearchResult(title, url));
}
resultList3.addAll(resultList1);
resultList3.addAll(resultList2);
return resultList3;
}catch(Exception e) {
e.printStackTrace();
}
return Collections.emptyList();
}
The values that i put in those arrays are extracted from a web page
Thanks for helping!

From the comment, you have said that you just want to correlate/merge the objects from both lists by each index.
You can simply loop through the list, constructing a new SearchResult (assuming you have getters for the fields)
for(int i = 0; i < resultList1.size(); i++) {
resultList3.add(new SearchResult(resultList1.get(i).getPrice(),
resultList2.get(i).getTitle(),
resultList2.get(i).getUrl()));
}
You may have to change the order of the passed arguments to the SearchResult constructor taking price, title and url as you haven't shown it.

why don't you do it in one shot?
List<SearchResult> resultList1 = new ArrayList<SearchResult>();
for(Element searchResult : page.select(".offer-title__link")) {
String title = searchResult.text();
String url = searchResult.attr("href");
resultList1.add(new SearchResult(title, url));
}
int index = 0;
for(Element searchResult : page.select(".offer-price")) {
String price = searchResult.text();
//since you have already assumed
//that price will come in the same order and title and url.
resultList1.get(index++).setPrice(price);
}
return resultList1;

Make custom code to reduce number of repetitive lines

I have to get 'tags' from the database and store them in an array so I could check if my document contains them. Due to the number of tag categories (customers, system_dependencies, keywords) I have multiple arrays to compare my document with. Is there an easy way to simplify and make my code look nicer?
This is my approach but it looks terrible with all the repetitive for loops.
ArrayList<String> KEYWORDS2 = new ArrayList<String>();
ArrayList<String> CUSTOMERS = new ArrayList<String>();
ArrayList<String> SYSTEM_DEPS = new ArrayList<String>();
ArrayList<String> MODULES = new ArrayList<String>();
ArrayList<String> DRIVE_DEFS = new ArrayList<String>();
ArrayList<String> PROCESS_IDS = new ArrayList<String>();
while (resultSet2.next()) {
CUSTOMERS.add(resultSet2.getString(1));
}
sql = "SELECT da_tag_name FROM da_tags WHERE da_tag_type_id = 6";
stmt = conn.prepareStatement(sql);
resultSet2 = stmt.executeQuery();
while (resultSet2.next()) {
SYSTEM_DEPS.add(resultSet2.getString(1));
}
while (resultSet.next()) {
String da_document_id = resultSet.getString(1);
String file_name = resultSet.getString(2);
try {
if(file_name.endsWith(".docx") || file_name.endsWith(".docm")) {
System.out.println(file_name);
XWPFDocument document = new XWPFDocument(resultSet.getBinaryStream(3));
XWPFWordExtractor wordExtractor = new XWPFWordExtractor(document);
//Return what's inside the document
System.out.println("Keywords found in the document:");
for (String keyword : KEYWORDS) {
if (wordExtractor.getText().contains(keyword)) {
System.out.println(keyword);
}
}
System.out.println("\nCustomers found in the document:");
for (String customer : CUSTOMERS) {
if (wordExtractor.getText().contains(customer)) {
System.out.println(customer);
}
}
System.out.println("\nSystem dependencies found in the document:");
for (String systemDeps : SYSTEM_DEPS) {
if (wordExtractor.getText().contains(systemDeps)) {
System.out.println(systemDeps);
}
}
System.out.println("Log number: " + findLogNumber(wordExtractor));
System.out.println("------------------------------------------");
wordExtractor.close();
}
As you can see there are 3 more to come and this doesn't look good already. Maybe there's a way to compare all of them at the same time.
I have made another attempt at this creating this method:
public void genericForEachLoop(ArrayList<String> al, POITextExtractor te) {
for (String item : al) {
if (te.getText().contains(item)) {
System.out.println(item);
}
}
}
Then calling it like so: genericForEachLoop(MODULES, wordExtractor);
Any better solutions?

I've got two ideas to shorten this: first of all you can write a general for-loop in a separate method that has an ArrayList as a parameter. Then you pass it each of your ArrayLists successively, which would mean that at least you do not have to repeat the for-loops. Secondly, you can create an ArrayList of type ArrayList and store your ArrayLists inside it. Then you can iterate over the whole thing. Only apparent disadvantage of both ideas (or a combination of them) would be, that you need to name the variable for your query string alike for the search of each ArrayList.

What you could do is use a Map and an enum like this:
enum TagType {
KEYWORDS2(2), // or whatever its da_tag_type_id is
CUSTOMERS(4),
SYSTEM_DEPS(6),
MODULES(8),
DRIVE_DEFS(10),
PROCESS_IDS(12);
public final daTagTypeId; // this will be used in queries
TagType(int daTagTypeId) {
this.daTagTypeId = daTagTypeId;
}
}
Map<TagType, List<String>> tags = new HashMap<>();
XWPFDocument document = new XWPFDocument(resultSet.getBinaryStream(3));
XWPFWordExtractor wordExtractor = new XWPFWordExtractor(document);
for(TagType tagType : TagType.values()) {
tags.put(tagType, new ArrayList<>()); // initialize
String sql = String.format("SELECT da_tag_name FROM da_tags WHERE da_tag_type_id = %d", tagType.daTagTypeId); // build query
stmt = conn.prepareStatement(sql);
resultSet2 = stmt.executeQuery();
while(resultSet2.next()) { // fill from DB
tags.get(tagType).add(.add(resultSet2.getString(1)));
}
System.out.println(String.format("%s found in the document:", tags.get(tagType).name());
for (String tag : tags.get(tagType)) { // search in text
if (wordExtractor.getText().contains(tag)) {
System.out.println(keyword);
}
}
}
But at this point I'm not sure you need those lists at all:
enum TagType {
KEYWORDS2(2), // or whatever its da_tag_type_id is
CUSTOMERS(4),
SYSTEM_DEPS(6),
MODULES(8),
DRIVE_DEFS(10),
PROCESS_IDS(12);
public final daTagTypeId; // this will be used in queries
TagType(int daTagTypeId) {
this.daTagTypeId = daTagTypeId;
}
}
XWPFDocument document = new XWPFDocument(resultSet.getBinaryStream(3));
XWPFWordExtractor wordExtractor = new XWPFWordExtractor(document);
for(TagType tagType : TagType.values()) {
String sql = String.format("SELECT da_tag_name FROM da_tags WHERE da_tag_type_id = %d", tagType.daTagTypeId); // build query
stmt = conn.prepareStatement(sql);
resultSet2 = stmt.executeQuery();
System.out.println(String.format("%s found in the document:", tags.get(tagType).name());
while(result2.next()) {
String tag = resultSet2.getString(1);
if (wordExtractor.getText().contains(tag)) {
System.out.println(keyword);
}
}
}
This given I don't know where those resultSet is declared and initialised, nor where that resultSet2 is initialised.
Basically you just fetch tags for each type from DB and then directly search them in the text without storing them at first and then re-iterating the stored ones... I mean that's what the DB is there for.

Java ExecutorService Runnable doesn't update value

I'm using Java to download HTML contents of websites whose URLs are stored in a database. I'd like to put their HTML into database, too.
I'm using Jsoup for this purpose:
public String downloadHTML(String byLink) {
String htmlInPage = "";
try {
Document doc = Jsoup.connect(byLink).get();
htmlInPage = doc.html();
} catch (org.jsoup.UnsupportedMimeTypeException e) {
// process this and some other exceptions
}
return htmlInPage;
}
I'd like to download websites concurrently and use this function:
public void downloadURL(int websiteId, String url,
String categoryName, ExecutorService executorService) {
executorService.submit((Runnable) () -> {
String htmlInPage = downloadHTML(url);
System.out.println("Category: " + categoryName + " " + websiteId + " " + url);
String insertQuery =
"INSERT INTO html_data (website_id, html_contents) VALUES (?,?)";
dbUtils.query(insertQuery, websiteId, htmlInPage);
});
}
dbUtils is my class based on Apache Commons DbUtils. Details are here: http://pastebin.com/iAKXchbQ
And I'm using everything mentioned above in a such way: (List<Object[]> details are explained on pastebin, too)
public static void main(String[] args) {
DbUtils dbUtils = new DbUtils("host", "db", "driver", "user", "pass");
List<String> categoriesList =
Arrays.asList("weapons", "planes", "cooking", "manga");
String sql = "SELECT lw.id, lw.website_url, category_name " +
"FROM list_of_websites AS lw JOIN list_of_categories AS lc " +
"ON lw.category_id = lc.id " +
"where category_name = ? ";
ExecutorService executorService = Executors.newFixedThreadPool(10);
for (String category : categoriesList) {
List<Object[]> sitesInCategory = dbUtils.select(sql, category );
for (Object[] entry : sitesInCategory) {
int websiteId = (int) entry[0];
String url = (String) entry[1];
String categoryName = (String) entry[2];
downloadURL(websiteId, url, categoryName, executorService);
}
}
executorService.shutdown();
}
I'm not sure if this solution is correct but it works. Now I want to modify code to save HTML not from all websites in my database, but only their fixed ammount in each category.
For example, download and save HTML of 50 websites from the "weapons" category, 50 from "planes", etc. I don't think it's necessary to use sql for this purpose: if we select 50 sites per category, it doesn't mean we save them all, because of possibly incorrect syntax and connection problems.
I've tryed to create separate class implementing Runnable with fields: counter and maxWebsitesPerCategory, but these variables aren't updated. Another idea was to create field Map<String,Integer> sitesInCategory instead of counter, put each category as a key there and increment its value until it reaches maxWebsitesPerCategory, but it didn't work, too. Please, help me!
P.S: I'll also be grateful for any recommendations connected with my realization of concurrent downloading (I haven't worked with concurrency in Java before and this is my first attempt)

How about this?
for (String category : categoriesList) {
dbUtils.select(sql, category).stream()
.limit(50)
.forEach(entry -> {
int websiteId = (int) entry[0];
String url = (String) entry[1];
String categoryName = (String) entry[2];
downloadURL(websiteId, url, categoryName, executorService);
});
}
sitesInCategory has been replaced with a stream of at most 50 elements, then your code is run on each entry.
EDIT
In regard to comments. I've gone ahead and restructured a bit, you can modify/implement the content of the methods I've suggested.
public void werk(Queue<Object[]> q, ExecutorService executorService) {
executorService.submit(() -> {
try {
Object[] o = q.remove();
try {
String html = downloadHTML(o); // this takes one of your object arrays and returns the text of an html page
insertIntoDB(html); // this is the code in the latter half of your downloadURL method
}catch (/*narrow exception type indicating download failure*/Exception e) {
werk(q, executorService);
}
}catch (NoSuchElementException e) {}
});
}
^^^ This method does most of the work.
for (String category : categoriesList) {
Queue<Object[]> q = new ConcurrentLinkedQueue<>(dbUtils.select(sql, category));
IntStream.range(0, 50).forEach(i -> werk(q, executorService));
}
^^^ this is the for loop in your main
Now each category tries to download 50 pages, upon failure of downloading a page it moves on and tries to download another page. In this way, you will either download 50 pages or have attempted to download all pages in the category.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Web scraping using multithreading - java

Related

How to access an object attribute from a String in Java?

JPA - find by multiple attributes in collections of objects

Combine 2 array lists of objects that have null values

Make custom code to reduce number of repetitive lines

Java ExecutorService Runnable doesn't update value

Categories

Resources