Webpage collector using google bot

Webpage collector using google bot - java

I'm continuing a project that has been coming for a few years at my university. One of the activities this project does is to collect some web pages using the google bot.
Due to a problem that I cannot understand, the project is not getting through this part. Already research a lot about what may be happening, if it is some part of the code that is outdated.
The code is in Java and uses Maven for project management.
I've tried to update some information from maven's "pom".
I already tried to change the part of the code that uses the bot, but nothing works.
I'm posting the part of code that isn't working as it should:
private List<JSONObject> querySearch(int numSeeds, String query) {
List<JSONObject> result = new ArrayList<>();
start=0;
do {
String url = SEARCH_URL + query.replaceAll(" ", "+") + FILE_TYPE + "html" + START + start;);
Connection conn = Jsoup.connect(url).userAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)").timeout(5000);
try {
Document doc = conn.get();
result.addAll(formatter(doc);
} catch (IOException e) {
System.err.println("Could not search for seed pages in IO.");
System.err.println(e);
} catch (ParseException e) {
System.err.println("Could not search for seed pages in Parse.");
System.err.println(e);
}
start += 10;
} while (result.size() < numSeeds);
return result;
}
what some variables do:
private static final String SEARCH_URL = "https://www.google.com/search?q=";
private static final String FILE_TYPE = "&fileType=";
private static final String START = "&start=";
private QueryBuilder queryBuilder;
public GoogleAjaxSearch() {
this.queryBuilder = new QueryBuilder();
}
Until this part is ok, it connect with the bot and can get a html from google. The problem is to separate what found and take only the link, that should be between ("h3.r> a").
That it does in this part with the result.addAll(formatter(doc)
public List<JSONObject> formatter(Document doc) throws ParseException {
List<JSONObject> entries = new ArrayList<>();
Elements results = doc.select("h3.r > a");
for (Element result : results) {
//System.out.println(result.toString());
JSONObject entry = new JSONObject();
entry.put("url", (result.attr("href").substring(6, result.attr("href").indexOf("&")).substring(1)));
entry.put("anchor", result.text());
So when it gets to this part: Elements results = doc.select ("h3.r> a"), find, probably, no h3 and can't increment the "results" list by not entering the for loop. Then goes back to the querysearch function and try again, without increment the results list. And with that, entering in a infinite loop trying to get the requested data and never finding.
If anyone here can help me, I've been trying for a while and I don't know what else to do. Thanks in advance.

Related

Can I update my TableView items one a time without blocking UI thread?

I have two methods. The first retrieves a list of results from a search method in another class.
/* 2 - Retrieve list of results */
qmitResultsList = QMITSearchUtil.execute(URL, keyword);
/* 3 - Show results */
populateTable(qmitResultsList, tableView)
The second, populateTable() adds all the items to the table at once by calling:
ObservableList<QMITResult> dataPriority = FXCollections.observableArrayList(
qmitResultsList
);
tableView.setItems(dataPriority);
My goal is to add each new element to the TableView as it is being processed in real-time. For example, instead of processing and returning the entire list in the first method, QMITSearchUtil.execute(), I would like to update the UI with each result that is returned, one at a time. How can this be accomplished? I've tried a few ways, using a Platform.runLater() hack for example, with no success...

I discovered the answer to my question. I first define the ObservableList for my TableView:
ObservableList<QMITResult> dataPriority = FXCollections.observableArrayList();
Then I pass that into the execute() method that runs the background thread:
private void execute(String URL, String keyword, ObservableList<QMITResult> dataPriority) throws Exception {
/* Download HTML page and create list of URLs from relevant links */
Elements links = getLinkList(URL);
List<QMITResult> qmitResults = new ArrayList<>();
new Thread(() -> {
for (Element link : links) {
try {
/* Create a list of formatted URLs to loop through */
String linkText = link.toString();
String titleText = link.text();
String formattedLink = StringUtils.substringBetween(linkText, "<a href=\"", "\"").replace("\\", "/");
System.out.println(titleText);
System.out.println(formattedLink);
/* Create Word Document for each link and parse for keyword */
QMITResult qmitResultNode = null;
try {
qmitResultNode = parseDocument(keyword, formattedLink, titleText);
} catch (Exception e) {
e.printStackTrace();
}
qmitResults.add(qmitResultNode);
dataPriority.add(qmitResultNode);
Thread.sleep(200);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}).start();
tableView.setItems(dataPriority);
}
The result is that while the list is being formed each TableView item is being individually published without blocking the main UI thread. They come one at a time.

Create new directory in share-point using SOAP in java

I have just started working on share-point using java my firm uses share-point 2010 version so I have to work using SOAP and I have done following functionality.
Download Document from share-point.
Upload Document to share-point.
But I am stuck at creating new directory, I have tried some code by googling but no luck.
Here is my code:
public void createFolder(ListsSoap ls, String filePathToCreate,
String fileName, LoginDO loginDO, HttpServletResponse response)
throws Exception {
try {
String query = "";
String folderName = "NewFolder";
// 1. Prepare Query, Query Options and field Options
if (CommonUtilities.isValidStr(folderName)) {
// Prepare Query & Query Options for child folders
query = "<Batch OnError=\"Continue\" PreCalc=\"TRUE\" ListVersion=\"0\" " +
"RootFolder=\"https://xxx/Shared%20Documents/FPL%20SOs%20-%20JD%20Documents\">"
+ "<Method ID=\"1\" Cmd=\"New\">"
+ "<Field Name=\"FSObjType\">1</Field>"
+"<Field Name=\"ID\">New</Field>"
+ "<Field Name=\"BaseName\">" + folderName + "</Field>"
+ "</Method></Batch>";
} else {
// Prepare Query & Query Options for Parent folders
query = "<Query><Where><Eq><FieldRef Name=\"FSObjType\" />"
+ "<Value Type=\"Lookup\">1</Value></Eq></Where></Query>";
}
UpdateListItems.Updates updates = null;
// 2. Prepare Query, QueryOptions & ViewFields object as per options
if (CommonUtilities.isValidStr(query)) {
updates = new UpdateListItems.Updates();
updates.getContent().add(
sharepointUtil.createSharePointCAMLNode(query));
}
// 3. Call Web service to get result for selected options
UpdateListItemsResult result = ls
.updateListItems(SHAREPOINT_FOLDER_NAME,updates);
/*
* CommonUtilities
* .getApplicationProperty(ApplicationConstants.SHAREPOINT_FOLDER_NAME
* ), "", msQuery, viewFields, "", msQueryOptions, "");
*/
// 4. Get elements from share point result
Element element = (Element) result.getContent().get(0);
NodeList nl = element.getElementsByTagName("z:row");
for (int i = 0; i < nl.getLength(); i++) {
Node node = nl.item(i);
System.out.println("Some.!");
}
} catch (Exception e) {
e.printStackTrace();
}
// logger.logMethodEnd();
}`
Anyhow updateListItems() method is executed without error but there is nothing in result.
Any help is appreciated.
Thank you :)

seems there was problem in
"RootFolder=\"https://xxx/Shared%20Documents/FPL%20SOs%20-%20JD%20Documents\">"
while passing root directory to create folder within.
found other way around by passing folderName with full directories hierarchy.
Thank you all for your inputs :)

How to get more than 1000 search results with Github eclipse java API

I am trying to search a large number of repositories with "searchRepository" method in https://github.com/eclipse/egit-github/tree/master/org.eclipse.egit.github.core
However, there is a limitation to get more than 1000 results
https://developer.github.com/v3/search/#about-the-search-api
Since it will throw an exception "Only the first 1000 search results are available (422)" (based on the code example below)
A solution is presented in github search limit results
My question is how can I split up the search into segments by the date (as mentioned in the thread), or is there a another way to do this with the Java GitHub API.
int countRepos = 0;
Map<String, String> searchQuery = new HashMap<String, String>();
searchQuery.put("language", "java");
List<SearchRepository> searchRes = null;
GitHubClient client = new GitHubClient();
client.setCredentials("xxx", "xxxxx");
RepositoryService service = new RepositoryService(client);
for(int page = 1 ; page <12 ; page++){
try {
searchRes = service.searchRepositories(searchQuery,page);
} catch (IOException e) {
e.printStackTrace();
}
for(SearchRepository repo : searchRes){
System.out.println("Repository"+countRepos+": "+repo.getOwner()+"/"+repo.getName());
countRepos++;
}
}
System.out.println("Total number of repositories are="+countRepos);
Thanks in advance.

NetSuite SOAP API (SuiteTalk) to dump General Ledger

Can anyone give me advice on how to read the general ledger using SuiteTalk, the SOAP API from NetSuite?
For example, if you look at an account or a transaction on the NetSuite UI, there is an option to select "GL Impact". This produces a list of relevant general ledger entries.
However, I couldn't figure out a way to get the same list using SuiteTalk. One initially promising SOAP operation I tried calling was getPostingTransactionSummary(), but that is just a summary and lacks detail such as transaction dates. Another way is to call search() passing a TransactionSearchBasic object. That returns too many types of transaction and I'm not sure which of those actually have an impact on the general ledger.
I'm using Java and Axis toolkit for the SOAP operations, but examples in any language whatsoever (or raw SOAP XML) would be appreciated.

you are on the right track with your transaction search.
You are looking for posting is true and where the line has an account.
However I'd set this up in the saved search editor at least until you've figured out how you are going to filter to manageable numbers of lines. Then use TransactionSearchAdvanced with savedSearchId to pull that info via SuiteTalk

I am able to search GL transaction with below code, this could help you.
public void GetTransactionData()
{
DataTable dtData = new DataTable();
string errorMsg = "";
LoginToService(ref errorMsg);
TransactionSearch objTransSearch = new TransactionSearch();
TransactionSearchBasic objTransSearchBasic = new TransactionSearchBasic();
SearchEnumMultiSelectField semsf = new SearchEnumMultiSelectField();
semsf.#operator = SearchEnumMultiSelectFieldOperator.anyOf;
semsf.operatorSpecified = true;
semsf.searchValue = new string[] { "Journal" };
objTransSearchBasic.type = semsf;
objTransSearchBasic.postingPeriod = new RecordRef() { internalId = "43" };
objTransSearch.basic = objTransSearchBasic;
//Set Search Preferences
SearchPreferences _searchPreferences = new SearchPreferences();
Preferences _prefs = new Preferences();
_serviceInstance.preferences = _prefs;
_serviceInstance.searchPreferences = _searchPreferences;
_searchPreferences.pageSize = 1000;
_searchPreferences.pageSizeSpecified = true;
_searchPreferences.bodyFieldsOnly = false;
//Set Search Preferences
try
{
SearchResult result = _serviceInstance.search(objTransSearch);
List<JournalEntry> lstJEntry = new List<JournalEntry>();
List<JournalEntryLine> lstLineItems = new List<JournalEntryLine>();
if (result.status.isSuccess)
{
for (int i = 0; i <= result.recordList.Length - 1; i += 1)
{
JournalEntry JEntry = (JournalEntry)result.recordList[i];
lstJEntry.Add((JournalEntry)result.recordList[i]);
if (JEntry.lineList != null)
{
foreach (JournalEntryLine line in JEntry.lineList.line)
{
lstLineItems.Add(line);
}
}
}
}
try
{
_serviceInstance.logout();
}
catch (Exception ex)
{
}
}
catch (Exception ex)
{
throw ex;
}
}

Issue using Sibling Index in JSoup 1.8.2

New to Java. Trying to learn/practice JSoup.
Goal: Extract the "Close" column from the Historical Prices off Yahoo Finance.
The code below returns both the "Close" prices and the "Adj Close" prices.
"S" below is any ticker in the SP 500.
Found another post which I've been using as a model. (here: How to parse the cells of the 3rd column of a table?).
public class MovingAverage200 {
private String ticker;
private String movingAverageURL;
double movingAverage200;
public MovingAverage200(String s) {
ticker = s;
movingAverageURL = ("https://finance.yahoo.com/q/hp?s="+ticker+"+Historical+Prices");
}
public void setMovingAverage() {
try {
Document document = Jsoup.connect(movingAverageURL).get();
Elements prices = document.select("td.yfnc_tabledata1:eq(4)");
for (Element price : prices) {
System.out.println(price.text());
}
}
catch (IOException ex) {
ex.printStackTrace();
}
}

Hint posted by alkis:
If you are using a previous release (1.8.2) then just use the latest. There were some performance tweaks that introduced bugs regarding siblings.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Webpage collector using google bot - java

Related

Can I update my TableView items one a time without blocking UI thread?

Create new directory in share-point using SOAP in java

How to get more than 1000 search results with Github eclipse java API

NetSuite SOAP API (SuiteTalk) to dump General Ledger

Issue using Sibling Index in JSoup 1.8.2

Categories

Resources