Jsoup Hanging Without Errors

Jsoup Hanging Without Errors - java

I'm having an issue with Jsoup and hanging. I was testing Jsoup code, that was previously working, when out of nowhere it stopped working. I haven't changed any of the code since about a week ago, and until now, it has been working.
I've been trying to hit Wikipedia's Main Page to scrape links off of it for a homework assignment.
It hangs without throwing me any errors, nor does the program move past the URL connection .get() method. I waited about 10 minutes and still nothing happened.
Below is my code:
private WikiPage pullData(String url, WikiPage parent) {
WikiPage wp;
try {
String decodedURL = URLDecoder.decode(url, "UTF-8");
Document doc = Jsoup.connect(decodedURL).get();
Elements links = doc.select("a");
Elements paragraphs = doc.select("p");
Element t = doc.select("title").first();
StringBuilder words = new StringBuilder();
String title = t.text().replace(" - Wikipedia", "");
paragraphs.forEach(e -> {
words.append(e.text().toLowerCase());
});
wp = new WikiPage(url, title, parent);
for (int i = 0; i < AMOUNT_LINKS; i++) {
boolean properLink = false;
while (!properLink) {
//int rnd = R_G.nextInt(links.size());
String a = links.get(i).attr("href");
if (a.length() >= 5 && a.substring(0, 5).equals("/wiki") && containsChecker(a)) {
String BASE_URL = "https://en.wikipedia.org";
String decode = URLDecoder.decode(BASE_URL + a, "UTF-8");
wp.addChildren(decode);
properLink = true;
}
}
}
String[] splitWords = words.toString().replaceAll("[_$&+,:;=?##|'<>.^*()%!\\[\\]\\-\"/{}]", " ").split(" ");
for (String s : splitWords) {
if (s.length() >= 1) {
wp.addToWords(new WordCount(s, 1, 0));
}
}
System.out.printf("%1$-10s %2$-45s\n", counter, title);
counter++;
} catch (Exception e) {
e.printStackTrace();
return null;
}
return wp;
}
Here is a screenshot after running the program for 10 minutes with a break point at Elements links = doc.select("a"); :
Before hitting the .get() method
Hanging for about 10 minutes
I can't seem to see where the issue lies, and I've even tried different websites, but it just does not work at all.
Thanks for the help!

Related

Twitter Query API filter not working in date range

I am trying to use twitter4j to query twitter status data. I need tweets only for the user who posted them on his/her time line for a day.
So far, I used this code to achieve this:
try {
for (int i = 0; i < userNames.length; i++) {
int totalCount = 0;
Query query = new Query(userNames[i]);
query.setCount(100);
int searchResultCount;
long lowestTweetId = Long.MAX_VALUE;
totalCount = 0;
Date date = new Date(
DateTimeUtils.getNDaysbackDailySliceStamp(1));
String modifiedDate = new SimpleDateFormat("yyyy-MM-dd")
.format(date);
System.out.println(modifiedDate);
query.setSince(modifiedDate);
date = new Date(DateTimeUtils.getDailySliceStamp());
modifiedDate = new SimpleDateFormat("yyyy-MM-dd").format(date);
System.out.println(modifiedDate);
query.setUntil(modifiedDate);
List<DBObject> dbl = new ArrayList<DBObject>();
Set<Long> ste = new HashSet<Long>();
do {
QueryResult queryResult = twitter.search(query);
searchResultCount = queryResult.getTweets().size();
for (Status st : queryResult.getTweets()) {
if (!st.isRetweet()) {
URLEntity[] uEn = st.getURLEntities();
StringBuilder url = new StringBuilder();
for (URLEntity urle : uEn) {
if (urle.getURL() != null && !urle.getURL().isEmpty()) {
url.append(urle.getExpandedURL());
}
}
ste.add(st.getId());
dbl.add(createTweetObject(userNames[i]/*, total*/,
st.getText(), st.getRetweetCount(), st.getId(),
url.toString(), st.getCreatedAt(), st.isRetweet()));
}
}
} while (searchResultCount != 0 && searchResultCount % 100 == 0);
System.out.println(dbl.size());
System.out.println(dbl);
if (dbl != null && !dbl.isEmpty()) {
// populateTweetCollection(dbl);
}
System.out.println("TweetCount"+ste.size());
System.out.println(ste);
}
} catch (Exception e) {
log.error("Exception in TwitterTime line api---"
+ Config.getStackTrace(e));
}
But this code gives me tweets made by others mentioning the User I am looking for.
For example I searched for my tweets in a day which were actually 8 but it gave me 12 results as some of my friends tweeted on their time line mentioning my twitter name using #username operator.
Also one thing i want to confirm if truncated tweet has same id for the whole group.
Regards

Try this
try {
ResponseList<User> users = twitter.lookupUsers("user name");
for (User auser : users) {
System.out.println("Friend's Name " + auser.getName());
if (auser.getStatus() != null) {
System.out.println("Friend timeline");
List<Status> statusess =
twitter.getHomeTimeline();
for (Status status3 : statusess) {
System.out.println(status3.getText());
}
}
}
} catch (TwitterException e) {
e.printStackTrace();
}

It worked using this code
if (!st.isRetweet() && (st.getUser().getScreenName().equalsIgnoreCase(userNames[i]))) {
URLEntity[] uEn = st.getURLEntities();
StringBuilder url = new StringBuilder();
}
if (st.getId() < lowestTweetId) {
lowestTweetId = st.getId();
query.setMaxId(lowestTweetId);
}
I verified that it is not an RT and also userName screen Name is also similar to the user I am looking for.
Regards
Virendra Agarwal

Android - for loop terminating early

I'm trying to manipulate a JSONArray, rawJArr, (taken from the Reddit API), and get the url and a bitmap (taken from the gfycat "API") from each object to create an ArrayList (listing) of Highlight instances which will be converted to a CardView containing a picture, a short description, and a link to the gfycat.
try {
int count = 0;
int highlightMax;
Bitmap bitmap = null;
Highlight curHighlight;
myJSON = new JSONObject(rawJSON);
curJSON = myJSON.getJSONObject("data");
rawJArr = curJSON.getJSONArray("children");
String strHighlightNo = mySPrefs.getString("pref_highlightNo", "notFound");
if(strHighlightNo.equals("notFound")) {
Log.w("FT", "shared pref not found");
return null;
}
highlightMax = Integer.parseInt(strHighlightNo);
Log.w("Arr Length", Integer.toString(rawJArr.length()));
Log.w("Highlight No", Integer.toString(highlightMax));
for(int i=0; i < rawJArr.length(); i++) {
Log.w("Count", Integer.toString(count));
Log.w("I", Integer.toString(i));
if(count == highlightMax) {
Log.w("FT", "Breakpoint reached!");
break;
}
curJSON = rawJArr.getJSONObject(i).getJSONObject("data");
String url = curJSON.getString("url");
String[] parts = url.split("//");
String imageUrl = "http://thumbs." + parts[1] + "-thumb100.jpg";
try {
bitmap = BitmapFactory.decodeStream((InputStream) new URL(imageUrl).getContent());
} catch (MalformedURLException e) {
e.printStackTrace();
}
// if there is no available picture, then don't include one in the Highlight
if(bitmap == null) {
Log.w("FT", "Null bitmap");
curHighlight = new Highlight(curJSON.getString("title"), url, null);
listing.add(curHighlight);
count++;
} else {
Log.w("FT", "Bitmap Available");
curHighlight = new Highlight(curJSON.getString("title"), url, bitmap);
listing.add(curHighlight);
count++;
}
}
} catch(JSONException e) {
e.printStackTrace();
return listing;
}
However, my for loop terminates way too early. The current JSONArray I'm using has a length of 25, and I've specified a pref_highlightNo of 15, but my for loop terminates after 6 iterations.
My Log.w tests in the for loop all record the same count (Count: 1, Integer: 1 - Count: 6, Integer: 6).
I'm struggling to see why my loop is terminating: there is no stack trace printed to my console, and my app doesn't crash.
Any idea what's going on?

Turns out the issue was specific to the last url I was trying to create to get the required gfycat - I didn't have any code to handle cases where the link began with http://www.

How to modify the iterator when hashset values are added

I have written code for to find out the broken links present in the website using selenium webdriver in java. As links are getting added in the HashSet while launching the different urls. I have tried to read the added urls from HashSet it stops executing after sometime. This is happening because iterator remains as it is even adding of new links to the HashSet. I want that execution should continue for all links present in the HashSet.
[I have tried to convert Set to an array but duplicate links are executing multiple times.]
public Set<String> unique_links;
HashMap<String, String> result;
Set<String> finalLinkSet = new HashSet<>();
Set<String> hs = new HashSet<>();
Set<String> uniqueLinkSet = new HashSet<>();
// String[] finalLinkArray;
String[] finalLinkArray;
boolean isValid = false;
FileWriter fstream;
BufferedWriter out;
int count = 1;
int FC = 0;
Set<String> secondaryset = new HashSet<>();
// String Responsecode = null;
#Test
public void LinkTesting() throws IOException, RowsExceededException,
WriteException {
w.manage().deleteAllCookies();
unique_links = new HashSet<String>();
w.get("http://www.skyscape.com");
ArrayList<WebElement> urlList = new ArrayList<WebElement>();
urlList = (ArrayList<WebElement>) w.findElements(By.tagName("a"));
setFinalLinkSet(getUniqueList(urlList));
for(Iterator<String> i = finalLinkSet.iterator(); i.hasNext(); ) {
System.out.println(finalLinkSet.size());
String currenturl = (String) i.next();
if ((currenturl.length() > 0 && currenturl
.startsWith("http://www.skyscape.com"))) {
if (!currenturl.startsWith("http://www.skyscape.com/estore/")&&
(!currenturl.startsWith("http://www.skyscape.com/demos/"))) {
System.out.println(currenturl);
getResponseCode(currenturl);
}
}
}
writetoexcel();
}
public void setFinalLinkSet(Set<String> finalLinkSet) {
this.finalLinkSet = finalLinkSet;
}
// function to get link from page and return array list of links
public Set<String> getLinksOnPage(String url) {
ArrayList<WebElement> secondaryUrl = new ArrayList<WebElement>();
secondaryUrl = (ArrayList<WebElement>) w.findElements(By.tagName("a"));
for (int i = 0; i < secondaryUrl.size(); i++) {
secondaryset.add((secondaryUrl.get(i).getAttribute("href")
.toString()));
}
return secondaryset;
}
// function to fetch link from array list and store unique links in hashset
public Set<String> getUniqueList(ArrayList<WebElement> url_list) {
for (int i = 0; i < url_list.size(); i++) {
uniqueLinkSet.add(url_list.get(i).getAttribute("href").toString());
}
return uniqueLinkSet;
}
public boolean getResponseCode(String url) {
boolean isValid = false;
if (result == null) {
result = new HashMap<String, String>();
}
try {
URL u = new URL(url);
w.navigate().to(url);
HttpURLConnection h = (HttpURLConnection) u.openConnection();
h.setRequestMethod("GET");
h.connect();
System.out.println(h.getResponseCode());
if ((h.getResponseCode() != 500) && (h.getResponseCode() != 404)
&& (h.getResponseCode() != 403)
&& (h.getResponseCode() != 402)
&& (h.getResponseCode() != 400)
&& (h.getResponseCode() != 401)) {
// && (h.getResponseCode() != 302)) {
//getLinksOnPage(url);
Set<String> unique2 = getLinksOnPage(url);
setFinalLinkSet(unique2);
result.put(url.toString(), "" + h.getResponseCode());
} else {
result.put(url.toString(), "" + h.getResponseCode());
FC++;
}
return isValid;
} catch (Exception e) {
}
return isValid;
}
private void writetoexcel() throws IOException, RowsExceededException,
WriteException {
FileOutputStream fo = new FileOutputStream("OldLinks.xls");
WritableWorkbook wwb = Workbook.createWorkbook(fo);
WritableSheet ws = wwb.createSheet("Links", 0);
int recordsToPrint = result.size();
Label HeaderUrl = new Label(0, 0, "Urls");
ws.addCell(HeaderUrl);
Label HeaderCode = new Label(1, 0, "Response Code");
ws.addCell(HeaderCode);
Label HeaderStatus = new Label(2, 0, "Status");
ws.addCell(HeaderStatus);
Iterator<Entry<String, String>> it = result.entrySet().iterator();
while (it.hasNext() && count < recordsToPrint) {
String Responsecode = null;
Map.Entry<String, String> pairs = it.next();
System.out.println("Value is --" + pairs.getKey() + " - "
+ pairs.getValue() + "\n");
Label Urllink = new Label(0, count, pairs.getKey());
Label RespCode = new Label(1, count, pairs.getValue());
Responsecode = pairs.getValue();
System.out.println(Responsecode);
if ((Responsecode.equals("500")) || (Responsecode.equals("404"))
|| (Responsecode.equals("403"))
|| (Responsecode.equals("400"))
|| (Responsecode.equals("402"))
|| (Responsecode.equals("401"))) {
// || (Responsecode.equals("302"))) {
Label Status1 = new Label(2, count, "Fail");
ws.addCell(Status1);
} else {
Label Status2 = new Label(2, count, "Pass");
ws.addCell(Status2);
}
try {
ws.addCell(Urllink);
} catch (RowsExceededException e) {
e.printStackTrace();
} catch (WriteException e) {
e.printStackTrace();
}
ws.addCell(RespCode);
count++;
}
Label FCS = new Label(4, 1, "Fail Urls Count is = " + FC);
ws.addCell(FCS);
wwb.write();
wwb.close();
}
}

In short, as far as I understand the problem: You have (at least) two threads (although I couldn't find them in the too long code example), one is adding entries to the HashSet, and the other should continuously list elements as they are added to the HashSet.
1st: You should use a concurrent data structure for this, but not a simple HashSet.
2nd: Iterators of HashSet do not support concurrent modification, so you can now have an iterator "waiting" for new entries being added.
Best is to change your code to use some kind of event-message pattern (sometimes also called broadcaster/listener), where the finding of a new URL generates an event, that other parts of your code listen to and then write them to the file.

Your loop finishes (earlier than desired) for the following reasons:
The initiation part Iterator<String> i = finalLinkSet.iterator()of your for-loop
for(Iterator<String> i = finalLinkSet.iterator(); i.hasNext(); ) {
is evaluated once when the loop is started. Hence it will not react on changes to finalLinkSet even if there where some.
You are not making any changes to finalLinkSet. Instead you are overwriting it with a new set when calling
setFinalLinkSet(unique2);
So instead you should:
Use a list, so you have ordered elements. (Adding entries to an unordered set will make it impossible to know which ones you already have iterated over). I suggest you therefore use an ArrayList<String>, so you have constant access time by the little drawback of performance for resizing on adding new entries.
Modify your for-loop to use an index, so evaluating the init-part once is sufficient and you can react on the changing size of list:
for(int i = 0; i < finalLinkList.size(); i++) {
System.out.println(finalLinkSet.size());
String currenturl = (String) finalLinkList.get(i);
Then instead of overwriting the list you should:
// for both occurrences
addToFinalLinkList(...); // see new code below
and
public void addToFinalLinkList(Set<String> tempSet) {
for(String url: tempSet)
{
if(!finalLinkList.contains(url))
finalListList.add(url);
}
}
I know this is not best from the performance point of view, but since you are inside a test, this shouldn't be a problem from what I see...

Is it even possible to make this loop wait a few seconds each time?

Firstly, yes I'm calling this from a web browser. It's quite a long piece of code but I've tried shortening it as much as possible.
Basically, I need to wait let's say 1 second for every iteration in the loop. Tried pretty much everything (.sleep() etc.) but it just doesn't seem to be pausing. The reason why I need to do this is because the SimpleSocketClient is calling a socket which has a low limit per second allowed.
#Override
public String execute(HttpServletRequest request, HttpServletResponse response) {
String forwardToJsp = null;
HttpSession session = request.getSession();
String allUrls = request.getParameter("domains");
ArrayList domainList = new ArrayList<String>();
Scanner sc = new Scanner(allUrls);
while (sc.hasNextLine()) {
String line = sc.nextLine();
domainList.add(line);
// process the line
}
sc.close();
String pageHtml = null;
String domain = "";
String status = "";
String registrant = "";
String dates = "";
String tag = "";
String email = "";
ArrayList domains = new ArrayList<Domain>();
Domain theDomain;
String ipAddress = request.getHeader("X-FORWARDED-FOR");
if (ipAddress == null) {
ipAddress = request.getRemoteAddr();
}
for (int i = 0; i < domainList.size(); i++) {
//NEED TO WAIT 1 SECOND HERE / ANYWHERE IN LOOP
String singleDomain = domainList.get(i).toString();
SimpleSocketClient tester = new SimpleSocketClient(singleDomain,ipAddress);
pageHtml = tester.getResult();
try {
String whoIs2 = ipAddress + " " + ipAddress + " " + singleDomain + "\r\n";
byte[] data = whoIs2.getBytes();
//details of each domain
//domain name
domain = singleDomain;
//status
status = "FTR";
//registrant
registrant = "N/A";
//dates
dates = "N/A";
//tag
tag = "N/A";
//email
email = "N/A";
}
} catch (Exception e) {
Logger.getLogger("ip is " + ipAddress + bulkWhoIsCommand.class.getName()).log(Level.SEVERE, null, e);
forwardToJsp = "index.jsp";
return forwardToJsp;
}
//single one
theDomain = new Domain(domain,status,registrant,dates,tag,email);
//now add to arrayList
domains.add(theDomain);
// try {
// Thread.sleep(230000);
// } catch (InterruptedException ex) {
// Logger.getLogger(bulkWhoIsCommand.class.getName()).log(Level.SEVERE, null, ex);
// }
// try {
// pause.poll(100 * 300, TimeUnit.MILLISECONDS);
// } catch (InterruptedException ex) {
// Logger.getLogger(bulkWhoIsCommand.class.getName()).log(Level.SEVERE, null, ex);
// }
}
EDIT - Friend recommended to use ajax to poll updates but surely there's a way of just using java.

Your can try to set a while-loop in the while-loop, to pause it. Should like this:
while(!done)
{
long start = System.currentTimeMillis();
while(System.currentTimeMillis() - start < 1000L){}
}
Didn't test it but the approach counts. I had the idea to do a combination of both. So every time Thread.Sleep() crashes, you have to take the loop. Something like this:
while(!done)
{
long start = System.currentTimeMillis();
try {
Thread.sleep(1000);
} catch (InterruptedException ex) {
System.err.println(e);
}
while(System.currentTimeMillis() - start < 1000L){}
}
When Thread.Sleep() worked it just get called once. Otherwise you need some CPU time. Could be the cpu economical version.

Java - Print any detail of HTML element

I am fairly new to Java, at least regarding interacting with web. Anyway, I am making an app that has to grab HTML out of a webpage, and parse it.
By parsing I mean finding out what the element has in the 'class="" ' attribute, or in any attribute available in the element. Also finding out what is inside the element. This is where I have searched so far: http://www.java2s.com/Code/Java/Development-Class/HTMLDocumentElementIteratorExample.htm
I found very little regarding this.
I know there are lots of Java parsers out there. I have tried JTidy, and the default Swing parser. I would prefer to use the built-in-to-java parser.
Here is what i have so far (this is just method for testing how it works, proper code will come when i know what & how. Also connection is a URLConnection variable, and connection has been established before this method gets called. < just to clarify):
public void parse() {
try {
InputStream is = connection.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader br = new BufferedReader(isr);
String line;
while ((line = br.readLine()) != null) {
System.out.println(line);
}
// copied from http://www.java2s.com/Code/Java/Development-Class/HTMLDocumentElementIteratorExample.htm
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
HTMLEditorKit.Parser parser = new ParserDelegator();
HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0);
parser.parse(br, callback, true);
// Parse
ElementIterator iterator = new ElementIterator(htmlDoc);
Element element;
while ((element = iterator.next()) != null) {
AttributeSet attributes = element.getAttributes();
Object name = attributes.getAttribute(StyleConstants.NameAttribute);
System.out.println ("All attrs of " + name + ": " + attributes.getAttributeNames().toString());
Enumeration e = attributes.getAttributeNames();
Object obj;
while (e.hasMoreElements()) {
obj = e.nextElement();
System.out.println (obj.toString());
System.out.println ("attribute of class = " + attributes.containsAttribute("class", "login"));
}
if ((name instanceof HTML.Tag)
&& ((name == HTML.Tag.H1) || (name == HTML.Tag.H2) || (name == HTML.Tag.H3))) {
// Build up content text as it may be within multiple elements
StringBuffer text = new StringBuffer();
int count = element.getElementCount();
for (int i = 0; i < count; i++) {
Element child = element.getElement(i);
AttributeSet childAttributes = child.getAttributes();
if (childAttributes.getAttribute(StyleConstants.NameAttribute) == HTML.Tag.CONTENT) {
int startOffset = child.getStartOffset();
int endOffset = child.getEndOffset();
int length = endOffset - startOffset;
text.append(htmlDoc.getText(startOffset, length));
}
}
System.out.println(name + ": " + text.toString());
}
}
} catch (IOException e) {
System.out.println ("Exception?1 " + e.getMessage() );
} catch (Exception e) {
System.out.println ("Exception? " + e.getMessage());
}
}
The question is: How do I get any element's attributes and print them out?

This code is needlessly verbose. I would suggest using a better library like Jsoup. Here's some code to find out all the attributes of all divs on this page.
String url = "http://stackoverflow.com/questions/7311269"
+ "/java-print-any-detail-of-html-element";
Document doc = Jsoup.connect(url).get();
Elements divs = doc.select("div");
int i = 0;
for (Element div : divs) {
System.out.format("Div #%d:\n", ++i);
for(Attribute attr : div.attributes()) {
System.out.format("%s = %s\n", attr.getKey(), attr.getValue());
}
}
Follow the Jsoup Cookbook for a gentle introduction to the this powerful library.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Jsoup Hanging Without Errors - java

Related

Twitter Query API filter not working in date range

Android - for loop terminating early

How to modify the iterator when hashset values are added

Is it even possible to make this loop wait a few seconds each time?

Java - Print any detail of HTML element

Categories

Resources