I use JSoup for the first time. so i'm not familiar with JSoup. I already read 'COOKBOOk'. But still i don't know why that Elements still empty. am i missing something?
Document doc = Jsoup.connect("http://sports.news.naver.com/sports/" +
"index.nhn?category=baseball.html").get();
Elements teams= doc.select("td.t_name");
Elements wins= doc.select("td.win");
System.out.println(teams.isEmpty());
System.out.println(wins.isEmpty());
Maybe because there is no "td.t_name" and "td.win" in the document.
You should make sure that the document you get form http://sports.news.naver.com/sports/index.nhn?category=baseball.html
contain the data you want to select.
As far as I debugged into your code, I didn't see any "td.win" or "td.t_name" in the document.
Note that data loaded in via AJAX will not be downloaded by JSoup.
Related
I have files containing HTML and I am trying to parse that file and then tokenise the text of the body.
I achieve this through:
docs = JSOUP.parse("myFile","UTF-8","");
System.out.println(docs.boy().text());
The above codes work fine but the problem is TEXT that is present outside of html tags without any tag is also printed as part of the body tags.
I need to find a way to stop this text outside of HTML tags from being read
Help this is a time sensitive question !
You can select and remove unwanted elements in your document.
doc.select("body > :matchText").remove();
The above statement will remove all text-nodes, that are direct children of the body-element. The :matchText selector is rather new, so please make sure to use a somehow recent version of JSoup (1.11.3 definitely works, but 1.10.2 not).
Find more infos on the selector syntax on https://jsoup.org/cookbook/extracting-data/selector-syntax
I'm trying to create a Java Program, where I can insert a String into a search bar and then record/print out the results.
This site is: http://maple.fm/khroa
I'm fairly new to JSoup and I've spent several hours just reading the html code regarding that page and have come across variables that could be used to insert the String that I need and get results, although I'm not sure how to exactly do that. Would someone be able to point me to the right direction?
I think you missed the point of JSOUP.
JSOUP can parse a page that is already loaded - it is not used to interact with a page (as you want). You could use Selenium to interact with the page (http://www.seleniumhq.org/) and then use JSOUP to parse the loaded page's source code.
In this case, the search results seem to be all loaded when the page load, and the Item Search function only filters the (already existing) results with Javascript.
There are no absolute links you could use to get results to a particular search.
Hello Im googling for hours now and can't find answer...(or smt close to it)
What i am trying to do is, lets say i have this code(very simplified):
<div id="one"><div id="two"><div id="three"></div></div></div>
And what i want to do is delete specific amount of this elements , lets say 2 of them. So the result would be:
<div id="one"><div id="two"><div id="three"></div>
Or i want to delete this opening elements (again specific amount of them, lets say 2 again) but without knowing their full name (so we can assume if real name is id="one_54486464" i know its one_ ... )
So after deleting I get this result:
<div id="three"></div></div></div>
Can anyone suggest way to achieve this results? It does not have to Include JSOUP, any better. more simple or more efficient way is welcomed :) (But i am using JSOUP to parse document to get to the point where i am left with )
I hope i explain myself clearly if you have any question please do ask... Thanks :)
EDIT: Those elements that i want to delete are on very end of the HTML document(so nothing, nothing is behind them not body tag html tag nothing...)
Please keep that HTML document would have many across whole code and i want to delete only specific amount at the end of the document...
For the opening divs THOSE are on very beginning of my HTML document and nothing is before them... So i need to remove specific amount from the beginning without knowing their specific ID only start of it. Also this div has closing too somewhere in the document and that closing i want to keep there.
For the first case, you can get the element's html (using the html() method) and use some String methods on it to delete a couple of its closing tags.
Example:
e.html().replaceAll("(((\\s|\n)+)?<\\/div>){2}$","");
This will remove the last 2 closing div tags, to change the number of tags to be remove, just change the number between the curly brackets {n}
(this is just an example and is probably unreliable, you should use some other String methods to decide which parts to discard)
For the second case, you can select the inner element(s) and add some additional closing tags to it/them.
Example:
String s = e.select("#two").first().html() + "</div></div>";
To select an element that has an ID that starts with some String you can use this e.select("div[id^=two]")
You can find more details on how to select elements here
After Titus suggested regular expressions I decided to write regex for deleting opening divs too.
So I convert Jsoup Document to String then did the parsing on a string and then convert back to Jsoup Document so I can use Jsoup functions.
ADD: What I was doing is that I was parsing and connecting two pages to one seamless. So there was no missing opening div or closing... so my HTML code stay with no errors therefore I was able to convert it back to Jsoup Document without complications.
I am trying to find some elements, from a web page which is continuously refreshing for 4 seconds periodically.
So when I am trying to detect some of the page elements after parsing the web page,it is throwing exception as :
"org.openqa.selenium.StaleElementReferenceException: Element is no longer attached to the DOM".
As the page is getting refreshed periodically,the DOM is getting changed. I can parse only few elements which is located at the upper portion of the DOM structure, because as the page is getting refreshed the internal DOM parser may not able to go inside after certain depth. So in this situation I am not able to traverse the whole DOM,won't able to fetch the lower portion of the DOM.
So please guys kindly,give me a solution so that I can parse the whole page,can traverse the entire DOM tree.
Thanks in advance,
I dont understand why you need to identify the elements in webpage if the page is getting refreshed continuously as you wont be able to perform any operation on it. Nevertheless, try the following code to get the body tag to a WebElement object:-
WebElement body = driver.getElement(By.tagname("body"))
Use this body object to find the rest of the elements.
body.findElements(By.tagname("input"));
I use jsoup to parse a HTML page and when using doc.select("tr") it should place return a list with all <tr> elements. When I investigate the size of that list it tells me 242. Although when using Chrome to double check against the source with a simple search, it got 264 hits.
This makes my code break. It seems almost like jsoup doesn't handle a lot of Elements very well.
I'm parsing a page with a table, 262 * 88 cells and almost as many helper tags. Is this the reason why jsoup doesn't have the objects in the list? Or why do you think I'm having this problem?
There may be a differance in the websites. You often get a different view if you have a desktop browser, than e.g. a mobile device.
You can try this with jsoup:
Set a user agent of a browser
Print the parsed document (System.out.println(doc)) and check if all tags are included
Check the website using another browser
Check if there's no javascript (or similar) which creates additional html (jsoup can't handle those)