I'm trying to create a Java Program, where I can insert a String into a search bar and then record/print out the results.
This site is: http://maple.fm/khroa
I'm fairly new to JSoup and I've spent several hours just reading the html code regarding that page and have come across variables that could be used to insert the String that I need and get results, although I'm not sure how to exactly do that. Would someone be able to point me to the right direction?
I think you missed the point of JSOUP.
JSOUP can parse a page that is already loaded - it is not used to interact with a page (as you want). You could use Selenium to interact with the page (http://www.seleniumhq.org/) and then use JSOUP to parse the loaded page's source code.
In this case, the search results seem to be all loaded when the page load, and the Item Search function only filters the (already existing) results with Javascript.
There are no absolute links you could use to get results to a particular search.
Related
Curerntly using Java to scrape the HTML code from this page http://counter.onlineclock.net/
I want to get the value from the counter, but this is unique for each version of the webpage, that is, if its open in different browsers or for different people it will be a different value.
Because of this, when I scrape the HTML, the value that I am looking for is just blank. I am wondering if there is any way at all for me to get the current value I am looking for.
For example, if I have the counter at 4 I would like to be able to get that value. It does not have to be in java, any language or any way.
JSoup is a great library for scraping data out of a web page. There are a lot of good examples of its usage on the web
Let me just start by saying that this is a soft question.
I am rather new to application development, and thus why I'm asking a question without presenting you with any actual code. I know the basics of Java coding, and I was wondering if anyone could enlighten me on the following topic:
Say I have an external website, Craigslist, or some other site that allows me to search through products/services/results manually by typing a query into a searchbox somewhere on the page. The trouble is, that there is no API for this site for me to use.
However I do know that http://sfbay.craigslist.org/search/sss?query=QUERYHERE&sort=rel points me to a list of results, where QUERYHERE is replaced by what I'm looking for.
What I'm wondering here is: is it possible to store these results in an Array (or List or some form of Collection) in Java?
Is there perhaps some library or external tool that can allow me to specify a query to search for, have it paste it in to a search-link, perform the search, and fill an Array with the results?
Or is what I am describing impossible without an API?
This depends, if the query website accepts returning the result as XML or JSON (usually with a .xml or .json at the end of url) you can parse it easily with DOM for XML on Java or download and use the JSONLibrary to parse a JSON.
Otherwise you will receive a HTML that is the page that a user would see in a browser, then you can try parse it as a XML but you will have a lot of work to map all fields in the HTML to get the list as you want.
I use jsoup to parse a HTML page and when using doc.select("tr") it should place return a list with all <tr> elements. When I investigate the size of that list it tells me 242. Although when using Chrome to double check against the source with a simple search, it got 264 hits.
This makes my code break. It seems almost like jsoup doesn't handle a lot of Elements very well.
I'm parsing a page with a table, 262 * 88 cells and almost as many helper tags. Is this the reason why jsoup doesn't have the objects in the list? Or why do you think I'm having this problem?
There may be a differance in the websites. You often get a different view if you have a desktop browser, than e.g. a mobile device.
You can try this with jsoup:
Set a user agent of a browser
Print the parsed document (System.out.println(doc)) and check if all tags are included
Check the website using another browser
Check if there's no javascript (or similar) which creates additional html (jsoup can't handle those)
I want to collect domain names (crawling). I have wrote a simple Java application that reads HTML page and save the code in text file. Now, I want to parse this text in order to collect all domain names without douplicates. But I need the domain names without "http://www.", just domainname.topleveldmian or the possibilities of dmianname.subdomain.topleveldomain or whatever number of subdomains (then, the collected links need to be extracted the same way and collect the links inside them till I reach certain number of links, say 100).
I have asked about this in previous posts https://stackoverflow.com/questions/11113568/simple-efficient-java-web-crawler-to-extract-hostnames , and searched. JSoup seems good solution but I have not worked with JSoup before, so before going deeply on it. I just want to ask: Does it achieve what I want to do ?? Any other suggestions for achieving my simple crawling in a simple way are welcome.
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods
So yes, you can connect to a website extract its html and parse it with jsoup.
The logic of extracting the top level domain is "your part" you will need to write the code logic yourself.
Take a look at the docs for more options...
Use selector-syntax to find elements
Use DOM methods to navigate a document
I want to retrieve a set of results, which consist of all results produced by (looping) all the options of one of the request-form fields.
I'm using Java language, and HtmlUnit API.
I have managed to do this looping form-fill using the URL to 'fill' the field's variables (I don't know if its the best method, and actually am quite worried it's one of the worst...But it's the one i could do with the knowledge i have).
But i'm having problems figuring out how to make the program submit the form in order to reach the result page, and on how to download (scrape) that page before moving to the next.
NOTES:
-If you have a better way of filling the 'request-form', that is welcome as well.
UPDATE:
This solves the issues when using HtmlUnit API (thank you, touti):
HtmlPage resultado = pageNow.getElementByName("buscar").click();
System.out.println(resultado.asText());
A better way than loading both the request and response pages is still hugely welcome tough!
you can simulate using Jquery the click on your submit input like this
$("#submit_id").trigger("click");