I'm absolutely new to parsing and quite stumbled now. I've downloaded code from webpage with R function htmlParse and now want to extract some data from it. It has lots of elements, but i'd like to get data from exact one, starting with <script type="text/javascript">. How is it possible to do so and get the needed part of a code in a charachter format so it'll be possible than to extract other small elements from this bulk with, probably, regexp?
Related
Let me just start by saying that this is a soft question.
I am rather new to application development, and thus why I'm asking a question without presenting you with any actual code. I know the basics of Java coding, and I was wondering if anyone could enlighten me on the following topic:
Say I have an external website, Craigslist, or some other site that allows me to search through products/services/results manually by typing a query into a searchbox somewhere on the page. The trouble is, that there is no API for this site for me to use.
However I do know that http://sfbay.craigslist.org/search/sss?query=QUERYHERE&sort=rel points me to a list of results, where QUERYHERE is replaced by what I'm looking for.
What I'm wondering here is: is it possible to store these results in an Array (or List or some form of Collection) in Java?
Is there perhaps some library or external tool that can allow me to specify a query to search for, have it paste it in to a search-link, perform the search, and fill an Array with the results?
Or is what I am describing impossible without an API?
This depends, if the query website accepts returning the result as XML or JSON (usually with a .xml or .json at the end of url) you can parse it easily with DOM for XML on Java or download and use the JSONLibrary to parse a JSON.
Otherwise you will receive a HTML that is the page that a user would see in a browser, then you can try parse it as a XML but you will have a lot of work to map all fields in the HTML to get the list as you want.
Heres what i want to do. Im quite a beginner with this so maybe a lame question, But, I want to implement gui application in java wich gets data from sports livescore pages
e.g
http://www.futbol24.com/Live/
http://livescore.com/
and parse it (somehow) in my app...and then i will be able to store it in for example jtable ,save full time results in database,playing sounds after goal is scored and so on
What is the best way to do this ?
It would be almost impossible to parse an HTML document from a live web page and get specific information from it. If you did manage to work out exactly where in the document the data is, the page structure could change at any time. The scores might not even be in the HTML - they could be fetched by Javascript in the page.
I suggest you find an RSS feed of the information you want. Then you'll only have a nice, small piece of XML to parse. That's what it's for.
I want to collect domain names (crawling). I have wrote a simple Java application that reads HTML page and save the code in text file. Now, I want to parse this text in order to collect all domain names without douplicates. But I need the domain names without "http://www.", just domainname.topleveldmian or the possibilities of dmianname.subdomain.topleveldomain or whatever number of subdomains (then, the collected links need to be extracted the same way and collect the links inside them till I reach certain number of links, say 100).
I have asked about this in previous posts https://stackoverflow.com/questions/11113568/simple-efficient-java-web-crawler-to-extract-hostnames , and searched. JSoup seems good solution but I have not worked with JSoup before, so before going deeply on it. I just want to ask: Does it achieve what I want to do ?? Any other suggestions for achieving my simple crawling in a simple way are welcome.
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods
So yes, you can connect to a website extract its html and parse it with jsoup.
The logic of extracting the top level domain is "your part" you will need to write the code logic yourself.
Take a look at the docs for more options...
Use selector-syntax to find elements
Use DOM methods to navigate a document
Im trying to make a simple webpage which obtains football league table data
http://www.skysports.com/football/league/0,19540,11660,00.html
For example i want to read in the points column and divide it by the number of games played to get an average points per game column that i will print onto my webpage.
How can i do this online?
Im quite experienced at doing this with offline programmes such as C/Matlab but i dont know where to start with it online.
Thanks
I wouldn't suggest to do it client side (on browser). It will be easier to scrap on server side (using java for example) following the steps:
Grab the content of the webpage (skysports)
Use existing html markup with regex to locate the desired content part.
Strip/split html markup with regex to get records (tr) and fields (td).
Cast values and do your math.
Use results to generate your version of html or json or whatever.
Serve the generated content to your client.
In general scrapping is easy but not guaranteed for tomorrow as source html markup may change at any time (and without warning).
I can provide a basic sample in C# if you want. (Sorry I haven't "java" since 1997).
You use jQuery.get like this:
$.get('http://www.skysports.com/football/league/0,19540,11660,00.html', function(data) {
//do the parsing here
});
There are several programing languages capable of getting at this information, PHP would be the classic method using curl or file_get_contents and regex parsing to extract the bits you want. You could do it with Yahoo Pipes as well if your web host does not allow remote URL retrieval.
If none of the Java brigade come back with something better contact me and I'll do some rough code for you in PHP.
I'm trying to figure out what is nessary to preform what I belive to be a somewhat simple task, but it seems its execution is a bit advance.
Can someone provide an example that might help me figure out the following goal?
Check various known .html files on local server for a string
If string is Que_for_board the preform a parse of other strings that will be in file
Example: Release data, Author, Program etc
Else (if Que_for_board not found) go to next HTML
Takes results in and print to a file
Is this is hard as it seems? I've looked into HTMLCleaner parser but not sure I need to clean up the HTML into XML, and Im finding it hard to find a query code that has the next step in detail.
I would not describe this as a truly hard task, in that it's really a matter of using a number of techniques that are "out there", but I can see that it could be intimidating.
A technique I find useful is to decompose the overall task into small problems, and teach myself to think only about one problem at a time, having faith that I can assemble the overall solution eventually.
So here you have perhaps
get a list of files from somewhere (where? directory listing, a document?)
Open each file in a list in turn
parse an html file
find a specific string in the parsed file
Of these parsing html files is potentially very easy or not so easy. Can you trust the file to be well-formed or was it written by a human? Humans simply don't make good HTML files and browsers are very tolerant of missing </P> etc.
If these are very simple html files you can fake this by using simple String searching, regex and so on. Otherwise you need a proper parser, and maybe a clean up first.
My first step would be to understand how to process a single HTML file.