parsing html page in java without using external library - java

I know its an old question and have been asked many a times. Note :I cannot use external libraries.
Given a function with label as argument, my function should return list of all the tags that contain that label.
I thought of saving my html as tree and then I can find the label and return list of all the tags. But I am not able to code it in java. How to completely parse and store html as tree structure and search on it?
Please help.
Thanks
Ravi

Related

Java - use searchbar on given website

Let me just start by saying that this is a soft question.
I am rather new to application development, and thus why I'm asking a question without presenting you with any actual code. I know the basics of Java coding, and I was wondering if anyone could enlighten me on the following topic:
Say I have an external website, Craigslist, or some other site that allows me to search through products/services/results manually by typing a query into a searchbox somewhere on the page. The trouble is, that there is no API for this site for me to use.
However I do know that http://sfbay.craigslist.org/search/sss?query=QUERYHERE&sort=rel points me to a list of results, where QUERYHERE is replaced by what I'm looking for.
What I'm wondering here is: is it possible to store these results in an Array (or List or some form of Collection) in Java?
Is there perhaps some library or external tool that can allow me to specify a query to search for, have it paste it in to a search-link, perform the search, and fill an Array with the results?
Or is what I am describing impossible without an API?
This depends, if the query website accepts returning the result as XML or JSON (usually with a .xml or .json at the end of url) you can parse it easily with DOM for XML on Java or download and use the JSONLibrary to parse a JSON.
Otherwise you will receive a HTML that is the page that a user would see in a browser, then you can try parse it as a XML but you will have a lot of work to map all fields in the HTML to get the list as you want.

Internal-linking of texts out of .csv files (in java)

I have a .csv file with text, and am supposed to parse the data, and based on specific keywords, replace the words with the necessary html tags for linking the keywords to a website.
So far, I wrote a .csv parser and writer, that gets all the data from the columns required out of the first file, and prints those columns to a newly created (.csv) file (e.g. text id in one cell, text title in the next cell, and the actual text in the next cell).
Now I am still waiting to get a list of keywords, as well as the website hierarchy and links to put it, but to be honest I have no idea how to continue working on this. Somehow I'll have to parse down the website hierarchy to where the text title is present, and only consider elements beneath it, and link them to keywords in my text. How can this be done? I there special software of extensions, libs, packs for java to do something like this?
Any help would be appreciated, I'm running on a deadline here...
THX!
P.S.: I am coding all of it in java
I'm not sure, but it sounds like you want to create an href column in your output:
Visit W3Schools
You could do this most simply by concatenating the strings:
String makeHref(String title, String id, String link) {
return "<a href=" + ... etc. }
before you write out the second csv. You'll need to escape the "s, of course.
It's also entirely possible that I didn't understand the question. You may want to try to be more specific if that's the case.

Replacing specific strings in HTML file

I need to translate some of HTML page content. I have a lot of HTML documents as a list of files and a map with translations like this:
List<File> files
Map<String, String> translations
Only strings in specific tags (p, h1..h6, li) have to be translated. I want to end up with the same files like at the beginning but with replaced strings.
Two solutions that don't work:
Replacing - because I don't want to translate strings inside comments or in javascript, another problem is that one string with original text can be a part of another string with original text.
Parsing libraries like Jsoup - because it cleans, fixes dom structure and I want to have unmodified HTML structure.
Any solutions?
You pretty much have to use a proper html parser (which fixes the dom structure), because otherwise there's no way to tell where an element starts and where it ends. There are all sorts of special cases and different types of broken html and if you want to handle them all, you are basically implementing a full html parser.
The only other way I can think of (and which is often used) is to use placeholders in the original files, such as <h1>${title}</h1> <p>${introduction}</p> etc, and find and replace them directly, but I guess that would require a lot of work to change the files if you don't already have them in this form.

Edit browser's text field using java

Is it possible to edit browser's text field using java? Currently I'm using Jsoup to gather some information about websites so I'm looking for some more options.Could JSoup to this? Thank you!
I don't see how JSoup would help here. JSoup is just a way to parse html. You could use it to write out some html that has a text field in it with an input tag that has a value attribute on it. Then when you render the file in a browser, the page would have that value. But since you haven't given us very much information, I'm not sure if this is exactly what you want to do or completely different from what you want to do.
This is probably the last thing you'd want to do (not the first), but you could set the values of a text field using Java's Robot class.

Does JSoup achieve this?

I want to collect domain names (crawling). I have wrote a simple Java application that reads HTML page and save the code in text file. Now, I want to parse this text in order to collect all domain names without douplicates. But I need the domain names without "http://www.", just domainname.topleveldmian or the possibilities of dmianname.subdomain.topleveldomain or whatever number of subdomains (then, the collected links need to be extracted the same way and collect the links inside them till I reach certain number of links, say 100).
I have asked about this in previous posts https://stackoverflow.com/questions/11113568/simple-efficient-java-web-crawler-to-extract-hostnames , and searched. JSoup seems good solution but I have not worked with JSoup before, so before going deeply on it. I just want to ask: Does it achieve what I want to do ?? Any other suggestions for achieving my simple crawling in a simple way are welcome.
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods
So yes, you can connect to a website extract its html and parse it with jsoup.
The logic of extracting the top level domain is "your part" you will need to write the code logic yourself.
Take a look at the docs for more options...
Use selector-syntax to find elements
Use DOM methods to navigate a document

Categories