Modifying HTML in Memory with JSoup

Modifying HTML in Memory with JSoup - java

Recently I was recommended to use JSoup to parse and modify HTML documents.
However what if I have a HTML document that I want to modify (to send, store somewhere else, etc.), how might I go about doing that without changing the original document?
Say I have an HTML file like so:
<html>
<head></head>
<body>
<p></p>
<h2>Title: title</h2>
<p></p>
<p>Name: </p>
<p>Address: </p>
<p>Phone Number: </p>
</body>
</html>
And I want to fill in the appropriate data for Name, Address, Phone Number and any other information I'd like, without modifying the original HTML file, how might I go about that using JSoup?

A possible simpler solution is to modify your template to have placeholders like:
<html>
<head></head>
<body>
<p></p>
<h2>Title: title</h2>
<p></p>
<p>Name: <span id="name"></span></p>
<p>Address: <span id="address"></span></p>
<p>Phone Number: <span id="phone"></span></p>
</body>
</html>
Then load your document this way:
Document doc = Jsoup.parse("" +
"<html>\n" +
" <head></head>\n" +
" <body> \n" +
" <p></p>\n" +
" <h2>Title: title</h2>\n" +
" <p></p>\n" +
" <p>Name: <span id=\"name\"></span></p>\n" +
" <p>Address: <span id=\"address\"></span></p>\n" +
" <p>Phone Number: <span id=\"phone\"></span></p>\n" +
" </body>\n" +
"</html>");
doc.getElementById("name").text("Andrey");
doc.getElementById("address").text("Stackoverflow.com");
doc.getElementById("phone").text("secret!");
System.out.println(doc.html());
And this would give the form filled out.

#MarcoS had an excellent solution using a NodeTraversor to make a list of nodes to change at https://stackoverflow.com/a/6594828/1861357 and I only very slightly modified his method which replaces a node (a set of tags) with the data in the node plus whatever information you would like to add.
To store a String in memory I used a static StringBuilder to save the HTML in memory.
First we read in the HTML file (that is manually specified, this can be changed), then we make a series of checks to change whatever nodes with any data that we want.
The one problem that I didn't fix in the solution by MarcoS was that it split each individual word, instead of looking at a line. However I just used '-' for multiple words, because otherwise it places the string directly after that word.
So a full implementation:
import java.util.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
import java.io.*;
public class memoryHTML
{
static String htmlLocation = "C:\\Users\\User\\";
static String fileName = "blah"; // Just for demonstration, easily modified.
static StringBuilder buildTmpHTML = new StringBuilder();
static StringBuilder buildHTML = new StringBuilder();
static String name = "John Doe";
static String address = "42 University Dr., Somewhere, Someplace";
static String phoneNumber = "(123) 456-7890";
public static void main(String[] args)
{
// You can send it the full path with the filename. I split them up because I used this for multiple files.
readHTML(htmlLocation, fileName);
modifyHTML();
System.out.println(buildHTML.toString());
// You need to clear the StringBuilder Object or it will remain in memory and build on each run.
buildTmpHTML.setLength(0);
buildHTML.setLength(0);
System.exit(0);
}
// Simply parse and build a StringBuilder for a temporary HTML file that will be modified in modifyHTML()
public static void readHTML(String directory, String fileName)
{
try
{
BufferedReader br = new BufferedReader(new FileReader(directory + fileName + ".html"));
String line;
while((line = br.readLine()) != null)
{
buildTmpHTML.append(line);
}
br.close();
}
catch (Exception e)
{
e.printStackTrace();
System.exit(1);
}
}
// Excellent method of parsing and modifying nodes in HTML files by #MarcoS at https://stackoverflow.com/a/6594828/1861357
// It has its small problems, but it does the trick.
public static void modifyHTML()
{
String htmld = buildTmpHTML.toString();
Document doc = Jsoup.parse(htmld);
final List<TextNode> nodesToChange = new ArrayList<TextNode>();
NodeTraversor nd = new NodeTraversor(new NodeVisitor()
{
#Override
public void tail(Node node, int depth)
{
if (node instanceof TextNode)
{
TextNode textNode = (TextNode) node;
nodesToChange.add(textNode);
}
}
#Override
public void head(Node node, int depth)
{
}
});
nd.traverse(doc.body());
for (TextNode textNode : nodesToChange)
{
Node newNode = buildElementForText(textNode);
textNode.replaceWith(newNode);
}
buildHTML.append(doc.html());
}
private static Node buildElementForText(TextNode textNode)
{
String text = textNode.getWholeText();
String[] words = text.trim().split(" ");
Set<String> units = new HashSet<String>();
for (String word : words)
units.add(word);
String newText = text;
for (String rpl : units)
{
if(rpl.contains("Name"))
newText = newText.replaceAll(rpl, "" + rpl + " " + name:));
if(rpl.contains("Address") || rpl.contains("Residence"))
newText = newText.replaceAll(rpl, "" + rpl + " " + address);
if(rpl.contains("Phone-Number") || rpl.contains("PhoneNumber"))
newText = newText.replaceAll(rpl, "" + rpl + " " + phoneNumber);
}
return new DataNode(newText, textNode.baseUri());
}
And you'll get this HTML back (remember I changed "Phone Number" to "Phone-Number"):
<html>
<head></head>
<body>
<p></p>
<h2>Title: title</h2>
<p></p>
<p>Name: John Doe </p>
<p>Address: 42 University Dr., Somewhere, Someplace</p>
<p>Phone-Number: (123) 456-7890</p>
</body>
</html>

Related

Parsing html in Jsoup

I am trying to parse html tags here using jsoup. I am new to jsoup. Basically I need to parse the tags and get the text inside those tags and apply the style mentioned in the class attribute.
I am creating a SpannableStringBuilder for that I can create substrings, apply styles and append them together with texts that have no styles.
String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";
SpannableStringBuilder text = new SpannableStringBuilder();
if (value.contains("</span>")) {
Document document = Jsoup.parse(value);
Elements elements = document.getElementsByTag("span");
if (elements != null) {
int i = 0;
int start = 0;
for (Element ele : elements) {
String styleName = type + "." + ele.attr("class");
text.append(ele.text());
int style = context.getResources().getIdentifier(styleName, "style", context.getPackageName());
text.setSpan(new TextAppearanceSpan(context, style), start, text.length(), Spannable.SPAN_EXCLUSIVE_EXCLUSIVE);
text.append(ele.nextSibling().toString());
start = text.length();
i++;
}
}
return text;
}
I am not sure how I can parse the strings that are not between any tags such as the "There are" and "worker from the".
Need output such as:
- There are
- <span class='newStyle'> two </span>
- workers from the
- <span class='oldStyle'>Front of House</span>

Full answer: you can get the text outside of the tags by getting childNodes(). This way you obtain List<Node>. Note I'm selecting body because your HTML fragment doesn't have any parent element and parsing HTML fragment with jsoup adds <html> and <body> automatically.
If Node contains only text it's of type TextNode and you can get the content using toString().
Otherwise you can cast it to Element and get the text usingelement.text().
String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";
Document doc = Jsoup.parse(str);
Element body = doc.selectFirst("body");
List<Node> childNodes = body.childNodes();
for (int i = 0; i < childNodes.size(); i++) {
Node node = body.childNodes().get(i);
if (node instanceof TextNode) {
System.out.println(i + " -> " + node.toString());
} else {
Element element = (Element) node;
System.out.println(i + " -> " + element.text());
}
}
output:
0 ->
There are
1 -> two
2 -> workers from the
3 -> Front of House
By the way: I don't know how to get rid of the first line break before There are.

How to find the html element of a given text

Assume I have the following code to be parsed using JSoup
<body>
<div id="myDiv" class="simple" >
<p>
<img class="alignleft" src="myimage.jpg" alt="myimage" />
I just passed out of UC Berkeley
</p>
</div>
</body>
The question is, given just a keyword "Berkeley", is there a better way to find the element/XPath (or a list of it, if multiple occurrences of the keyword is present) in the html, which has this keyword as part of its text.
I don't get to see the html before hand, and will be available only at runtime.
My current implementation - Using Java-Jsoup, iterate through the children of body, and get "ownText" and text of each children, and then drill down into their children to narrow down the html element. I feel this is very slow.

Not elegant but simple way could look like :
import java.util.HashSet;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Tag;
import org.jsoup.select.Elements;
public class JsoupTest {
public static void main(String argv[]) {
String html = "<body> \n" +
" <div id=\"myDiv\" class=\"simple\" >\n" +
" <p>\n" +
" <img class=\"alignleft\" src=\"myimage.jpg\" alt=\"myimage\" />\n" +
" I just passed out of UC Berkeley\n" +
" </p>\n" +
" <ol>\n" +
" <li>Berkeley</li>\n" +
" <li>Berkeley</li>\n" +
" </ol>\n" +
" </div> \n" +
"</body>";
Elements eles = Jsoup.parse(html).getAllElements(); // get all elements which apear in your html
Set<String> set = new HashSet<>();
for(Element e : eles){
Tag t = e.tag();
set.add(t.getName()); // put the tag name in a set or list
}
set.remove("head"); set.remove("html"); set.remove("body"); set.remove("#root"); set.remove("img"); //remove some unimportant tags
for(String s : set){
System.out.println(s);
if(!Jsoup.parse(html).select(s+":contains(Berkeley)").isEmpty()){ // check if the tag contains your key word
System.out.println(Jsoup.parse(html).select(s+":contains(Berkeley)").get(0).toString());} // print it out or do something else
System.out.println("---------------------");
System.out.println();
}
}
}

Try this xpath :
for the first element with a class :
'//*[contains(normalize-space(), "Berkeley")]/ancestor::*[#class]'
for the first element with an id :
'//*[contains(normalize-space(), "Berkeley")]/ancestor::*[#id]'
Check normalize-space

Update a tag name along with its value

I am trying to replace html tags with updated values. I had tried using JSOUP but could not work out a way yet.
The functionality:
if (webText.contains("a href")) {
// Parse it into jsoup
Document doc = Jsoup.parse(webText);
// Create an array to tackle every type individually as wrap can
// affect whole body types otherwises.
Element[] array = new Element[doc.select("a").size()];
for (int i = 0; i < doc.select("a").size(); i++) {
if (doc.select("a").get(i) != null) {
array[i] = doc.select("a").get(i);
}
}
for (int i = 0; i < array.length; i++) {
if (array[i].toString().contains("http")) {
Log.e("Link", array[i].toString());
Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(array[i].toString());
String url = null;
if (m.find()) {
url = m.group(1); // this variable should contain the link URL
Log.e("Link Value", url);
array[i] = array[i].wrap("<a href='"+url+"' class='link'></a>");
}
}
else {
Log.e("Favourite", array[i].toString());
Pattern p = Pattern.compile("href=\"(.*?)\"");
Matcher m = p.matcher(array[i].toString());
String url = null;
if (m.find()) {
url = m.group(1); // this variable should contain the link URL
Log.e("Favourite Value", url);
array[i] = array[i].wrap("<a href='"+url+"' class='favourite'></a>");
//array[i] = array[i].replaceWithreplaceWith("","");
}
}
}
Element element = doc.body();
Log.e("From element html *************** ", " " + element.html());
String currentHtml = wrapImgWithCenter(element.html());
Log.e("currentHtml", currentHtml);
listOfElements = currentHtml;
}
This array[i] = array[i].wrap("<a href='"+url+"' class='favourite'></a>"); is basically wrapping the existing tags with the new value. But I do not want that to happen. I want to replace the tags completely with something like:
"<a href='"+url+"' class='favourite'>+url+"</a>";
Input:
<html>
<head></head>
<body>
<p dir="ltr"><font color="#009a49">Frank Frank</font> <font color="#0033cc">http://yahoo.co.in</font></p>
<br />
<br />
</body>
</html>
Expected output:
<html>
<head></head>
<body>
<p dir="ltr"><font color="#009a49">Frank Frank</font> <font color="#0033cc">http://yahoo.co.in</font></p>
<br />
<br />
</body>
</html>
I have tried using replaceWith but was unsuccessful. You can still find it commented out in the source code provided above. Please tell me where am I going wrong? What should I do to update the tags?
P.S.: The input might be variable with some more or less tags.

You can use the replaceWith method of class Element. I've cleared your code a little bit. Removed the arrays and used the provided lists wherever possible. Moreover you don't need regex to get the href attribute (or any other attribute for that matter) when you've already parsed the html. Check it out and inform me if you need further assistance.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Tag;
import org.jsoup.select.Elements;
public class Main {
public static void main(String[] args) throws Exception {
String webText =
"<html>" +
"<head></head>" +
"<body>" +
"<p dir=\"ltr\">" +
"" +
"<font color=\"#009a49\">Frank Frank</font>" +
"" +
"<font color=\"#0033cc\">http://yahoo.co.in</font>" +
"</p>" +
"</body>" +
"</html>";
if (webText.contains("a href")) {
// Parse it into jsoup
Document doc = Jsoup.parse(webText);
Elements links = doc.select("a");
for (Element link : links) {
if (link.attr("href").contains("http")) {
System.out.println("Link: " + link.toString());
String url = link.attr("href");
if (url != null) {
System.out.println("Link Value: " + url);
Attributes attributes = new Attributes();
attributes.put("href", url);
attributes.put("class", "link");
link.replaceWith(new Element(Tag.valueOf("a"), "", attributes).insertChildren(0, link.childNodes()));
}
} else {
System.out.println("Favourite: " + link.toString());
String url = link.attr("href");
if (url != null) {
System.out.println("Favourite Value: " + url);
Attributes attributes = new Attributes();
attributes.put("href", url);
attributes.put("class", "favourite");
link.replaceWith(new Element(Tag.valueOf("a"), "", attributes).insertChildren(0, link.childNodes()));
}
}
}
Element element = doc.body();
System.out.println("From element html *************** "+ element.html());
}
}
}
Input
<p dir="ltr">
<font color="#009a49">Frank Frank</font>
<font color="#0033cc">http://yahoo.co.in</font>
</p>
Output
<p dir="ltr">
<font color="#009a49">Frank Frank</font>
<font color="#0033cc">http://yahoo.co.in</font>
</p>
Input
<p dir="ltr">
<font color="#009a49">Frank Frank</font>
<font color="#0033cc">http://yahoo.co.in</font>
</p>
Output
<p dir="ltr">
<font color="#009a49">Frank Frank</font>
<font color="#0033cc">http://yahoo.co.in</font>
</p>

Replace every word with tag

JAVASCRIPT or JAVA solution needed
The solution I am looking for could use java or javascript. I have the html code in a string so I could manipulate it before using it with java or afterwards with javascript.
problem
Anyway, I have to wrap each word with a tag. For example:
<html> ... >
Hello every one, cheers
< ... </html>
should be changed to
<html> ... >
<word>Hello</word> <word>every</word> <word>one</word>, <word>cheers</word>
< ... </html>
Why?
This will help me use javascript to select/highlight a word. It seems the only way to do it is to use the function highlightElementAtPoint which I added in the JAVASCRIPT hint: It simply finds the element of a certain x,y coordinate and highlights it. I figured that if every word is an element, it will be doable.
The idea is to use this approach to allow us to detect highlighted text in an android WebView even if that would mean to use a twisted highlighting method. Think a bit more and you will find many other applications for this.
JAVASCRIPT hint
I am using the following code to highlight a word; however, this will highlight the whole text belonging to a certain tag. When each word is a tag, this will work to some extent. If there is a substitute that will allow me to highlight a word at a certain position, it would also be a solution.
function highlightElementAtPoint(xOrdinate, yOrdinate) {
var theElement = document.elementFromPoint(xOrdinate, yOrdinate);
selectedElement = theElement;
theElement.style.backgroundColor = "yellow";
var theName = theElement.nodeName;
var theArray = document.getElementsByTagName(theName);
var theIndex = -1;
for (i = 0; i < theArray.length; i++) {
if (theArray[i] == theElement) {
theIndex = i;
}
}
window.androidselection.selected(theElement.innerHTML);
return theName + " " + theIndex;
}

Try to use something like
String yourStringHere = yourStringHere.replace(" ", "</word> <word>" )
yourStringHere.replace("<html></word>", "<html>" );//remove first closing word-tag
Should work, maybe u have to change sth...

var tags = document.body.innerText.match(/\w+/g);
for(var i=0;i<tags.length;i++){
tags[i] = '<word>' + tags[i] + '</word>';
}
Or as #ThomasK said:
var tags = document.body.innerText;
tags = '<word>' + tags + '</word>';
tags = tags.replace(/\s/g,'</word><word>');
But you have to keep in mind: .replace(" ",foo) only replaces the space once. For multiple replaces you have to use .replace(/\s+/g,foo)
And as #ajax333221 said, the second way will include commas, dots and other symbols, so the better solution is the first
JSFiddle example: http://jsfiddle.net/c6ftq/4/

inputStr = inputStr.replaceAll("(?<!</?)\\w++(?!\\s*>)","<word>$0</word>");

You can try following code,
import java.util.StringTokenizer;
public class myTag
{
static String startWordTag = "<Word>";
static String endWordTag = "</Word>";
static String space = " ";
static String myText = "Hello how are you ";
public static void main ( String args[] )
{
StringTokenizer st = new StringTokenizer (myText," ");
StringBuffer sb = new StringBuffer();
while ( st.hasMoreTokens() )
{
sb.append(startWordTag);
sb.append(st.nextToken());
sb.append(endWordTag);
sb.append(space);
}
System.out.println ( "Result:" + sb.toString() );
}
}

Extracting contents from HTML represented as a String

I have a Big html in String variable and I want to get contents of a div. I can not rely on regular expression because it can have nested div's. So, let's suppose I have following String -
String test = "<div><div id=\"mainContent\">foo bar<div>good best better</div> <div>test test</div></div><div>foo bar</div></div>";
Then how can I get this with a simple java program -
<div id="mainContent">foo bar<div>good best better</div> <div>test test</div></div>
Well my approch is something like this (might be horrable, still fighting to correct) -
public static void main(String[] args) {
int count = 1;
int fl = 0;
String s = "<div><div id=\"mainContent\">foo bar<div>good best better</div> <div>test test</div></div><div>foo bar</div></div>";
String tmp = s;
int len = s.length();
for (int i=0; i<len; i++){
int st = s.indexOf("div>");
if(st > -1) {
char c = s.charAt(st-1);
if(c == '/') {
count--;
} else {
count++;
}
s = s.substring(st+4);
System.out.println(s);
i = i + st;
System.out.println(c + " -- " + st + " -- " + count + " -- " + i);
if (count == 0) {
fl = i;
break;
}
}
}
System.out.println("final ind - " + fl);
s = tmp.substring(0, fl + 4);
System.out.println("final String - " + s);
}

I would recommend using JSoup to parse the HTML and find what you are looking for.
It fulfills the simple requirement for sure. You can do what you want in just a couple of lines of code!
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to
the same DOM as modern browsers do.
scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
jsoup is designed to deal with all varieties of HTML found in the
wild; from pristine and validating, to invalid tag-soup; jsoup will
create a sensible parse tree.
Using the selector syntax makes finding and extracting data extremely simple.
public static void main(final String[] args)
{
final String s = "<div><div id=\"mainContent\">foo bar<div>good best better</div> <div>test test</div></div><div>foo bar</div></div>";
final Document d = Jsoup.parse(s);
final Elements e = d.select("#mainContent");
System.out.println(e.get(0));
}
outputs
<div id="mainContent">
foo bar
<div>
good best better
</div>
<div>
test test
</div>
</div>
Doesn't get much more simple than that!

I'm afraid the answer is: You don't. At least not with a "simple" program...
But there is hope: You can use a HTML parser library (like NekoHTML or HTMLParser, although the latter project seems to be dead) to parse the string and retrive the part you need.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Modifying HTML in Memory with JSoup - java

Related

Parsing html in Jsoup

How to find the html element of a given text

Update a tag name along with its value

Replace every word with tag

Extracting contents from HTML represented as a String

Categories

Resources