Generate PDF from JSON object containing content from TinyMCE (html) - java

TL;DR
How do you create a PDF from a JSON object that contains a String written in HTML.
Example JSON:
{
dimensions: {
height: 297,
width: 210
},
boxes: [
{
dimensions: {
height: 10,
width: 190
},
position: {
x: 10,
y: 10
},
content: "<h1>Hello StackOverflow</h1>, I think you are <strong></strong>! I hope someone can answer this!"
}
]
}
Tech used in front-end: AngularJS 1.4.9, ui.tinymce, ment.io
Back-end: whatever works.
I want to be able to create templates for PDFs. The user writes some text in a textarea, uses some variable that will later be replaced with actual data, and when the user presses a button, a PDF should be returned with the finished product.
This should be very generic. So it would be able to be used in pretty much anything.
So, minimal example: The user writes a little text in TinyMCE like
<h1>Hello #[COMMUNITY]</h1>, I think you are <strong>great</strong>! I hope someone can answer this!
This text contains two variables that the user gets with the help of the ment.io plugin. The actual variables is supplied from the controller.
This text is written in an AngularJS version of TinyMCE which also has Ment.io on it which supplies a nice view of available variables.
When the user presses the Save button, a JSON object like the following is created, which is the template.
{
dimensions: {
height: 297,
width: 210
},
boxes: [
{
dimensions: {
height: 10,
width: 190
},
position: {
x: 10,
y: 10
},
content: "user input"
}
]
}
I have a directive in Angular that can generate any number of boxes really, in any size (generic-ho!). This part works great. Simply send in how big you want the 'page' (in mm, so the example says A4-paper size) in the first dimensions object as you see in the object. Then in the boxes you define how big they should be, and where on the 'paper' it should go. And then finally the content, which the user writes in a TinyMCE textarea.
Next step: The back-end replaces the variables with actual data. Then pass it on to the generator.
Then we come to the tricky part: The actual generator. This should accept, preferably, JSON. The reason for this is because any project should be able to use it. The front-end and the PDF-generator goes hand in hand. They don't care what's in the middle. This means that the generator can be written in pretty much anything. I'm a Java-developer though, so Java is preferable (hence the Java-tag).
Solutions I've found are:
PDFbox, but the problem with using that is the content that TinyMCE produces. TinyMCE outputs HTML or XML. PDFBox does not handle this, at all. Which means I have to write my own HTML or XML parser to try and figure out where the user wants bold-text, and where she wants italics, headings, other font, etc. etc. And I really don't want that. I've been burned on that before. It is on the other hand great for placing the text in the correct places. Even if it is the raw text.
I've read that iText does HTML. But the the AGPL-license pretty much kills it.
I've also looked at Flying Saucer that takes XHTML and creates a PDF. But it seems to rely on iText.
The solution I'm looking at now is a convoluted way to use Apache FOP. FOP takes an XSL-FO object to work on. So the trouble here is to actually dynamically create that XSL-FO object. I've also read that the XSL-FO standard has been dropped, so unsure how future-proof this approach will be. I've never worked with neither FOP nor XSLT. So the task seems daunting.
What I'm currently looking at is taking in the output from TinyMCE, run that through something like JTidy to get XHTML. From the XHTML create a XSLT file (in some magical way). Create a XSL-FO object from the XHTML and XSLT. And the generate the PDF from the XSL-FO file. Please tell me there is an easier way.
I can't have been the first to want to do something like this. Yet searching for answers seems to yield very few actual results.
So my question is basically this: How do you create a PDF from a JSON-object like the above, which contains HTML, and get the resulting text to look like it does when you write it in TinyMCE?
Have in mind that the object can contain an unlimited number of boxes.

So. After some research and work I decided to actually go with PDFbox for the generation. I've also been very strict about what I accept as content input. Right now, I really just accept bold, italics and headings. So I look for <strong>, <em>, and <h[1-6]> tags.
To begin with, I updated my input JSON a bit, more wrapping really.
{
[
documents: [
{
pages: [
{
dimensions: {width: 210, height, 297},
boxes: [
dimensions: {width: 190, height: 40},
placement: {x: 10, y, 10},
content: "Hello <strong>StackOverflow</strong>!"
]
}
]
}
]
]
}
And the reason is because I want to be able to put out lots and lots of documents in the same PDF. Think if you are doing a mass send out of letters. Each document is slightly different, but you still want it all in the same PDF. You could of course do this all with just the pages level, but if one document is several pages, it's nicer to have the separated, I think.
My actual code is about 500 lines long, so I won't paste it all here, just the basic parts to be of help, and that' still around 150 lines.
Here goes:
public class Generator {
public static ByteArrayOutputStream generatePDF(final Bundle bundle) {
final ByteArrayOutputStream output = new ByteArrayOutputStream();
pdf = new PDDocument();
for (final Document document : bundle.documents) {
for (final Page page : document.pages) {
pdf.addPage(generatePage(pdf, page));
}
}
pdf.save(output);
pdf.close();
return output;
}
private static generatePage(final PDDocument document, final Page page) {
final PDRectangle rect = new PDRectangle(mmToPoints(page.dimensions.width)mmToPoints(page.deminsions.height));
final PDPage pdPage = new PDPage(rect);
final PDPageContentStream cs = new PDPageContentStream(document, pdPage);
for (final Box box : page.boxes) {
resetFont(cs); // Reset the font when starting new box so missing ending tags don't mess up the next box.
final String pc = processContent(box.content); // Make the content prettier. Eg. strip all <p>, replace </p> with \n, strip all <div> tags, etc.
lines(Arrays.asList(processContent.split("\n")), box, cs);
}
cs.close();
return pdPage;
}
private static float mmToPoints(final float mm) {
// 1 inch == 72 points (standard DPI), 1 inch == 25.4mm. So, mm to points means (mm / inchInmm) * pointsInInch
return (float) ((mm / 25.5) * 72);
}
private static lines(final List<String> lines, final Box box, final PDPageContentStream cs) {
if (lines.size() == 0) { return; }
cs.beginText();
cs.moveTextPositionByAmount(mmToPoints(box.placement.x), mmToPoints(box.placement.y));
// Now we begin the tricky part
for (int i = 0, length = lines.size; i < length; ++i) {
final String line = lines.get(i);
final List<Word> wordList = new ArrayList<>();
final String[] splitArray = line.split(" ");
final float fontHeight = fontHeight(currentFont(), currentFontSize()); // Documented elsewhere
cs.appendRawCommands(fontHeight + " TL\n");
if (i == 0) { addNewLine(cs); } // PDFbox starts at the bottom, we start at the top. Add new line so we are inside the box
for (final String index : splitArray) {
final String word = index + " "; // We removed spaces when we split on them, add it to words now.
final StringBuilder wordBuilder = new StringBuilder();
boolean addWord = true;
for (int j = 0; wordLength = word.length(); j < wordLength ; ++j){
final char c = word.charAt(j);
if (c == '<') { // check for <strong> and those
final StringBuilder command = new StringBuilder();
if (addWord && wordBuilder.length() > 0) {
wordList.add(new Word(wordBuilder.toString(), currentFont(), currentFontSize()));
wordBuilder.setLength(0);
addWord = false;
}
for (; j < wordLength; ++j) {
final char c1 = word.charAt(j);
command.append(c1);
if (c1 == '>') {
if (j + 1 < wordLength) { addWord = true; }
break;
}
}
final boolean b = parseForFontChange(command.toString());
if (!b) { // If it wasn't a command, we want to append it to out text
wordBuilder.append(command.toString());
}
} else if (c == '&') { // check for html escaped entities
final int longestHTMLEntityName = 24 + 2; // &ClocwiseContourIntegral;
final StringBuilder escapedChar = new StringBuilder();
escapedChar.append(c);
int k = 1;
for (; k < longestHTMLEntityName && j + k < wordLength; ++k) {
final char c1 = word.charAt(j + k);
if (c1 == '<' || c1 == '>') { break; } // Can't be an espaced char.
escapedChar.append(c1);
if (c1 == ';') { break; } // End of char
}
if (escapedChar.indexOf(";") < 0) { k--; }
wordBuilder.append(StringEspaceUtils.unescapedHtml4(escapedChar.toString()));
j += k;
} else {
wordBuilder.append(c);
}
}
if (addWord) {
wordList.append(new Word(wordBuilder.toString(), currentFont(), currentFontSize()));
}
}
writeWords(wordList, box, cs);
if (i < length - 1) { addNewLine(cs); }
}
cs.endText();
}
public static void writeWords(final List<Word> words, final Box box, final PDPageContentStream cs) {
final float boxWidth = mmToPoints(box.dimensions.width);
float lineWidth = 0;
for (final Word word : words) {
lineWidth += word.width;
if (lineWidth > boxWidth) {
addNewLine(cs);
lineWidth = word.width;
}
if (lineWidth > boxWidth) { // Word longer than box width
lineWidth = 0;
final String string = word.string;
for (int i = 0, length = string.length(); i < length; ++i) {
final char c = string.charAt(i);
final float charWidth = calculateStringWidth(String.valueOf(c), word.font, word.fontSize);
lineWidth += charWidth;
if (lineWidth > boxWidth) {
addNewLine(cs);
lineWidth = charwidth);
}
drawChar(c, word.font, word.fontSize, cs);
}
} else {
draWord(word, cs);
}
}
}
}
public class Word {
public final String string;
public final PDFont font;
public final float fontSize;
public final float width;
public final float height;
public Word(final String string, final PDFont font, final float fontSize) {
this.string = string;
this.font = font;
this.fontSize = fontSize;
this.width = calculateStringWidth(string, font, fontSize);
this.height = calculateStringHeight(string, font, fontSize);
}
}
I hope this helps someone else facing the same problem. The reason to have a Word class is if you want to split on words, rather than chars.
Lots of other posts describe how to use some of these helper methods, like calculateStringWidth etc. So They are not here.
Check How to Insert a Linefeed with PDFBox drawString for newlines and fontHeight.
How to generate multiple lines in PDF using Apache pdfbox for string width.
In my case the parseForFontChange method changes the current font and font size. What's active is of course returned by the method currentFont() and currentFontSize. I use regexes like (?ui:(<strong>)) to check if a bold-tag was in there. Use what suits you.

Related

itext 7 pdf how to prevent text overflow on right side of the page

I am using itextpdf 7 (7.2.0) to create a pdf file. However even though the TOC part is rendered very well, in the content part the text overflows. Here is my code that generates the pdf:
public class Main {
public static void main(String[] args) throws IOException {
PdfWriter writer = new PdfWriter("fiftyfourthPdf.pdf");
PdfDocument pdf = new PdfDocument(writer);
Document document = new Document(pdf, PageSize.A4,false);
//document.setMargins(30,10,36,10);
// Create a PdfFont
PdfFont font = PdfFontFactory.createFont(StandardFonts.TIMES_ROMAN,"Cp1254");
document
.setTextAlignment(TextAlignment.JUSTIFIED)
.setFont(font)
.setFontSize(11);
PdfOutline outline = null;
java.util.List<AbstractMap.SimpleEntry<String, AbstractMap.SimpleEntry<String, Integer>>> toc = new ArrayList<>();
for(int i=0;i<5000;i++){
String line = "This is paragraph " + String.valueOf(i+1)+ " ";
line = line.concat(line).concat(line).concat(line).concat(line).concat(line);
Paragraph p = new Paragraph(line);
p.setKeepTogether(true);
document.add(p.setFont(font).setFontSize(10).setHorizontalAlignment(HorizontalAlignment.CENTER).setTextAlignment(TextAlignment.LEFT));
//PROCESS FOR TOC
String name = "para " + String.valueOf(i+1);
outline = createOutline(outline,pdf,line ,name );
AbstractMap.SimpleEntry<String, Integer> titlePage = new AbstractMap.SimpleEntry(line, pdf.getNumberOfPages());
p
.setFont(font)
.setFontSize(12)
//.setKeepWithNext(true)
.setDestination(name)
// Add the current page number to the table of contents list
.setNextRenderer(new UpdatePageRenderer(p));
toc.add(new AbstractMap.SimpleEntry(name, titlePage));
}
int contentPageNumber = pdf.getNumberOfPages();
for (int i = 1; i <= contentPageNumber; i++) {
// Write aligned text to the specified by parameters point
document.showTextAligned(new Paragraph(String.format("Sayfa %s / %s", i, contentPageNumber)).setFontSize(10),
559, 26, i, TextAlignment.RIGHT, VerticalAlignment.MIDDLE, 0);
}
//BEGINNING OF TOC
document.add(new AreaBreak());
Paragraph p = new Paragraph("Table of Contents")
.setFont(font)
.setDestination("toc");
document.add(p);
java.util.List<TabStop> tabStops = new ArrayList<>();
tabStops.add(new TabStop(580, TabAlignment.RIGHT, new DottedLine()));
for (AbstractMap.SimpleEntry<String, AbstractMap.SimpleEntry<String, Integer>> entry : toc) {
AbstractMap.SimpleEntry<String, Integer> text = entry.getValue();
p = new Paragraph()
.addTabStops(tabStops)
.add(text.getKey())
.add(new Tab())
.add(String.valueOf(text.getValue()))
.setAction(PdfAction.createGoTo(entry.getKey()));
document.add(p);
}
// Move the table of contents to the first page
int tocPageNumber = pdf.getNumberOfPages();
for (int i = 1; i <= tocPageNumber; i++) {
// Write aligned text to the specified by parameters point
document.showTextAligned(new Paragraph("\n footer text\n second line\nthird line").setFontColor(ColorConstants.RED).setFontSize(8),
300, 26, i, TextAlignment.CENTER, VerticalAlignment.MIDDLE, 0);
}
document.flush();
for(int z = 0; z< (tocPageNumber - contentPageNumber ); z++){
pdf.movePage(tocPageNumber,1);
pdf.getPage(1).setPageLabel(PageLabelNumberingStyle.UPPERCASE_LETTERS,
null, 1);
}
//pdf.movePage(tocPageNumber, 1);
// Add page labels
/*pdf.getPage(1).setPageLabel(PageLabelNumberingStyle.UPPERCASE_LETTERS,
null, 1);*/
pdf.getPage(tocPageNumber - contentPageNumber + 1).setPageLabel(PageLabelNumberingStyle.DECIMAL_ARABIC_NUMERALS,
null, 1);
document.close();
}
private static PdfOutline createOutline(PdfOutline outline, PdfDocument pdf, String title, String name) {
if (outline == null) {
outline = pdf.getOutlines(false);
outline = outline.addOutline(title);
outline.addDestination(PdfDestination.makeDestination(new PdfString(name)));
} else {
PdfOutline kid = outline.addOutline(title);
kid.addDestination(PdfDestination.makeDestination(new PdfString(name)));
}
return outline;
}
private static class UpdatePageRenderer extends ParagraphRenderer {
protected AbstractMap.SimpleEntry<String, Integer> entry;
public UpdatePageRenderer(Paragraph modelElement, AbstractMap.SimpleEntry<String, Integer> entry) {
super(modelElement);
this.entry = entry;
}
public UpdatePageRenderer(Paragraph modelElement) {
super(modelElement);
}
#Override
public LayoutResult layout(LayoutContext layoutContext) {
LayoutResult result = super.layout(layoutContext);
//entry.setValue(layoutContext.getArea().getPageNumber());
if (result.getStatus() != LayoutResult.FULL) {
if (null != result.getOverflowRenderer()) {
result.getOverflowRenderer().setProperty(
Property.LEADING,
result.getOverflowRenderer().getModelElement().getDefaultProperty(Property.LEADING));
} else {
// if overflow renderer is null, that could mean that the whole renderer will overflow
setProperty(
Property.LEADING,
result.getOverflowRenderer().getModelElement().getDefaultProperty(Property.LEADING));
}
}
return result;
}
#Override
// If not overriden, the default renderer will be used for the overflown part of the corresponding paragraph
public IRenderer getNextRenderer() {
return new UpdatePageRenderer((Paragraph) this.getModelElement());
}
}
}
Here are the screen shots of TOC part and content part :
TOC :
Content :
What am I missing? Thank you all for your help.
UPDATE
When I add the line below it renders with no overflow but the page margins of TOC and content part differ (the TOC margin is way more than the content margin). See the picture attached please :
document.setMargins(30,60,36,20);
Right Margin difference between TOC and content:
UPDATE 2 :
When I comment the line
document.setMargins(30,60,36,20);
and set the font size on line :
document.add(p.setFont(font).setFontSize(10).setHorizontalAlignment(HorizontalAlignment.CENTER).setTextAlignment(TextAlignment.LEFT));
to 12 then it renders fine. What difference should possibly the font size cause for the page content and margins? Are not there standard page margins and page setups? Am I unknowingly (I am newbie to itextpdf) messing some standard implementations?
TL; DR: either remove setFontSize in
p
.setFont(font)
.setFontSize(12)
//.setKeepWithNext(true)
.setDestination(name)
or change setFontSize(10) -> setFontSize(12) in
document.add(p.setFont(font).setFontSize(10).setHorizontalAlignment(HorizontalAlignment.CENTER).setTextAlignment(TextAlignment.LEFT));
Explanation: You are setting the Document to not immediately flush elements added to that document with the following line:
Document document = new Document(pdf, PageSize.A4,false);
Then you add an paragraph element with font size equal to 10 to the document with the following line:
document.add(p.setFont(font).setFontSize(10).setHorizontalAlignment(HorizontalAlignment.CENTER).setTextAlignment(TextAlignment.LEFT));
What happens is that the element is being laid out (split in lines etc), but now drawn on the page. Then you do .setFontSize(12) and this new font size is applied for draw only, so iText calculated that X characters would fit into one line assuming the font size is 10 while in reality the font size is 12 and obviously fewer characters can fit into one line.
There is no sense in setting the font size two times to different values - just pick one value you want to see in the resultant document and set it once.

Converting line and column coordinate to a caret position for a JSON debugger

I am building a small Java utility (using Jackson) to catch errors in Java files, and one part of it is a text area, in which you might paste some JSON context and it will tell you the line and column where it's found it:
I am using the error message to take out the line and column as a string and print it out in the interface for someone using it.
This is the JSON sample I'm working with, and there is an intentional error beside "age", where it's missing a colon:
{
"name": "mkyong.com",
"messages": ["msg 1", "msg 2", "msg 3"],
"age" 100
}
What I want to do is also highlight the problematic area in a cyan color, and for that purpose, I have this code for the button that validates what's inserted in the text area:
cmdValidate.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent e) {
functionsClass ops = new functionsClass();
String JSONcontent = JSONtextArea.getText();
Results obj = new Results();
ops.validate_JSON_text(JSONcontent, obj);
String result = obj.getResult();
String caret = obj.getCaret();
//String lineNum = obj.getLineNum();
//showStatus(result);
if(result==null) {
textAreaError.setText("JSON code is valid!");
} else {
textAreaError.setText(result);
Highlighter.HighlightPainter cyanPainter;
cyanPainter = new DefaultHighlighter.DefaultHighlightPainter(Color.cyan);
int caretPosition = Integer.parseInt(caret);
int lineNumber = 0;
try {
lineNumber = JSONtextArea.getLineOfOffset(caretPosition);
} catch (BadLocationException e2) {
// TODO Auto-generated catch block
e2.printStackTrace();
}
try {
JSONtextArea.getHighlighter().addHighlight(lineNumber, caretPosition + 1, cyanPainter);
} catch (BadLocationException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
}
}
});
}
The "addHighlight" method works with a start range, end range and a color, which didn't become apparent to me immediately, thinking I had to get the reference line based on the column number. Some split functions to extract the numbers, I assigned 11 (in screenshot) to a caret value, not realizing that it only counts character positions from the beginning of the string and represents the end point of the range.
For reference, this is the class that does the work behind the scenes, and the error handling at the bottom is about extracting the line and column numbers. For the record, "x" is the error message that would generate out of an invalid file.
package parsingJSON;
import java.io.IOException;
import com.fasterxml.jackson.core.JsonParseException;
import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
public class functionsClass extends JSONTextCompare {
public boolean validate_JSON_text(String JSONcontent, Results obj) {
boolean valid = false;
try {
ObjectMapper objMapper = new ObjectMapper();
JsonNode validation = objMapper.readTree(JSONcontent);
valid = true;
}
catch (JsonParseException jpe){
String x = jpe.getMessage();
printTextArea(x, obj);
//return part_3;
}
catch (IOException ioe) {
String x = ioe.getMessage();
printTextArea(x, obj);
//return part_3;
}
return valid;
}
public void printTextArea(String x, Results obj) {
// TODO Auto-generated method stub
System.out.println(x);
String err = x.substring(x.lastIndexOf("\n"));
String parts[] = err.split(";");
//String part 1 is the discarded leading edge that is the closing brackets of the JSON content
String part_2 = parts[1];
//split again to get rid of the closing square bracket
String parts2[] = part_2.split("]");
String part_3 = parts2[0];
//JSONTextCompare feedback = new JSONTextCompare();
//split the output to get the exact location of the error to communicate back and highlight it in the JSONTextCompare class
//first need to get the line number from the output
String[] parts_lineNum = part_3.split("line: ");
String[] parts_lineNum_final = parts_lineNum[1].split(", column:");
String lineNum = parts_lineNum_final[0];
String[] parts_caret = part_3.split("column: ");
String caret = parts_caret[1];
System.out.println(caret);
obj.setLineNum(lineNum);
obj.setCaret(caret);
obj.setResult(part_3);
System.out.println(part_3);
}
}
Screenshot for what the interface currently looks like:
Long story short - how do I turn the coordinates Line 4, Col 11 into a caret value (e.g. it's value 189, for the sake of argument) that I can use to get the highlighter to work properly. Some kind of custom parsing formula might be possible, but in general, is that even possible to do?
how do I turn the coordinates Line 4, Col 11 into a caret value (e.g. it's value 189,
Check out: Text Utilities for methods that might be helpful when working with text components. It has methods like:
centerLineInScrollPane
getColumnAtCaret
getLineAtCaret
getLines
gotoStartOfLine
gotoFirstWordOnLine
getWrappedLines
In particular the gotoStartOfLine() method contains code you can modify to get the offset of the specified row/column.offset.
The basic code would be:
int line = 4;
int column = 11;
Element root = textArea.getDocument().getDefaultRootElement();
int offset = root.getElement( line - 1 ).getStartOffset() + column;
System.out.println(offset);
The way it works is essentially counting the number of characters in each line, up until the line in which the error is occurring, and adding the caretPosition to that sum of characters, which is what the Highlighter needs to apply the marking to the correct location.
I've added the code for the Validate button for context.
functionsClass ops = new functionsClass();
String JSONcontent = JSONtextArea.getText();
Results obj = new Results();
ops.validate_JSON_text(JSONcontent, obj);
String result = obj.getResult();
String caret = obj.getCaret();
String lineNum = obj.getLineNum();
//showStatus(result);
if(result==null) {
textAreaError.setText("JSON code is valid!");
} else {
textAreaError.setText(result);
Highlighter.HighlightPainter cyanPainter;
cyanPainter = new DefaultHighlighter.DefaultHighlightPainter(Color.cyan);
//the column number as per the location of the error
int caretPosition = Integer.parseInt(caret); //JSONtextArea.getCaretPosition();
//the line number as per the location of the error
int lineNumber = Integer.parseInt(lineNum);
//get the number of characters in the string up to the line in which the error is found
int totalChars = 0;
int counter = 0; //used to only go to the line above where the error is located
String[] lines = JSONcontent.split("\\r?\\n");
for (String line : lines) {
counter = counter + 1;
//as long as we're above the line of the error (lineNumber variable), keep counting characters
if (counter < lineNumber)
{
totalChars = totalChars + line.length();
}
//if we are at the line that contains the error, only add the caretPosition value to get the final position where the highlighting should go
if (counter == lineNumber)
{
totalChars = totalChars + caretPosition;
break;
}
}
//put down the highlighting in the area where the JSON file is having a problem
try {
JSONtextArea.getHighlighter().addHighlight(totalChars - 2, totalChars + 2, cyanPainter);
} catch (BadLocationException e1) {
// TODO Auto-generated catch block
e1.getMessage();
}
}
The contents of the JSON file is treated as a string, and that's why I'm also iterating through it in that fashion. There are certainly better ways to go through lines in the string, and I'll add some reference topics on SO:
What is the easiest/best/most correct way to iterate through the characters of a string in Java? - Link
Check if a string contains \n - Link
Split Java String by New Line - Link
What is the best way to iterate over the lines of a Java String? - Link
Generally a combination of these led to this solution, and I am also not targeting it for use on very large JSON files.
A screenshot of the output, with the interface highlighting the same area that Notepad++ would complain about, if it could debug code:
I'll post the project on GitHub after I clean it up and comment it some, and will give a link to that later, but for now, hopefully this helps the next dev in a similar situation.

finding the number of open closed html tags in a string

I trying to figure out the best way to find the number of valid HTML tags in a string.
The assumption is that the tag is valid only if it has an opening and closing tag
this is an example of a test case
INPUT
"html": "<html><head></head><body><div><div></div></div>"
Output
"validTags":3
If you need to parse HTML
Do not do it yourself. There is no need to reinvent the wheel. There is a plethora of libraries for parsing HTML. Use the proper tool for the proper job.
Concentrate your efforts on the rest of your project. Sure, you could implement your own function that parses a string, looks for < and >, and acts appropriately. But HTML might be slightly more complex than you imagine, or you might end up needing more HTML parsing than just counting tags.
Maybe in the future you'llwant to count <br/> and <br /> as well. Or you'll want to find the depth of the HTML tree.
Maybe your homemade code doesn't account for all possible combinations of escaping characters, nested tags, etc. How many correct tags are there in the string:
<a><b><c><d e><f g="<h></h>"><i j="<k>" l="</k>"></i></f></e d></b></c></ a >
In a comment, user dbl linked to a similar question with links to libraries: How to validate HTML from java ?
If you want to count open-closed tag pairs as a learning project
Here is a proposed algorithm in pseudocode, as a recursive function:
function count_tags(s):
tag, remainder = find_next_tag(s)
found, inside, after = find_closing_tag(tag, remainder)
if (found)
return 1 + count_tags(inside) + count_tags(after)
else
return count_tags(inside)
Examples
on the string hello <a>world<c></c></a><b></b>, we will get:
tag = "<a>"
remainder = "world<c></c></a><b></b>"
found = true
inside = "world<c></c>"
after = "<b></b>"
return 1 + count_tags("world<c></c>") + count_tags("<b></b>")
on the string <html><head></head>:
tag = "<html>"
remainder = "<head></head>"
found = false
inside = "<head></head>"
after = ""
return count_tags("<head></head>")
on the string <a><b></a></b>:
tag = "<a>"
remainder = "<b></a></b>"
found = true
inside = "<b>"
after = "</b>"
return 1 + count_tags("<b>") + count_tags("</b>")
I wrote a function that would do exactly this.
static int checkValidTags(String html,String[] openTags, String[] closeTags) {
//openTags and closeTags must have the same length;
//This function keeps track of all opening tags.
//and removes the opening and closing tags if the tag is closed correctly
//It can even detect when there are labels added to the tags.
HashMap<Character,Integer> open = new HashMap<>();
HashMap<Character,Integer> close = new HashMap<>();
//Use a start character, this is 1 because 0 would be a string terminator.
int startChar = 1;
for(int i = 0; i < openTags.length; i++) {
open.put((char)startChar, i);
close.put((char)(startChar+1), i);
html = html.replaceAll(openTags[i],""+ (char)startChar);
html = html.replaceAll(closeTags[i],""+(char)(startChar+1));
startChar+=2;
}
List<List<Integer>> startIndexes = new ArrayList<>();
int validLabels = 0;
for(int i = 0; i < openTags.length; i++) {
startIndexes.add(new ArrayList<>());
}
for(int i = 0; i < html.length(); i++) {
char c = html.charAt(i);
if(open.get(c)!=null) {
startIndexes.get(open.get(c)).add(0,i);
}
if(close.get(c)!=null&&!startIndexes.get(close.get(c)).isEmpty()) {
String closed = html.substring(startIndexes.get(close.get(c)).get(0),i);
for(int k = 0; k < startIndexes.size(); k++) {
if(!startIndexes.get(k).isEmpty()) {
int p = startIndexes.get(k).get(0);
if(p > startIndexes.get(close.get(c)).get(0)) {
startIndexes.get(k).remove(0);
}
}
}
startIndexes.get(close.get(c)).remove(0);
html.replace(closed, "");
validLabels++;
}
}
return validLabels;
}
And to use it in your example you would do like this:
String html = "<html><head></head><body><div><div></div></div>";
int validTags = checkValidTags(html,new String[] {
//Add here all the tags you are looking for.
//Remove the trailing '>' so it can detect extra tags appended to it
"<html","<head","<body","<div"
}, new String[]{
"</html>","</head>","</body>","</div>"
});
System.out.println(validTags);
Output:
3

How to make it looking good sentence for line spacing

I develop some project, it works for tagging a words when i drag the words and click special button.
I want to make tags surrounding a word (begin & end tags with red color) (please refer exam picture, this)
but when it tagged at begin & end of text, it take null-spaces (like picture 2nd).
when I drag that spaces, there's no real space(white space or " " or "\nbsp" - no, never), that's just null space!
I can't select that space!
Pic. Link here
here's my code below:
attribute:
static final Color TAG_COLOR = new Color(255, 50, 50);
static final Color PLAIN_TXT_COLOR = new Color(0, 0, 0);
public static SimpleAttributeSet plainAttr = new SimpleAttributeSet();
public static SimpleAttributeSet tagAttr = new SimpleAttributeSet();
StyleConstants.setAlignment(plainAttr, StyleConstants.ALIGN_JUSTIFIED);
StyleConstants.setForeground(plainAttr, PLAIN_TXT_COLOR);
StyleConstants.setFontSize(plainAttr, 11);
StyleConstants.setBold(plainAttr, false);
StyleConstants.setAlignment(tagAttr, StyleConstants.ALIGN_JUSTIFIED);
StyleConstants.setForeground(tagAttr, TAG_COLOR);
StyleConstants.setFontSize(tagAttr, 11);
StyleConstants.setBold(tagAttr, true);
tagging function:
public static void tag_functiont() {
String taggedName = "tagMark";
int start_sel = mainEditText.getSelectionStart();
int end_sel = mainEditText.getSelectionEnd();
String selected = mainEditText.getSelectedText();
StyledDocument doc = mainEditText.getStyledDocument();
if(selected == null || selected.isEmpty()) return;
try {
String bTag = "__B:"+taggedName+"__";
String eTag = "__E:"+ taggedName +"__";
doc.insertString(start_sel, bTag, tagAttr);
doc.insertString(start_sel+bTag.length()+selected.length(), eTag, tagAttr);
} catch (Exception e) {
e.printStackTrace();
}
}
I also worked all possibility of attributes options.
(some kinds of fonts, all kinds of arrangement;center, right, left, justified )
could someone gimme a piece of advice???
Solved
I added " textPane.setContentType("html/text"); in main source, so foolish.
it triggered <p> & <div> tags.. so the paragraph are gone bad.

How do I preserve line breaks when using jsoup to convert html to plain text?

I have the following code:
public class NewClass {
public String noTags(String str){
return Jsoup.parse(str).text();
}
public static void main(String args[]) {
String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
"<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> googlez</p></BODY> </HTML> ";
NewClass text = new NewClass();
System.out.println((text.noTags(strings)));
}
And I have the result:
hello world yo googlez
But I want to break the line:
hello world
yo googlez
I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.
If there's a <br> in the markup I parse, how can I get a line break in my resulting output?
The real solution that preserves linebreaks should be like this:
public static String br2nl(String html) {
if(html==null)
return html;
Document document = Jsoup.parse(html);
document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
document.select("br").append("\\n");
document.select("p").prepend("\\n\\n");
String s = document.html().replaceAll("\\\\n", "\n");
return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}
It satisfies the following requirements:
if the original html contains newline(\n), it gets preserved
if the original html contains br or p tags, they gets translated to newline(\n).
With
Jsoup.parse("A\nB").text();
you have output
"A B"
and not
A
B
For this I'm using:
descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");
Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
We're using this method here:
public static String clean(String bodyHtml,
String baseUri,
Whitelist whitelist,
Document.OutputSettings outputSettings)
By passing it Whitelist.none() we make sure that all HTML is removed.
By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved.
On Jsoup v1.11.2, we can now use Element.wholeText().
String cleanString = Jsoup.parse(htmlString).wholeText();
user121196's answer still works. But wholeText() preserves the alignment of texts.
Try this by using jsoup:
public static String cleanPreserveLineBreaks(String bodyHtml) {
// get pretty printed html with preserved br and p tags
String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
// get plain text with preserved line breaks by disabled prettyPrint
return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}
For more complex HTML none of the above solutions worked quite right; I was able to successfully do the conversion while preserving line breaks with:
Document document = Jsoup.parse(myHtml);
String text = new HtmlToPlainText().getPlainText(document);
(version 1.10.3)
You can traverse a given element
public String convertNodeToText(Element element)
{
final StringBuilder buffer = new StringBuilder();
new NodeTraversor(new NodeVisitor() {
boolean isNewline = true;
#Override
public void head(Node node, int depth) {
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
String text = textNode.text().replace('\u00A0', ' ').trim();
if(!text.isEmpty())
{
buffer.append(text);
isNewline = false;
}
} else if (node instanceof Element) {
Element element = (Element) node;
if (!isNewline)
{
if((element.isBlock() || element.tagName().equals("br")))
{
buffer.append("\n");
isNewline = true;
}
}
}
}
#Override
public void tail(Node node, int depth) {
}
}).traverse(element);
return buffer.toString();
}
And for your code
String result = convertNodeToText(JSoup.parse(html))
Based on the other answers and the comments on this question it seems that most people coming here are really looking for a general solution that will provide a nicely formatted plain text representation of an HTML document. I know I was.
Fortunately JSoup already provide a pretty comprehensive example of how to achieve this: HtmlToPlainText.java
The example FormattingVisitor can easily be tweaked to your preference and deals with most block elements and line wrapping.
To avoid link rot, here is Jonathan Hedley's solution in full:
package org.jsoup.examples;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;
import java.io.IOException;
/**
* HTML to plain-text. This example program demonstrates the use of jsoup to convert HTML input to lightly-formatted
* plain-text. That is divergent from the general goal of jsoup's .text() methods, which is to get clean data from a
* scrape.
* <p>
* Note that this is a fairly simplistic formatter -- for real world use you'll want to embrace and extend.
* </p>
* <p>
* To invoke from the command line, assuming you've downloaded the jsoup jar to your current directory:</p>
* <p><code>java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]</code></p>
* where <i>url</i> is the URL to fetch, and <i>selector</i> is an optional CSS selector.
*
* #author Jonathan Hedley, jonathan#hedley.net
*/
public class HtmlToPlainText {
private static final String userAgent = "Mozilla/5.0 (jsoup)";
private static final int timeout = 5 * 1000;
public static void main(String... args) throws IOException {
Validate.isTrue(args.length == 1 || args.length == 2, "usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]");
final String url = args[0];
final String selector = args.length == 2 ? args[1] : null;
// fetch the specified URL and parse to a HTML DOM
Document doc = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get();
HtmlToPlainText formatter = new HtmlToPlainText();
if (selector != null) {
Elements elements = doc.select(selector); // get each element that matches the CSS selector
for (Element element : elements) {
String plainText = formatter.getPlainText(element); // format that element to plain text
System.out.println(plainText);
}
} else { // format the whole doc
String plainText = formatter.getPlainText(doc);
System.out.println(plainText);
}
}
/**
* Format an Element to plain-text
* #param element the root element to format
* #return formatted text
*/
public String getPlainText(Element element) {
FormattingVisitor formatter = new FormattingVisitor();
NodeTraversor traversor = new NodeTraversor(formatter);
traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node
return formatter.toString();
}
// the formatting rules, implemented in a breadth-first DOM traverse
private class FormattingVisitor implements NodeVisitor {
private static final int maxWidth = 80;
private int width = 0;
private StringBuilder accum = new StringBuilder(); // holds the accumulated text
// hit when the node is first seen
public void head(Node node, int depth) {
String name = node.nodeName();
if (node instanceof TextNode)
append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM.
else if (name.equals("li"))
append("\n * ");
else if (name.equals("dt"))
append(" ");
else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr"))
append("\n");
}
// hit when all of the node's children (if any) have been visited
public void tail(Node node, int depth) {
String name = node.nodeName();
if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5"))
append("\n");
else if (name.equals("a"))
append(String.format(" <%s>", node.absUrl("href")));
}
// appends text to the string builder with a simple word wrap method
private void append(String text) {
if (text.startsWith("\n"))
width = 0; // reset counter if starts with a newline. only from formats above, not in natural text
if (text.equals(" ") &&
(accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n")))
return; // don't accumulate long runs of empty spaces
if (text.length() + width > maxWidth) { // won't fit, needs to wrap
String words[] = text.split("\\s+");
for (int i = 0; i < words.length; i++) {
String word = words[i];
boolean last = i == words.length - 1;
if (!last) // insert a space if not the last word
word = word + " ";
if (word.length() + width > maxWidth) { // wrap and reset counter
accum.append("\n").append(word);
width = word.length();
} else {
accum.append(word);
width += word.length();
}
}
} else { // fits as is, without need to wrap text
accum.append(text);
width += text.length();
}
}
#Override
public String toString() {
return accum.toString();
}
}
}
text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");
works if the html itself doesn't contain "br2n"
So,
text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "<pre>\n</pre>")).text();
works more reliable and easier.
Try this:
public String noTags(String str){
Document d = Jsoup.parse(str);
TextNode tn = new TextNode(d.body().html(), "");
return tn.getWholeText();
}
Use textNodes() to get a list of the text nodes. Then concatenate them with \n as separator.
Here's some scala code I use for this, java port should be easy:
val rawTxt = doc.body().getElementsByTag("div").first.textNodes()
.asScala.mkString("<br />\n")
Try this by using jsoup:
doc.outputSettings(new OutputSettings().prettyPrint(false));
//select all <br> tags and append \n after that
doc.select("br").after("\\n");
//select all <p> tags and prepend \n before that
doc.select("p").before("\\n");
//get the HTML from the document, and retaining original new lines
String str = doc.html().replaceAll("\\\\n", "\n");
This is my version of translating html to text (the modified version of user121196 answer, actually).
This doesn't just preserve line breaks, but also formatting text and removing excessive line breaks, HTML escape symbols, and you will get a much better result from your HTML (in my case I'm receiving it from mail).
It's originally written in Scala, but you can change it to Java easily
def html2text( rawHtml : String ) : String = {
val htmlDoc = Jsoup.parseBodyFragment( rawHtml, "/" )
htmlDoc.select("br").append("\\nl")
htmlDoc.select("div").prepend("\\nl").append("\\nl")
htmlDoc.select("p").prepend("\\nl\\nl").append("\\nl\\nl")
org.jsoup.parser.Parser.unescapeEntities(
Jsoup.clean(
htmlDoc.html(),
"",
Whitelist.none(),
new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true)
),false
).
replaceAll("\\\\nl", "\n").
replaceAll("\r","").
replaceAll("\n\\s+\n","\n").
replaceAll("\n\n+","\n\n").
trim()
}
/**
* Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced.
* #param html
* #param linebreakerString
* #return the html as String with proper java newlines instead of br
*/
public static String replaceBrWithNewLine(String html, String linebreakerString){
String result = "";
if(html.contains(linebreakerString)){
result = replaceBrWithNewLine(html, linebreakerString+"1");
} else {
result = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", linebreakerString)).text(); // replace and html line breaks with java linebreak.
result = result.replaceAll(linebreakerString, "\n");
}
return result;
}
Used by calling with the html in question, containing the br, along with whatever string you wish to use as the temporary newline placeholder.
For example:
replaceBrWithNewLine(element.html(), "br2n")
The recursion will ensure that the string you use as newline/linebreaker placeholder will never actually be in the source html, as it will keep adding a "1" untill the linkbreaker placeholder string is not found in the html. It wont have the formatting issue that the Jsoup.clean methods seem to encounter with special characters.
Based on user121196's and Green Beret's answer with the selects and <pre>s, the only solution which works for me is:
org.jsoup.nodes.Element elementWithHtml = ....
elementWithHtml.select("br").append("<pre>\n</pre>");
elementWithHtml.select("p").prepend("<pre>\n\n</pre>");
elementWithHtml.text();

Categories