I have 2 separate java files (Main & RSS). I would like to return the result from my RSS class to my Main class. Right now the results are displayed in console. How can I append the results to my JTextArea instead? Thanks!
In my Main class:
public void news()
{
news = new JPanel();
news.setLayout( null );
JTextArea textArea = new JTextArea();
textArea.setBackground(SystemColor.window);
textArea.setBounds(10, 11, 859, 512);
textArea.setWrapStyleWord(true);
news.add(textArea);
TextSamplerDemo reader = TextSamplerDemo.getInstance();
reader.writeNews();
}
In my RSS class:
public void writeNews(){
try{
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
URL u = new URL("http://rss.cnn.com/rss/cnn_world.rss");
Document doc = builder.parse(u.openStream());
NodeList nodes = doc.getElementsByTagName("item");
for(int i=0;i<nodes.getLength();i++){
Element element = (Element)nodes.item(i);
System.out.println("Title: " + getElementValue(element,"title"));
System.out.println("Link: " + getElementValue(element,"link"));
}
}
catch(Exception ex){
ex.printStackTrace();
}
}
If you modify your RSS.writeNews method to return the parsed RSS feed, the Main class can easily insert the data into the text area.
// In the RSS class
public String writeNews()
{
String result = "";
...
// Instead of printing to console, store text in a String variable
result += "Title: " + getElementValue(element,"title");
result += "Link: " + getElementValue(element,"link");
...
// Return result
return result
}
// In the Main.news method
String rssNews = reader.writeNews();
textArea.append(rssNews);
Instead of initializing the text area in your method, initialize it globally (like your news var), then use
Main.textArea.setText(String text);
You could consider the Observer Design Pattern. This way, you don't have to share the JTextArea object between classes.
Related
I'm trying to add a href to Arraylist and this adds nicely to the Arraylist, but the link is broken. Everything after the question mark (?) in the URL is not included in the link.
Is there anything that I'm missing, code below:
private String processUpdate(Database dbCurrent) throws NotesException {
int intCountSuccessful = 0;
View vwLookup = dbCurrent.getView("DocsDistribution");
ArrayList<String> listArray = new ArrayList<String>();
Document doc = vwLookup.getFirstDocument();
while (doc != null) {
String paperDistro = doc.getItemValueString("DistroRecords");
if (paperDistro.equals("")) {
String ref = doc.getItemValueString("ref");
String unid = doc.getUniversalID();
// the link generated when adding to Arraylist is broken
listArray.add("" + ref + "");
}
Document tmppmDoc = vwLookup.getNextDocument(doc);
doc.recycle();
doc = tmppmDoc;
}
Collections.sort(listArray);
String listString = "";
for (String s : listArray) {
listString += s + ", \t";
}
return listString;
}
You have a problem with " escaping around unid value due to which you URL becomes gandhi.w3schools.com/testbox.nsf/distro.xsp?documentId="+ unid + "&action=openDocument.
It would be easier to read if you use String.format() and single quotes to generate the a tag:
listArray.add(String.format(
"<a href='gandhi.w3schools.com/testbox.nsf/distro.xsp?documentId=%s&action=openDocument'>%s</a>",
unid, ref));
I have used the following code to extract text from .odt files:
public class OpenOfficeParser {
StringBuffer TextBuffer;
public OpenOfficeParser() {}
//Process text elements recursively
public void processElement(Object o) {
if (o instanceof Element) {
Element e = (Element) o;
String elementName = e.getQualifiedName();
if (elementName.startsWith("text")) {
if (elementName.equals("text:tab")) // add tab for text:tab
TextBuffer.append("\\t");
else if (elementName.equals("text:s")) // add space for text:s
TextBuffer.append(" ");
else {
List children = e.getContent();
Iterator iterator = children.iterator();
while (iterator.hasNext()) {
Object child = iterator.next();
//If Child is a Text Node, then append the text
if (child instanceof Text) {
Text t = (Text) child;
TextBuffer.append(t.getValue());
}
else
processElement(child); // Recursively process the child element
}
}
if (elementName.equals("text:p"))
TextBuffer.append("\\n");
}
else {
List non_text_list = e.getContent();
Iterator it = non_text_list.iterator();
while (it.hasNext()) {
Object non_text_child = it.next();
processElement(non_text_child);
}
}
}
}
public String getText(String fileName) throws Exception {
TextBuffer = new StringBuffer();
//Unzip the openOffice Document
ZipFile zipFile = new ZipFile(fileName);
Enumeration entries = zipFile.entries();
ZipEntry entry;
while(entries.hasMoreElements()) {
entry = (ZipEntry) entries.nextElement();
if (entry.getName().equals("content.xml")) {
TextBuffer = new StringBuffer();
SAXBuilder sax = new SAXBuilder();
Document doc = sax.build(zipFile.getInputStream(entry));
Element rootElement = doc.getRootElement();
processElement(rootElement);
break;
}
}
System.out.println("The text extracted from the OpenOffice document = " + TextBuffer.toString());
return TextBuffer.toString();
}
}
now my problem occurs when using the returned string from getText() method.
I ran the program and extracted some text from a .odt, here is a piece of extracted text:
(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....
So I tried this
System.out.println( TextBuffer.toString().split("\\n"));
the output I received was:
substring: [Ljava.lang.String;#505bb829
I also tried this:
System.out.println( TextBuffer.toString().trim() );
but no changes in the printed string.
Why this behaviour?
What can I do to parse that string correctly?
And, if I wanted to add to array[i] each substring that ends with "\n\n" how can I do?
edit:
Sorry I made a mistake with the example because I forgot that split() returns an array.
The problem is that it returns an array with one line so what I'm asking is why doing this:
System.out.println(Arrays.toString(TextBuffer.toString().split("\\n")));
has no effect on the string I wrote in the example.
Also this:
System.out.println( TextBuffer.toString().trim() );
has no effects on the original string, it just prints the original string.
I want to example the reason why I want to use the split(), it is because I want parse that string and put each substring that ends with "\n" in an array line, here is an example:
my originale string:
(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....
after parsing I would print each line of an array and the output should be:
line 1: (no hi virtual x oy)\
line 2: house cat
line 3: open it
line 4: trying to
and so on.....
If I understood your question correctly I would do something like this
String str = "(no hi virtual x oy)\n\n house cat \n open it \n\n trying to....";
List<String> al = new ArrayList<String>(Arrays.asList(str.toString()
.split("\\n")));
al.removeAll(Arrays.asList("", null)); // remove empty or null string
for (int i = 0; i< al.size(); i++) {
System.out.println("Line " + i + " : " + al.get(i).trim());
}
Output
Line 0 : (no hi virtual x oy)
Line 1 : house cat
Line 2 : open it
Line 3 : trying to....
My code is to add RSS feeds to a list - and the code originally was only to pull one feed from the first position in a list, and add this object to another list.
This was the original code:
public static List<Feed> getFeedsFromXml(String xml) {
Pattern feedPattern = Pattern.compile("<feed>\\s*<name>\\s*([^<]*)</name>\\s*<uri>\\s*([^<]*)</uri>\\s*</feed>");
Matcher feedMatch = feedPattern.matcher(xml);
while (feedMatch.find()) {
String feedName = feedMatch.group(1);
String feedURI = feedMatch.group(2);
feeds.add(new Feed(feedName, feedURI));
}
return feeds;
}
#POST
#Consumes(MediaType.APPLICATION_XML)
#Produces(MediaType.APPLICATION_XML)
public String addXmlFeed() throws IOException
{
int i = 0;
String stringXml = "<feed><name>SMH Top Headlines</name><uri>http://feeds.smh.com.au/rssheadlines/top.xml</uri></feed><feed><name>UTS Library News</name>";
getFeedsFromXml(stringXml);
Feed f = (Feed) feeds.get(0);
feedList.add(f);
String handler = "You have successfully added: \n";
String xmlStringReply = "" + f + "\n";
feedList.save(feedFile);
return handler + xmlStringReply;
}
Everything was going well, and then I decided to implement a for loop for handling the adding of more than one feed to the list, and I tried the following (only the code for the second method in question):
#POST
#Consumes(MediaType.APPLICATION_XML)
#Produces(MediaType.APPLICATION_XML)
public String addXmlFeed() throws IOException
{
int i = 0;
String stringXml = "<feed><name>SMH Top Headlines</name><uri>http://feeds.smh.com.au/rssheadlines/top.xml</uri></feed><feed><name>UTS Library News</name>";
getFeedsFromXml(stringXml);
for (Feed feed: feeds)
{
Feed f = (Feed) feeds.get(i++);
feedList.add(f);
String handler = "You have successfully added: \n";
String xmlStringReply = "" + f + "\n";
}
feedList.save(feedFile);
return handler + xmlStringReply;
}
Now I'm sure this is a basic problem, but now in the line:
return handler + xmlStringReply;
handler and xmlStringReply cannot be resolved to a variable as they are within the FOR LOOP.
Is there any easy way around this?
The scope of those 2 variables is limited to the for loop. To access them outside the loop, you need to increase their scope by declaring them before the loop:
String handler = "";
String xmlStringReply = "";
for (Feed f: feeds) {
feedList.add(f);
handler = "You have successfully added: \n";
xmlStringReply = "" + f + "\n";
}
feedList.save(feedFile);
return handler + xmlStringReply;
Also, your current code overwrites the value of your strings at each loop, whereas you probably meant to concatenate the values. In that case, you could use a StringBuilder instead of string concatenation:
StringBuilder xmlStringReply = new StringBuilder("You have successfully added: \n");
for (Feed f: feeds) {
feedList.add(f);
xmlStringReply.append(f + "\n");
}
feedList.save(feedFile);
return xmlStringReply.toString();
The question you need to answer is "what do I want to return if I add several feeds ?".
Maybe you'd like to return "You have successfully added : feed1 feed2 feed3\n"
In that case, the code is :
StringBuilder response = new StringBuilder( "You have successfully added: ");
for (Feed feed: feeds)
{
feedList.add(feed);
response.append(f.toString()).append(" ");
}
feedList.save(feedFile);
return response.toString();
By the way, your feedand fvariables are just the same and redondant !
Don't write :
int i = 0;
for (Feed feed: feeds)
{
Feed f = (Feed) feeds.get(i++);
feedList.add(f);
}
but
for (Feed feed: feeds)
{
feedList.add(feed);
}
You need to accumulate the result into a variable. I am using StringBuilder because it makes string concatenation efficient.
#POST
#Consumes(MediaType.APPLICATION_XML)
#Produces(MediaType.APPLICATION_XML)
public String addXmlFeed() throws IOException
{
String stringXml = "<feed><name>SMH Top Headlines</name><uri>http://feeds.smh.com.au/rssheadlines/top.xml</uri></feed><feed><name>UTS Library News</name>";
getFeedsFromXml(stringXml);
StringBuilder replyBuilder = new StringBuilder("You have successfully added: \n");
for (Feed feed : feeds)
{
feedList.add(feed);
String xmlStringReply = feed + "\n";
reployBuilder.append(xmlStringReply);
}
feedList.save(feedFile);
return replyBuilder.toString();
}
Because, now they became out of scope.
Beside the original error -- you can easily fix that using other suggestions, I would like to suggest that you should not make feeds as instance variable. I can see your method getFeedsFromXml() is returning the list. So, I think it would have been better if you define that variable inside that method. And then, call the method like,
List<Feed> feeds = getFeedsFromXml(stringXml);
Or in case, this doesn't give you the desired behaviour, then you should rename the method to something, loadFeedsFromXml(). Making that as instance variable may result in threading issues.
Now, trying to improve on your looping,
StringBuilder xmlStringReply = new StringBuilder("You have successfully added: \n");
for (Feed feed: feeds) {
feedList.add(feed);
xmlStringReply.append(f + "\n");
}
feedList.save(feedFile);
return xmlStringReply.toString();
Moreover, I found that your feedList is also a instance variable. And this again can cause threading issues, as it doesn't sound immutable or stateless. Synchronising the methods will give you performance issues. See if you can make it local to this method. A rule of thumb is to keep variable scope as narrow as possible.
A good rule of thumb is to view scope like this:
{ //This is a constructor
int i;
} // This is a deconstructor
anything that is created / instantiated between the curlies only lives inside the curlies. Whenever your working with variables and loops:
for(int i = 0; i < 10; i++){
//some code here
} // after this curly i is no longer in scope or accessible.
i have a collection of raw text in a table in database, i need to replace some words in this collection using a set of words.
i put all the term to be replace and its substitutes in a text file as below
min=admin
lelet=lambat
lemot=lambat
nii=nih
ntu=itu
and so on.
i have successfully initiate a variabel of File and Scanner to read the collection of the term and its substitutes.
i loop all the dataset and save the raw text in a string
in the same loop
i loop all the term collection and save its row to a string name 'pattern', and split the pattern into two string named 'term' and 'replacer'
in this loop i initiate a new string which its value is the string from the dataset modified by replaceAll(term,replacer)
end loop for term collection
then i insert the new string to another table in database
end loop for dataset
i do it manualy as below
replaceAll("min","admin")
and its works but its really something to code it manually for almost 2000 terms to be replace it.
anyone ever face this kind of really something..
i really need a help now desperate :(
package sentimenrepo;
import javax.swing.*;
import java.sql.*;
import java.io.*;
//import java.util.HashMap;
import java.util.Scanner;
//import java.util.Map;
/**
*
* #author herman
*/
public class synonimReplaceV2 extends SwingWorker {
protected Object doInBackground() throws Exception {
new skripsisentimen.sentimenttwitter().setVisible(true);
Integer row = 0;
File synonimV2 = new File("synV2/catatan_kata_sinonim.txt");
String newTweet = "";
DB db = new DB();
Connection conn = db.dbConnect("jdbc:mysql://localhost:3306/tweet", "root", "");
try{
Statement select = conn.createStatement();
select.executeQuery("select * from synonimtweet");
ResultSet RS = select.getResultSet();
Scanner scSynV2 = new Scanner(synonimV2);
while(RS.next()){
row++;
String no = RS.getString("no");
String tweet = " "+ RS.getString("tweet");
String published = RS.getString("published");
String label = RS.getString("label");
clean2 cleanv2 = new clean2();
newTweet = cleanv2.cleanTweet(tweet);
try{
Statement insert = conn.createStatement();
insert.executeUpdate("INSERT INTO synonimtweet_v2(no,tweet,published,label) values('"
+no+"','"+newTweet+"','"+published+"','"+label+"')");
String current = skripsisentimen.sentimenttwitter.txtAreaResult.getText();
skripsisentimen.sentimenttwitter.txtAreaResult.setText(current+"\n"+row+"original : "+tweet+"\n"+newTweet+"\n______________________\n");
skripsisentimen.sentimenttwitter.lblStat.setText(row+" tweet read");
skripsisentimen.sentimenttwitter.txtAreaResult.setCaretPosition(skripsisentimen.sentimenttwitter.txtAreaResult.getText().length() - 1);
}catch(Exception e){
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
}
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
}
}catch(Exception e){
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
}
return row;
}
class clean2{
public clean2(){}
public String cleanTweet(String tweet){
File synonimV2 = new File("synV2/catatan_kata_sinonim.txt");
String pattern = "";
String term = "";
String replacer = "";
String newTweet="";
try{
Scanner scSynV2 = new Scanner(synonimV2);
while(scSynV2.hasNext()){
pattern = scSynV2.next();
term = pattern.split("=")[0];
replacer = pattern.split("=")[1];
newTweet = tweet.replace(term, replacer);
}
}catch(Exception e){
e.printStackTrace();
}
System.out.println(newTweet+"\n"+tweet);
return newTweet;
}
}
}
update
ive just realize that the code actually works but only for the first row in database, the second row and so on stand still. here is i update the newest code i ve build
public class synonimReplaceV2 extends SwingWorker {
protected Object doInBackground() throws Exception {
new skripsisentimen.sentimenttwitter().setVisible(true);
Integer row = 0;
String newTweet = "";
DB db = new DB();
Connection conn = db.dbConnect("jdbc:mysql://localhost:3306/tweet", "root", "");
try{
Statement select = conn.createStatement();
select.executeQuery("select * from synonimtweet limit 2,10");
ResultSet RS = select.getResultSet();
FileReader readSyn = new FileReader("synV2/catatan_kata_sinonim.txt");
BufferedReader buffSyn = new BufferedReader(readSyn);
while(RS.next()){
row++;
String no = RS.getString("no");
String tweet = " "+ RS.getString("tweet");
String published = RS.getString("published");
String label = RS.getString("label");
String pattern = "";
while((pattern=buffSyn.readLine())!=null){
String patternTerm = pattern.split("=")[0];
String patternSubs = pattern.split("=")[1];
tweet = tweet.replaceAll("\\s"+patternTerm, patternSubs);
}
try{
Statement insert = conn.createStatement();
insert.executeUpdate("INSERT INTO synonimtweet_v2(no,tweet,published,label) values('"
+no+"','"+tweet+"','"+published+"','"+label+"')");
String current = skripsisentimen.sentimenttwitter.txtAreaResult.getText();
skripsisentimen.sentimenttwitter.txtAreaResult.setText(current+"\n"+row+"original : "+tweet+"\n"+newTweet+"\n______________________\n");
skripsisentimen.sentimenttwitter.lblStat.setText(row+" tweet read");
skripsisentimen.sentimenttwitter.txtAreaResult.setCaretPosition(skripsisentimen.sentimenttwitter.txtAreaResult.getText().length() - 1);
}catch(Exception e){
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
}
}
}catch(Exception e){
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
// System.out.println(e.getMessage());
}
Thread.sleep(100);
return row;
}
}
Opening the synonym file and iterating over 2,000 lines for every row in your ResultSet is a bit wasteful.
Load your synonyms into an in-memory Map once, keyed by unique misspelt term, then do a lookup on the map for every row in your result set, and replace as necessary.
Let us use both solutions to build a single solution for you:
First, you create a HashMap with all your keys:
public static HashMap<String, String> getMap() {
//your version would read from the file
HashMap<String,String> myMap=new HashMap<String,String>();
myMap.put("min", "admin");
myMap.put("lelet", "lambat");
myMap.put("lemot", "lambat");
myMap.put("nii", "nih");
myMap.put("ntu", "itu");
return(myMap);
}
Second, you create a pattern that contains all the keys in your hashmap:
public static String getPattern(HashMap<String,String> mapReplacement) {
String pattern="";
for (String s : mapReplacement.keySet()) {
if (!pattern.isEmpty()) {
pattern=pattern+"|";
}
pattern=pattern+s;
}
return(pattern);
}
Next, you can create a cleanTweet method that uses both structures you created:
public static String cleanTweet(String tweet, Pattern pattern,HashMap<String, String> myMap) {
String newTweet=tweet;
Matcher matcher = pattern.matcher(newTweet);
int start=0;
while (matcher.find()) {
String key=matcher.group();
String replacement=myMap.get(key);
if (replacement!=null) {
newTweet=newTweet.replace(key, replacement );
}
}
return(newTweet);
}
This might require some tweaking to perfect (I onyl tested a few cases), but the point is that you are going to iterate a single time in your keys and then iterate only on your tweets.
I hope it helps.
I didn't try, but it seems to me that you've almost got it - just replace this line:
newTweet = tweet.replace(term, replacer);
with this:
tweet = tweet.replaceAll(term, replacer);
As you're not using newTweet any more, return tweet:
return tweet;
You should also delete the newTweet declaration.
Also, you shouldn't read Scanner to read lines. Use FileReader instead.
thanks folks
i ve found the answer why the code is not working,
the txt file containing terms and its substitutes should be initiated each time the program read a row from database.
the code would be like this
public class synonimReplaceV2 extends SwingWorker {
protected Object doInBackground() throws Exception {
new skripsisentimen.sentimenttwitter().setVisible(true);
Integer row = 0;
String newTweet = "";
DB db = new DB();
Connection conn = db.dbConnect("jdbc:mysql://localhost:3306/tweet", "root", "");
try{
Statement select = conn.createStatement();
select.executeQuery("select * from synonimtweet limit 2,10");
ResultSet RS = select.getResultSet();
while(RS.next()){
row++;
FileReader readSyn = new FileReader("synV2/catatan_kata_sinonim.txt");
BufferedReader buffSyn = new BufferedReader(readSyn);
String no = RS.getString("no");
String tweet = " "+ RS.getString("tweet");
String published = RS.getString("published");
String label = RS.getString("label");
String pattern = "";
while((pattern=buffSyn.readLine())!=null){
String patternTerm = pattern.split("=")[0];
String patternSubs = pattern.split("=")[1];
tweet = tweet.replaceAll("\\s"+patternTerm, patternSubs);
}
try{
Statement insert = conn.createStatement();
insert.executeUpdate("INSERT INTO synonimtweet_v2(no,tweet,published,label) values('"
+no+"','"+tweet+"','"+published+"','"+label+"')");
String current = skripsisentimen.sentimenttwitter.txtAreaResult.getText();
skripsisentimen.sentimenttwitter.txtAreaResult.setText(current+"\n"+row+"original : "+tweet+"\n"+newTweet+"\n______________________\n");
skripsisentimen.sentimenttwitter.lblStat.setText(row+" tweet read");
skripsisentimen.sentimenttwitter.txtAreaResult.setCaretPosition(skripsisentimen.sentimenttwitter.txtAreaResult.getText().length() - 1);
}catch(Exception e){
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
}
}
}catch(Exception e){
skripsisentimen.sentimenttwitter.lblStat.setText(e.getMessage());
// System.out.println(e.getMessage());
}
Thread.sleep(100);
return row;
}
}
but im actually want to apply the code in which rlinden made above, but cant figure it out how to call the cleanTweet function.
I have the following code:
public class NewClass {
public String noTags(String str){
return Jsoup.parse(str).text();
}
public static void main(String args[]) {
String strings="<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" +
"<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> googlez</p></BODY> </HTML> ";
NewClass text = new NewClass();
System.out.println((text.noTags(strings)));
}
And I have the result:
hello world yo googlez
But I want to break the line:
hello world
yo googlez
I have looked at jsoup's TextNode#getWholeText() but I can't figure out how to use it.
If there's a <br> in the markup I parse, how can I get a line break in my resulting output?
The real solution that preserves linebreaks should be like this:
public static String br2nl(String html) {
if(html==null)
return html;
Document document = Jsoup.parse(html);
document.outputSettings(new Document.OutputSettings().prettyPrint(false));//makes html() preserve linebreaks and spacing
document.select("br").append("\\n");
document.select("p").prepend("\\n\\n");
String s = document.html().replaceAll("\\\\n", "\n");
return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false));
}
It satisfies the following requirements:
if the original html contains newline(\n), it gets preserved
if the original html contains br or p tags, they gets translated to newline(\n).
With
Jsoup.parse("A\nB").text();
you have output
"A B"
and not
A
B
For this I'm using:
descrizione = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");
Jsoup.clean(unsafeString, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
We're using this method here:
public static String clean(String bodyHtml,
String baseUri,
Whitelist whitelist,
Document.OutputSettings outputSettings)
By passing it Whitelist.none() we make sure that all HTML is removed.
By passsing new OutputSettings().prettyPrint(false) we make sure that the output is not reformatted and line breaks are preserved.
On Jsoup v1.11.2, we can now use Element.wholeText().
String cleanString = Jsoup.parse(htmlString).wholeText();
user121196's answer still works. But wholeText() preserves the alignment of texts.
Try this by using jsoup:
public static String cleanPreserveLineBreaks(String bodyHtml) {
// get pretty printed html with preserved br and p tags
String prettyPrintedBodyFragment = Jsoup.clean(bodyHtml, "", Whitelist.none().addTags("br", "p"), new OutputSettings().prettyPrint(true));
// get plain text with preserved line breaks by disabled prettyPrint
return Jsoup.clean(prettyPrintedBodyFragment, "", Whitelist.none(), new OutputSettings().prettyPrint(false));
}
For more complex HTML none of the above solutions worked quite right; I was able to successfully do the conversion while preserving line breaks with:
Document document = Jsoup.parse(myHtml);
String text = new HtmlToPlainText().getPlainText(document);
(version 1.10.3)
You can traverse a given element
public String convertNodeToText(Element element)
{
final StringBuilder buffer = new StringBuilder();
new NodeTraversor(new NodeVisitor() {
boolean isNewline = true;
#Override
public void head(Node node, int depth) {
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
String text = textNode.text().replace('\u00A0', ' ').trim();
if(!text.isEmpty())
{
buffer.append(text);
isNewline = false;
}
} else if (node instanceof Element) {
Element element = (Element) node;
if (!isNewline)
{
if((element.isBlock() || element.tagName().equals("br")))
{
buffer.append("\n");
isNewline = true;
}
}
}
}
#Override
public void tail(Node node, int depth) {
}
}).traverse(element);
return buffer.toString();
}
And for your code
String result = convertNodeToText(JSoup.parse(html))
Based on the other answers and the comments on this question it seems that most people coming here are really looking for a general solution that will provide a nicely formatted plain text representation of an HTML document. I know I was.
Fortunately JSoup already provide a pretty comprehensive example of how to achieve this: HtmlToPlainText.java
The example FormattingVisitor can easily be tweaked to your preference and deals with most block elements and line wrapping.
To avoid link rot, here is Jonathan Hedley's solution in full:
package org.jsoup.examples;
import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeTraversor;
import org.jsoup.select.NodeVisitor;
import java.io.IOException;
/**
* HTML to plain-text. This example program demonstrates the use of jsoup to convert HTML input to lightly-formatted
* plain-text. That is divergent from the general goal of jsoup's .text() methods, which is to get clean data from a
* scrape.
* <p>
* Note that this is a fairly simplistic formatter -- for real world use you'll want to embrace and extend.
* </p>
* <p>
* To invoke from the command line, assuming you've downloaded the jsoup jar to your current directory:</p>
* <p><code>java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]</code></p>
* where <i>url</i> is the URL to fetch, and <i>selector</i> is an optional CSS selector.
*
* #author Jonathan Hedley, jonathan#hedley.net
*/
public class HtmlToPlainText {
private static final String userAgent = "Mozilla/5.0 (jsoup)";
private static final int timeout = 5 * 1000;
public static void main(String... args) throws IOException {
Validate.isTrue(args.length == 1 || args.length == 2, "usage: java -cp jsoup.jar org.jsoup.examples.HtmlToPlainText url [selector]");
final String url = args[0];
final String selector = args.length == 2 ? args[1] : null;
// fetch the specified URL and parse to a HTML DOM
Document doc = Jsoup.connect(url).userAgent(userAgent).timeout(timeout).get();
HtmlToPlainText formatter = new HtmlToPlainText();
if (selector != null) {
Elements elements = doc.select(selector); // get each element that matches the CSS selector
for (Element element : elements) {
String plainText = formatter.getPlainText(element); // format that element to plain text
System.out.println(plainText);
}
} else { // format the whole doc
String plainText = formatter.getPlainText(doc);
System.out.println(plainText);
}
}
/**
* Format an Element to plain-text
* #param element the root element to format
* #return formatted text
*/
public String getPlainText(Element element) {
FormattingVisitor formatter = new FormattingVisitor();
NodeTraversor traversor = new NodeTraversor(formatter);
traversor.traverse(element); // walk the DOM, and call .head() and .tail() for each node
return formatter.toString();
}
// the formatting rules, implemented in a breadth-first DOM traverse
private class FormattingVisitor implements NodeVisitor {
private static final int maxWidth = 80;
private int width = 0;
private StringBuilder accum = new StringBuilder(); // holds the accumulated text
// hit when the node is first seen
public void head(Node node, int depth) {
String name = node.nodeName();
if (node instanceof TextNode)
append(((TextNode) node).text()); // TextNodes carry all user-readable text in the DOM.
else if (name.equals("li"))
append("\n * ");
else if (name.equals("dt"))
append(" ");
else if (StringUtil.in(name, "p", "h1", "h2", "h3", "h4", "h5", "tr"))
append("\n");
}
// hit when all of the node's children (if any) have been visited
public void tail(Node node, int depth) {
String name = node.nodeName();
if (StringUtil.in(name, "br", "dd", "dt", "p", "h1", "h2", "h3", "h4", "h5"))
append("\n");
else if (name.equals("a"))
append(String.format(" <%s>", node.absUrl("href")));
}
// appends text to the string builder with a simple word wrap method
private void append(String text) {
if (text.startsWith("\n"))
width = 0; // reset counter if starts with a newline. only from formats above, not in natural text
if (text.equals(" ") &&
(accum.length() == 0 || StringUtil.in(accum.substring(accum.length() - 1), " ", "\n")))
return; // don't accumulate long runs of empty spaces
if (text.length() + width > maxWidth) { // won't fit, needs to wrap
String words[] = text.split("\\s+");
for (int i = 0; i < words.length; i++) {
String word = words[i];
boolean last = i == words.length - 1;
if (!last) // insert a space if not the last word
word = word + " ";
if (word.length() + width > maxWidth) { // wrap and reset counter
accum.append("\n").append(word);
width = word.length();
} else {
accum.append(word);
width += word.length();
}
}
} else { // fits as is, without need to wrap text
accum.append(text);
width += text.length();
}
}
#Override
public String toString() {
return accum.toString();
}
}
}
text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "br2n")).text();
text = descrizione.replaceAll("br2n", "\n");
works if the html itself doesn't contain "br2n"
So,
text = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", "<pre>\n</pre>")).text();
works more reliable and easier.
Try this:
public String noTags(String str){
Document d = Jsoup.parse(str);
TextNode tn = new TextNode(d.body().html(), "");
return tn.getWholeText();
}
Use textNodes() to get a list of the text nodes. Then concatenate them with \n as separator.
Here's some scala code I use for this, java port should be easy:
val rawTxt = doc.body().getElementsByTag("div").first.textNodes()
.asScala.mkString("<br />\n")
Try this by using jsoup:
doc.outputSettings(new OutputSettings().prettyPrint(false));
//select all <br> tags and append \n after that
doc.select("br").after("\\n");
//select all <p> tags and prepend \n before that
doc.select("p").before("\\n");
//get the HTML from the document, and retaining original new lines
String str = doc.html().replaceAll("\\\\n", "\n");
This is my version of translating html to text (the modified version of user121196 answer, actually).
This doesn't just preserve line breaks, but also formatting text and removing excessive line breaks, HTML escape symbols, and you will get a much better result from your HTML (in my case I'm receiving it from mail).
It's originally written in Scala, but you can change it to Java easily
def html2text( rawHtml : String ) : String = {
val htmlDoc = Jsoup.parseBodyFragment( rawHtml, "/" )
htmlDoc.select("br").append("\\nl")
htmlDoc.select("div").prepend("\\nl").append("\\nl")
htmlDoc.select("p").prepend("\\nl\\nl").append("\\nl\\nl")
org.jsoup.parser.Parser.unescapeEntities(
Jsoup.clean(
htmlDoc.html(),
"",
Whitelist.none(),
new org.jsoup.nodes.Document.OutputSettings().prettyPrint(true)
),false
).
replaceAll("\\\\nl", "\n").
replaceAll("\r","").
replaceAll("\n\\s+\n","\n").
replaceAll("\n\n+","\n\n").
trim()
}
/**
* Recursive method to replace html br with java \n. The recursive method ensures that the linebreaker can never end up pre-existing in the text being replaced.
* #param html
* #param linebreakerString
* #return the html as String with proper java newlines instead of br
*/
public static String replaceBrWithNewLine(String html, String linebreakerString){
String result = "";
if(html.contains(linebreakerString)){
result = replaceBrWithNewLine(html, linebreakerString+"1");
} else {
result = Jsoup.parse(html.replaceAll("(?i)<br[^>]*>", linebreakerString)).text(); // replace and html line breaks with java linebreak.
result = result.replaceAll(linebreakerString, "\n");
}
return result;
}
Used by calling with the html in question, containing the br, along with whatever string you wish to use as the temporary newline placeholder.
For example:
replaceBrWithNewLine(element.html(), "br2n")
The recursion will ensure that the string you use as newline/linebreaker placeholder will never actually be in the source html, as it will keep adding a "1" untill the linkbreaker placeholder string is not found in the html. It wont have the formatting issue that the Jsoup.clean methods seem to encounter with special characters.
Based on user121196's and Green Beret's answer with the selects and <pre>s, the only solution which works for me is:
org.jsoup.nodes.Element elementWithHtml = ....
elementWithHtml.select("br").append("<pre>\n</pre>");
elementWithHtml.select("p").prepend("<pre>\n\n</pre>");
elementWithHtml.text();