my first time posting!
The problem I'm having is I'm using XPath and Tag-Soup to parse a webpage and read in the data. As these are news articles sometimes they have links embedded in the content and these are what is messing with my program.
The XPath I'm using is storyPath = "//html:article//html:p//text()"; where the page has a structure of:
<article ...>
<p>Some text from the story.</p>
<p>More of the story, which proves what a great story this is!</p>
<p>More of the story without links!</p>
</article>
My code relating to the xpath evaluation is this:
NodeList nL = XPathAPI.selectNodeList(doc,storyPath);
LinkedList<String> story = new LinkedList<String>();
for (int i=0; i<nL.getLength(); i++) {
Node n = nL.item(i);
String tmp = n.toString();
tmp = tmp.replace("[#text:", "");
tmp = tmp.replace("]", "");
tmp = tmp.replaceAll("’", "'");
tmp = tmp.replaceAll("‘", "'");
tmp = tmp.replaceAll("–", "-");
tmp = tmp.replaceAll("¬", "");
tmp = tmp.trim();
story.add(tmp);
}
this.setStory(story);
...
private void setStory(LinkedList<String> story) {
String tmp = "";
for (String p : story) {
tmp = tmp + p + "\n\n";
}
this.story = tmp.trim();
}
The output this gives me is
Some text from the story.
More of the story, which proves
what a great story this is
!
More of the story without links!
Does anyone have a way of me eliminating this error? Am I taking a wrong approach somewhere? (I understand I could well be with the setStory code, but don't see another way.
And without the tmp.replace() codes, all the results appear like [#text: what a great story this is] etc
EDIT:
I am still having troubles, though possibly of a different kind.. what is killing me here is again a link, but the way the BBC have their website, the link is on a separate line, thus it still reads in with the same problem as described before (note that problem was fixed with the example given). The section of code on the BBC page is:
<p> Former Queens Park Rangers trainee Sterling, who
<a href="http://news.bbc.co.uk/sport1/hi/football/teams/l/liverpool/8541174.stm" >moved to the Merseyside club in February 2010 aged 15,</a>
had not started a senior match for the Reds before this season.
</p>
which appears in my output as:
Former Queens Park Rangers trainee Sterling, who
moved to the Merseyside club in February 2010 aged 15,
had not started a senior match for the Reds before this season.
For the problem with your edit where new lines in the html source code come out into your text document, you'll want to remove them before you print them. Instead of System.out.print(text.trim()); do System.out.println(text.trim().replaceAll("[ \t\r\n]+", " "));
First find paragraphs,: storyPath = "//html:article//html:p, then for each paragraph, get out all the text with another xpath query and concatenate them without new lines and put two new lines just at the end of the paragraph.
On another note, you shouldn't have to replaceAll("’", "'"). That is a sure sign that you are opening your file incorrectly. When you open your file you need to pass a Reader into tag soup. You should initialize the Reader like this: Reader r = new BufferedReader(new InputStreamReader(new FileInputStream("myfilename.html"),"Cp1252")); Where you specify the correct character set for the file. A list of character sets is here: http://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html My guess is that it is Windows latin 1.
The [#text: thing is simply the toString() representation of a DOM Text node. The toString() method is intended to be used when you want a string representation of the node for debugging purposes. Instead of toString() use getTextContent() which returns the actual text.
If you don't want the link content to appear on separate lines then you could remove the //text() from your XPath and just take the textContent of the element nodes directly (getTextContent() for an element returns the concatenation of all the descendant text nodes)
String storyPath = "//html:article//html:p";
NodeList nL = XPathAPI.selectNodeList(doc,storyPath);
LinkedList<String> story = new LinkedList<String>();
for (int i=0; i<nL.getLength(); i++) {
Node n = nL.item(i);
story.add(n.getTextContent().trim());
}
The fact that you are having to manually fix up things like "’" suggests your HTML is actually encoded in UTF-8 but you're reading it using a single-byte character set such as Windows1252. Rather than try and fix it post-hoc you should instead work out how to read the data in the correct encoding in the first place.
Related
I came up with something like this which didn't work out. I am trying to extract the texts that contain the keyword alone and not the entire text of the webpage just because the webpage has that keyword.
String pconcat="";
for (i = 0; i < urls.length; i++) {
Document doc=Jsoup.connect(urls[i]).ignoreContentType(true).timeout(60*1000).get();
for(int x=0;x<keyWords.length;x++){
if(doc.body().text().toLowerCase().contains(keyWords[x].toLowerCase())){
Elements e=doc.select("body:contains("+keyWords[x]+")");
for(Element element : e)
{
pconcat+=element.text();
System.out.println("pconcat"+pconcat);
}
}
}
}
Consider example.com , if the keyword I look for is "documents" , I need the output as "This domain is established to be used for illustrative examples in documents." and nothing else
You don't need to lowercase the body text in order to use the :contains selector, it is case insensitive.
elements that contains the specified text. The search is case
insensitive. The text may appear in the found element, or any of its
descendants.
select() is only going to return elements if it finds a match.
elements that match the query (empty if none match)
You don't need an if-statement to check for "documents", just use css selectors to select any element that matches then do something with the results.
Document doc = Jsoup
.connect(url)
.ignoreContentType(true)
.timeout(60*1000)
.get();
for (String keyword : keywords) {
String selector = String.format(
"p:contains(%s)",
keyword.toLowerCase());
String content = doc
.select(selector)
.text();
System.out.println(content);
}
Output
This domain is established to be used for illustrative examples in
documents. You may use this domain in examples without prior
coordination or asking for permission.
I am new to Java and am trying to write a program that gets the meaning of a given word from MW api. The output is XML, now I am using DOM parser to print the list of all definitions. Normally the retrieved XML will be as following
<?xml version="1.0" encoding="utf-8" ?>
<entry_list version="1.0">
<entry id="dictionary"><ew>dictionary</ew><subj>PU-1#PU-2#PU-3#CP-4</subj><hw>dic*tio*nary</hw><sound><wav>dictio04.wav</wav></sound><pr>ˈdik-shə-ˌner-ē, -ˌne-rē</pr><fl>noun</fl><in><il>plural</il> <if>dic*tio*nar*ies</if></in><et>Medieval Latin <it>dictionarium,</it> from Late Latin <it>diction-, dictio</it> word, from Latin, speaking</et><def><date>1526</date> <sn>1</sn> <dt>:a reference source in print or electronic form containing words usually alphabetically arranged along with information about their forms, <d_link>pronunciations</d_link>, functions, <d_link>etymologies</d_link>, meanings, and <d_link>syntactical</d_link> and idiomatic uses</dt> <sn>2</sn> <dt>:a reference book listing alphabetically terms or names important to a particular subject or activity along with discussion of their meanings and <d_link>applications</d_link></dt> <sn>3</sn> <dt>:a reference book listing alphabetically the words of one language and showing their meanings or translations in another language</dt> <sn>4</sn> <dt>:a <d_link>computerized</d_link> list (as of items of data or words) used for reference (as for information retrieval or word processing)</dt></def></entry>
</entry_list>
The list of definitions will be enclosed inside the tag <dt>
Now the problem I am facing is inside the tag <dt> there is another sub tag <d_link>. Whenever the DOM parser runs across this sub tag, the getNodeValue() method is considering the end of the tag <dt>
My Code is as below:
import org.w3c.dom.*;
import javax.xml.parsers.*;
public class Dictionary5 {
public static void main(String[] args) throws Exception {
String head = new String("http://www.dictionaryapi.com/api/v1/references/collegiate/xml/");
String word = new String("banal");
String apiKey = new String("?key=xxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx"); //My API Key for Merriam webster
String finalURL = head.trim() + word.trim()+ apiKey.trim();
try
{
DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
DocumentBuilder b = f.newDocumentBuilder();
Document doc = b.parse(finalURL);
doc.getDocumentElement().normalize();
NodeList items = doc.getElementsByTagName("entry");
for (int i = 0; i < items.getLength(); i++)
{
Node n = items.item(i);
if (n.getNodeType() != Node.ELEMENT_NODE)
continue;
Element e = (Element) n;
NodeList titleList = e.getElementsByTagName("dt");
for (int j = 0; j < titleList.getLength(); j++){
Node dt = titleList.item(j);
if (dt.getNodeType() != Node.ELEMENT_NODE)
continue;
Element titleElem = (Element) titleList.item(j);
Node titleNode = titleElem.getChildNodes().item(0);
System.out.println(titleNode.getNodeValue());
}
}
}
catch (Exception e)
{
e.printStackTrace();
}
}
}
The output is as following
:a reference source in print or electronic form containing words usually alphabetically arranged along with information about their forms,
:a reference book listing alphabetically terms or names important to a particular subject or activity along with discussion of their meanings and
:a reference book listing alphabetically the words of one language and showing their meanings or translations in another language
:a
As you can see the first, second and fourth definitions are abruptly ended because the parser encounter the sub tag <d_link>.
My Expected output is as following:
:a reference source in print or electronic form containing words usually alphabetically arranged along with information about their forms, pronunciations, functions, etymologies, meanings, and syntactical and idiomatic uses
:a reference book listing alphabetically terms or names important to a particular subject or activity along with discussion of their meanings and applications
:a reference book listing alphabetically the words of one language and showing their meanings or translations in another language
:a computerized list (as of items of data or words) used for reference (as for information retrieval or word processing)
Can someone please help me with this. Any help is highly appreciated. Thanks in advance.
In the dom model, the content of the the dt tag will be TEXT, d_link element, TEXT, d_link ....
So you want to concatenated all text elements (and it seems that also the content of the d_link tag). You are only reading the first one: titleElem.getChildNodes().item(0) so it is "abruptly" finished
I want to display my element to an textview.
code
Document doc = Jsoup.parse(myURL);
Elements name = doc.getElementsByClass(".lNameHeader");
for (Element nametext : name){
String text = nametext.text();
tabel1.setText(text);
but it displays nothing.
(the site i am parsing http://roosters.gepro-osi.nl/roosters/rooster.php?leerling=120777&type=Leerlingrooster&afdeling=12-13_OVERIG&tabblad=2&school=905)
From your previous question it shows that myURL is a String. In this case you are are using the constructor Jsoup.parse(String html).
You need the one that takes a URL to make the connection:
Document doc = Jsoup.parse(new URL(myURL), 2000);
Elements name = doc.getElementsByClass("lNameHeader");
Also drop the leading . character from the class name. If you don't wish to specify a timeout you can simply use:
Document doc = Jsoup.connect(myURL).get();
Actually the class for it is:
lNameHeader
Note that first letter is not 1 (one) - it's l (letter L)
So it should be:
Elements name = doc.getElementsByClass("lNameHeader");
Note also that JSoup getElementsByClass methods doesn't work like CSS selectors - so the . must be omitted.
I addressed a strange behavior when I parsed a HTML page which contains a unicode/ASCII element. Here the example git://gist.github.com/2995626.git.
What performed is:
File layout = new File(html_file);
Document doc = Jsoup.parse(layout, "UTF-8");
System.out.println(doc.toString());
What I expected was the HTML triangle, but it is converted to "â–¼". Do you have any suggestions?
Thanks in advance.
Jsoup is perfectly capable of parsing HTML using UTF-8. Even more, it's its default character encoding already. Your problem is caused elsewhere. Based on the information provided so far, I can see two possible problem causes:
The HTML file was originally not saved using UTF-8 (or perhaps it's one step before; it's originally not been read using UTF-8).
The stdout (there where the System.out goes to) does not use UTF-8.
If you make sure that both are correctly set, then your problem should disappear. If not, then there's another possible cause which is not guessable based on the information provided so far in your question. At least, this blog should bring a lot of new insight: Unicode - How to get the characters right?
It is a problem caused by unicode. Here you can have an example following. You can try the code below .The result will show you the cause why the code you write not working.
public static void main(String[] argv) {
String test = "Ch\u00e0o bu\u1ed5i s\u00e1ng";
System.out.println(unicode2String(test));
}
/**
* unicode 转字符串
*/
public static String unicode2String(String unicode) {
StringBuffer string = new StringBuffer();
String[] hex = unicode.split("\\\\u");
string.append(hex[0]);
for (int i = 1; i < hex.length; i++) {
// 转换出每一个代码点
int data = Integer.parseInt(hex[i], 16);
// 追加成string
string.append((char) data);
}
return string.toString();
}
Maybe you code should be as follows:
System.out.println(unicode2String(doc.toString()));
Is there a fast way to search for string in another string?
I have this kind of a file:
<br>
Comment EC00:
<br>
The EC00 is different from EC12 next week. The EC00 much wetter in the very end, which is not seen before.
<br>
<br>
<br>
Comment EC12:
<br>
The Ec12 of today is reliable. It starts cold, but temp are rising. From Sunday normal temp and wet, except for a strengthening high from SE in the very end.
<br>
I have deleted all the <br>'s and I will be searching for a string like "Comment EC12:" to retrieve what comes after:
The Ec12 of today is reliable. It starts cold, but temp are rising. From Sunday normal temp and wet, except for a strengthening high from SE in the very end.
Or maybe it could be a better idea to leave all the <br>'s so that I will know at least where to stop reading the lines..
P.S. These comments might have multiple occurences in the document.
EDIT:
I think that this solution would be ok for finding the occurences, at least a good place to start..
This is the last version, it works for me very good, because I know what in the HTML will be static and what is not.. But for those, who would like to do something simmilar, you can rewrite first two loops in the simmilar way as the last one(instead of 'if' using while - going down the lines of the text file)
StringTokenizer parser = new StringTokenizer(weatherComments);
String commentLine = "";
String commentWord = "";
while (parser.hasMoreTokens()) {
if (parser.nextToken().equals("Comment")) {
String commentType = parser.nextToken();
if (commentType.equals(forecastZone + ":")) {
parser.nextToken(); //first occured <br>
commentWord = parser.nextToken();
while(!commentWord.equals("<br>")){
commentLine += commentWord + " ";
commentWord = parser.nextToken();
}
commentLine += "\n";
System.out.println(commentLine);
}
}
}
P.P.S.
Before downloading a lot of libraries to make your code look smaller or to understand things easier, think first how to solve it yourself
You can try to simply use indexOf():
String html = ...;
String search = "Comment EC12:";
int comment = html.indexOf(search);
if (comment != -1) {
int start = comment + search.length();
int end = start + ...;
String after = html.substring(start, end);
...
}
The problem is to find the end of the text. So it may be useful not to replace the <br> and split the HTML on the tags:
String html = ...;
String[] parts = html.split("\\p{Space}*<br>\\p{Space}*")
for (int i = 0; i < parts.length; i += 2) {
String search = parts[i];
String after = parts[i + 1];
System.out.println(search + "\n\t" + after);
}
The example will print the following:
Comment EC00:
The EC00 is different from EC12 next week. The EC00 much wetter in the very end, which is not seen before.
Comment EC12:
The Ec12 of today is reliable. It starts cold, but temp are rising. From Sunday normal temp and wet, except for a strengthening high from SE in the very end.
Firstly i would remove blank lines and < br > and the i would implement an algorithm like BNDM for searching or better use a library like StringSearch. From the site "High-performance pattern matching algorithms in Java" http://johannburkard.de/software/stringsearch/
Depending on what you want to achieve, this might be an overkill, but I suggest you use finite state automaton string searching. You ca have a look at an example at http://en.literateprograms.org/Finite_automaton_string_search_algorithm_%28Java%29.