Extracting contents from HTML represented as a String

Extracting contents from HTML represented as a String - java

I have a Big html in String variable and I want to get contents of a div. I can not rely on regular expression because it can have nested div's. So, let's suppose I have following String -
String test = "<div><div id=\"mainContent\">foo bar<div>good best better</div> <div>test test</div></div><div>foo bar</div></div>";
Then how can I get this with a simple java program -
<div id="mainContent">foo bar<div>good best better</div> <div>test test</div></div>
Well my approch is something like this (might be horrable, still fighting to correct) -
public static void main(String[] args) {
int count = 1;
int fl = 0;
String s = "<div><div id=\"mainContent\">foo bar<div>good best better</div> <div>test test</div></div><div>foo bar</div></div>";
String tmp = s;
int len = s.length();
for (int i=0; i<len; i++){
int st = s.indexOf("div>");
if(st > -1) {
char c = s.charAt(st-1);
if(c == '/') {
count--;
} else {
count++;
}
s = s.substring(st+4);
System.out.println(s);
i = i + st;
System.out.println(c + " -- " + st + " -- " + count + " -- " + i);
if (count == 0) {
fl = i;
break;
}
}
}
System.out.println("final ind - " + fl);
s = tmp.substring(0, fl + 4);
System.out.println("final String - " + s);
}

I would recommend using JSoup to parse the HTML and find what you are looking for.
It fulfills the simple requirement for sure. You can do what you want in just a couple of lines of code!
jsoup is a Java library for working with real-world HTML. It provides
a very convenient API for extracting and manipulating data, using the
best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to
the same DOM as modern browsers do.
scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
jsoup is designed to deal with all varieties of HTML found in the
wild; from pristine and validating, to invalid tag-soup; jsoup will
create a sensible parse tree.
Using the selector syntax makes finding and extracting data extremely simple.
public static void main(final String[] args)
{
final String s = "<div><div id=\"mainContent\">foo bar<div>good best better</div> <div>test test</div></div><div>foo bar</div></div>";
final Document d = Jsoup.parse(s);
final Elements e = d.select("#mainContent");
System.out.println(e.get(0));
}
outputs
<div id="mainContent">
foo bar
<div>
good best better
</div>
<div>
test test
</div>
</div>
Doesn't get much more simple than that!

I'm afraid the answer is: You don't. At least not with a "simple" program...
But there is hope: You can use a HTML parser library (like NekoHTML or HTMLParser, although the latter project seems to be dead) to parse the string and retrive the part you need.

Related

Alternative of Java String.split for better performance

In the process of adding data by import from a csv/tab seperated file, my code consumes a lot of time to upload data. Is there any alternative to do this in a more faster way ?? This is the code i use to split fields in an array.
//Here - lineString = fileReader.readLine()
public static String [] splitAndGetFieldNames(String lineString ,String fileType)
{
if(lineString==null || lineString.trim().equals("")){
return null;
}
System.out.print("LINEEEE " + lineString);
String pattern = "(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))";
if(fileType.equals("tab"))
pattern = "\t" + pattern;
else
pattern = "," + pattern;
String fieldNames[] = lineString.split(pattern);
for(int i=0 ; i < fieldNames.length ; i++){
//logger.info("Split Fields::"+fieldNames[i]);
if (fieldNames[i].startsWith("\""))
fieldNames[i] = fieldNames[i].substring(1);
if (fieldNames[i].endsWith("\""))
fieldNames[i] = fieldNames[i].substring(0, fieldNames[i].length()-1);
fieldNames[i] = fieldNames[i].replaceAll("\"\"","\"").trim();
//logger.info("Split Fields after manipulation::"+fieldNames[i]);
}
return fieldNames;
}

Use a CSV parser like super-csv.
Univocity provides a benchmark of CSV parsers. It says that univocity-parsers
is fast, which is no surprise. You could give it a try.

I would recommend you to take a look at opencsv library or try CSVParser from Apache Commons
Anyway, reinventing the wheel is not the best idea. Using 3rd party library would be less headache than writing it yourself :)

Display Stanford NER confidence score

I'm extracting named-entities from news articles with the use of Stanford NER CRFClassifier and in order to implement active learning, I would like to know what are the confidence scores of the classes for each labelled entity.
Exemple of display :
LOCATION(0.20) PERSON(0.10) ORGANIZATION(0.60) MISC(0.10)
Here is my code for extracting named-entities from a text :
AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions(classifier_path);
String annnotatedText = classifier.classifyWithInlineXML(text);
Is there a workaround to get thoses values along with the annotations ?

I've found it out by myself, in CRFClassifier's doc it is written :
Probabilities assigned by the CRF can be interrogated using either the
printProbsDocument() or getCliqueTrees() methods.
The first method is not useful since it only prints what I want on the console, but I want to be able to access this data, so I have read how this method is coded and copied a bit its behaviour like this :
List<CoreLabel> classifiedLabels = classifier.classify(sentences);
CRFCliqueTree<String> cliqueTree = classifier.getCliqueTree(classifiedLabels);
for (int i = 0; i < cliqueTree.length(); i++) {
CoreLabel wi = classifiedLabels.get(i);
for (Iterator<String> iter = classifier.classIndex.iterator(); iter.hasNext();) {
String label = iter.next();
int index = classifier.classIndex.indexOf(label);
double prob = cliqueTree.prob(i, index);
System.out.println("\t" + label + "(" + prob + ")");
}
String tag = StringUtils.getNotNullString(wi.get(CoreAnnotations.AnswerAnnotation.class));
System.out.println("Class : " + tag);
}

In java trying to extract XMLNS using a Regexpression

I have been trying for a few hours to get this right, and I really can't seem to do it...
Given a string
"xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\""
what is the correct expression to "save" the http://www.openarchives.org/OAI/2.0/oai-identifier bit?
Thanks in advance, really having trouble getting this right.
String validXML = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><feed "
+ "xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\" "
+ "xmlns:mingo-identifier=\"http://www.google.com\" "
+ "xmlns:abeve-identifier=\"http://www.news.ycombinator.org/OAI/2.0/oai-identifier\">"
+ "</feed>";
Pattern p = Pattern.compile(".*\\\"(.*)\\\".*");
Matcher m = p.matcher(validXML);
System.out.println(m.group(1));
Is not printing out anything. Be aware that this attempt was just to get the string inside the quotes, I was going to worry about the other part once I got that working... To bad I never got that working. Thanks

Regular Expressions are so expensive - don't use them when you don't need to!! There are a million other ways to parse a string.
String validXml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><feed "
+ "xmlns:oai-identifier=\"http://www.openarchives.org/OAI/2.0/oai-identifier\" "
+ "xmlns:mingo-identifier=\"http://www.google.com\" "
+ "xmlns:abeve-identifier=\"http://www.news.ycombinator.org/OAI/2.0/oai-identifier\">"
+ "</feed>";
String start = "xmlns:oai-identifier=\"";
String end = "\" ";
int location = validXml.indexOf(start);
String result;
if (location > 0) {
result = validXml.substring(location + start.length(), validXml.length());
int endIndex = result.indexOf(end);
if (endIndex > 0) {
result = result.substring(0, endIndex);
}
else {
throw new Exception("Could not find end!");
}
}
else {
throw new Exception("Could not find start!");
}
System.out.println(result);

I think the problem might be that the first .* in your regular expression is too eager and matching more characters than you'd like.
Try changing ".*\\\"(.*)\\\".*" to be "xmlns.*=\"(.*)\".*" and see whether that works.
If it doesn't work at first, you can also try re-instating the quote escaping. Off the top of my head, I think you don't need them escaping, but I'm not 100% sure.
Note also that this will only match a single namespace declaration, not each one in the validXML variable in your example. You'll have to split the string in order to use this on an arbitrary number of xmlns:.*= attributes.

Since you are reading XML, you might be using DOM, so you can extract the namespace from the prefix name using lookupNamespaceURI() once you parse the document with the setNamespaceAware() option set to true:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
Document doc = factory.newDocumentBuilder().parse(new InputSource(new StringReader(validXML)));
String namespace = doc.lookupNamespaceURI("oai-identifier");
It's simpler and you don't have to do any string parsing.

Replace every word with tag

JAVASCRIPT or JAVA solution needed
The solution I am looking for could use java or javascript. I have the html code in a string so I could manipulate it before using it with java or afterwards with javascript.
problem
Anyway, I have to wrap each word with a tag. For example:
<html> ... >
Hello every one, cheers
< ... </html>
should be changed to
<html> ... >
<word>Hello</word> <word>every</word> <word>one</word>, <word>cheers</word>
< ... </html>
Why?
This will help me use javascript to select/highlight a word. It seems the only way to do it is to use the function highlightElementAtPoint which I added in the JAVASCRIPT hint: It simply finds the element of a certain x,y coordinate and highlights it. I figured that if every word is an element, it will be doable.
The idea is to use this approach to allow us to detect highlighted text in an android WebView even if that would mean to use a twisted highlighting method. Think a bit more and you will find many other applications for this.
JAVASCRIPT hint
I am using the following code to highlight a word; however, this will highlight the whole text belonging to a certain tag. When each word is a tag, this will work to some extent. If there is a substitute that will allow me to highlight a word at a certain position, it would also be a solution.
function highlightElementAtPoint(xOrdinate, yOrdinate) {
var theElement = document.elementFromPoint(xOrdinate, yOrdinate);
selectedElement = theElement;
theElement.style.backgroundColor = "yellow";
var theName = theElement.nodeName;
var theArray = document.getElementsByTagName(theName);
var theIndex = -1;
for (i = 0; i < theArray.length; i++) {
if (theArray[i] == theElement) {
theIndex = i;
}
}
window.androidselection.selected(theElement.innerHTML);
return theName + " " + theIndex;
}

Try to use something like
String yourStringHere = yourStringHere.replace(" ", "</word> <word>" )
yourStringHere.replace("<html></word>", "<html>" );//remove first closing word-tag
Should work, maybe u have to change sth...

var tags = document.body.innerText.match(/\w+/g);
for(var i=0;i<tags.length;i++){
tags[i] = '<word>' + tags[i] + '</word>';
}
Or as #ThomasK said:
var tags = document.body.innerText;
tags = '<word>' + tags + '</word>';
tags = tags.replace(/\s/g,'</word><word>');
But you have to keep in mind: .replace(" ",foo) only replaces the space once. For multiple replaces you have to use .replace(/\s+/g,foo)
And as #ajax333221 said, the second way will include commas, dots and other symbols, so the better solution is the first
JSFiddle example: http://jsfiddle.net/c6ftq/4/

inputStr = inputStr.replaceAll("(?<!</?)\\w++(?!\\s*>)","<word>$0</word>");

You can try following code,
import java.util.StringTokenizer;
public class myTag
{
static String startWordTag = "<Word>";
static String endWordTag = "</Word>";
static String space = " ";
static String myText = "Hello how are you ";
public static void main ( String args[] )
{
StringTokenizer st = new StringTokenizer (myText," ");
StringBuffer sb = new StringBuffer();
while ( st.hasMoreTokens() )
{
sb.append(startWordTag);
sb.append(st.nextToken());
sb.append(endWordTag);
sb.append(space);
}
System.out.println ( "Result:" + sb.toString() );
}
}

how to remove anchor tag and make it text

String k= <html>
<a target="_blank" href="http://www.taxmann.com/directtaxlaws/fileopencontainer.aspx?Page=CIRNO&
amp;id=1999033000019320&path=/Notifications/DirectTaxLaws/HTMLFiles/S.O.193(E)30031999.htm&
amp;aa=">number S.O.I93(E), dated the 30th March, 1999
</html>
I'm getting this HTML in a String and I want to remove the anchor tag so that data is also removed from link.
I just want display it as text not as a link.
how to do this i m trying to do so much not able to do please send me code regarding that i m
creating app for Android this issue i m getting in android on web view.

use JSoup, and jSoup.parse()

You can use the following example (don't remember where i've found it, but it works) using replace method to modify the string before showing it:
k = replace ( k, "<a target=\"_blank\" href=", "");
String replace(String _text, String _searchStr, String _replacementStr) {
// String buffer to store str
StringBuffer sb = new StringBuffer();
// Search for search
int searchStringPos = _text.indexOf(_searchStr);
int startPos = 0;
int searchStringLength = _searchStr.length();
// Iterate to add string
while (searchStringPos != -1) {
sb.append(_text.substring(startPos, searchStringPos)).append(_replacementStr);
startPos = searchStringPos + searchStringLength;
searchStringPos = _text.indexOf(_searchStr, startPos);
}
// Create string
sb.append(_text.substring(startPos,_text.length()));
return sb.toString();
}
To substitute all the target with an empty line:
k = replace ( k, "<a target=\"_blank\" href=\"http://www.taxmann.com/directtaxlaws/fileopencontainer.aspx?Page=CIRNO&id=1999033000019320&path=/Notifications/DirectTaxLaws/HTMLFiles/S.O.193(E)30031999.htm&aa=\">", "");
No escape is needed for slash.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting contents from HTML represented as a String - java

I'm afraid the answer is: You don't. At least not with a "simple" program... But there is hope: You can use a HTML parser library (like NekoHTML or HTMLParser, although the latter project seems to be dead) to parse the string and retrive the part you need.

Related

Alternative of Java String.split for better performance

Display Stanford NER confidence score

In java trying to extract XMLNS using a Regexpression

Replace every word with tag

how to remove anchor tag and make it text

Categories

Resources