I am learning jsoup. I want to parse the below script :
<script>
_cUq="1lj9lodlnq";
</script>
After parsing output : 1lj9lodlnq
Here is what I am trying:
String str = element.ownText().toString();
str = str.replace("\r","");
str = str.replace("\n","");
str = str.replace("<script>","");
str = str.replace("</script>","");
System.out.println(str);
if(str.contains("="))
split = str.split("=");
On debugging I can see the script is stored in the element tag but on assigning to str I get "". Correct me where I am going wrong.
You can extract the inner Javascript with Jsoup. This has the plus that your code is much easier to maintain. Also, you can use regular expressions to rule out the whitespaces instead of String.replace() them one by one.
import org.jsoup.Jsoup;
import org.junit.Test;
import static org.hamcrest.core.Is.is;
import static org.junit.Assert.assertThat;
public class JSoupSO {
#Test
public void script() {
String s = "<script>\n" +
"_cUq=\"1lj9lodlnq\";\n" +
"</script>";
// let Jsoup parse the HTML
String innerJavascript = Jsoup.parse(s).data();
// remove all whitespaces
innerJavascript = innerJavascript.replaceAll("\\s", "");
assertThat(innerJavascript, is("_cUq=\"1lj9lodlnq\";"));
}
}
Related
I am using JSOUP package for getting a specific TITLE search like facebook title's . Here is my code which gives the output with TITLE's. From the TITLE's I want to select facebook URL.
PROGRAM :
package googlesearch;
import java.io.IOException;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class SearchRegexDiv {
private static String REGEX = ".?[facebook]";
public static void main(String[] args) throws IOException {
Pattern p = Pattern.compile(REGEX);
String google = "http://www.google.com/search?q=";
//String search = "stackoverflow";
String search = "hortonworks";
String charset = "UTF-8";
String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!
Elements links = Jsoup.connect(google + URLEncoder.encode(search, charset)).userAgent(userAgent).get().select(".g>.r>a");
for (Element link: links) {
String title = link.text();
String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");
if (!url.startsWith("http")) {
continue; // Ads/news/etc.
}
//.?facebook
if (title.matches(REGEX)) {
System.out.println("Done");
title.substring(title.lastIndexOf(" ") + 1); //split the String
//(example.substring(example.lastIndexOf(" ") + 1));
}
System.out.println("Title: " + title);
System.out.println("URL: " + url);
}
}
}
OUTPUT :
Title: Hortonworks - Facebook logo
URL: https://www.facebook.com/hortonworks/
From the output I get the list of URL's and TITLE's in the above format.
I am trying to match Title containing word Facebook and I want to split it into two strings like
String socila_media = facebook;
String org = hortonworks;
use this code to split you String using multiple Character
Here is a Demo To Split character using multiple param
String word = "https://www.facebook.com/hortonworks/";
String [] array = word.split("[/.]");
for (String each1 : array)
System.out.println(each1);
Output is
https: //each splitted word in different line.
www
facebook
com
hortonworks
Using regex, I want to be able to get the text between multiple html tags.
Here HTML is just for representation of input, I am not worried about HTML tags, just want to retrieve the content in the HTML tags(between both correct open and close tags).
For instance, the following:
Required Input:
<h1>Text 1</h1>
<h1><h2>Text 2</h2></h1>
<h1><h2>Text 3</h2>Xtra</h1>
<h1>Text 4<h1>extra</h1515></h1>
<h1><h1></h1></h1>
Required Output:
Text 1
Text 2
Text 3
None
None
Output Obtained:
Text 1
Text 2
Text 3
Text 4<h1>extra</h1515>
<h1></h1>
Regex I tried:
"<([\\S ]+)>([\\S ]+)</\\1>"
I am not getting the expected result.
My java code:
import java.io.*;
import java.util.*;
import java.text.*;
import java.math.*;
import java.util.regex.*;
public class Solution{
public static void main(String[] args){
Scanner in = new Scanner(System.in);
int testCases = Integer.parseInt(in.nextLine());
while(testCases>0){
String line = in.nextLine();
String tmp = line;
Pattern r = Pattern.compile("<([\\S ]+)>([\\S ]+)</\\1>", Pattern.MULTILINE);
Matcher m = r.matcher(line);
while(m.find()){
line = line.replaceAll(line, m.group(2));
m = r.matcher(line);
}
if(line != tmp)
System.out.println(line);
else
System.out.println("None");
testCases--;
}
}
}
As pointed out in the comments that way lies nothing but pain. For what your attempting to do you would be far better off walking the DOM (Document Object Model) with something like jsoup
I want to remove the content between <script></script>tags. I'm manually checking for the pattern and iterating using while loop. But, I'm getting StringOutOfBoundException at this line:
String script = source.substring(startIndex,endIndex-startIndex);
Below is the complete method:
public static String getHtmlWithoutScript(String source) {
String START_PATTERN = "<script>";
String END_PATTERN = " </script>";
while (source.contains(START_PATTERN)) {
int startIndex=source.lastIndexOf(START_PATTERN);
int endIndex=source.indexOf(END_PATTERN,startIndex);
String script=source.substring(startIndex,endIndex);
source.replace(script,"");
}
return source;
}
Am I doing anything wrong here? And I'm getting endIndex=-1. Can anyone help me to identify, why my code is breaking.
String text = "<script>This is dummy text to remove </script> dont remove this";
StringBuilder sb = new StringBuilder(text);
String startTag = "<script>";
String endTag = "</script>";
//removing the text between script
sb.replace(text.indexOf(startTag) + startTag.length(), text.indexOf(endTag), "");
System.out.println(sb.toString());
If you want to remove the script tags too add the following line :
sb.toString().replace(startTag, "").replace(endTag, "")
UPDATE :
If you dont want to use StringBuilder you can do this:
String text = "<script>This is dummy text to remove </script> dont remove this";
String startTag = "<script>";
String endTag = "</script>";
//removing the text between script
String textToRemove = text.substring(text.indexOf(startTag) + startTag.length(), text.indexOf(endTag));
text = text.replace(textToRemove, "");
System.out.println(text);
You can use a regex to remove the script tag content:
public String removeScriptContent(String html) {
if(html != null) {
String re = "<script>(.*)</script>";
Pattern pattern = Pattern.compile(re);
Matcher matcher = pattern.matcher(html);
if (matcher.find()) {
return html.replace(matcher.group(1), "");
}
}
return null;
}
You have to add this two imports:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
I know I'm probably late to the party. But I would like to give you a regex (really tested solution).
What you have to note here is that when it comes to regular expressions, their engines are greedy by default. So a search string such as <script>(.*)</script> will match the entire string starting from <script> up until the end of the line, or end of the file depending on the regexp options used. This is due to the fact that the search engine uses greedy matching by default.
Now in order to perform the match that you want to in an accurate manner... you could use "lazy" searching.
Search with Lazy loading
<script>(.*?)<\/script>
Now with that, you will get accurate results.
You can read more about about Regexp Lazy & Greedy in this answer.
This worked for me:
private static String removeScriptTags(String message) {
String scriptRegex = "<(/)?[ ]*script[^>]*>";
Pattern pattern2 = Pattern.compile(scriptRegex);
if(message != null) {
Matcher matcher2 = pattern2.matcher(message);
StringBuffer str = new StringBuffer(message.length());
while(matcher2.find()) {
matcher2.appendReplacement(str, Matcher.quoteReplacement(" "));
}
matcher2.appendTail(str);
message = str.toString();
}
return message;
}
Credit goes to nealvs: https://nealvs.wordpress.com/2010/06/01/removing-tags-from-a-string-in-java/
JAVASCRIPT or JAVA solution needed
The solution I am looking for could use java or javascript. I have the html code in a string so I could manipulate it before using it with java or afterwards with javascript.
problem
Anyway, I have to wrap each word with a tag. For example:
<html> ... >
Hello every one, cheers
< ... </html>
should be changed to
<html> ... >
<word>Hello</word> <word>every</word> <word>one</word>, <word>cheers</word>
< ... </html>
Why?
This will help me use javascript to select/highlight a word. It seems the only way to do it is to use the function highlightElementAtPoint which I added in the JAVASCRIPT hint: It simply finds the element of a certain x,y coordinate and highlights it. I figured that if every word is an element, it will be doable.
The idea is to use this approach to allow us to detect highlighted text in an android WebView even if that would mean to use a twisted highlighting method. Think a bit more and you will find many other applications for this.
JAVASCRIPT hint
I am using the following code to highlight a word; however, this will highlight the whole text belonging to a certain tag. When each word is a tag, this will work to some extent. If there is a substitute that will allow me to highlight a word at a certain position, it would also be a solution.
function highlightElementAtPoint(xOrdinate, yOrdinate) {
var theElement = document.elementFromPoint(xOrdinate, yOrdinate);
selectedElement = theElement;
theElement.style.backgroundColor = "yellow";
var theName = theElement.nodeName;
var theArray = document.getElementsByTagName(theName);
var theIndex = -1;
for (i = 0; i < theArray.length; i++) {
if (theArray[i] == theElement) {
theIndex = i;
}
}
window.androidselection.selected(theElement.innerHTML);
return theName + " " + theIndex;
}
Try to use something like
String yourStringHere = yourStringHere.replace(" ", "</word> <word>" )
yourStringHere.replace("<html></word>", "<html>" );//remove first closing word-tag
Should work, maybe u have to change sth...
var tags = document.body.innerText.match(/\w+/g);
for(var i=0;i<tags.length;i++){
tags[i] = '<word>' + tags[i] + '</word>';
}
Or as #ThomasK said:
var tags = document.body.innerText;
tags = '<word>' + tags + '</word>';
tags = tags.replace(/\s/g,'</word><word>');
But you have to keep in mind: .replace(" ",foo) only replaces the space once. For multiple replaces you have to use .replace(/\s+/g,foo)
And as #ajax333221 said, the second way will include commas, dots and other symbols, so the better solution is the first
JSFiddle example: http://jsfiddle.net/c6ftq/4/
inputStr = inputStr.replaceAll("(?<!</?)\\w++(?!\\s*>)","<word>$0</word>");
You can try following code,
import java.util.StringTokenizer;
public class myTag
{
static String startWordTag = "<Word>";
static String endWordTag = "</Word>";
static String space = " ";
static String myText = "Hello how are you ";
public static void main ( String args[] )
{
StringTokenizer st = new StringTokenizer (myText," ");
StringBuffer sb = new StringBuffer();
while ( st.hasMoreTokens() )
{
sb.append(startWordTag);
sb.append(st.nextToken());
sb.append(endWordTag);
sb.append(space);
}
System.out.println ( "Result:" + sb.toString() );
}
}
Need help in parsing html string
String str = "<div id=\"test\" > Amrit </div><div><a href=\"#bbbb\" > Amrit </a> </div><a href=\"#cccc\" ><a href=\"#dddd\" >";
String reg = ".*(<\\s*a\\s+href\\s*=\\s*\\\"(.+?)\"\\s*>).*";
str is my sample string and reg is my regex used to parse all the anchor tags, specially the value of href. Using this regex, it only shows the last part of the string.
Pattern MY_PATTERN = Pattern.compile(reg);
Matcher m = MY_PATTERN.matcher(str);
while (m.find()) {
for(int i=0; i<m.groupCount(); i++){
String s = m.group(i);
System.out.println("->" + s);
}
}
This is the code I did.
What is missing?
And also if i want particular occurrence of string to be replaced, generally if I have my url changed form [string]_[string] into [string]-[string]. How can I get "_" and replace it by "-" ?
Instead of parsing html using regex (regex is for regular language - HTML is not regular language) use HtmlUnit
http://htmlunit.sourceforge.net/
This may help: Options for HTML scraping?
It looks like you have a double escape too many.
This segment may fix it: "<\\s*a\\s+href\\s*=\\s*\"(.+?)\"\\s*>", but can't comment
on the entire regex if it works or not.
I would suggest to use JSoup. It could be much more flexible than using a regex. A sample code is put below.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ListLinks {
public static void main(String[] args) throws Exception {
String url = "http://www.umovietv.com/EntertainmentList.aspx";
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
print("%s", link.attr("abs:href"));
}
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
}
Refer to http://jsoup.org/ for more information.