Remove White Space From Text that i scraped from website

Remove White Space From Text that i scraped from website - java

I am trying to scrape a list of medicines from a website.
I am using JSOUP to parse the Html.
Here is my code :
URL url = new URL("http://www.medindia.net/drug-price/index.asp?alpha=a");
Document doc1 = Jsoup.parse(url, 0);
Elements rows = doc1.getElementsByAttributeValue("style", "padding-left:5px;border-right:1px solid #A5A5A5;");
for(Element row : rows){
String htm = row.text();
if(!(htm.equals("View Price")||htm.contains("Show Details"))) {
System.out.println(htm);
System.out.println();
}
}
Here is the Output that I am getting:
P.S. This is not the complete output But As I couldn't Take The Screen Shot of the complete output, I just displayed it.
I need to Know Two Things :
Question 1. Why am I getting an Extra Space In front of each Drug Name and why am I getting Extra New Line After Some Drug's Name?
Question 2. How do I resolve this Issue?

A few things:
It's not the complete output because there's more than one page. I put a for loop that fixes that for you.
You should probably trim the output using htm.trim()
You should probably make sure to not print when there's a newLine (!htm.isEmpty())
That website has a weird character with ASCII value 160 in it. I added a small fix that solves the problem. (with .replace)
Here's the fixed code:
for(char page='a'; page <= 'z'; page++) {
String urlString = String.format("http://www.medindia.net/drug-price/index.asp?alpha=%c", page);
URL url = new URL(urlString);
Document doc1 = Jsoup.parse(url, 0);
Elements rows = doc1.getElementsByAttributeValue("style", "padding-left:5px;border-right:1px solid #A5A5A5;");
for(Element row : rows){
String htm = row.text().replace((char) 160, ' ').trim();
if(!(htm.equals("View Price")||htm.contains("Show Details"))&& !htm.isEmpty())
{
System.out.println(htm.trim());
System.out.println();
}
}
}

Do one thing :
Use trim function in syso : System.out.println(htm.trim());
UPDATED :
After a lot of effort I was able to parse all 80 medicines like this :-
URL url = new URL("http://www.medindia.net/drug-price/index.asp?alpha=a");
Document doc1 = Jsoup.parse(url, 0);
Elements rows = doc1.select("td.ta13blue");
Elements rows1 = doc1.select("td.ta13black.tbold");
int cnt=0;
for(Element row : rows){
cnt++;
String htm = row.text().trim();
if(!(htm.equals("View Price")||htm.contains("Show Details") || htm.startsWith("Drug"))) {
System.out.println(cnt+" : "+htm);
System.out.println();
}
}
for(Element row1 : rows1){
cnt++;
String htm = row1.text().trim();
if(!(htm.equals("View Price")||htm.contains("Show Details") || htm.startsWith("Drug"))) {
System.out.println(cnt+" : "+htm);
System.out.println();
}
}

1) Taking elements by style is quite dangerous;
2) Calling ROWS what instead is a list of FIELDS is even more dangerous :)
3) Opening the page , you can see that the extra lines are added ONLY after "black names", name of items not wrapped in an anchor link.
You problem is then that the second field in that rows is not Show Details nor View Price and not even empty... it is:
<td bgcolor="#FFFFDB" align="center"
style="padding-left:5px;border-right:1px solid #A5A5A5;">
</td>
It is a one space string. Modify your code like this:
for(Element row : rows){
String htm = row.text().trim(); // <!-- This one
if(!
(htm.equals("View Price")
|| htm.contains("Show Details")
|| htm.equals(" ")) // <!-- And this one
) {
System.out.println(htm);
System.out.println();
}
}

Related

How can I print the contents of this HTML table using JSoup?

I will start off by stating that working with HTML and JSoup for that matter is very foreign to me so if this comes across as a stupid question, I apologize.
What I am trying to achieve with my code is to print the contents from the table on this link https://www.stormshield.one/pve/stats/daviddean/sch into my console in a format like this for each entry:
Wall Launcher
50
grade grade grade grade grade
15% ImpactKnockback
42% Reload Speed
15% Impact Knockback
42% Reload Speed
15% ImpactKnockback
42% Durability
My main issue is pretty much supplying the correct name for the table and the rows, once I can do that the formatting isn't really an issue for me.
This is the code I have tried to use to no avail:
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://www.stormshield.one/pve/stats/daviddean/sch").get();
for (Element table : doc.select("table schematics")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}

You need to find your table element, and it's head and rows.
Be careful, it is not always the first() element, I add it as an example.
Here is what you need to do:
Document doc = null;
try {
doc = Jsoup.connect("https://www.stormshield.one/pve/stats/daviddean/sch").get();
} catch (IOException e) {
e.printStackTrace();
}
Element table = doc.body().getElementsByTag("table").first();
Element thead = table.getElementsByTag("thead").first();
StringBuilder headBuilder = new StringBuilder();
for (Element th : thead.getElementsByTag("th")) {
headBuilder.append(th.text());
headBuilder.append(" ");
}
System.out.println(headBuilder.toString());
Element tbody = table.getElementsByTag("tbody").first();
for (Element tr : tbody.getElementsByTag("tr")) {
StringBuilder rowBuilder = new StringBuilder();
for (Element td : tr.getElementsByTag("td")) {
rowBuilder.append(td.text());
rowBuilder.append(" ");
}
System.out.println(rowBuilder.toString());
}
The output is :

How to generate XPath query matching a specific element in Jsoup?

_ Hi , this is my web page :
<html>
<head>
</head>
<body>
<div> text div 1</div>
<div>
<span>text of first span </span>
<span>text of second span </span>
</div>
<div> text div 3 </div>
</body>
</html>
I'm using jsoup to parse it , and then browse all elements inside the page and get their paths :
Document doc = Jsoup.parse(new File("C:\\Users\\HC\\Desktop\\dataset\\index.html"), "UTF-8");
Elements elements = doc.body().select("*");
ArrayList all = new ArrayList();
for (Element element : elements) {
if (!element.ownText().isEmpty()) {
StringBuilder path = new StringBuilder(element.nodeName());
String value = element.ownText();
Elements p_el = element.parents();
for (Element el : p_el) {
path.insert(0, el.nodeName() + '/');
}
all.add(path + " = " + value + "\n");
System.out.println(path +" = "+ value);
}
}
return all;
my code give me this result :
html/body/div = text div 1
html/body/div/span = text of first span
html/body/div/span = text of second span
html/body/div = text div 3
in fact i want get result like this :
html/body/div[1] = text div 1
html/body/div[2]/span[1] = text of first span
html/body/div[2]/span[2] = text of second span
html/body/div[3] = text div 3
please could any one give me idea how to get reach this result :) . thanks in advance.

As asked here a idea.
Even if I'm quite sure that there better solutions to get the xpath for a given node. For example use xslt as in the answer to "Generate/get xpath from XML node java".
Here the possible solution based on your current attempt.
For each (parent) element check if there are more than one element with this name.
Pseudo code: if ( count (el.select('../' + el.nodeName() ) > 1)
If true count the preceding-sibling:: with same name and add 1.
count (el.select('preceding-sibling::' + el.nodeName() ) +1

This is my solution to this problem:
StringBuilder absPath=new StringBuilder();
Elements parents = htmlElement.parents();
for (int j = parents.size()-1; j >= 0; j--) {
Element element = parents.get(j);
absPath.append("/");
absPath.append(element.tagName());
absPath.append("[");
absPath.append(element.siblingIndex());
absPath.append("]");
}

This would be easier, if you traversed the document from the root to the leafs instead of the other way round. This way you can easily group the elements by tag-name and handle multiple occurences accordingly. Here is a recursive approach:
private final List<String> path = new ArrayList<>();
private final List<String> all = new ArrayList<>();
public List<String> getAll() {
return Collections.unmodifiableList(all);
}
public void parse(Document doc) {
path.clear();
all.clear();
parse(doc.children());
}
private void parse(List<Element> elements) {
if (elements.isEmpty()) {
return;
}
Map<String, List<Element>> grouped = elements.stream().collect(Collectors.groupingBy(Element::tagName));
for (Map.Entry<String, List<Element>> entry : grouped.entrySet()) {
List<Element> list = entry.getValue();
String key = entry.getKey();
if (list.size() > 1) {
int index = 1;
// use paths with index
key += "[";
for (Element e : list) {
path.add(key + (index++) + "]");
handleElement(e);
path.remove(path.size() - 1);
}
} else {
// use paths without index
path.add(key);
handleElement(list.get(0));
path.remove(path.size() - 1);
}
}
}
private void handleElement(Element e) {
String value = e.ownText();
if (!value.isEmpty()) {
// add entry
all.add(path.stream().collect(Collectors.joining("/")) + " = " + value);
}
// process children of element
parse(e.children());
}

Here is the solution in Kotlin. It's correct, and it works. The other answers are wrong and caused me hours of lost work.
fun Element.xpath(): String = buildString {
val parents = parents()
for (j in (parents.size - 1) downTo 0) {
val parent = parents[j]
append("/*[")
append(parent.siblingIndex() + 1)
append(']')
}
append("/*[")
append(siblingIndex() + 1)
append(']')
}

JSoup parsing form with checkboxes and select input

I have a form which I have to read with jsoup, it contains several fields including checkboxes and comboboxes (select inputs).
I am reading there values with following code -
Element campaignForm = doc.getElementById("Campaign");
Elements allInputFields = campaignForm.getElementsByTag("input");
Elements allSelections = campaignForm.getElementsByTag("select");
Map<String, String> postData = new HashMap<String, String>();
for(Element selectField:allSelections){
postData.put(selectField.attr("name"), selectField.attr("value"));
}
for(Element inputField:allInputFields){
if(inputField.attr("type").equalsIgnoreCase("checkbox")){
postData.put(inputField.attr("name"), inputField.attr("checked").equalsIgnoreCase("checked")?"1":"0");
}else{
postData.put(inputField.attr("name"), inputField.attr("value"));
}
}
So when I print the postData Map, it gives correct values for text input fields but for checkboxes and dropdown(comboboxes) it is not working. Please let me know if there is different way to handle checkboxes and select inputs in jsoup.
EDIT:
Checkboxes I got working with help of comment, but select input still not working.
Thanks in advance.

I got it working with following code -
for(Element selectField:allSelections){
String nameField = selectField.attr("name");
String valueField = "";
Elements allOptions = selectField.getElementsByTag("option");
for(Element opt:allOptions){
if(opt.attr("selected").equalsIgnoreCase("selected")){
valueField = opt.attr("value");
break;
}
}
postData.put(nameField, valueField);
}
for(Element inputField:allInputFields){
if(inputField.attr("type").equalsIgnoreCase("checkbox")){
postData.put(inputField.attr("name"), inputField.attr("checked").equalsIgnoreCase("checked")?"1":"0");
}else{
postData.put(inputField.attr("name"), inputField.attr("value"));
}

using Jsoup to extract a table inside several divs

I am trying to use jsoup so as to have access to a table embedded inside multiple div's of an html page.The table is under the outer division with id "content-top". I will give the inner divs leading to the table: content-top -> center -> middle-right-col -> result .
Under the div result; is table round. This is the table that i want to access and whose rows I need to traverse and print out the data contained in them. Below is the java code I have been trying to use but yielding no results :
Document doc = Jsoup.connect("http://www.calculator.com/#").data("express", "sin(x)").data("calculate","submit").post();
// give the application time to calculate result before retrieving result from results table
try {
Thread.sleep(10000);
}
catch(InterruptedException ex)
{
Thread.currentThread().interrupt();
}
Elements content = doc.select("div#result") ;
Element tables = content.get(0) ;
Elements table_rows = tables.select("tr") ;
Iterator iterRows = table_rows.iterator();
while (iterRows.hasNext()) {
Element tr = (Element)iterRows.next();
Elements table_data = tr.select("td");
Iterator iterData = table_data.iterator();
int tdCount = 0;
String f_x_value = null;
String result = null;
// process new line
while (iterData.hasNext()) {
Element td = (Element)iterData.next();
switch (tdCount++) {
case 1:
f_x_value = td.text();
f_x_value = td.select("a").text();
break;
case 2:
result = td.text();
result = td.select("a").text();
break;
}
}
System.out.println(f_x_value + " " + result ) ;
}
The above code crashes and hardly does what I want it to do. PLEASE CAN ANYONE PLEASE HELP ME !!!

public static String do_conversion (String str)
{
char c;
String output = "{";
for(int i = 0; i < str.length(); i++)
{
c = str.charAt(i);
if(c=='e')
output += "{mathrm{e}}";
else if(c=='(')
output += '{';
else if(c==')')
output += '}';
else if(c=='+')
output += "{cplus}";
else if(c=='-')
output += "{cminus}";
else if(c=='*')
output += "{cdot}";
else if(c=='/')
output += "{cdivide}";
else output += c; // else copy the character normally
}
output += ", mathrm{d}x}";
return output;
}
#Syam S

The page doesnt directly give you a table in a div with id as "result". It uses an ajax class to a php file and get the process done. So what you need to do here is to first build a json like
{"expression":"sin(x)","intVar":"x","upperBound":"","lowerBound":"","simplifyExpressions":false,"latex":"\\displaystyle\\int\\limits^{}_{}{\\sin\\left(x\\right)\\, \\mathrm{d}x}"}
The expression key hold the expression that you want to evaluate, the latex is a mathjax expression and then post it to int.php. This expects two arguments namely q which is the above json and v which seems to a constant value 1380119311. I didnt understand what this is.
Now this will return a response like
<html>
<head></head>
<body>
<table class="round">
<tbody>
<tr class="">
<th>$f(x) =$</th>
<td>$\sin\left(x\right)$</td>
</tr>
<tr class="sep odd">
<th>$\displaystyle\int{f(x)}\, \mathrm{d}x =$</th>
<td>$-\cos\left(x\right)$</td>
</tr>
</tbody>
</table>
<!-- Finished in 155 ms -->
<p id="share"> <img src="layout/32x32xshare.png.pagespeed.ic.i3iroHP5fI.png" width="32" height="32" /> <a id="share-link" href="http://www.integral-calculator.com/#expr=sin%28x%29" onclick="window.prompt("To copy this link to the clipboard, press Ctrl+C, Enter.", $("share-link").href); return false;">Direct link to this calculation (for sharing)</a> </p>
</body>
</html>
The table in this expression gives you the result and the site uses mathjax to display it like
A sample program would be
import java.io.IOException;
import org.apache.commons.lang3.StringEscapeUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class JsoupParser6 {
public static void main(String[] args) {
try {
// Integral
String url = "http://www.integral-calculator.com/int.php";
String q = "{\"expression\":\"sin(4x) * e^(-x)\",\"intVar\":\"x\",\"upperBound\":\"\",\"lowerBound\":\"\",\"simplifyExpressions\":false,\"latex\":\"\\\\displaystyle\\\\int\\\\limits^{}_{}{\\\\sin\\\\left(4x\\\\right){\\\\cdot}{\\\\mathrm{e}}^{-x}\\\\, \\\\mathrm{d}x}\"}";
Document integralDoc = Jsoup.connect(url).data("q", q).data("v", "1380119311").post();
System.out.println(integralDoc);
System.out.println("\n*******************************\n");
//Differential
url = "http://www.derivative-calculator.net/diff.php";
q = "{\"expression\":\"sin(x)\",\"diffVar\":\"x\",\"diffOrder\":1,\"simplifyExpressions\":false,\"showSteps\":false,\"latex\":\"\\\\dfrac{\\\\mathrm{d}}{\\\\mathrm{d}x}\\\\left(\\\\sin\\\\left(x\\\\right)\\\\right)\"}";
Document differentialDoc = Jsoup.connect(url).data("q", q).data("v", "1380119305").post();
System.out.println(differentialDoc);
System.out.println("\n*******************************\n");
//Calculus
url = "http://calculus-calculator.com/calculation/integrate.php";
Document calculusDoc = Jsoup.connect(url).data("expression", "sin(x)").data("intvar", "x").post();
String outStr = StringEscapeUtils.unescapeJava(calculusDoc.toString());
Document formattedOutPut = Jsoup.parse(outStr);
formattedOutPut.body().html(formattedOutPut.select("div.isteps").toString());
System.out.println(formattedOutPut);
} catch (IOException e) {
e.printStackTrace();
}
}
}
Update based on comment.
The unescape works perfectly well. In MathJax you could right click and view the command. So if you go to your site http://calculus-calculator.com/ and try the sin(x) equation there and right click the result and view TexCommand like
The you could see the commands are exactly the ones which we get after unsescape. The demo site is not rendering it. May be a limitation of the demo site, thats all.

Multiple Line String to separated new strings for every line

I have the following code. I am using the jsoup library to retrieve the URLs from a website; after that, I am checking if the URLs contain the keyword I want, and list them in another string. My problem is that I am not able to retrieve only one URL.
Have a look at my code:
// Get the webpage and parse it.
org.jsoup.nodes.Document doc = Jsoup.connect("http://www.examplepage").get();
// Get the anchors with href attribute.
// Or, you can use doc.select("a") to get all the anchors.
org.jsoup.select.Elements links = doc.select("a[href]");
// Iterate over all the links and process them.
for (org.jsoup.nodes.Element link : links) {
String scrapedlinks += link.attr("abs:href")+"\n" ;
String scrapedlinks3 ="";
}
String[] links2 = links.split("\n");
for (String newlink : hulklinks ) {
if (newlink("mysearchterm")) {
scrapedlinks3 +=newlink ;
String[] scrapedlines = scrapedlinks3.split("\n" );
}
}

I think it will be easier if you directly store your urls in an Arraylist:
Arraylist<String> urls = new Arraylist<String>();
for (org.jsoup.nodes.Element link : links)
urls.add(link.attr("abs:href"));
After this you can easy access them with
urls.get(i);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove White Space From Text that i scraped from website - java

Related

How can I print the contents of this HTML table using JSoup?

How to generate XPath query matching a specific element in Jsoup?

JSoup parsing form with checkboxes and select input

using Jsoup to extract a table inside several divs

Multiple Line String to separated new strings for every line

Categories

Resources