Don't have to worry about linked style or hover style.
I want to automatically convert files like this
<html>
<body>
<style>
body{background:#FFC}
p{background:red}
body, p{font-weight:bold}
</style>
<p>...</p>
</body>
</html>
to files like this
<html>
<body style="background:red;font-weight:bold">
<p style="background:#FFC;font-weight:bold">...</p>
</body>
</html>
I would be even more interested if there was an HTML parser that would do this.
The reason I want to do this is so I can display emails that use global style sheets without their style sheets messing up the rest of my web page. I also would like to send the resulting style to web based rich text editor for reply and original message.
Here is a solution on java, I made it with the JSoup Library: http://jsoup.org/download
import java.io.IOException;
import java.util.StringTokenizer;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class AutomaticCssInliner {
/**
* Hecho por Grekz, http://grekz.wordpress.com
*/
public static void main(String[] args) throws IOException {
final String style = "style";
final String html = "<html>" + "<body> <style>"
+ "body{background:#FFC} \n p{background:red}"
+ "body, p{font-weight:bold} </style>"
+ "<p>...</p> </body> </html>";
// Document doc = Jsoup.connect("http://mypage.com/inlineme.php").get();
Document doc = Jsoup.parse(html);
Elements els = doc.select(style);// to get all the style elements
for (Element e : els) {
String styleRules = e.getAllElements().get(0).data().replaceAll(
"\n", "").trim(), delims = "{}";
StringTokenizer st = new StringTokenizer(styleRules, delims);
while (st.countTokens() > 1) {
String selector = st.nextToken(), properties = st.nextToken();
Elements selectedElements = doc.select(selector);
for (Element selElem : selectedElements) {
String oldProperties = selElem.attr(style);
selElem.attr(style,
oldProperties.length() > 0 ? concatenateProperties(
oldProperties, properties) : properties);
}
}
e.remove();
}
System.out.println(doc);// now we have the result html without the
// styles tags, and the inline css in each
// element
}
private static String concatenateProperties(String oldProp, String newProp) {
oldProp = oldProp.trim();
if (!newProp.endsWith(";"))
newProp += ";";
return newProp + oldProp; // The existing (old) properties should take precedence.
}
}
Using jsoup + cssparser:
private static final String STYLE_ATTR = "style";
private static final String CLASS_ATTR = "class";
public String inlineStyles(String html, File cssFile, boolean removeClasses) throws IOException {
Document document = Jsoup.parse(html);
CSSOMParser parser = new CSSOMParser(new SACParserCSS3());
InputSource source = new InputSource(new FileReader(cssFile));
CSSStyleSheet stylesheet = parser.parseStyleSheet(source, null, null);
CSSRuleList ruleList = stylesheet.getCssRules();
Map<Element, Map<String, String>> allElementsStyles = new HashMap<>();
for (int ruleIndex = 0; ruleIndex < ruleList.getLength(); ruleIndex++) {
CSSRule item = ruleList.item(ruleIndex);
if (item instanceof CSSStyleRule) {
CSSStyleRule styleRule = (CSSStyleRule) item;
String cssSelector = styleRule.getSelectorText();
Elements elements = document.select(cssSelector);
for (Element element : elements) {
Map<String, String> elementStyles = allElementsStyles.computeIfAbsent(element, k -> new LinkedHashMap<>());
CSSStyleDeclaration style = styleRule.getStyle();
for (int propertyIndex = 0; propertyIndex < style.getLength(); propertyIndex++) {
String propertyName = style.item(propertyIndex);
String propertyValue = style.getPropertyValue(propertyName);
elementStyles.put(propertyName, propertyValue);
}
}
}
}
for (Map.Entry<Element, Map<String, String>> elementEntry : allElementsStyles.entrySet()) {
Element element = elementEntry.getKey();
StringBuilder builder = new StringBuilder();
for (Map.Entry<String, String> styleEntry : elementEntry.getValue().entrySet()) {
builder.append(styleEntry.getKey()).append(":").append(styleEntry.getValue()).append(";");
}
builder.append(element.attr(STYLE_ATTR));
element.attr(STYLE_ATTR, builder.toString());
if (removeClasses) {
element.removeAttr(CLASS_ATTR);
}
}
return document.html();
}
After hours of trying different manual java code solutions and not being satisfied with results (responsive media query handling issues mostly), I stumbled upon https://github.com/mdedetrich/java-premailer-wrapper which works great as a java solution. Note that you might actually be better off running your own "premailer" server. While there is a public api to premailer, I wanted to have my own instance running that I can hit as hard as I want:
https://github.com/TrackIF/premailer-server
Easy to run on ec2 with just a few clicks: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/create_deploy_Ruby_sinatra.html
git clone https://github.com/Enalmada/premailer-server
cd premailer-server
eb init (choose latest ruby)
eb create premailer-server
eb deploy
curl --data "html=<your html>" http://your.eb.url
I haven't tried this but looks like you can use something like CSS parser to get a DOM tree corresponding to your CSS. So you can do something like:
Obtain cssDOM
Obtain htmlDOM (JAXP)
Iterate over each cssDOM element and use xpath to locate and insert the correct style in your htmlDOM.
Convert htmlDOM to string.
I can't yet comment but I wrote a gist that attempted to enhance the accepted answer to handle the cascading part of cascading stylesheets.
It doesn't work perfectly but it is almost there.
https://gist.github.com/moodysalem/69e2966834a1f79492a9
You can use HtmlUnit and Jsoup. You render the html page in the browser using HtmlUnit. Then you get the computed styles going through the elements thanks to HtmlUnit. Jsoup is just here to format the html output.
You can find here a simple implementation :
public final class CssInliner {
private static final Logger log = Logger.getLogger(CssInliner.class);
private CssInliner() {
}
public static CssInliner make() {
return new CssInliner();
}
/**
* Main method
*
* #param html html to inline
*
* #return inlined html
*/
public String inline(String html) throws IOException {
try (WebClient webClient = new WebClient()) {
HtmlPage htmlPage = getHtmlPage(webClient, html);
Window window = webClient.getCurrentWindow().getScriptableObject();
for (HtmlElement htmlElement : htmlPage.getHtmlElementDescendants()) {
applyComputedStyle(window, htmlElement);
}
return outputCleanHtml(htmlPage);
}
}
/**
* Output the HtmlUnit page to a clean html. Remove the old global style tag
* that we do not need anymore. This in order to simplify of the tests of the
* output.
*
* #param htmlPage
*
* #return
*/
private String outputCleanHtml(HtmlPage htmlPage) {
Document doc = Jsoup.parse(htmlPage.getDocumentElement().asXml());
Element globalStyleTag = doc.selectFirst("html style");
if (globalStyleTag != null) {
globalStyleTag.remove();
}
doc.outputSettings().syntax(Syntax.html);
return doc.html();
}
/**
* Modify the html elements by adding an style attribute to each element
*
* #param window
* #param htmlElement
*/
private void applyComputedStyle(Window window, HtmlElement htmlElement) {
HTMLElement pj = htmlElement.getScriptableObject();
ComputedCSSStyleDeclaration cssStyleDeclaration = window.getComputedStyle(pj, null);
Map<String, StyleElement> map = getStringStyleElementMap(cssStyleDeclaration);
// apply style element to html
if (!map.isEmpty()) {
htmlElement.writeStyleToElement(map);
}
}
private Map<String, StyleElement> getStringStyleElementMap(ComputedCSSStyleDeclaration cssStyleDeclaration) {
Map<String, StyleElement> map = new HashMap<>();
for (Definition definition : Definition.values()) {
String style = cssStyleDeclaration.getStyleAttribute(definition, false);
if (StringUtils.isNotBlank(style)) {
map.put(definition.getAttributeName(),
new StyleElement(definition.getAttributeName(),
style,
"",
SelectorSpecificity.DEFAULT_STYLE_ATTRIBUTE));
}
}
return map;
}
private HtmlPage getHtmlPage(WebClient webClient, String html) throws IOException {
URL url = new URL("http://tinubuinliner/" + Math.random());
StringWebResponse stringWebResponse = new StringWebResponse(html, url);
return HTMLParser.parseHtml(stringWebResponse, webClient.getCurrentWindow());
}
}
For a solution to this you're probably best using a battle hardened tool like the one from Mailchimp.
They've have opened up their css inlining tool in their API, see here: http://apidocs.mailchimp.com/api/1.3/inlinecss.func.php
Much more useful than a web form.
There's also an open source Ruby tool here: https://github.com/alexdunae/premailer/
Premailer also exposes an API and web form, see http://premailer.dialect.ca - it's sponsored by Campaign Monitor who are one of the other big players in the email space.
I'm guessing you could integrate Premailer into your Java app via [Jruby][1], although I have no experience with this.
The CSSBox + jStyleParser libraries can do the job as already answered here.
http://www.mailchimp.com/labs/inlinecss.php
Use that link above. It will save hours of your time and is made especially for email templates. It's a free tool by mailchimp
This kind of thing is often required for e-commerce applications where the bank/whatever doesn't allow linked CSS, e.g. WorldPay.
The big challenge isn't so much the conversion as the lack of inheritance. You have to explicitly set inherited properties on all descendant tags. Testing is vital as certain browsers will cause more grief than others. You will need to add a lot more inline code than you need for a linked stylesheet, for example in a linked stylesheet all you need is p { color:red }, but inline you have to explicitly set the color on all paragraphs.
From my experience, it's very much a manual process that requires a light touch and a lot of tweaking and cross-browser testing to get right.
I took the first two answers and adopted them to make use of the capabilities of the CSS parser library:
public String inline(String html, String styles) throws IOException {
Document document = Jsoup.parse(html);
CSSRuleList ruleList = getCssRules(styles);
for (int i = 0; i < ruleList.getLength(); i++) {
CSSRule rule = ruleList.item(i);
if (rule instanceof CSSStyleRule) {
CSSStyleRule styleRule = (CSSStyleRule) rule;
String selector = styleRule.getSelectorText();
Elements elements = document.select(selector);
for (final Element element : elements) {
applyRuleToElement(element, styleRule);
}
}
}
removeClasses(document);
return document.html();
}
private CSSRuleList getCssRules(String styles) throws IOException {
CSSOMParser parser = new CSSOMParser(new SACParserCSS3());
CSSStyleSheet styleSheet = parser.parseStyleSheet(new InputSource(new StringReader(styles)), null, null);
CSSRuleList list = styleSheet.getCssRules();
return list;
}
private void applyRuleToElement(Element element, CSSStyleRule rule){
String elementStyleString = element.attr("style");
CSSStyleDeclarationImpl elementStyleDeclaration = new CSSStyleDeclarationImpl();
elementStyleDeclaration.setCssText(elementStyleString);
CSSStyleDeclarationImpl ruleStyleDeclaration = (CSSStyleDeclarationImpl)rule.getStyle();
for(Property p : ruleStyleDeclaration.getProperties()){
elementStyleDeclaration.addProperty(p);
}
String cssText = elementStyleDeclaration.getCssText();
element.attr("style", cssText);
}
private void removeClasses(Document document){
Elements elements = document.getElementsByAttribute("class");
elements.removeAttr("class");
}
Maybe its possible to improve it further by using a CSS parser like https://github.com/phax/ph-css?
Related
Very new to JSoup, trying to retrieve a changeable value that is stored within an tag, specifically from the following website and html.
Snapshot of HTML
the results after "consitituency/" are changeable and dependent on the input of the user. I am able to retrieve the h2 tags themselves but not the information within. At the moment the best return I can get is just tags using the method below
The desired return would be something that I can substring down into
Dublin Bay South
The actual return is
<well.col-md-4.h2></well.col-md-4.h2>
private String jSoupTDRequest(String aLine1, String aLine3) throws IOException {
String constit = "";
String h2 = "h2";
String url = "https://www.whoismytd.com/search?utf8=✓&form-input="+aLine1+"%2C+"+aLine3+"+Ireland";
//Switch to try catch if time
Document doc = Jsoup.connect(url)
.timeout(6000).get();
//Scrape elements from relevant section
Elements body = doc.select("well.col-md-4.h2");
Element e = new Element("well.col-md-4.h2");
constit = e.toString();
return constit;
I am extremely new to JSoup and scraping in general. Would appreciate any input from someone who knows what they're doing or any alternate ways to try and get the desired result
Change your scraping elements from relevant section code as follows:
Select the very first <div class="well"> element first.
Element tdsDiv = doc.select("div.well").first();
Select the very first <a> link element next. This link points to the constituency.
Element constLink = tdsDiv.select("a").first();
Get the constituency name by grabbing this link's text content.
constit = constLink.text();
import org.junit.jupiter.api.Test;
import java.io.IOException;
#DisplayName("JSoup, how to return data from a dynamic <a href> tag")
class JsoupQuestionTest {
private static final String URL = "https://www.whoismytd.com/search?utf8=%E2%9C%93&form-input=Kildare%20Street%2C%20Dublin%2C%20Ireland";
#Test
void findSomeText() throws IOException {
String expected = "Dublin Bay South";
Document document = Jsoup.connect(URL).get();
String actual = document.getElementsByAttributeValue("href", "/constituency/dublin-bay-south").text();
Assertions.assertEquals(expected, actual);
}
}
So I've been browsing around the source code / documentation for POI (specifically XWPF) and I can't seem to find anything that relates to editing a hyperlink in a .docx. I only see functionality to get the information for the currently set hyperlink. My goal is to change the hyperlink in a .docx to link to "http://yahoo.com" from "http://google.com" as an example. Any help would be greatly appreciated. Thanks!
I found a way to edit the url of the link in a "indirect way" (copy the previous hyperlink, modify the url, delete the previous hyperlink and add the new one in the paragraph).
Code is shown below:
private void editLinksOfParagraph(XWPFParagraph paragraph, XWPFDocument document) {
for (int rIndex = 0; rIndex < paragraph.getRuns().size(); rIndex++) {
XWPFRun run = paragraph.getRuns().get(rIndex);
if (run instanceof XWPFHyperlinkRun) {
// get the url of the link to edit it
XWPFHyperlink link = ((XWPFHyperlinkRun) run).getHyperlink(document);
String linkURL = link.getURL();
//get the xml representation of the hyperlink that includes all the information
XmlObject xmlObject = run.getCTR().copy();
linkURL += "-edited-link"; //edited url of the link, f.e add a '-edited-link' suffix
//remove the previous link from the paragraph
paragraph.removeRun(rIndex);
//add the new hyperlinked with updated url in the paragraph, in place of the previous deleted
XWPFHyperlinkRun hyperlinkRun = paragraph.insertNewHyperlinkRun(rIndex, linkURL);
hyperlinkRun.getCTR().set(xmlObject);
}
}
}
This requirement needs knowledge about how hyperlinks referring to an external reference get stored in Microsoft Word documents and how this gets represented in XWPF of Apache POI.
The XWPFHyperlinkRun is the representation of a linked text run in a IRunBody. This text run, or even multiple text runs, is/are wrapped with a XML object of type CTHyperlink. This contains a relation ID which points to a relation in the package relations part. This package relation contains the URI which is the hyperlink's target.
Currently (apache poi 5.2.2) XWPFHyperlinkRun provides access to a XWPFHyperlink. But this is very rudimentary. It only has getters for the Id and the URI. It neither provides access to it's XWPFHyperlinkRun and it's IRunBody nor it provides a setter for the target URI in the package relations part. It not even has internally access to it's the package relations part.
So only using Apache POI classes the only possibility currently is to delete the old XWPFHyperlinkRun and create a new one pointing to the new URI. But as the text runs also contain the text formatting, deleting them will also delete the text formatting. It would must be copied from the old XWPFHyperlinkRun to the new before deleting the old one. That's uncomfortable.
So the rudimentary XWPFHyperlink should be extended to provide a setter for the target URI in the package relations part. A new class XWPFHyperlinkExtended could look like so:
import org.apache.poi.xwpf.usermodel.*;
import org.apache.poi.openxml4j.opc.PackageRelationship;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
/**
* Extended XWPF hyperlink class
* Provides access to it's Id, URI, XWPFHyperlinkRun, IRunBody.
* Provides setting target URI in PackageRelationship.
*/
public class XWPFHyperlinkExtended {
private String id;
private String uri;
private XWPFHyperlinkRun hyperlinkRun;
private IRunBody runBody;
private PackageRelationship rel;
public XWPFHyperlinkExtended(XWPFHyperlinkRun hyperlinkRun, PackageRelationship rel) {
this.id = rel.getId();
this.uri = rel.getTargetURI().toString();
this.hyperlinkRun = hyperlinkRun;
this.runBody = hyperlinkRun.getParent();
this.rel = rel;
}
public String getId() {
return this.id;
}
public String getURI() {
return this.uri;
}
public IRunBody getIRunBody() {
return this.runBody;
}
public XWPFHyperlinkRun getHyperlinkRun() {
return this.hyperlinkRun;
}
/**
* Provides setting target URI in PackageRelationship.
* The old PackageRelationship gets removed.
* A new PackageRelationship gets added using the same Id.
*/
public void setTargetURI(String uri) {
this.runBody.getPart().getPackagePart().removeRelationship(this.getId());
this.uri = uri;
PackageRelationship rel = this.runBody.getPart().getPackagePart().addExternalRelationship(uri, XWPFRelation.HYPERLINK.getRelation(), this.getId());
this.rel = rel;
}
}
It does not extend XWPFHyperlink as this is so rudimentary it's not worth it. Furthermore after setTargetURI the String uri needs to be updated. But there is no setter in XWPFHyperlink and the field is only accessible from inside the package.
The new XWPFHyperlinkExtended can be got from XWPFHyperlinkRun like so:
/**
* If this HyperlinkRun refers to an external reference hyperlink,
* return the XWPFHyperlinkExtended object for it.
* May return null if no PackageRelationship found.
*/
/*modifiers*/ XWPFHyperlinkExtended getHyperlink(XWPFHyperlinkRun hyperlinkRun) {
try {
for (org.apache.poi.openxml4j.opc.PackageRelationship rel : hyperlinkRun.getParent().getPart().getPackagePart().getRelationshipsByType(XWPFRelation.HYPERLINK.getRelation())) {
if (rel.getId().equals(hyperlinkRun.getHyperlinkId())) {
return new XWPFHyperlinkExtended(hyperlinkRun, rel);
}
}
} catch (org.apache.poi.openxml4j.exceptions.InvalidFormatException ifex) {
// do nothing, simply do not return something
}
return null;
}
Once we have that XWPFHyperlinkExtended we can set an new target URI using it's method setTargetURI.
A further problem results from the fact, that the XML object of type CTHyperlink can wrap around multiple text runs, not only one. Then multiple XWPFHyperlinkRun are in one CTHyperlink and point to one target URI. For example this could look like:
... [this is a link to example.com]->https://example.com ...
This results in 6 XWPFHyperlinkRuns in one CTHyperlink linking to https://example.com.
This leads to problems when link text needs to be changed when the link target changes. The text of all the 6 text runs is the link text. So which text run shall be changed?
The best I have found is a method which sets the text of the first text run in the CTHyperlink.
/**
* Sets the text of the first text run in the CTHyperlink of this XWPFHyperlinkRun.
* Tries solving the problem when a CTHyperlink contains multiple text runs.
* Then the String value is set in first text run only. All other text runs are set empty.
*/
/*modifiers*/ void setTextInFirstRun(XWPFHyperlinkRun hyperlinkRun, String value) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTHyperlink ctHyperlink = hyperlinkRun.getCTHyperlink();
for (int r = 0; r < ctHyperlink.getRList().size(); r++) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR ctR = ctHyperlink.getRList().get(r);
for (int t = 0; t < ctR.getTList().size(); t++) {
org.openxmlformats.schemas.wordprocessingml.x2006.main.CTText ctText = ctR.getTList().get(t);
if (r == 0 && t == 0) {
ctText.setStringValue(value);
} else {
ctText.setStringValue("");
}
}
}
}
There the String value is set in first text run only. All other text runs are set empty. The text formatting of the first text run remains.
This works, but need more some steps to get text formatting correctly:
try (var fis = new FileInputStream(fileName);
var doc = new XWPFDocument(fis)) {
var pList = doc.getParagraphs();
for (var p : pList) {
var runs = p.getRuns();
for (int i = 0; i < runs.size(); i++) {
var r = runs.get(i);
if (r instanceof XWPFHyperlinkRun) {
var run = (XWPFHyperlinkRun) r;
var link = run.getHyperlink(doc);
// To get text: link for checking
System.out.println(run.getText(0) + ": " + link.getURL());
// how i replace it
var run1 = p.insertNewHyperlinkRun(i, "http://google.com");
run1.setText(run.getText(0));
// remove the old link
p.removeRun(i + 1);
}
}
}
try (var fos = new FileOutputStream(outFileName)) {
doc.write(fos);
}
}
I'm using these libraries:
implementation 'org.apache.poi:poi:5.2.2'
implementation 'org.apache.poi:poi-ooxml:5.2.2'
I am trying to scrape the Top Stories section in google news for all the titles. In order to only get the titles in the Top Stories section, I must narrow into this tag:
<div class="section top-stories-section" id=":2r">..</div>
This is the code I use (in Eclipse):
public static void main(String[] args) throws IOException {
// fetches & parses HTML
String url = "http://news.google.com";
Document document = Jsoup.connect(url).get();
// Extract data
Element topStories = document.getElementById(":2r").;
Elements titles = topStories.select("span.titletext");
// Output data
for (Element title : titles) {
System.out.println("Title: " + title.text());
}
}
I always seem to be getting a NullPointerException. It doesn't work either, when I try to reach the Top Stories like this:
Element topStories = document.select("#:2r").first();
Am I missing something? Shouldn't this be working? I am relatively new to this, please help and thank you!
Judging from the error message (and actually looking at the page) that div tag doesn't contain an id attribute. Instead you could select based on the CSS class
Element topStories = document.select("div.section.top-stories-section").first();
Hello everybody I want to get the data from
http://sansoyunlari.hurriyet.com.tr/SayisalLoto/SayisalLotoSonuclari.aspx this adress by using jsoup ı can get them but only the latest results . There is a dropdownlist on the website which consists dates how can I reach other dates ? by the way I will move these codes to the android these are codes which is written in netbeans for now. ı will put a dropdownlist to my android program which get the data from this adress and also the results.
these are my java codes I wrote until now
public static void main(String[] args) {
String adres = "http://sansoyunlari.hurriyet.com.tr/SayisalLoto/SayisalLotoSonuclari.aspx";
ArrayList sayi = new ArrayList<>();
sayi.add("six");
sayi.add("five");
sayi.add("four");
sayi.add("three");
sayi.add("two");
sayi.add("one");
//Sayısal Loto
try {
Document doc = Jsoup.connect(adres).get();
Elements sonuclar = doc.select("div.hurriyet2010_so_sanstopu_no_bg");
//1. yi manuel almak gerek ilk yoldan çünkü resut diye kodlanmış
Elements sonuclar1 = doc.select("span#_ctl0_ContentPlaceHolder1_lblresut"+sayi.get(sayi.size()-1));
Element numaralar = sonuclar1.first();
System.out.println(numaralar.text());
//yol 1 numaraları almak için
for (int i = sonuclar.size();i>1;i--)
{
sonuclar1 = doc.select("span#_ctl0_ContentPlaceHolder1_lblresult"+sayi.get(i-2));
Element numaralar1 = sonuclar1.first();
System.out.println(numaralar1.text());
}
//yol 2 numaraları almak için
// for(Element el : sonuclar)
// {
// System.out.println(el.text());
// }
//kazanan kişi sayısı ve ikramiye tutarı için
for(int i = 0;i<4;i++)
{
int b = 6 -i;
System.out.println(b + " bilen kişi sayısı :");
sonuclar = doc.select("span#_ctl0_ContentPlaceHolder1_lblluckycount"+sayi.get(i));
Element el = sonuclar.first();
System.out.println(el.text());
System.out.println("Kişi başına düşen ikramiye :");
sonuclar = doc.select("span#_ctl0_ContentPlaceHolder1_lblluckyamount"+sayi.get(i));
el = sonuclar.first();
System.out.println(el.text());
}
}
catch(Exception e){
}
}
To get the select item you should do:
Element select = doc.select("#_ctl0_ContentPlaceHolder1_ddlSayisalLotoDates").first();
Now the children of this elements are the "option" items you want:
for (Element e : select) {
String date = e.text();
}
edit
I looked at the html source. In order to get the right page you need to do a post request at the URL "http://sansoyunlari.hurriyet.com.tr/SayisalLoto/SayisalLotoSonuclari.aspx" with following params:
__EVENTARGUMENT = empty
__EVENTTARGET = _ctl0$ContentPlaceHolder1$ddlSayisalLotoDates
__EVENTVALIDATION = a random value that you get from the html page
__LASTFOCUS = empty
__VIEWSTATE = another random value
_ctl0:ContentPlaceHolder1:ddlSayisalLotoDates = The ID of the date you want to search (i.e. 884 for 19 Ekim 2013)
txtSearch = can be empty
As you can see, it's quite annoying scraping an ASP.NET webpage..
Use an application like Fiddler (or another one) to find the params you need to post (hidden inputs, session cookies, your selected input). Probably you're missing some of them.
Hope it helps.
I have this html code that I need to parse
<a class="sushi-restaurant" href="/greatSushi">Best Sushi in town</a>
I know there's an example for jsoup that you can get all links in a page,e.g.
Elements links = doc.select("a[href]");
for (Element link : links) {
print(" * a: <%s> (%s)", link.attr("abs:href"),
trim(link.text(), 35));
}
but I need a piece of code that can return me the href for that specific class.
Thanks guys
You can select elements by class. This example finds elements with the class sushi-restaurant, then gets the absolute URL of the first result.
Make sure that when you parse the HTML, you specify the base URL (where the document was fetched from) to allow jsoup to determine what the absolute URL of a link is.
public static void main(String[] args) {
String html = "<a class=\"sushi-restaurant\" href=\"/greatSushi\">Best Sushi in town</a>";
Document doc = Jsoup.parse(html, "http://example.com/");
// find all <a class="sushi-restaurant">...
Elements links = doc.select("a.sushi-restaurant");
Element link = links.first();
// 'abs:' makes "/greatsushi" = "http://example.com/greatsushi":
String url = link.attr("abs:href");
System.out.println("url = " + url);
}
Shorter version:
String url = doc.select("a.sushi-restaurant").first().attr("abs:href");
Hope this helps!
Elements links = doc.select("a");
for (Element link : links) {
String attribute=link.attr("class");
if(attribute.equalsIgnoreCase("sushi-place")){
print link.href//You probably need this
}
}