I have this JavaScript sourcecode from a website.
<script>"#context": "http://schema.org/","#type": "Product","name": "Shower head","image": "https://example.com/jpeg.png","description": "Hello stackoverflow","url": "link.com","offers": {"#type": "Offer","priceCurrency": "USD","price": "10.00","itemCondition": "http://schema.org/NewCondition","availability": "http://schema.org/InStock","url": "MyUrl.com","availableAtOrFrom": {"#type": "Place","name": "Geneva, NY","geo": {"#type": "GeoCoordinates","latitude": "42.8361","longitude": "-76.9874"}},"seller": {"#type": "Person","name": "Edward"}}}</script>
And I'm trying to use this JSoup code to extract the last line with "name": "Edward"
public class JsoupCrawler {
public static void main(String[] args) {
try {
Document doc = Jsoup.connect("https://example.com").userAgent("mozilla/17.0").get();
Elements temp = doc.select("script.name");
int i=0;
for (Element nameList:temp) {
i++;
System.out.println(i+ " "+ nameList.getElementsByTag(" ").first().text() );
}
}
catch (IOException e) {
ex.printStackTrace();
}
}
}
Can somebody help me with this, or is impossible?
JSoup is interpreting HTML. The Contents of the <script> element contain JavaScript, so JSoup can't interpret what is inside the <script> element.
It looks as if the content of the <script> element is formatted in JSON. So you could use JSoup to get to the content of the <script> element, and then try to feel this string into a JSON interpreting library. Look here if you want to dive into that: How to parse JSON in Java
If this is a one-off and you can trust that the contents of the <script> element do not change too much, you may also use regular expressions to get to the desired part. However, I would recommend using a JSON library.
Related
I have a program I am writing that takes user input to connect to a site, download it's html into a text, and retrieve data from a table twice a day. I understand the code will not be one size fits all for any page (I will likely "hardwire" the url into the code once I get it working). My issue presently is that my jsoup parser isn't properly reading in the tabular data. I'm not sure if my element selectors are too generic? The table looks like it is in standard table/tr/td format, but my rows array populates with size 0. If someone could help me debug my parser and possibly provide some suggestions on where to look for making it grab data silently twice a day, I'd really appreciate it! No runtime/compile errors, just need to correct output.
Source site: https://www.cnbc.com/us-markets/
Source code for table (snipet) :
<table class="BasicTable-table"><thead class="BasicTable-tableHeading BasicTable-tableHeadingSortable"><tr><th class="BasicTable-textData"><span>SYMBOL <span class="icon-sort undefined"></span></span></th><th class="BasicTable-numData"><span>PRICE <span class="icon-sort undefined"></span></span></th><th class="BasicTable-numData">
My code:
public class StockScraper {
public static void main(String[] args) {
Scanner input = new Scanner (System.in);
System.out.println("Enter the complete url (including http://) of the site you would like to parse:");
String html = input.nextLine();
try {
Document doc = Jsoup.connect(html).get();
System.out.printf("Title: %s", doc.title());
//Try to print site content
System.out.println("");
System.out.println("Writing html contents to 'html.txt'...");
//Save html contents to text file
PrintWriter outputfile = new PrintWriter("html.txt");
outputfile.print(doc.outerHtml());
outputfile.close();
//Select stock data you want to retrieve
System.out.println("Enter the name of the stock you want to check");
String name = input.nextLine();
//Pull data from CNBC Markets
Element table = doc.select("table").get(0);
Elements rows = table.select("tr");
System.out.println(rows.size());
for(int i = 1; i < rows.size(); i++) {
Element rowx = rows.get(i);
Elements col = rows.select("td");
if(col.get(0).equals(name)) {
System.out.println("I worked!");
System.out.println(col.get(1));
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
The problem here is that this site is a dynamic page that is loading content after the browser initially downloads the page. Jsoup is not going to be adequate to scrape pages like this. A couple options you have:
1) Use a tool that simulates a browser and makes all the necessary api calls. A couple options are Selenium WebDriver or HTMLUnit.
2) Figure out the api calls you are interested in on this site, and just call those api's directly to get a JSON document you can parse. You can see api details by opening developer tools in your browser, then look at the Network tab. For this site an example would be the following, which includes the stock quote for DJI:
https://quote.cnbc.com/quote-html-webservice/quote.htm?noform=1&partnerId=2&fund=1&exthrs=0&output=json&symbolType=issue&symbols=599362|579435|593933|49020635|49031016|5093160|617254|601065&requestMethod=extended
Returns:
ExtendedQuoteResult: {
xmlns: "http://quote.cnbc.com/services/MultiQuote/2006",
ExtendedQuote: [{
QuickQuote: {
symbol: ".DJI",
code: "0",
curmktstatus: "REG_MKT",
FundamentalData: {
yrlodate: "2020-03-23",
yrloprice: "18213.65",
yrhidate: "2020-02-12",
yrhiprice: "29568.57"
},
mappedSymbol: {
xsi:nil: "true"
},
source: "Exchange",
cnbcId: "599362",
prev_prev_closing: "21413.44",
high: "22783.45",
low: "21693.63",
provider: "CNBC Quote Cache",
streamable: "0",
last_time: "2020-04-06T17:16:28.000-0400",
countryCode: "US",
previous_day_closing: "21052.53",
altName: "Dow Industrials",
reg_last_time: "2020-04-06T17:16:28.000-0400",
last_time_msec: "1586207788000",
altSymbol: ".DJI",
change_pct: "7.73",
providerSymbol: ".DJI",
assetSubType: "Index",
comments: "RIC",
last: "22679.99",
issue_id: "599362",
cacheServed: "false",
responseTime: "Mon Apr 06 19:12:09 EDT 2020",
change: "1627.46",
timeZone: "EDT",
onAirName: "Dow Industrials",
symbolType: "issue",
assetType: "INDEX",
volume: "614200990",
fullVolume: "614200990",
realTime: "true",
name: "Dow Jones Industrial Average",
quoteDesc: { },
exchange: "Dow Jones Global Indexes",
shortName: "DJIA",
cachedTime: "Mon Apr 06 19:12:09 EDT 2020",
currencyCode: "USD",
open: "21693.63"
}
}
...
I am trying to parse HTML that comes to me as a giant String. When I get to Line 13, NodeChild page = it.parent()
I am able to find the key that I am looking for, but the data comes to me like This Is Value One In My KeyThis is Value Two in my KeyThis is Value Three In My Key and so on. I see a recurring trend where the seperator between the two is always UppercaseUppercase (withoutSpaces).
I would like to put it into an ArrayList one way or another. Is there a method that I am missing from the docs that is able to automatically do this? Is there a better way to parse this together?
class htmlParsingStuff{
private def slurper = new XmlSlurper(new Parser())
private void slurpItUp(String rawHTMLString){
ArrayList urlList = []
def htmlParser = slurper.parseText(rawHTMLString)
htmlParser.depthFirst().findAll() {
//Loop through all of the HTML Tags to get to the key that I am looking for
//EDIT: I see that I am able to iterate through the parent object, I just need a way to figure out how to get into that object
boolean trigger = it.text() == 'someKey'
if (trigger){
//I found the key that I am looking for
NodeChild page = it.parent()
page = page.replace('someKey', '')
LazyMap row = ["page": page, "type": "Some Type"]
urlList.add(row)
}
}
}
}
I can't provide you with working code since I don't know your specific html.
But: don't use XmlSlurper for parsing HTML, HTML is not well formed and therefor XmlSlurper is not the right tool for the job.
For HTML use a library like JSoup. You will find it much easier to use especially if you have some JQuery knowledge. Since you didn't post your HTML snippet I made up my own example:
#Grab(group='org.jsoup', module='jsoup', version='1.10.1')
import org.jsoup.Jsoup
def html = """
<html>
<body>
<table>
<tr><td>Key 1</td></tr>
<tr><td>Key 2</td></tr>
<tr><td>Key 3</td></tr>
<tr><td>Key 4</td></tr>
<tr><td>Key 5</td></tr>
</table>
</body>
</html>"""
def doc = Jsoup.parse(html)
def elements = doc.select('td')
def result = elements.collect {it.text()}
// contains ['Key 1', 'Key 2', 'Key 3', 'Key 4', 'Key 5']
To manipulate the document you would use
def doc = Jsoup.parse(html)
def elements = doc.select('td')
elements.each { oldElement ->
def newElement = new Element(Tag.valueOf('td'), '')
newElement.text('Another key')
oldElement.replaceWith(newElement)
}
println doc.outerHtml()
I am working with the Play framework (2.4) for Java. I want to pass a JSONObject to a javascript used inside one of the Play view templates.
On the Java side I prepare the JSONObject, like so:
(Keep in mind that this is a test vehicle.)
public static Result showBusinesses(){
List<Item> list = new ArrayList<Item>();
Item r = new Item();
r.id = "23234";
r.name = "Joes hardware";
Item s = new Item();
s.id = "23254";
s.name = "Martys collision";
list.add(r);
list.add(s);
return ok(views.html.wheel.render(getJSONObject(list)));
}
public static JSONObject getJSONObject(List<Item> list){
JSONObject jsonObject = new JSONObject();
try{
for (int i = 0; i < list.size(); i++) {
jsonObject.put(list.get(i).id, list.get(i).name);
}
}catch (JSONException e) {
}
return jsonObject;
}
In my Play template, I accept the JSONObject parameter:
#(item : org.json.JSONObject)
#import helper._
#import helper.twitterBootstrap._
#import play.api.libs.json.Json
...
So far, so good.
Until I attempt to use the object in my javascript:
If I place my object, #item, anywhere in the template besides inside the javascript, I get this:
{"23254":"Martys Pancakes","23234":"Joes place"};
which looks like a properly formed var to me.
I am inserting the JSONObject into the javascript like this:
<script type="text/javascript">
businesses = #item;
and I expect that to translate like this:
businesses = {
"23332" : "Joe's hardware",
"56755" : "Marty's collision"
};
And yet the object does not behave as expected. I suspect that I am not passing the parameter to the javascript in the correct way.
Can anyone enlighten me? Thanks.
I found the answer to my own question. It ended up being fairly simple. First of all, you don't need to mess with JSON. You pass a standard Java List to the Play template. Then you iterate through that list inside the Javascript variable curly braces. Here is the template code:
#(businesses: List[Business])
#import helper._
#import helper.twitterBootstrap._
...
<script type="text/javascript">
places = {
#for((item, index) <- businesses.zipWithIndex) {
#if(index != businesses.size-1) {
"#item.id" : "#Html(item.name)",}
else {"#item.id" : "#Html(item.name)"}
}
};
I used the built-in zipWithIndex because I needed commas separating every line but the last. The #Html() was needed to escape all the special chars that HTML needs to have translated. Once the javascript runs, you end up with your variable:
places = {
"345" : "Joe's Hardware",
"564" : "Jan's Party Store"
}
I am creating an app in Java that will take all the information from a public website and load it in the app for people to read using jsoup. I was trying the same kind of function with Facebook but it wasn't working the same way. Does anyone have a good idea about how I should go about this?
Thanks,
Calland
public String[] scrapeEvents(String... args) throws Exception {
Document doc = Jsoup.connect("http://www.facebook.com/cedarstreettimes?fref=ts").get();
Elements elements = doc.select("div._wk");
String s = elements.toString();
return s;
}
edit: I found this link of information,but I'm a little confused on how to manipulate it to get me only the content of what the specific user posts on their wall. http://developers.facebook.com/docs/getting-started/graphapi/
I had a look at the source of that page -- the thing that is tripping up the parse is that all the real content is wrapped in comments, like this:
<code class="hidden_elem" id="u_0_42"><!-- <div class="fbTimelineSection ...> --></code>
There is JS on the page that lifts that data into the real DOM, but as jsoup doesn't execute JS it stays as comments. So before extracting the content, we need to emulate that JS and "un-hide" those elements. Here's an example to get you started:
String url = "https://www.facebook.com/cedarstreettimes?fref=ts";
String ua = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.33 (KHTML, like Gecko) Chrome/27.0.1438.7 Safari/537.33";
Document doc = Jsoup.connect(url).userAgent(ua).timeout(10*1000).get();
// move the hidden commented out html into the DOM proper:
Elements hiddenElements = doc.select("code.hidden_elem");
for (Element hidden: hiddenElements) {
for (Node child: hidden.childNodesCopy()) {
if (child instanceof Comment) {
hidden.append(((Comment) child).getData()); // comment data parsed as html
}
}
}
Elements articles = doc.select("div[role=article]");
for (Element article: articles) {
if (article.select("span.userContent").size() > 0) {
String text = article.select("span.userContent").text();
String imgUrl = article.select("div.photo img").attr("abs:src");
System.out.println(String.format("%s\n%s\n\n", text,imgUrl));
}
}
That example pulls out the article text and any photo that is associated with it.
(It's possibly better to use the FB API that this method; I wanted to show how you can emulate little bits of JS to make a scrape work properly.)
So I've tried searching and searching on how to do this but I keep seeing a lot of complicated answers for what I need. I basically am using the Flurry Analytics API to return some xml code from an HTTP request and this is what it returns.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<eventMetrics type="Event" startDate="2011-2-28" eventName="Tip Calculated" endDate="2011-3-1" version="1.0" generatedDate="3/1/11 11:32 AM">
<day uniqueUsers="1" totalSessions="24" totalCount="3" date="2011-02-28"/>
<day uniqueUsers="0" totalSessions="0" totalCount="0" date="2011-03-01"/>
<parameters/>
</eventMetrics>
All I want to get is that totalCount number which is 3 with Java to an int or string. I've looked at the different DOM and SAX methods and they seem to grab information outside of the tags. Is there someway I can just grab totalCount within the tag?
Thanks,
Update
I found this url -http://www.androidpeople.com/android-xml-parsing-tutorial-%E2%80%93-using-domparser/
That helped me considering it was in android. But I thank everyone who responded for helping me out. I checked out every answer and it helped out a little bit for getting to understand what's going on. However, now I can't seem to grab the xml from my url because it requires an HTTP post first to then get the xml. When it goes to grab xml from my url it just says file not found.
Update 2
I got it all sorted out reading it in now and getting the xml from Flurry Analytics (for reference if anyone stumbles upon this question)
HTTP request for XML file
totalCount is what we call an attribute. If you're using the org.w3c.dom API, you call getAttribute("totalCount") on the appropriate element.
If you are using an SAX handler, override the startElement callback method to access attributes:
public void startElement (String uri, String name, String qName, Attributes atts)
{
if("day".equals (qName)) {
String total = attrs.getValue("totalCount");
}
}
A JDOM example. Note the use of SAXBuilder to load the document.
URL httpSource = new URL("some url string");
Document document = SAXBuilder.build(httpSource);
List<?> elements = document.getDescendants(new KeyFilter());
for (Element e : elements) {
//do something more useful with it than this
String total = (Element) e.getAttributeValue("totalCount");
}
class KeyFilter implements Filter {
public boolean matches (Object obj) {
return (Element) obj.getName().equals("key");
}
}
I think that the simplest way is to use XPath, below is an example based on vtd-xml.
import com.ximpleware.*;
public class test {
public static void main(String[] args) throws Exception {
String xpathExpr = "/eventMetrics/day/#totalCount";
VTDGen vg = new VTDGen();
int i = -1;
if (vg.parseHttpUrl("http://localhost/test.xml", true)) {
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot();
ap.selectXPath(xpathExpr);
ap.bind(vn);
System.out.println("total count "+(int)ap.evalXPathtoDouble());
}
}
}