Get a part of a webpage using JSOUP

Get a part of a webpage using JSOUP - java

I am trying to programmatically search for a word meaning in google & save its meaning in a file in my computer. I have successfully called the page & get the response in Document (org.jsoup.nodes.Document). Now I do not know how to get only the word meaning from this Document. Please find the screenshot where I have indicated the part of data that I need.
The response html is so big that I can't understand from which element I will get my desired data. Please help. Here is what I have done so far:
public class Search {
private static Pattern patternDomainName;
private Matcher matcher;
private static final String DOMAIN_NAME_PATTERN
= "([a-zA-Z0-9]([a-zA-Z0-9\\-]{0,61}[a-zA-Z0-9])?\\.)+[a-zA-Z]{2,6}";
static {
patternDomainName = Pattern.compile(DOMAIN_NAME_PATTERN);
}
public static void main(String[] args) {
Search obj = new Search();
Set<String> result = obj.getDataFromGoogle("debug%20meaning");
for(String temp : result){
System.out.println(temp);
}
System.out.println(result.size());
}
public String getDomainName(String url){
String domainName = "";
matcher = patternDomainName.matcher(url);
if (matcher.find()) {
domainName = matcher.group(0).toLowerCase().trim();
}
return domainName;
}
private Set<String> getDataFromGoogle(String query) {
Set<String> result = new HashSet<String>();
String request = "https://www.google.com/search?q=" + query + "&num=20";
System.out.println("Sending request..." + request);
try {
// need http protocol, set this as a Google bot agent :)
Document doc = Jsoup
.connect(request)
.userAgent(
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)")
.timeout(5000).get();
/**********Here comes my data fetching logic*****************
* Dont know where to fing my desired data in such a big html
*/
/*
String sc = doc.html().replaceAll("\\n", "");
System.out.println(doc.html());
*/
} catch (IOException e) {
e.printStackTrace();
}
return result;
}
}

Google Dictionary API is deprecated!
But instead scraping through google search URI,which is what you are doing currently, you can do the same thing using this http://google-dictionary.so8848.com/ service which preferably more easy to scrape data from, with what you are doing currently.

Related

How to extract the text from the web site?

I'm working on code for parsing the weather site.
I found a CSS class with needed data on the web-site. How to pick up from there "on October 12" in the form of a string? (Tue, Oct 12)
public class Pars {
private static Document getPage() throws IOException {
String url = "https://www.gismeteo.by/weather-mogilev-4251/3-day/";
Document page = Jsoup.parse(new URL(url), 3000);
return page;
}
public static void main(String[] args) throws IOException {
Document page = getPage();
Element Nameday = page.select("div [class=date date-2]").first();
String date = Nameday.select("div [class=date date-2").text();
System.out.println(Nameday);
}
}
The code is written for the purpose of parsing the weather site. On the page I found the right class in which only the date and day of the week I need. But at the stage of converting data from a class, an error crashes into a string.

The problem is with class selector, it should look like this: div.date.date-2
Working code example:
public class Pars {
private static Document getPage() throws IOException {
String url = "https://www.gismeteo.by/weather-mogilev-4251/3-day/";
return Jsoup.parse(new URL(url), 3000);
}
public static void main(String[] args) throws IOException {
Document page = getPage();
Element dateDiv = page.select("div.date.date-2").first();
if(dateDiv != null) {
String date = dateDiv.text();
System.out.println(date);
}
}
}
Here is an answer to Your problem: Jsoup select div having multiple classes
In future, please make sure Your question is more detailed and well structured. Here is the "asking questions" guideline: https://stackoverflow.com/help/how-to-ask

Extract values from xml file using Java

Here is my response contain XML file and I want to retrieve bEntityID="328" from this xml response
<?xml version="1.0" encoding="UTF-8"?>
<ns2:aResponse xmlns:ns2="http://www.***.com/F1/F2/F3/2011-09-11">
<createBEntityResponse bEntityID="328" />
</ns2:aResponse>
I am trying to this but getting null
System.out.println("bEntitytID="+XmlPath.with(response.asString())
.getInt("aResponse.createBEntityResponse.bEntityID"));
Any suggestion for getting BEntityID from this response?

Though I dont suggest the below approach to use Regex to get element values, but if you are too desperate to get then try the below code:
public class xmlValue {
public static void main(String[] args) {
String xml = "<ns2:aResponse xmlns:ns2=\"http://www.***.com/F1/F2/F3/2011-09-11\">\n" +
" <createBEntityResponse bEntityID=\"328\" />\n" +
"</ns2:aResponse>";
System.out.println(getTagValue(xml,"createBEntityResponse bEntityID"));
}
public static String getTagValue(String xml, String tagName){
String [] s;
s = xml.split("createBEntityResponse bEntityID");
String [] valuesBetweenQuotes = s[1].split("\"");
return valuesBetweenQuotes[1];
}
}
Output: 328
Note: Better solution is to use XML parsers
This will fetch the first tag value:
public static String getTagValue(String xml, String tagName){
return xml.split("<"+tagName+">")[1].split("</"+tagName+">")[0];
}
Other way around is to use JSoup:
Document doc = Jsoup.parse(xml, "", Parser.xmlParser()); //parse the whole xml doc
for (Element e : doc.select("tagName")) {
System.out.println(e); //select the specific tag and prints
}

I think the best way is deserializing xml to pojo like here, and then get value
entityResponse.getEntityId();

I tried with the same XML file and was able to get the value of bEntityId with the following code. Hope it helps.
#Test
public void xmlPathTests() {
try {
File xmlExample = new File(System.getProperty("user.dir"), "src/test/resources/Data1.xml");
String xmlContent = FileUtils.readFileToString(xmlExample);
XmlPath xmlPath = new XmlPath(xmlContent).setRoot("aResponse");
System.out.println(" Entity ::"+xmlPath.getInt(("createBEntityResponse.#bEntityID")));
assertEquals(328, xmlPath.getInt(("createBEntityResponse.#bEntityID")));
} catch (Exception e) {
e.printStackTrace();
}
}

Getting versionCode and VersionName from Google Play

I am looking for a way how to get app versionCode and VersionName from google play with package name via java app in PC.
I have seen: https://androidquery.appspot.com/ but it not working anymore and also https://code.google.com/archive/p/android-market-api/ started to making problems and also stopped working, and it requer device ID.
Can you help me with some simple solution or API for this?
Very important, i need versionCode and VersionName and VersionName is relatively easy to get by parsing html google play app site. The versionCode is very important.

There is no official Google Play API, Playstore uses an internal protobuf API which is not documented and not open. IMHO, you could :
use an open source library that reverse engineer the API
scrap apk download sites that have already extracted this information (most likely via the same protobuf Google Play API)
Note that there is a Google Play developer API but you can't list your apks, versions, apps. It's essentially used to manage the app distribution, reviews, edits etc..
Google play internal API
play-store-api Java library
This library uses Google Play Store protobuf API (undocumented and closed API) and requires an email/password to generate a token that can be reused to play with the API :
GplaySearch googlePlayInstance = new GplaySearch();
DetailsResponse response = googlePlayInstance.getDetailResponse("user#gmail.com",
"password", "com.facebook.katana");
AppDetails appDetails = response.getDocV2().getDetails().getAppDetails();
System.out.println("version name : " + appDetails.getVersionString());
System.out.println("version code : " + appDetails.getVersionCode());
with this method :
public DetailsResponse getDetailResponse(String email,
String password,
String packageName) throws IOException, ApiBuilderException {
// A device definition is required to log in
// See resources for a list of available devices
Properties properties = new Properties();
try {
properties.load(getClass().getClassLoader().getSystemResourceAsStream("device-honami" +
".properties"));
} catch (IOException e) {
System.out.println("device-honami.properties not found");
return null;
}
PropertiesDeviceInfoProvider deviceInfoProvider = new PropertiesDeviceInfoProvider();
deviceInfoProvider.setProperties(properties);
deviceInfoProvider.setLocaleString(Locale.ENGLISH.toString());
// Provide valid google account info
PlayStoreApiBuilder builder = new PlayStoreApiBuilder()
.setDeviceInfoProvider(deviceInfoProvider)
.setHttpClient(new OkHttpClientAdapter())
.setEmail(email)
.setPassword(password);
GooglePlayAPI api = builder.build();
// We are logged in now
// Save and reuse the generated auth token and gsf id,
// unless you want to get banned for frequent relogins
api.getToken();
api.getGsfId();
// API wrapper instance is ready
return api.details(packageName);
}
device-honami.properties is device property file that is required to identify device characteristics. You have some device.properties file sample here
The OkHttpClientAdapter can be found here
Dependencies used to run this example :
allprojects {
repositories {
...
maven { url 'https://jitpack.io' }
}
}
dependencies {
compile 'com.github.yeriomin:play-store-api:0.19'
compile 'com.squareup.okhttp3:okhttp:3.8.1'
}
Scrap third part apk download sites
http://apk-dl.com
You could get the version name & version code from http://apk-dl.com (of course unofficial) by scraping the page with jsoup for the required package name :
String packageName = "com.facebook.katana";
Document doc = Jsoup.connect("http://apk-dl.com/" + packageName).get();
Elements data = doc.select(".file-list .mdl-menu__item");
if (data.size() > 0) {
System.out.println("full text : " + data.get(0).text());
Pattern pattern = Pattern.compile("(.*)\\s+\\((\\d+)\\)");
Matcher matcher = pattern.matcher(data.get(0).text());
if (matcher.find()) {
System.out.println("version name : " + matcher.group(1));
System.out.println("version code : " + matcher.group(2));
}
}
https://apkpure.com
Another possibility is scrapping https://apkpure.com :
String packageName = "com.facebook.katana";
Elements data = Jsoup.connect("https://apkpure.com/search?q=" + packageName)
.userAgent("Mozilla")
.get().select(".search-dl .search-title a");
if (data.size() > 0) {
Elements data2 = Jsoup.connect("https://apkpure.com" + data.attr("href"))
.userAgent("Mozilla")
.get().select(".faq_cat dd p");
if (data2.size() > 0) {
System.out.println(data2.get(0).text());
Pattern pattern = Pattern.compile("Version:\\s+(.*)\\s+\\((\\d+)\\)");
Matcher matcher = pattern.matcher(data2.get(0).text());
if (matcher.find()) {
System.out.println("version name : " + matcher.group(1));
System.out.println("version code : " + matcher.group(2));
}
}
}
https://api-apk.evozi.com
Also, https://api-apk.evozi.com has an internal JSON api but :
sometimes it doesn't work (return Ops, APK Downloader got access denied when trying to download) mostly for non popular app
it has mechanism in place against scraping bot (random token generated in JS with a random variable name)
The following is returning the version name and code with https://api-apk.evozi.com FWIW :
String packageName = "com.facebook.katana";
String data = Jsoup.connect("https://apps.evozi.com/apk-downloader")
.userAgent("Mozilla")
.execute().body();
String token = "";
String time = "";
Pattern varPattern = Pattern.compile("dedbadfbadc:\\s+(\\w+),");
Pattern timePattern = Pattern.compile("t:\\s+(\\w+),");
Matcher varMatch = varPattern.matcher(data);
Matcher timeMatch = timePattern.matcher(data);
if (varMatch.find()) {
Pattern tokenPattern = Pattern.compile("\\s*var\\s*" + varMatch.group(1) + "\\s*=\\s*'(.*)'.*");
Matcher tokenMatch = tokenPattern.matcher(data);
if (tokenMatch.find()) {
token = tokenMatch.group(1);
}
}
if (timeMatch.find()) {
time = timeMatch.group(1);
}
HttpClient httpclient = HttpClients.createDefault();
HttpPost httppost = new HttpPost("https://api-apk.evozi.com/download");
List<NameValuePair> params = new ArrayList<NameValuePair>();
params.add(new BasicNameValuePair("t", time));
params.add(new BasicNameValuePair("afedcfdcbdedcafe", packageName));
params.add(new BasicNameValuePair("dedbadfbadc", token));
params.add(new BasicNameValuePair("fetch", "false"));
httppost.setEntity(new UrlEncodedFormEntity(params, "UTF-8"));
HttpResponse response = httpclient.execute(httppost);
JsonElement element = new JsonParser().parse(EntityUtils.toString(response.getEntity()));
JsonObject result = element.getAsJsonObject();
if (result.has("version") && result.has("version_code")) {
System.out.println("version name : " + result.get("version").getAsString());
System.out.println("version code : " + result.get("version_code").getAsInt());
} else {
System.out.println(result);
}
Implementation
You could implement it on a backend of yours that communicates directly with your Java application, this way you could maintain the process of retrieving version code/name if one of the above method fails.
If you are only interested in your own apps, a cleaner solution would be :
to set up a backend which will store all your current app version name / version code
all developer/publisher in your company could share a publish task (gradle task) which will use the Google Play developer API to publish apk and that gradle task would include a call to your backend to store the version code / version name entry when the app is published. The main goal would be to automate the whole publication with storage of the app metadata on your side.

Apart from using JSoup, we can alternatively do pattern matching for getting the app version from playStore.
To match the latest pattern from google playstore ie
<div class="BgcNfc">Current Version</div><span class="htlgb"><div><span class="htlgb">X.X.X</span></div>
we first have to match the above node sequence and then from above sequence get the version value. Below is the code snippet for same:
private String getAppVersion(String patternString, String inputString) {
try{
//Create a pattern
Pattern pattern = Pattern.compile(patternString);
if (null == pattern) {
return null;
}
//Match the pattern string in provided string
Matcher matcher = pattern.matcher(inputString);
if (null != matcher && matcher.find()) {
return matcher.group(1);
}
}catch (PatternSyntaxException ex) {
ex.printStackTrace();
}
return null;
}
private String getPlayStoreAppVersion(String appUrlString) {
final String currentVersion_PatternSeq = "<div[^>]*?>Current\\sVersion</div><span[^>]*?>(.*?)><div[^>]*?>(.*?)><span[^>]*?>(.*?)</span>";
final String appVersion_PatternSeq = "htlgb\">([^<]*)</s";
String playStoreAppVersion = null;
BufferedReader inReader = null;
URLConnection uc = null;
StringBuilder urlData = new StringBuilder();
final URL url = new URL(appUrlString);
uc = url.openConnection();
if(uc == null) {
return null;
}
uc.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6");
inReader = new BufferedReader(new InputStreamReader(uc.getInputStream()));
if (null != inReader) {
String str = "";
while ((str = inReader.readLine()) != null) {
urlData.append(str);
}
}
// Get the current version pattern sequence
String versionString = getAppVersion (currentVersion_PatternSeq, urlData.toString());
if(null == versionString){
return null;
}else{
// get version from "htlgb">X.X.X</span>
playStoreAppVersion = getAppVersion (appVersion_PatternSeq, versionString);
}
return playStoreAppVersion;
}
I got this solved through this. Hope that helps.

Jsoup takes too long, its inefficient, for short easy way with pattermatching:
public class PlayStoreVersionChecker {
public String playStoreVersion = "0.0.0";
OkHttpClient client = new OkHttpClient();
private String execute(String url) throws IOException {
okhttp3.Request request = new Request.Builder()
.url(url)
.build();
Response response = client.newCall(request).execute();
return response.body().string();
}
public String getPlayStoreVersion() {
try {
String html = execute("https://play.google.com/store/apps/details?id=" + APPIDHERE!!! + "&hl=en");
Pattern blockPattern = Pattern.compile("Current Version.*([0-9]+\\.[0-9]+\\.[0-9]+)</span>");
Matcher blockMatch = blockPattern.matcher(html);
if(blockMatch.find()) {
Pattern versionPattern = Pattern.compile("[0-9]+\\.[0-9]+\\.[0-9]+");
Matcher versionMatch = versionPattern.matcher(blockMatch.group(0));
if(versionMatch.find()) {
playStoreVersion = versionMatch.group(0);
}
}
} catch (IOException e) {
e.printStackTrace();
}
return playStoreVersion;
}
}

public class Store {
private Document document;
private final static String baseURL = "https://play.google.com/store/apps/details?id=";
public static void main(String[] args) {
}
public Store(String packageName) {
try {
document = Jsoup.connect(baseURL + packageName).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:64.0) Gecko/20100101 Firefox/64.0").get();
} catch (IOException ex) {
ex.printStackTrace();
}
}
public String getTitle() {
return document.select("h1.AHFaub > span").text();
}
public String getDeveloper() {
return document.selectFirst("span.UAO9ie > a").text();
}
public String getCategory() {
Elements elements = document.select("span.UAO9ie > a");
for (Element element : elements) {
if (element.hasAttr("itemprop")) {
return element.text();
}
}
return null;
}
public String getIcon() {
return document.select("div.xSyT2c > img").attr("src");
}
public String getBigIcon() {
return document.select("div.xSyT2c > img").attr("srcset").replace(" 2x", "");
}
public List<String> getScreenshots() {
List<String> screenshots = new ArrayList<>();
Elements img = document.select("div.u3EI9e").select("button.Q4vdJd").select("img");
for (Element src : img) {
if (src.hasAttr("data-src")) {
screenshots.add(src.attr("data-src"));
} else {
screenshots.add(src.attr("src"));
}
}
return screenshots;
}
public List<String> getBigScreenshots() {
List<String> screenshots = new ArrayList<>();
Elements img = document.select("div.u3EI9e").select("button.Q4vdJd").select("img");
for (Element src : img) {
if (src.hasAttr("data-src")) {
screenshots.add(src.attr("data-srcset").replace(" 2x", ""));
} else {
screenshots.add(src.attr("srcset").replace(" 2x", ""));
}
}
return screenshots;
}
public String getDescription() {
return document.select("div.DWPxHb > span").text();
}
public String getRatings() {
return document.select("div.BHMmbe").text();
}
}
Imports
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
This script will return
Category (Personalization for example)
Developer Name
App Icon
App Name
Screenshots (Thumbnail and Full preview)
Description
You can also check the full source code here

How to remove status 203 error in parsing Google-trends html response?

I am getting Google-trends data through html response after hitting on a URL. I managed to parse that response through Jsoup library. I got the data but only for 3-4 times. After that it started to giving Status-203 error.Each day i run this code for 3-4 times after that i got this exception. Please help me what should i do now ?
My Code is -
public class HTMLParser {
private static HashMap<String, HashMap<String, String>> hostcokkies = new HashMap<String, HashMap<String,String>>();
public static ArrayList<HotTrends> getYouTubeTrendings()
{
Document document;
ArrayList<HotTrends> list = new ArrayList<HotTrends>();
HotTrends trends = null;
try {
document = Jsoup.connect("http://www.google.com/trends/fetchComponent?geo=IN&date=today+12-m&gprop=youtube&cmpt=q&cid=TOP_QUERIES_0_0").get();
Elements links = document.select("a[href]");
for(Element link : links){
trends = new HotTrends();
trends.setWord(link.text());
list.add(trends);
}
}
catch(Exception e)
{
e.printStackTrace();
}
return list;
}
public static void main(String args[])
{
ArrayList<HotTrends> hotTrends = new ArrayList<HotTrends>();
hotTrends = HTMLParser.getYouTubeTrendings();
for(HotTrends trends : hotTrends)
{
System.out.println(trends.getWord());
}
}
}

I have the same problem, just change your IP address by yourself or using a program (I use Hotspot Shield http://www.hotspotshield.com/) it works for me

List HTML tags from a String

I have a String and from which I want to list all the HTML tags present within it. Is there any library available to do this job?
Any information will be very helpful to me.

You can use the below code to extract only the HTML tags from your String.
package com.overflow.stack;
/**
*
* #author sarath_sivan
*/
public class ExtractHtmlTags {
public static void getHtmlTags(String html) {
int beginIndex = 0;
while(beginIndex!=-1) {
beginIndex = html.indexOf("<", 0);
int endIndex = html.indexOf(">", beginIndex+1);
String htmlTag = "";
try {
if(beginIndex!=-1) {
htmlTag = html.substring(beginIndex, endIndex+1);
}
} catch(Exception e) {
e.printStackTrace();
}
System.out.println(htmlTag);
html = html.substring(endIndex+1, html.length());
}
}
public static void main(String[] args) {
String html = "<html><body><h2>List HTML tags from a String</h2>hello<br /></body></html>";
ExtractHtmlTags.getHtmlTags(html);
}
}
But, I don't understand what you are trying to do with the extracted HTML tags. Good luck!

You can try http://jsoup.org/
Not sure it allows to get list of tags but you can get the list iterating DOM.

The parser from HTMLUnit can take a String and return an a structured result:
http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/html/HTMLParser.html

page = Nokogiri::HTML(open('http://yoursite.com'))
page.css("*").map{|x| x.name}.flatten.uniq

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get a part of a webpage using JSOUP - java

Google Dictionary API is deprecated! But instead scraping through google search URI,which is what you are doing currently, you can do the same thing using this http://google-dictionary.so8848.com/ service which preferably more easy to scrape data from, with what you are doing currently.

Related

How to extract the text from the web site?

Extract values from xml file using Java

Getting versionCode and VersionName from Google Play

How to remove status 203 error in parsing Google-trends html response?

List HTML tags from a String

Categories

Resources