I have the following code. I am using the jsoup library to retrieve the URLs from a website; after that, I am checking if the URLs contain the keyword I want, and list them in another string. My problem is that I am not able to retrieve only one URL.
Have a look at my code:
// Get the webpage and parse it.
org.jsoup.nodes.Document doc = Jsoup.connect("http://www.examplepage").get();
// Get the anchors with href attribute.
// Or, you can use doc.select("a") to get all the anchors.
org.jsoup.select.Elements links = doc.select("a[href]");
// Iterate over all the links and process them.
for (org.jsoup.nodes.Element link : links) {
String scrapedlinks += link.attr("abs:href")+"\n" ;
String scrapedlinks3 ="";
}
String[] links2 = links.split("\n");
for (String newlink : hulklinks ) {
if (newlink("mysearchterm")) {
scrapedlinks3 +=newlink ;
String[] scrapedlines = scrapedlinks3.split("\n" );
}
}
I think it will be easier if you directly store your urls in an Arraylist:
Arraylist<String> urls = new Arraylist<String>();
for (org.jsoup.nodes.Element link : links)
urls.add(link.attr("abs:href"));
After this you can easy access them with
urls.get(i);
Related
I am a complete beginner to webscraping. I have followed a couple tutorials online, but I can't seem to get it to work with Premiere League results.
Here is the exact link I've tried scraping from: https://www.premierleague.com/results
My goal is to read all the home-team and away teams as well as get their results (1-1 etc.). If anyone could help I would really appreicate it! Below is code I've tried so far:
First attempt
String element = doc.select("div.fixtures__matches-list span.competitionLabel1").first().text();
Second attempt
Elements elements = doc.select("div.fixtures__matches-list");
Elements matches = doc.getElementsByClass("matchList");
Element ULElement = matches.get(0);
Elements childElements = ULElement.children();
for (Element e : childElements) {
String first = e.select("ul.matchList").select("li.matchFixtureContainer data-home").text();
System.out.println(e.text());
}
Third attempt
Elements test = doc.getElementsByClass("fixtures");
Element firstE = test.get(0);
System.out.println(firstE.text())
for (Element e : test) {
System.out.println(e.text());
}
Fourth attempt
Elements names = doc.select("data-home");
for (Element name : names) {
System.out.println(name.text());
}
Fifth attempt
String webUrl = "https://www.premierleague.com/results";
Document doc = null;
try {
doc = Jsoup.connect(webUrl).timeout(6000).get();
}
catch(IOException e) {
e.printStackTrace();
}
Elements body = doc.select("div.tabbedContent");
for (Element e : body) {
String data = e.select("div.col-12 section.fixtures div.fixtures__matches-list ul.matchList").text();
}
I really can't figure it out.
Hello people of the internet,
We're having the following problem with the Stanford NLP API:
We have a String that we want to transform into a list of sentences.
First, we used String sentenceString = Sentence.listToString(sentence); but listToString does not return the original text because of the tokenization. Now we tried to use listToOriginalTextString in the following way:
private static List<String> getSentences(String text) {
Reader reader = new StringReader(text);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
String sentenceString = Sentence.listToOriginalTextString(sentence);
sentenceList.add(sentenceString.toString());
}
return sentenceList;
}
This does not work. Apparently we have to set an attribute " invertible " to true but we don't know how to. How can we do this?
In general, how do you use listToOriginalTextString properly? What preparations do you need?
sincerely,
Khayet
If I understand correctly, you want to get the mapping of tokens to the original input text after tokenization. You can do it like this;
//split via PTBTokenizer (PTBLexer)
List<CoreLabel> tokens = PTBTokenizer.coreLabelFactory().getTokenizer(new StringReader(text)).tokenize();
//do the processing using stanford sentence splitter (WordToSentenceProcessor)
WordToSentenceProcessor processor = new WordToSentenceProcessor();
List<List<CoreLabel>> splitSentences = processor.process(tokens);
//for each sentence
for (List<CoreLabel> s : splitSentences) {
//for each word
for (CoreLabel token : s) {
//here you can get the token value and position like;
//token.value(), token.beginPosition(), token.endPosition()
}
}
String sentenceStr = sentence.get(CoreAnnotations.TextAnnotation.class)
It gives you original text. An example for JSONOutputter.java file :
l2.set("id", sentence.get(CoreAnnotations.SentenceIDAnnotation.class));
l2.set("index", sentence.get(CoreAnnotations.SentenceIndexAnnotation.class));
l2.set("sentenceOriginal",sentence.get(CoreAnnotations.TextAnnotation.class));
l2.set("line", sentence.get(CoreAnnotations.LineNumberAnnotation.class));
I am using Appium and I want to print names of the elements in the list.
I am using following code
List<WebElement> list = getDriver().findElementsByXPath(getLocator(Locators.MY_ITEM));
List<String> strings = new ArrayList<>();
for (WebElement object : list) {
String text = object.getText();
logger.info(text);
if (!text.isEmpty())
strings.add(text);
}
But I am getting text always as empty.
What is the suggested approach over here.
Note each element is of type UIACollectionCell in case of iOS and on Android //android.widget.TextView[#text='%s']
From what I understand, you should be getting the text from the text attribute, replace:
String text = object.getText();
with:
String text = object.getAttribute("text");
I have a form which I have to read with jsoup, it contains several fields including checkboxes and comboboxes (select inputs).
I am reading there values with following code -
Element campaignForm = doc.getElementById("Campaign");
Elements allInputFields = campaignForm.getElementsByTag("input");
Elements allSelections = campaignForm.getElementsByTag("select");
Map<String, String> postData = new HashMap<String, String>();
for(Element selectField:allSelections){
postData.put(selectField.attr("name"), selectField.attr("value"));
}
for(Element inputField:allInputFields){
if(inputField.attr("type").equalsIgnoreCase("checkbox")){
postData.put(inputField.attr("name"), inputField.attr("checked").equalsIgnoreCase("checked")?"1":"0");
}else{
postData.put(inputField.attr("name"), inputField.attr("value"));
}
}
So when I print the postData Map, it gives correct values for text input fields but for checkboxes and dropdown(comboboxes) it is not working. Please let me know if there is different way to handle checkboxes and select inputs in jsoup.
EDIT:
Checkboxes I got working with help of comment, but select input still not working.
Thanks in advance.
I got it working with following code -
for(Element selectField:allSelections){
String nameField = selectField.attr("name");
String valueField = "";
Elements allOptions = selectField.getElementsByTag("option");
for(Element opt:allOptions){
if(opt.attr("selected").equalsIgnoreCase("selected")){
valueField = opt.attr("value");
break;
}
}
postData.put(nameField, valueField);
}
for(Element inputField:allInputFields){
if(inputField.attr("type").equalsIgnoreCase("checkbox")){
postData.put(inputField.attr("name"), inputField.attr("checked").equalsIgnoreCase("checked")?"1":"0");
}else{
postData.put(inputField.attr("name"), inputField.attr("value"));
}
I am trying to scrape a list of medicines from a website.
I am using JSOUP to parse the Html.
Here is my code :
URL url = new URL("http://www.medindia.net/drug-price/index.asp?alpha=a");
Document doc1 = Jsoup.parse(url, 0);
Elements rows = doc1.getElementsByAttributeValue("style", "padding-left:5px;border-right:1px solid #A5A5A5;");
for(Element row : rows){
String htm = row.text();
if(!(htm.equals("View Price")||htm.contains("Show Details"))) {
System.out.println(htm);
System.out.println();
}
}
Here is the Output that I am getting:
P.S. This is not the complete output But As I couldn't Take The Screen Shot of the complete output, I just displayed it.
I need to Know Two Things :
Question 1. Why am I getting an Extra Space In front of each Drug Name and why am I getting Extra New Line After Some Drug's Name?
Question 2. How do I resolve this Issue?
A few things:
It's not the complete output because there's more than one page. I put a for loop that fixes that for you.
You should probably trim the output using htm.trim()
You should probably make sure to not print when there's a newLine (!htm.isEmpty())
That website has a weird character with ASCII value 160 in it. I added a small fix that solves the problem. (with .replace)
Here's the fixed code:
for(char page='a'; page <= 'z'; page++) {
String urlString = String.format("http://www.medindia.net/drug-price/index.asp?alpha=%c", page);
URL url = new URL(urlString);
Document doc1 = Jsoup.parse(url, 0);
Elements rows = doc1.getElementsByAttributeValue("style", "padding-left:5px;border-right:1px solid #A5A5A5;");
for(Element row : rows){
String htm = row.text().replace((char) 160, ' ').trim();
if(!(htm.equals("View Price")||htm.contains("Show Details"))&& !htm.isEmpty())
{
System.out.println(htm.trim());
System.out.println();
}
}
}
Do one thing :
Use trim function in syso : System.out.println(htm.trim());
UPDATED :
After a lot of effort I was able to parse all 80 medicines like this :-
URL url = new URL("http://www.medindia.net/drug-price/index.asp?alpha=a");
Document doc1 = Jsoup.parse(url, 0);
Elements rows = doc1.select("td.ta13blue");
Elements rows1 = doc1.select("td.ta13black.tbold");
int cnt=0;
for(Element row : rows){
cnt++;
String htm = row.text().trim();
if(!(htm.equals("View Price")||htm.contains("Show Details") || htm.startsWith("Drug"))) {
System.out.println(cnt+" : "+htm);
System.out.println();
}
}
for(Element row1 : rows1){
cnt++;
String htm = row1.text().trim();
if(!(htm.equals("View Price")||htm.contains("Show Details") || htm.startsWith("Drug"))) {
System.out.println(cnt+" : "+htm);
System.out.println();
}
}
1) Taking elements by style is quite dangerous;
2) Calling ROWS what instead is a list of FIELDS is even more dangerous :)
3) Opening the page , you can see that the extra lines are added ONLY after "black names", name of items not wrapped in an anchor link.
You problem is then that the second field in that rows is not Show Details nor View Price and not even empty... it is:
<td bgcolor="#FFFFDB" align="center"
style="padding-left:5px;border-right:1px solid #A5A5A5;">
</td>
It is a one space string. Modify your code like this:
for(Element row : rows){
String htm = row.text().trim(); // <!-- This one
if(!
(htm.equals("View Price")
|| htm.contains("Show Details")
|| htm.equals(" ")) // <!-- And this one
) {
System.out.println(htm);
System.out.println();
}
}