Java URL not fetching entire source

Java URL not fetching entire source - java

I'm trying to create a simple project where the user inputs a URL and I fetch the relevant information (author, title, etc.) for a citation. The problem is that the Java URL library doesn't seem to fetch the entire page source. For example, I'll use the link https://www.cia.gov/library/publications/the-world-factbook/geos/jo.html as a reference. Here's the code I'm using:
import java.net.*;
import java.io.*;
import java.util.ArrayList;
public class URLTester
{
private static URL url;
public URLTester(URL u)
{
url = u;
}
public static ArrayList <String> getContents() throws Exception
{
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
ArrayList <String> arr = new ArrayList<String>();
while ((inputLine = in.readLine()) != null)
{
arr.add(inputLine);
}
in.close();
return arr;
}
public static void main (String args[]) throws Exception
{
url = new URL("https://www.cia.gov/library/publications/the-world-factbook/geos/jo.html");
ArrayList<String> contents = getContents();
for(int i = 0; i < contents.size(); i++)
{
System.out.println((contents.get(i)));
}
}
}
This fetches what appears to be a shortened version of the page source for the target. When I pressed 'view page source' on the site, a much more expanded version came up, including information such as the date and the author of the article. I can't paste the source here, because it'll exceed the character limit. How can I get the entire page source, instead of a shortened version?

The problem is due to console character limit exceed.
The default limit is 80000 character in Eclipse.
To change the preference, go to Window -> Preference.
Then find Run/Debug in Left Menu.
Then open and choose Console.
Uncheck "Limit console output" or increase the limit as you want.

Related

Removing HTML tags from a column in an excel file using java

I am using a java code to remove HTML tags from a text file. But my requirement is that, I want to access an excel file using java, and remove the HTML tags from each row of a particular column. How can I access an excel file using javascript and how to integrate my java code(removing HTML tags) into that...
import java.io.*;
import java.util.logging.Logger;
public class Html2TextWithRegExp {
private Html2TextWithRegExp() {}
public static void main (String[] args) throws Exception{
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new FileReader("java-new.txt"));
String line;
while ( (line=br.readLine()) != null) {
sb.append(line);
}
String nohtml = sb.toString().replaceAll("\\<.*?>","");
System.out.println(nohtml);
try( PrintWriter out = new PrintWriter( "nohtml.txt" )){
out.println( nohtml );
}
}
}

You can use jsoup. Then you can do this:
String noHTML = Jsoup.parse(sb.toString()).text();
Don't use regexes; HTML is not a regular language and you're unlikely to be able to deal with all the special cases that are bound to crop up.
I recently used this very method to clean up a bunch of forum posts that I was using for a machine-learning task, and it worked perfectly.

Java, CSV, Getting error if do not click on Save button of the csv file

This is my code
public double myMethod(String name)
{
double result = 0.0;
String path = "/Users/T/Desktop/Training/MyFolder/";
int maxColumn = 0;
BufferedReader br = null;
ArrayList findMaxColumn = new ArrayList();
String line = "";
try
{
br = new BufferedReader(new FileReader(path+name));
while((line = br.readLine())!= null)
{
findMaxColumn.add(temp.split(",").length);
}
maxColumn = getMaxNumber(findMaxColumn);
CSVReader reader = new CSVReader(new FileReader(path+name));
List<String[]> myData = reader.readAll();
for(int i = 0 ; i < maxColumn;i++)
{
for (String[] lineData : myData)
{
String value= lineData[i];
The problem is, I have a csv file (generated from other method and stored in MyFolder) and when I run this code, I got an error "ArrayIndexOutOfBoundsException: 1" at String value= lineData[i]. But, if I open my csv file and click on save button (or make some changes for a value eg 0 to 1 etc) and close it before I run this code then it will be fine when I run it. That's weird!!! Could anyone explain to me why I have to open the csv file and make some changes(just click on save button or change a value to another) to ignore the problem and how to fix it?

Check that your encoding when you save the file is the same encoding with the one you use when you read the file. It may well be that you are saving, for example, in UTF8 and reading the file as it would be UTF16.
This fits what you describe (if opening and saving the file before reading it, then it works) as well as the ""ArrayIndexOutOfBoundsException: 1"" - showing that the "split" method did not find any separator (the comma) thus return a single string.
It would also help if using a debugger to check what is in your array just-to-be-added-to findMaxColumn after splitting. Easier to debug if you use a local variable to store the result of split before adding:
while((line = br.readLine())!= null)
{
String splitResult[]=temp.split(","); // easier to examine
findMaxColumn.add(splitResult);
}

Get a tweet from html content in Java through either regex or at least without external libraries

How can I get the latest tweet from html content through either regex or without any external libraries. I am happy to use external libraries I would just prefer not to. I just wanted to know how it would be possible. I have written the html download part in Java and if anyone wants I will post it here.
So I'll do a pit of pseudo code so that I'm not only targeting Java developers This is how my program looks so far.
1.)Load site("www.twitter.com/user123")
2.)Get initial string and write it to variable->buffer
3.)Loop start
4.) Append string->buffer
5.) If there is no more ->break
6.)print buffer
Obviously the variable buffer will now have raw html content. How can I sort this out to get the tweet. I have found a way but this is too inconsistent. The way I managed it was to find the string which held the tweets and get the content surrounded by the code. However there were too many changes in this section. What I mean is some content inside of it changes, like the font size. I could write multiple if statements but is there a neater solution?

Let me just start off by saying that jsoup is an amazing lightweight HTML parsing library. You can use things like CSS selectors and whatnot. If you ever decide to use a library jsoup will make your life a lot easier.
You can just query for the element with the class of TweetTextSize, then get the text content. This will give you all text, hashtags, and links. (The downside being pictures are also given in links)
Otherwise, you'll need to manually traverse the DOM. For example, use regex to find the beginning of the first TweetTextSize, and then just keep all text which is not between a < and a >.
Unfortunately, this second solution is volatile and may not work in the future, and you'll end up with a big glob of code which is overly complex and hard to debug.

Simple answer if you want a regex and not a sophisticated third party library.
<p[^>]+js-tweet-text[^>]*>(.*)</p>
Try the above on the "view-source" of https://twitter.com/a
Thanks.
EDIT:
Source Code:
import java.io.ByteArrayOutputStream;
import java.io.InputStream;
import java.net.URL;
import java.net.URLConnection;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class TweetSucker {
public static void main(String[] args) throws Exception {
URLConnection urlConnection = new URL("https://twitter.com/a").openConnection();
InputStream inputStream = urlConnection.getInputStream();
String encoding = urlConnection.getContentEncoding();
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
byte[] buffer = new byte[8192];
int len = 0;
while ((len = inputStream.read(buffer)) != -1) {
byteArrayOutputStream.write(buffer, 0, len);
}
String htmlContent = null;
if (encoding != null) {
htmlContent = new String(byteArrayOutputStream.toByteArray(), encoding);
} else {
htmlContent = new String(byteArrayOutputStream.toByteArray());
}
Pattern TWEET_PATTERN = Pattern.compile("(<p[^>]+js-tweet-text[^>]*>(.*)</p>)", Pattern.CASE_INSENSITIVE);
Matcher matcher = TWEET_PATTERN.matcher(htmlContent);
while (matcher.find()) {
System.out.println("Tweet Found: " + matcher.group(2));
}
}
}

I know that you don't want any libraries but if you want something really quick this is working code in C#:
using (IE browser = new IE())
{
browser.GoTo("https://twitter.com/user");
List tweets = browser.List(Find.ById("stream-items-id"));
if (tweets != null)
{
foreach (var tweet in tweets.ListItems)
{
var tweetText = tweet.Paras.FirstOrDefault();
if (tweetText != null)
{
MessageBox.Show(tweetText.Text);
}
}
}
}
This program uses a library called WatiN (if you use Visual Studio go to Tools Menu, select "NuGet Package Manager" then select "Manage Nuget Packages for Solution" and then select "Browse" and then type "Watin" on the search box, after you find the library hit "Install", after it is installed you just add a reference in your code and then a using statement:
using WatiN.Core;
You can just copy and paste the code I wrote above in a button handler and it'll work, you need to change the twitter.com/XXXXXX user name to list all their tweets. Modify code accordingly to meet your needs.

Populating ArrayList from file without including formatting info and back-slashes

To start, no this is not a homework assignment. I am fresh out of highschool and am trying to do some personal projects before college. I've been trying to populate an ArrayList with elements from a document. The document looks like:
item1
item2
item3
...
itemN
After failing many times on my own, I tried different solutions from this website. Most recently, this one got me the closest to what I desire:
public static void main(String[] args) throws IOException {
List<String> names = new ArrayList<String>();
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader("/Users/MyName/Desktop/names.txt"));
String line = null;
while ((line = reader.readLine()) != null) {
names.add(line);
}
} finally {
reader.close();
}
for(int i = 0; i<names.size(); i++){
System.out.println(names.get(i));
}
//String[] array = (String[]) names.toArray(); Not necessary that it is in an array
}
The only problem is that this returns something rather ugly in the console:
{\rtf1\ansi\ansicpg1252\cocoartf1347\cocoasubrtf570
{\fonttbl\f0\froman\fcharset0 Times-Roman;}
{\colortbl;\red255\green255\blue255;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\pard\tx577\tx1155\tx1733\tx2311\tx2889\tx3467\tx4045\tx4623\tx5201\tx5779\tx6357\tx6935\tx7513\tx8091\tx8669\tx9247\tx9825\tx10403\tx10981\tx11559\tx12137\tx12715\tx13293\tx13871\tx14449\tx15027\tx15605\tx16183\tx16761\tx17339\tx17917\tx18495\tx19072\tx19650\tx20228\tx20806\tx21384\tx21962\tx22540\tx23118\tx23696\tx24274\tx24852\tx25430\tx26008\tx26586\tx27164\tx27742\tx28320\tx28898\tx29476\tx30054\tx30632\tx31210\tx31788\tx32366\tx32944\tx33522\tx34100\tx34678\tx35256\tx35834\tx36412\tx36990\tx37567\tx38145\tx38723\tx39301\tx39879\tx40457\tx41035\tx41613\tx42191\tx42769\tx43347\tx43925\tx44503\tx45081\tx45659\tx46237\tx46815\tx47393\tx47971\tx48549\tx49127\tx49705\tx50283\tx50861\tx51439\tx52017\tx52595\tx53173\tx53751\tx54329\tx54907\tx55485\tx56062\tx56640\tx57218\tx57796\li577\fi-578
\f0\fs24 \cf0 \CocoaLigature0 item1\
item2\
item3\
...
itemN\
}
How can i get it to read from the file but not include all of the back-slashes and formatting info?

you just need to really save the file as text file. Your file looks like an RTF file at the moment. Open Pages application and open that file. Go to File... Export to... Plain Text... and save it into a new file.

Looks like your names.txt file got saved as RTF (Rich Text Format). Make sure you convert it to plain text.

How do I scrape website with term acceptance page? [duplicate]

This question already has answers here:
How to code an automated bot that can browse and do operations on a webpage
(6 answers)
Closed 7 years ago.
I am new to writing code and I am trying to write code to scrape a specific website. The issue is that this website has a page to accept the conditions of use and privacy page. This can be seen by the website: http://cpdocket.cp.cuyahogacounty.us/
I need to bypass this page somehow and I have no idea how. I am writing my code in Java, and so far have working code that scrapes the source for any website. This code is:
import java.net.URL;
import java.net.URLConnection;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.lang.StringBuilder;
import java.io.IOException;
// Scraper class takes an input of a string, and returns the source code of the of the website
public class Scraper {
private static String url; // the input website to be scraped
//constructor
public Scraper(String url) {
this.url = url;
}
//scrapeWebsite runs the method to scrape the input variable. As of now it retuns a string. This string idealy should be saved
//so it is able to be parsed by another method
public static String scrapeWebsite() throws IOException {
URL urlconnect = new URL(url); //creates the url from the variable
URLConnection connection = urlconnect.openConnection(); // connects to the created url
BufferedReader in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "UTF-8")); // annonymous class to stream the website
String inputLine; //creates a new variable of string
StringBuilder a = new StringBuilder(); // creates stringbuilder
//loop appends to the string builder as long as there is information
while ((inputLine = in.readLine()) != null)
a.append(inputLine);
in.close();
return a.toString();
}
}
Any suggestions on how to go about doing this would be greatly appreciated.
I am rewriting the code based off a ruby code. The code is:
def initializeSession()
## SETUP # POST headers
post_header = Hash.new()
post_header['Host'] = 'cpdocket.cp.cuyahogacounty.us'
post_header['User-Agent'] = 'Mozilla/5.0 (Windows NT 5.1; rv:20.0) Gecko/20100101 Firefox/20.0'
post_header['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
post_header['Accept-Language'] = 'en-US,en;q=0.5'
post_header['Accept-Encoding'] = 'gzip, deflate'
post_header['X-Requested-With'] = 'XMLHttpRequest'
post_header['X-MicrosoftAjax'] = 'Delta=true'
post_header['Cache-Control'] = 'no-cache'
post_header['Content-Type'] = 'application/x-www-form-urlencoded; charset=utf-8'
post_header['Referer'] = 'http://cpdocket.cp.cuyahogacounty.us/Search.aspx' # may have to alter this per request
# post_header['Content-Length'] = '12197'
post_header['Connection'] = 'keep-alive'
post_header['Pragma'] = 'no-cache'
# STEP # set up simulated browser and make first request
#browser = SimBrowser.new()
#logname = 'log.txt'
#s = Scribe.new(logname)
session_cookie = 'ASP.NET_SessionId'
url = 'http://cpdocket.cp.cuyahogacounty.us/'
#browser.http_get(url)
#puts browser.get_body() # debug
puts 'DEBUG: session cookie: ' + #browser.get_cookie_var(session_cookie)
#log.slog('DEBUG: home page response code: expected 200, actual ' + #browser.get_response().code)
# s.flog('### HOME PAGE RESPONSE')
# s.flog(browser.get_body()) # debug
# STEP # send our acceptance of the terms of service
data = {
'ctl00$SheetContentPlaceHolder$btnYes' => 'Yes',
'__EVENTARGUMENT'=>'',
'__EVENTTARGET'=>'',
'__EVENTVALIDATION'=>'/wEWBwKc78CQCQLn3/HqCQLZw/fZCgLipuudAQK42duKDQL33NjnAwKn6+K4CIM3TSmrbrsn2xBRJf2DRwg01Vsbdk+oJV9lhG/in+xD',
'__VIEWSTATE'=>'/wEPDwUKLTI4MzA1ODM0OA9kFgJmD2QWAgIDD2QWDgIDD2QWAgIBD2QWCAIBDxYCHgRUZXh0BQ9BbmRyZWEgRi4gUm9jY29kAgMPFgIfAAUfQ3V5YWhvZ2EgQ291bnR5IENsZXJrIG9mIENvdXJ0c2QCBQ8PFgIeB1Zpc2libGVoZGQCBw8PFgIfAWhkZAIHDw9kFgIeB29uY2xpY2sFGmphdmFzY3JpcHQ6d2luZG93LnByaW50KCk7ZAILDw9kFgIfAgUiamF2YXNjcmlwdDpvbkNsaWNrPXdpbmRvdy5jbG9zZSgpO2QCDw8PZBYCHwIFRmRpc3BsYXlQb3B1cCgnaF9EaXNjbGFpbWVyLmFzcHgnLCdteVdpbmRvdycsMzcwLDIyMCwnbm8nKTtyZXR1cm4gZmFsc2VkAhMPZBYCZg8PFgIeC05hdmlnYXRlVXJsBRMvVE9TLmFzcHg/aXNwcmludD1ZZGQCFQ8PZBYCHwIFRWRpc3BsYXlQb3B1cCgnaF9RdWVzdGlvbnMuYXNweCcsJ215V2luZG93JywzNzAsMzcwLCdubycpO3JldHVybiBmYWxzZWQCFw8WAh8ABQYxLjAuNTRkZEnXSWiVLEPsDmlc7dX4lH/53vU1P1SLMCBNASGt4T3B'
}
#post_header['Referer'] = url
#browser.http_post(url, data, post_header)
#log.slog('DEBUG: accept terms response code: expected 200, actual ' + #browser.get_response().code)
#log.flog('### TOS ACCPTANCE RESPONSE')
# #log.flog(#browser.get_body()) # debug
end
can this be done in Java as well?

If you don't understand how to do this, the best way to learn is to do this manually while watching what happens with FireBug (on Firefox) or the equivalent tools for IE, Chrome or Safari.
You must duplicate in your code whatever happens in the protocol when the user accepts the terms & conditions manually.
You must also be aware that the UI presented to the user may not be sent directly as HTML, it may be constructed dynamically by Javascript that would normally run on the browser. If you are not prepared to fully emulate a browser to the point of maintaining a DOM and executing Javascript, then this may not be possible.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java URL not fetching entire source - java

The problem is due to console character limit exceed. The default limit is 80000 character in Eclipse. To change the preference, go to Window -> Preference. Then find Run/Debug in Left Menu. Then open and choose Console. Uncheck "Limit console output" or increase the limit as you want.

Related

Removing HTML tags from a column in an excel file using java

Java, CSV, Getting error if do not click on Save button of the csv file

Get a tweet from html content in Java through either regex or at least without external libraries

Populating ArrayList from file without including formatting info and back-slashes

How do I scrape website with term acceptance page? [duplicate]

Categories

Resources