Creating a database from various web pages? [closed]

Creating a database from various web pages? [closed] - java

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
Is there a way using java or python that I can somehow gather a ton of information from a ton of different colleges on a website such as collegeboard?
I want to know how to do things like this but I've never really programmed outside of default libraries. I have no idea how to start my approach.
Example:
I input a large list of colleges on a list that looks somewhat like
this
https://bigfuture.collegeboard.org/print-college-search-results
The code then finds the page for each college such as
https://bigfuture.collegeboard.org/college-university-search/alaska-bible-college?searchType=college&q=AlaskaBibleCollege
and then gathers information from the page such as tuition, size, etc.
and then stores it in a class that I can use for analysis and stuff
Is something like this even possible? I remember seeing a similar program in the Social Network. How would I go about this?

So, short answer, yes. It's perfectly possible, but you need to learn a bunch of stuff first:
1) The basics of the DOM model (HTML) so you can parse the page
2) The general idea of how servers and databases work (and how to interface with them in python- what I use, or java)
3) Sort of a subsection of 2: Learn how to retrieve HTML documents from a server to then parse
Then, once you do that, this is the procedure a program would haft to go through:
1) You need to come up with a list of pages that you want to search. If you want to search and entire website, you need to sort of narrow that down. You can easily limit your program to just search certain types of forums, which all have the same format on college board. You'll also want to add part of the program that builds a list of web pages that your program finds links to. For instants, if collegeboard has a page with a bunch of links to different pages with statistics, you'll want your program to scan that page to find the links to the pages with those statistics.
2) You need to find the ID, location, or some identifying marker of the HTML tag that contains the information you want. If you want to get REALLY FANCY (and I mean REALLY fancy) you can try to use some algorithms to parse the text and try to get information (maybe trying to parse admission statistics and stuff from the text on the forums)
3) You then need to store that information in a database that you then index and create an interface to search (if you want this whole thing to be online, I suggest the python framework Django for making it a web application). For the database type, it would make sense to use Sqlite 3 (I)
So yes, it's perfectly possible, but here's the bad news:
1) As someone already commented, you'll need to figure out step 2 for each individual web page format you do. (By web page format, I mean different pages with different layouts. The stack overflow homepage is different from this page, but all of the question pages follow the same format)
2) Not only will you need to repeat step 2 for each new website, but if the website does a redesign, you'll have to redo it again as well.
3) By the time you finish the program you may have easily gathered the info on your own.
Alternative and Less Cool Option
Instead of going through all the trouble or searching the web page for specific information, you can just search the web page and extract all its text, then try and find key words within the text relating to colleges.
BUT WAIT, THERE'S SOMETHING THAT DOES THIS ALREADY! It's called google :). That's basically how google works, so... yah.

Of course there is "a way". But there is no easy way.
You need to write a bunch of code that extracts the stuff you are interested in from the HTML. Then you need to write code to turn that information into a form that matches your database schema ... and do the database updates.
There are tools that help with parts of the problem; e.g. Web crawler frameworks for fetching the pages, JSoup for parsing HTML, Javascript engines if the pages are "dynamic", etc. But I'm not aware of anything that does the whole job.

What you're asking about here is called scraping and in general
it's quite tricky to do right. You have to worry about a bunch of
things:
The data is formatted for display, not programmatic consumption.
It may be messy, inconsistent, or incomplete.
There may be dynamic content, which means you might have to run a
JavaScript VM or something just to get the final state of the page.
The format could change, often.
So I'd say the first thing you should do is see if you can access the
data some other way before you resort to scraping. If you poke around
in the source for those pages, you might find a webservice feeding data
to the display layer in XML or JSON. That would be a much better place
to start.

Ok everyone thanks for the help. Here's how I ended up doing it. It took me a little while but thankfully collegeboard uses very simple addresses.
Basically there are 3972 colleges and each has a unique, text only page with an address that goes like:
https://bigfuture.collegeboard.org/print-college-profile?id=9
but the id=(1-3972).
Using a library called HTMLunit I was able to access all of these pages, convert them in to strings and then gather the info using indexOf.
It still is going to take about 16 hours to process all of them but I've got a hundred or so saved.
Maybe I lucked out with the print page but I got what I needed and thanks for the help!

Related

How do you grab a certain string of text through a link in java?

Is it possible to grab a certain piece of text through Java in a website? like for example, https://weather.com/weather/today/l/41.93,-88.25?par=google&temp=f , how would i be able to figure out the temp that it displays in java?

The practical answer to your question is: You don't wanna do that.
Let me try to answer it, at which point you'll realize why you don't want to:
How do I programatically parse a website?
It's complicated. Just about every browser has an option to right click and 'view source'. Presumably the number(s) you want are in here; you can parse this text to find them. It's NOT easy though. You'll probably be tempted to use something like a regular expression or a simple 'find me this exact string of text' trick to find what you need. It may work. But generally that means the day that this site changes the style or just does some basic updates, your code ceases to work.
You'll need to put in your agenda to check, every day if you have to, if your code still works. That's 5 minutes out of your day, every day, for the rest of the life of this project. That sounds incredulously expensive, which is why you don't want this.
If you must, there are ways to tighten up your parsing code. If you use libraries like jsoup, that helps a bit. If you toss the entire site through a 'browser emulator', you can deal with javascript making ajax requests and the like (these days websites are like little programs, and to truly observe programmatically what the site shows to human eyes, you need to run that program to get the job done. If you're very lucky, you can inspect the 'source code' of the little program and that's all you need, but you're not always that lucky).
But, as I said, that just helps a bit. The day will come the weather channel changes their site and breaks your code. They won't announce it. It is not considered immoral or technically dubious to do so. Maybe you can update your agenda to check if your code works down to once a week instead of daily, but it'll be a permanent maintenance burden. You DO NOT WANT THIS.
Okay, forget that. How does this really work?
Sites that are designed to let you read this stuff have an API. They'll document it someplace. This is a 'website' made specifically for code. It has no formatting, and a well defined specification. Send it this specific simple string, and this specific simple answer comes out, and the site has tooling to let you know when they change it (for example, an 'API version') - all luxuries the site meant for human consumption will not have.
You're in luck. The weather channel has an API.
What you really want, is to read all that, figure out how that API works, and use that.
The API will not break when the weather channel decides that today is a good day to slightly change the shade of the background image.

Organizing many texts by swapping instances

I am planning to develop an adventure-like game.
For that I am going to have a lot of instances of classes with different texts (basicly strings).
I dont want to hardcode this many texts, so i am looking for a way to do it better.
The guy in this video ( https://www.youtube.com/watch?v=8CDePunJlck ) is using json to write text files for each class instance manually and parse them automatically into instances. That goes into the right direction.
I´m looking for more information on that, so how is this procedure called?
Its said in the video that this also works with databases?
Is there a way to design a little bit more complex stuff with things like this?
E.g. I have the case that I would like to output different texts if e.g. a local or global variable is over a treshold etc. Can I do this without hardcoding and write an own class for each of my proposed instances?
Thank you!

Your question is quite broad, and it is hard to give a definitive answer. Here are some thoughts - hope you find it helpful.
You are right that you don't want to hardcode strings. The alternative to this is storing strings as external resources, and loading them into your game at start. There are numerous ways the resource can be organized; the choice depends on your programming platform, game architecture etc. For example, you can use simple name-value approach:
AREA_1_DESCRIPTION: You stand next o a small white house.
ITEM_22_DESTRUCTION: The nasty snake disappears with a loud "Bang!"
Using JSON or XML will give you more structured storage, which can be of great help, since you can organize your texts so that it is easier to use them in the code:
<item id="375" name="Great Sword">
<short_description>A Great Sword of Darkness</short_description>
<long_description>The sword has almost black blade with some unknown runes engraved</long_description>
</item>
If your programming system can access a database, then you can do something similar and store texts in the tables; this, however, might make it more difficult to edit texts later. If you want to go this way, I would still recommend using XML or JSON to store the texts, and making the game import texts in DB on the first run.
You probably will also need some sort of simple template-handling engine to be able to re-use some strings. You can start with creating your version of Java String.format() method. Your method might take as a first argument an ID of a string in your string catalog, and use some simple placeholders for the parameters. Suppose you have the following entry in your catalog:
FIRE_GEM_ACTION: "The Fire Gem touches %% and in %% seconds it turns into ashes."
Then you can write a method that will do something like this:
int delaySeconds = 5;
String message = MyTemplateProcessor.process(FIRE_GEM_ACTION, "old map", delaySeconds);
The function will take the string from the catalog, search for the occurrences of the placeholders (%%) and replace them sequentially with the parameters, so in the message you will get: The Fire Gem touches old map and in 5 seconds it turns into ashes.
In general, I would recommend you to have a look at some systems specially designed for creation of adventure games. Inform 7 will be a good starting place: http://inform7.com/learn/

Outputting a data driven generated graphic which can be modified by the user

I'm trying to develop a system whereby clients can input a series of plant related data which can then be queried against a database to find a suitable list of plants.
These plants then need to be displayed in a graphic output, putting tall plants at the back and small plants at the front of a flower bed. The algorithm to do this I have set in my mind already, but my question to you is what would be the best software to use that:
1) Allows a user to enter in data
2) Queries a database to return suitable results
3) Outputs the data into a systemised graphic (simple rectangle with dots representing plants)
and the final step is an "if possible" and something I've not yet completely considered:
4) Allow users to move these dots using their mouse to reposition if wanted
--
I know PHP can produce graphic outputs, and I assume you could probably mix this in with a bit of jQuery which would allow the user to move the dots. Would this work well or could other software (such as Java or __) produce a better result?
Thanks and apologies if this is in the wrong section of Stack!

Your question is a bit vague. To answer it directly, any general programming language these days is able to do what you want, with the right libraries - be it C/++, Java, PHP+Javascript, Python, Ruby, and millions of others
With Java in particular, you'll probably want to use the swing toolkit for the GUI.
If you do know PHP+Javascript exclusively, it's probably best for your project to stick to what you know. If, however, you see this more as a learning opportunity than a project that needs be done NOW, you could take time to learn a new language in the process.
As to what language to learn, each person has a different opinion, obviously, but generally speaking, a higher-level a language is faster to prototype in.
EDIT
If you need this for a website, however, you'll need to use something web based - that is, you'll necessarily have two programs, one that runs server-side, the other one in the client (browser). On the server side, you could very well use PHP, JSP (JavaServer Pages), Python or Ruby. On the client side, you'll be limited to Javascript+DOM (maybe HTML5), a Java applet, or something flash-based.

is creating a unique html file for each article a good practice?

sorry for poor topic name, i could not think for any thing better ;)
i am working on a news broadcast web site project, and the stake holder asked me to create a unique html file for each article and save it on disk instead of using a dbms like mysql , so the users can access the file directly and no computing will be needed so there wont be any bottle neck in that case.
and i did so.
and my question is , is this(what he asked me) a good and popular practice in programming?
what are the pros and cons?
thank you all and sorry for my poor English writing :P

If you got a template and can generate these pages automatically, it can be a good practise. Like you say, it prevents your server from having to generate the page. It only needs to put through the plain page.
And if you need to change the layout, or need to edit an article, you can just regenerate the page.
It is quite common, although lots of pages always have some dynamic content, like a date, user info or other session or time specific data. In this case you cannot cache the entire page. Of course you can combine both. Have dynamic index pages and front page, and only cache the actual articles themselves. But I read in your question that that is what you've done now.
Pros:
Faster retrieval of pages
Less load on your webserver
Less load on your database server
Cons:
Need to do some extra work to update the cache when an article is modified
Cannot have any dynamic content in the page
There probably isn't a problem at all. Most webservers are able to server large amounts of dynamic pages (premature optimization is the root of all evil).
There are other ways to speed things up, that don't have the above cons. You could cache query results in Memcache and/or use APC cache to speed up your PHP code and decrease disk I/O.
But there are web hosting companies dedicated entirely onto serving static content. That static content can be server from in-memory too, making it even faster than APC cached dynamic content, so if you really really really need the performance, yes, this is the way to go. But I seriously doubt you do.

Static pages are good for small websites. If you have the chance, go for it but if you need complex operations, dynamic page structure should be the way to go.
For an article site, I'd go with dynamic pages since the concept is dynamic (You'll need to update the site, add new articles, maybe add new features like commenting, user activity etc).
It is easier to add/delete/edit an article directly from an admin panel, with static pages, you'd have to find your way through the html code.
The list would go on and on...

Without a half-decent templating system, you'd have to store the full article AND the page layout and styles in the one file.
This means, it'd be difficult to update look and feel across all the published articles, and if you wanted to query the article list and return a list (such as those form a specific author or in a specific category), you'd be a bit stuck too.

If you think of it as a replacement for your database: No, that's not good pratice. You loose a lot of information, editing pages later will be harder as well es setting up indexed search functions,...
If you think of it as a caching solution: Then yes, this is good practice and also a common technique. But think on how to do the caching, when to replace the files with new versions and only do it if you have few write accesses and a lot of read accesses to your pages (which is typical for an article site ^^)

Definitely not a common practice, and I would not do it this way. Especially for the reasons of having a bottleneck - you won't have any bottletneck there. Nor any performance problem. How much unique visitors is your site likely to be getting? Hundreds of thousands?
In fact, reading from the disk is more likely to be a problem. DB operations can be optimized, cached in memory, etc - the db server performs various optimizations. On the other hand, you read the file each time (or handle the caching yourself).
The usual and preferred way to do it is:
store and load content from DB
have a template (header + footer) for the page, and only insert the content
have an admin panel with an editor (as rich as possible) where you can modify the content of the articel

I started out asking myself why a stakeholder might be asking you to implement a system this way. Why would he / she care, as long as your system meets the requirements? There are two possible answers to this:
The stakeholder is a bit of a control freak; e.g. an ex-techie who likes to interfere with what his developers do.
The stakeholder has had a bad experience in the past; e.g. with a previous system where the content was "locked into" a database with an unwieldy front end that made life hell for the users.
From this standpoint, how would you address the problem? My take is that you need to get to the bottom of why the stakeholder is asking for this. Does he have some genuine concern? Can you address that concern in the system design?
The bottom line is that "is this best practice" is not the overriding criterion here. Arguably, "what the customer wants" or "what the customer needs" are more important.
What I think you need to do is:
Find out what the stakeholder's real concern is.
Discuss with him / her (and other stakeholders) the design options that will address those concerns. Present them with the alternatives and an honest assessment of their implications, and involve them in the decision making.

Parsing IBM 3270 data in java

I was wondering if anyone had experience retrieving data with the 3270 protocol. My understanding so far is:
Connection
I need to connect to an SNA server using telnet, issue a command and then some data will be returned. I'm not sure how this connection is made since I've read that a standard telnet connection won't work. I've also read that IBM have a library to help but not got as far as finding out any more about it.
Parsing
I had assumed that the data being returned would be a string of 1920 characters since the 3278 screen was 80x24 chars. I would simply need to parse these chars into the appropriate fields. The more I read about the 3270 protcol the less this seems to be the case - I read in the documentation provided with a trial of the Jagacy 3270 Java library that attributes were marked in the protocol with the char 'A' before the attribute and my understanding is that there are more chars denoting other factors such as whether fields are editable.
I'm reasonably sure my thinking has been too simplistic. Take an example like a screen containing a list of items - pressing a special key on one of the 24 visible rows drills down into more detailed information regarding that row.
Also it's been suggested to me that print commands can be issued. This has some positive implications - if the format of the string returned is not 1920 since it contains these characters such as 'A' denoting how users interact with the terminal, printing would eradicate these. Also it would stop having to page through lots of data. The flip side is I wouldn't know how to retrieve the data from the print command back to Java.
So..
I currently don't have access to the SNA server but have some screen shots of what the terminal will look like once I get a connection and was therefore going to start work on parsing. With so many assumptions and not a lot of idea on what the data will look like I feel really stumped. Does anyone have any knowledge of these systems that might help me back on track?

You've picked a ripper of a problem there. 3270 is a very complex protocol indeed. I wouldn't bother about trying to implement it, it's a fool's errand, and I'm speaking from painful personal experience. Try to find a TN3270 (Telnet 3270) client API.

This might not specifically answer your question, but...
If you are using Rational Developer for z/OS, your java code should be able to use the integrated HATS product to deal with the 3270 stream. It might not fit your project, but I thought I would mention it if all you are trying to do is some simple screen scraping, it makes things very easy.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.