At this step you will download some real data from the site (IMDB, Amazon or eBay) and put it into your database.
You will use simple java program called WebCrawler that downloads webpages, extracts the data using sed, and inserts the data into the database.
The sed stream editor processes the html code and outputs insert statements with data.
You are suggested to get familiar with the WebCrawler's java code and encouraged to change it depending on your needs.
You should also investigate the sed scripts syntax and write your own scripts extracting your data.
1. Download WebCrawler files: details;
2. Learn how sed extraction works: details;
3. Learn how the WebCrawler program works: details;
4. Choose what data you want to extract, consult you schema;
5. Write your sed scripts for extracting the data (look at the html code and think how the particular piece of text /or URL/ can be extracted). Feel free to modify the example scripts;
6. Test your scripts locally;
7. Run the WebCrawler program with the new scripts in order to populate the data table with the real data extracted from the web;
8. Use the fetched data to fill the tables you created at step 1 using SQL scripts;
Time available: 3 weeks
The results will be graded in deliverabe 1 (23.04)
Please note that all the files should be put on lsir-cis-pcX.epfl.ch, not on your sun machine. The easiest way to do it is to run mozilla browser from your ssh terminal: mozilla &.
Create a new directory (let's call it <DIR>) and put the WebCrawler program: WebCrawler.class, source code: WebCrawler.java,
... and scripts:
for IMDB: imdb_list.sed, imdb_movie.sed, imdb_actor.sed;
for eBay: ebay_list_old.sed, ebay_dvd.sed, ebay_seller.sed;
WARNING: The eBay site structure has been changed. The proper script is here: ebay_list.sed.
for Amazon:
WARNING: The site www.absolutefreebies.com (which you could use instead of www.amazon.com) is down at the moment, that means old scripts do not work: amazon_list.sed, amazon_dvd.sed, amazon_dvd2.sed. THIS link can be used instead.
Currently working scripts are (the same as previous, just the links are fixed): amazon2_list.sed, amazon2_dvd.sed, amazon2_dvd2.sed.
Also change the starting page in the WebCrawler code, for example:
stmt.executeUpdate("insert into worklist(script,url) values('amazon2_list.sed','http://www.humorlinks.com/cgi-bin/amazon/amazon_products_feed.cgi?mode=dvd&node=163296&locale=us')");
//Please note that we use this link instead of www.amazon.com itself to crawl the information. Just because the html code is easier to process.
To play with the sed editor, download also the sample webpages:
for IMDB: imdb_list.html, imdb_movie.html, imdb_actor.html;
for eBay: ebay_list.html, ebay_dvd.html, ebay_seller.html;
for Amazon: amazon_list.html, amazon_dvd.html.
We will use sed for extracting some particular information out of an html code. The scripts use regular expressions to process text strings. The scripts
are written in such a way that SQL insert statements are produced as an output.
To try sed, run the following command from the console: sed -n -f scriptfile.sed inputfile.html
For example: sed -n -f imdb_list.sed imdb_list.html (you can try all scripts)
Then look at the html source code and compare to the output.
SED documentation and faq can be found at:
http://www.cornerstonemag.com/sed/
Alternative documentation: intro, regexp's, part1, part2.
Hint 1: The %CURRENTID% keyword inside any SED script will be replaced (by WebCrawler) with the current item id, taken from worklist table. See function ExecuteSED in WebCrawler class.
For example while parsing the imdb movie page, the %CURRENTID% will be replaced with current movie id, which was stored before in the worklist table.
Hint 2: Sed works with lines of text and it is relatively easy to extract some piece of text from each line using s/.../.../pg construction.
If the piece of text to extract is encountered several times within one line (probably long one) the following construction can be used:
:loop
h
s/.* \(something\) .*/\1/p
g
s/\(.*\) something \(.*\)/\1 \2/g
t loop
It is used in ebay_list.sed for example. Note that execution of this script can take quite a long time (up to 1 min).
1. Run pointbase console: go to SUNWappserver/pointbase/tools/serveroption and run "./startconsole.sh".
2. Connect to your database or create a new one and name it "project".
3. Create tables data and worklist: open "createtables.sql" and execute the queries.
4. You can find some sample data in imdb.sql, ebay.sql and amazon.sql.
5. Close the database.
6. Put pointbase.ini to your <DIR> and correct the path to the databases using the command: kedit poinbase.ini.
7. Run WebCrawler from <DIR>, use proper classpath and replace %site% with "imdb", "ebay" or "amazon":
java -classpath /opt/j2ee/1/SUNWappserver/pointbase/lib/pbembedded.jar:. WebCrawler %site%
The name of the database the program works with is "project" and no password by default. You can/have to change it in the code.
Warning! This operation can take several minutes.
8. If you change the code, compile the java code using: javac WebCrawler.java
9. Open pointbase console and run select query: select * from data.
The WebCrawler program does the following:
-stores the page (indicated by the URL) in the temp file (called temp);
-executes sed with the given script to extract the particular data in the form of insert SQL statements. What data should be extracted and the syntax of insert statements are specialized in the scripts files (*.sed).
-executes insert statements, filling the data table with the actual data extracted and worklist table with the links (urls) to be visited (if needed).
-iterate on the worklist table visiting the stored url's and process them.
For example IMDB crawling works as following:
-download the best 250 movies page (starting page);
-process the html file with the imdb_list.sed script. It outputs sql insert statement(s) that adds a title of the movie, rank etc. into the data table.
Also it outputs sql statement(s) that adds an url and id of the movie along with a sed script name into the worklist table.
-iterate on the worklist table row by row visiting the links and processing them using the sed script name stored.
Look at the WebCrawler java code to know more...
-Webcrawler program;
-Website (imdb, ebay or amazon) assigned on the first step;
-Database schema developed on the first step.