![]() |
|
This tool is a PHP class that takes a URI (eg: http://www.weather.com/) and parses the tables contained on that page into cells. Since most web pages use tables to nicely format their data, using this tool will allow you to easily retrieve that table formatted information from web pages without building complex expressions.
This script easily handles nested tables to an unlimited extent and is quite robust. In testing I've not been able to fool / confuse it (so long as I'm using valid HTML of course). It will remove unnecessary items such as HTML markup, scripts, and CSS styles, leaving plain text, links (eg. <a href=___>), and source (eg. <img src=___>) paths for you to use and easily manipulate. It will even extract alt tags! Included with the tool is a simple page browser (browser.php) that helps you quickly and easily locate the cell(s) that contain the information you need to work with. Using this tool and the browser together will greatly reduce initial web scraping application development time as well as make future updating a breeze when the target site changes their layout (as they surely will). How to Use
Add the following code to your page: <? Done! This will print the content in the 3rd cell of row 5 from table number 2 on the page http://www.ravis.org/code/webcell/. Easy like pie (which is the point). Of course we didn't bother to activate the optional cache, which is highly recommended (see below) but this should give you an idea of how easy WebCell can be to use. Obviously the GetCell function requires the proper numbers. The easiest way to get these is to fire up browser.php and enter the URI that you're feeding to WebCell. You'll see a page with various tables and cells. Find the cell that contains the information you want and look at the set of numbers in the top left corner. That's what you will feed to GetCell, which will in turn return the content from that cell. If in the future the layout of the target site changes, simply reload the page in browser.php and find the new cell number, make the change in your code and you're all fixed up. For another working example see browser.php.
In exchange for you using this script you must register it by sending me the grand sum of some information! Simply drop me a quick note with your site name and address, and what you're using Weather to do for your visitors. I would also greatly appreciate a link back to this page (http://www.ravis.org/code/webcell/). The link isn't required, but appreciated.
IMPORTANT: You should not attempt to bypass these functions and access the variables directly, as the tables are not parsed until their data is needed (to increase efficiency and decrease CPU cycles used). Hence if you try and access table data directly it may simply not be there (causing all sorts of problems with your resulting script). WebCell(URI)
EnableCache(cache_directory, cache_timeout)
UserAgent(user_agent_string)
DisplayTimes(true/false)
GetPage(force_reload)
GetCell(table,row,column)
PrintTable(table)
GetTable(table)
NumberOfCells(table,row)
NumberOfRows(table)
As always, remember that you may be dealing with copyrighted material, and it's up to you to get permission from the source before using their content. This is standard, everyone should know this by now...
This script is copyright (c) 2002 by Travis Richardson (http://www.ravis.org/). You have my permission to use it for personal or commercial purposes free of charge, as well as make whatever modifications you would like, so long as this copyright notice remains intact. Please do not redistribute the script yourself - it's freely available on my web site for all to use, and you are more than welcome to link to it from your site if you wish. :-)
|
|||
|
|