/ Home/ Contact/ Resume/ Portfolio/ Code///
 

 

 

Login to customize this site
and access restricted areas
Username
Password
Create Account
 

This tool is a PHP class that takes a URI (eg: http://www.weather.com/) and parses the tables contained on that page into cells. Since most web pages use tables to nicely format their data, using this tool will allow you to easily retrieve that table formatted information from web pages without building complex expressions.

This script easily handles nested tables to an unlimited extent and is quite robust. In testing I've not been able to fool / confuse it (so long as I'm using valid HTML of course). It will remove unnecessary items such as HTML markup, scripts, and CSS styles, leaving plain text, links (eg. <a href=___>), and source (eg. <img src=___>) paths for you to use and easily manipulate. It will even extract alt tags!

Included with the tool is a simple page browser (browser.php) that helps you quickly and easily locate the cell(s) that contain the information you need to work with.

Using this tool and the browser together will greatly reduce initial web scraping application development time as well as make future updating a breeze when the target site changes their layout (as they surely will).

How to Use
Functions
Legal Reminder
Copyright
Download Version 1.01
Registration and Links

How to use

Add the following code to your page:

<?
    include("WebCell.php");
    $wcObject = new     WebCell("http://www.someserver.com/path/to/file.html");
    print $wcObject->GetCell(2,5,3);
?>

Done! This will print the content in the 3rd cell of row 5 from table number 2 on the page http://www.ravis.org/code/webcell/. Easy like pie (which is the point).

Of course we didn't bother to activate the optional cache, which is highly recommended (see below) but this should give you an idea of how easy WebCell can be to use.

Obviously the GetCell function requires the proper numbers. The easiest way to get these is to fire up browser.php and enter the URI that you're feeding to WebCell. You'll see a page with various tables and cells. Find the cell that contains the information you want and look at the set of numbers in the top left corner. That's what you will feed to GetCell, which will in turn return the content from that cell. If in the future the layout of the target site changes, simply reload the page in browser.php and find the new cell number, make the change in your code and you're all fixed up.

For another working example see browser.php.

Registration and Link

In exchange for you using this script you must register it by sending me the grand sum of some information! Simply drop me a quick note with your site name and address, and what you're using Weather to do for your visitors. I would also greatly appreciate a link back to this page (http://www.ravis.org/code/webcell/). The link isn't required, but appreciated.

Functions

IMPORTANT: You should not attempt to bypass these functions and access the variables directly, as the tables are not parsed until their data is needed (to increase efficiency and decrease CPU cycles used). Hence if you try and access table data directly it may simply not be there (causing all sorts of problems with your resulting script).

WebCell(URI)

Object creator - takes the URI to retrieve as it's only argument. Standard format: 'http://server:port/path' Default port is 80.

Currently WebCell does not handle redirects, so make sure you have the real path and if it ends in a directory (/testing/) it has a trailing slash. SSL is not supported, so connecting to secure https:// servers won't work either. Both of these are doable, I just don't have the time to implement them right now.

EnableCache(cache_directory, cache_timeout)

Sets up the local cache. Using a local cache is highly recommended and greatly improves performance in almost all situations*.

Caching requires a directory that the script can write to (cache_directory) and the number of seconds that the cached version will remain in effect (cache_timeout - default 3600 (1 hour)).

If you're having problems with the cache (indicated by constantly reloading from the live site instead of the cached copy) it's most likely because the script doesn't have write permissions on the directory you're passing it. Most PHP scripts will run as the web server user, so make sure that user can write to that directory.

* The only exception being where you always want a fresh page loaded on every request (when the target content is updated according to your request, for example). If you don't need this, enable the cache.

UserAgent(user_agent_string)

Sets the user agent that WebCell will report to the web server. Default is "WebCell {version}". If you want the script to appear (for example) as Internet Explorer on Windows 98, you could change this by calling:

UserAgent("Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)");

Calling UserAgent() without any values will return the current setting.

DisplayTimes(true/false)

Prints simple debug information about the amount of time taken to load and process pages. Useful for locating bottlenecks in the process. If you want to see how much better performance is with caching enabled, turn this on and reload your page a few times. By default it's off.

GetPage(force_reload)

Retrieves the page from either the live URI or a locally cached copy (if caching is enabled). This is actually optional. If you don't call it directly, it will be called the first time you try to retrieve data.

If you have cache enabled but want to force a reload of the live page for some reason, you can call GetPage(true) which will ignore any cached pages and load directly from the live site.

GetCell(table,row,column)

Retrieves the content of the specified cell.

PrintTable(table)

Prints a nice formatted table of the data retrieved from URI. Useful for finding the information you need when building your web scraping app. See browser.php for examples.

GetTable(table)

Returns the entire specified table as a 2D array. Access the data using $ArrayName[row][column]. This is really no faster or slower than using the built in GetCell function. It's just here in case you want to use it.

NumberOfCells(table,row)

Returns the number of cells in the row for that table

NumberOfRows(table)

Returns the number of rows in the specified table

Legal Reminder

As always, remember that you may be dealing with copyrighted material, and it's up to you to get permission from the source before using their content. This is standard, everyone should know this by now...

Copyright

This script is copyright (c) 2002 by Travis Richardson (http://www.ravis.org/). You have my permission to use it for personal or commercial purposes free of charge, as well as make whatever modifications you would like, so long as this copyright notice remains intact.

Please do not redistribute the script yourself - it's freely available on my web site for all to use, and you are more than welcome to link to it from your site if you wish. :-)

 

 

Random Link:

Lerenti.com
This is kinda neat. Looks like it's trying to replace the lyrics.ch service which was killed, beaten, and vandalized by the record labels (yes, in that order). We'll see if they manage to pull it off or not...

(links archive)