Script Summary: A webscraping script that converts selections on a web page into a template that can be used automatically to extract the (changing) values of the selected parts.
Webscraper / Xidelscript
This script creates templates that you can use to "scrape" data from the web page.
You simply select the values you are interested in on the webpage, and the script automatically converts these selections in a corresponding template.
A selection is highlighted with a pink border, and you can move the mouse over it to change the following extended properties:
|xxx :=||In the first edit box, you can assign a variable name to the selection.|
|:= xxx||In the second edit box, you can add a XPath expression which should be read from the selection (e.g. @title)|
|X||The X button removes this selection.|
|follow link||This button marks the selection as a link which should be followed.|
|read repetition||This button let you select repetition of this selection, e.g. if you want to read all rows of a table. See below.|
|optional||Marks this selection as optional, so that the template extracts other existing selections, even if this one does not exist.|
|match children||The selection will only be accepted as match, if it has the same content it has now. (this is not useful for values with changing text, but if you want to find the url of a link with a certain text..)|
Generally, you do not want to extract a single value, but a list of values like a table. In this case, you can select the first value, click the "read repetition" button and select the second value.
All following values are then automatically detected, by searching the common ancestor between the first and second selection, and testing all siblings, if they contain a similar value. The first sibling, containing the first selection, is marked blue, all the other matches are marked green.
If some repetitions are not marked green, you have either selected the second value incorrectly (e.g. it often happens that you accidentally also select a space in the parent element of the value you want to select) or the webpage is too complicated for this script.
All values within the first/blue marked sibling will also be read from the green ones, i.e. you do not have to select repetitions for them. (see the screenshot)
Matching / General options
If you click on the "Show options" checkbox, you can control, which parts of the webpage should be included in the template.
The script generally works by creating a skeletal structure of the webpage, which can be matched against the actual webpage to find the selected values.
This is a little bit tricky, if the skeletal structure is too rough, it may match other values than the intended ones, if it is too dense it will not match anything (e.g. if it includes session ids and they have changed).
In the default configuration it will only consider id/class/name attributes for the template and from these attributes it will excluded those whose values look temporarily or contains numbers (many pages give an autogenerated numerical id to each element, which must not be included).
It will also ignore tbody-tags, since Firefox inserts these elements, even if they do not exist on the actual webpage.
If you use this script with Xidel/VideLibri, you can create multipage templates by clicking the "multipage template" checkbox. (well, you can create the templates without them, but no other program can use these kind of templates...)
If this checkbox is selected, all visited pages and all singlepage templates will be logged in the script window, and at the bottom you will find the final multipage template.
If you are lucky, you can just copy it to Xidel/VideLibri, but usually the urls will contains things like session ids and can not be used directly. Then you need to extract the changing values on the first page in a variable, and replace this value by $variable; in the urls (i.e. $foobar; is replaced by the value of foobar). You can also change the post data, give XPath-conditions which need to be satisfied, if this singlepage template shall be used, or give a list of values and apply the singlepage template to each of them.
You can move the script interface around by clicking on its caption bar, or on the <</>> buttons.
Some webpages do not allow selections or add an event handler that loads another page, if something is selected/clicked. In these cases you can activate the caret mode (F7 in Firefox) and select the values with the keyboard. Then click on "add selection" in the option box.
You can test, if the current template actually works, by clicking "remote test" and sending the entire webpage with the template to my Xidel-CGI-service (for privacy issues: it will not store anything, except the standard logging sourceforge does on all its servers.) (and sometimes the sourceforge server has caching problems, and you get a 500 error, than you need to click the button again, until it works again).