29 points
If you are parsing XML using the DOMParser is very easy and straight forward.
GM_xmlhttpRequest({
method: 'GET',
url: url,
onload: function(responseDetails) {
var responseXML = new DOMParser().parseFromString(responseDetails.responseText,
'text/xml');
}
});
However, this code only works because XML is strictly formatted. But if you want to parse an HTML page so that you can use XPath on it using this code will likely give you a "XML not well-formatted" error. But, luckily Seniltai has found and posted another solution. Create a new document and fill it with the response HTML:
function getDOC(url, callback) {
GM_xmlhttpRequest({
method: 'GET',
url: url,
onload: function (responseDetails) {
var dt = document.implementation.createDocumentType("html",
"-//W3C//DTD HTML 4.01 Transitional//EN", "http://www.w3.org/TR/html4/loose.dtd"),
doc = document.implementation.createDocument('', '', dt),
html = doc.createElement('html');
html.innerHTML = responseDetails.responseText;
doc.appendChild(html);
callback(doc);
}
});
}
getDOC('http://example.com/', function(doc) { alert(doc.documentElement.innerHTML) });
Once you do this you can use evaluate and getElementsByTagName on doc:
getDOC('http://example.com/', function(doc) {
alert(doc.evaluate('count(.//a)', doc, null, 1, null).numberValue);
alert(doc.getElementsByTagName('a').length);
});
login to vote
Excuse me, but what's the point of using DOMParser at all? You have your HTML parsed as soon as you set it as element's innerHTML.
GM_xmlhttpRequest({ method: 'GET', url: 'http://userscripts.org', onload: function(responseDetails) { var holder = document.createElement('div'); holder.innerHTML = responseDetails.responseText.split(/<body[^>]*>((?:.|\n)*)<\/body>/i)[1]; alert(document.evaluate('count(.//a)', holder, null, 1, null).numberValue); alert(holder.getElementsByTagName('a').length); } });You seem to do three times more work than necessary.
login to vote
You're absolutely right about the DOMParser being useless for parsing HTML. My old method was only temporary until I found something better, which I did just recently.
login to vote
Maybe perhaps you meant
/<body[^>]*>((?:.|\n|\r)*)<\/body>/ion your regular expression? You seem to deserve a -1 here.login to vote
As w35l3y once explained to me, using regular expression for this is a very bad idea and its best to just to leave the response alone unless you really need a completely valid document.
login to vote
login to vote
The DOMParser is of very little use for parsing HTML(I think it can only parse strict(maybe transitional) XHTML without running into errors). The method of using document.implementation.createDocument works the best imo. To create a full valid document you need to parse the content of both the head and body elements and then create new elements with the respective parsed content as the innerHTML and then append those to the html element created in the example. Another option is the library John Resig wrote that I mention in the guide(I would probably lean towards using this if you absolutely need a valid document).
login to vote
Also take a look at this commit http://github.com/Tim-Smart/greasemonkey/commit... and the few commits before it.
Probably something that shouldn't be in Greasemonkey Core, but it was cool to play around with.
login to vote
@sizzlemctwizzle
Hmm I don't think your hack is working in
Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6. Also John Resig's method causes an uncaught exception inMozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8login to vote
Hack works again.
login to vote
I tried using second method to create a doc object from responseText. It works fine in Firefox(3.6.10) for me, but blows up in Chrome(7.0.517.0).
Can anyone confirm this, or is it just me?
login to vote
I was looking for a way to do this but the above method gave me some problems as I explained in this post. I finally found out that using
document.implementation.createHTMLDocument('');to create a document solved all my issues :)Example code:
var text = '<html><head><title>page title</title></head><body><img src="image.png"/></body></html>'; var doc = document.implementation.createHTMLDocument(''); doc.documentElement.innerHTML = text; [doc.evaluate('.//title', doc, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null).singleNodeValue, doc.querySelector('img')];Note: this method isn't available in Firefox 3.x, so the previous method should be used if(!document.implementation.createHTMLDocument)
login to vote
Thanks for the guide, works for me!
In Scriptish there is GM_safeHTMLParser but it strips out some HTML tags, so using document.implementation is the other way round.