Parsing Response HTML

Written by sizzlemctwizzle — Last update Sep 29, 2010

29 points

If you are parsing XML using the DOMParser is very easy and straight forward.

GM_xmlhttpRequest({
      method: 'GET',
      url: url,
      onload: function(responseDetails) {
        var responseXML = new DOMParser().parseFromString(responseDetails.responseText,
          'text/xml');     
    }
  });

However, this code only works because XML is strictly formatted. But if you want to parse an HTML page so that you can use XPath on it using this code will likely give you a "XML not well-formatted" error. But, luckily Seniltai has found and posted another solution. Create a new document and fill it with the response HTML:

function getDOC(url, callback) {
    GM_xmlhttpRequest({
        method: 'GET',
        url: url,
        onload: function (responseDetails) {
          var dt = document.implementation.createDocumentType("html", 
              "-//W3C//DTD HTML 4.01 Transitional//EN", "http://www.w3.org/TR/html4/loose.dtd"),
            doc = document.implementation.createDocument('', '', dt),
            html = doc.createElement('html');

          html.innerHTML = responseDetails.responseText;
          doc.appendChild(html);
          callback(doc);
        }
    });
}

getDOC('http://example.com/', function(doc) {  alert(doc.documentElement.innerHTML) });

Once you do this you can use evaluate and getElementsByTagName on doc:

getDOC('http://example.com/', function(doc) {
    alert(doc.evaluate('count(.//a)', doc, null, 1, null).numberValue);
    alert(doc.getElementsByTagName('a').length);
  });