{"id":1759,"date":"2011-08-24T23:44:23","date_gmt":"2011-08-24T15:44:23","guid":{"rendered":"http:\/\/www.mudone.com\/?p=1759"},"modified":"2011-08-29T09:26:02","modified_gmt":"2011-08-29T01:26:02","slug":"headless-html-rendering-engines","status":"publish","type":"post","link":"https:\/\/www.mudone.com\/?p=1759","title":{"rendered":"Headless HTML rendering engines"},"content":{"rendered":"<p><a href=\"http:\/\/www.holovaty.com\/writing\/headless-html-rendering-engine\/\">http:\/\/www.holovaty.com\/writing\/headless-html-rendering-engine\/<\/a><br \/>\n\u8fd9\u7bc7\u6587\u7ae0\u7684\u8bc4\u8bba\u91cc\u8ba8\u8bba\u4e86\u5404\u79cd\u89e3\u51b3\u65b9\u6848\u3002\u3002<\/p>\n<blockquote><p>Request: Headless HTML rendering engine?<\/p>\n<p>Written by Adrian Holovaty on May 2, 2008<\/p>\n<p>Warning: Seriously geeky request ahead!<\/p>\n<p>I&#8217;m looking for a way to render arbitrary Web pages &#8212; including CSS and JavaScript &#8212; and access the resulting DOM tree programatically, i.e., in an automated\/headless fashion. I want to be able to ask the following questions of the resulting DOM tree:<\/p>\n<p>For a given element, what font family, size, and color is the text?<br \/>\nHow tall and wide (in pixels) is a given<\/p>\n<div>,<br \/>\n, etc.? What are the x\/y coordinates of a given element (from the upper-left corner of the page, or lower-left, or wherever)? For a given element, what is its text content?<\/div>\n<p>The rendering must be state-of-the-art, handling advanced CSS that Firefox, Safari and IE handle. It should work on Linux. Bonus points if there&#8217;s a Python API for this magical DOM tree.<\/p>\n<p>This is all stuff that standard in-page JavaScript could accomplish, but the catch with me is that I need to be able to do it in a completely automated way, on arbitrary pages, on a headless server.<\/p>\n<p>I know Gecko and Webkit provide this, but I&#8217;m not sure where to start with them. The docs and articles I&#8217;ve read seem to be focused more on embedding the full browser window in a GUI application than embedding the rendering engine itself and manipulating the resulting pages.<\/p>\n<p>Help! If you have any clues, I&#8217;d be grateful if you left a comment or got in touch with me.<br \/>\nComments<br \/>\nPosted by Andrew Sutherland on May 2, 2008 at 2:45 a.m.:<\/p>\n<p>PyXPCOM (http:\/\/developer.mozilla.org\/en\/docs\/PyXPCOM) should handle the Python part of the Gecko equation.<\/p>\n<p>I myself am no specific help on the gecko side of things, but I think the following post\/thread on the PyXPCOM mailing list may be of assistance:<\/p>\n<p>http:\/\/aspn.activestate.com\/ASPN\/Mail\/Message\/pyxpcom\/3619998<br \/>\nPosted by Rene Dudfield on May 2, 2008 at 3:19 a.m.:<\/p>\n<p>You can set up a headless X server, then run firefox, or whatever browser with a standard build.<br \/>\nPosted by Michael Twomey on May 2, 2008 at 4:46 a.m.:<\/p>\n<p>If you want an example of using webkit to do headless stuff you could look at webkit2png which is a tool for taking screenshots of websites from command line. It uses webkit and pyobjc, so you&#8217;ll need a mac. It doesn&#8217;t do any DOM stuff that I can see but I might be a useful starting point for writing an automated tool.<br \/>\nPosted by Justin Mason on May 2, 2008 at 5:01 a.m.:<\/p>\n<p>http:\/\/khtml2png.sourceforge.net\/ might be useful, if you&#8217;re doing this on a *NIX platform. Looks like it&#8217;s well-maintained, too, since the most recent release was only a couple of weeks ago.<br \/>\nPosted by G\u00e1bor Farkas on May 2, 2008 at 5:10 a.m.:<\/p>\n<p>in case of firefox, there are 2 issues:<\/p>\n<p>1. run it somehow in a headless mode: for this, try Xvfb. it starts a headless X server. then you can run firefox in it.<\/p>\n<p>2. communicate with the firefox instance. there is PyXPCOM, as others already mentioned, which could make it work.<br \/>\nPosted by Jason on May 2, 2008 at 7:04 a.m.:<\/p>\n<p>If you want to muck in C++ code you could look at RenderTreeAsText in Webkit. For actually setting up the rendering engine, there&#8217;s some relatively simple high-level apis in the wx and qt ports that seem pretty readable; the kind of api you&#8217;d use for those neat &#8220;write a web browser in 5 lines of code&#8221; demos. See WebFrame in particular. Disclaimer: I&#8217;ve never written anything with webkit, but it might be fun to learn.<br \/>\nPosted by anonymous on May 2, 2008 at 8:15 a.m.:<\/p>\n<p>What about Selenium? or Watir?<br \/>\nPosted by anonymous on May 2, 2008 at 8:50 a.m.:<\/p>\n<p>I haven&#8217;t tried this (but am planning to), so I don&#8217;t know if it really meets your needs, but HTMLUnit is a Java-based headless browser (designed for testing).<br \/>\nPosted by anonymous on May 2, 2008 at 10:15 a.m.:<\/p>\n<p>Attributes such as pixel width, height, font etc will either be determined by CSS, or they will be agent (and user setup) specific.<\/p>\n<p>The pixel width of a div of width 50% will depend on the size of the viewport &#8211; which of course would be anything. Do you intend to &#8216;fake&#8217; the settings of a user agent? If so, then a simple calculation would get the pixel width (as you would know your viewport dimensions).<\/p>\n<p>I really would consider seeing how far you can get by simply manipulating the dom and parsing the css (both of which are easily achieved with the python libraries urllib, lxml \/ beautifulsoup and cssutils).<\/p>\n<p>I know, I know; None of this helps with javascript dependent attributes.<\/p>\n<p>RC<br \/>\nPosted by alan taylor on May 2, 2008 at 10:36 a.m.:<\/p>\n<p>Have you looked at JSSh? Not sure if it fits the bill, but it just might &#8211; it&#8217;s a &#8220;Mozilla C++ extension module that allows other programs (such as telnet) to establish JavaScript shell connections to a running Mozilla process via TCP\/IP&#8221; I know it can return some parts of the DOM, but not sure how much detailed info you can get beack from it. http:\/\/www.croczilla.com\/jssh<br \/>\nPosted by Matthew Marshall on May 2, 2008 at 10:42 a.m.:<\/p>\n<p>I&#8217;ve played with doing this a little. The best I came up with was using PyKDE and khtml. I&#8217;m pretty sure it requires an X server, but if nothing else you could use a vnc server.<\/p>\n<p>MWM<br \/>\nPosted by Kumar McMillan on May 2, 2008 at 11:40 a.m.:<\/p>\n<p>There are probably several ways to do it, but the first that comes to mind is using the Python driver for Selenium RC &#8230;<\/p>\n<p>from selenium import selenium<\/p>\n<p># with the selenium-rc (Java) proxy sever running at localhost:4444 &#8230;<\/p>\n<p>selenium = selenium(&#8220;localhost&#8221;, 4444, &#8220;*firefox&#8221;, &#8220;http:\/\/thewebsite.com&#8221;)<\/p>\n<p>selenium.open(&#8220;\/&#8221;)<\/p>\n<p>selenium.wait_for_page_to_load(&#8216;30000&#8217;)<\/p>\n<p>selenium.get_html_source() # this is includes any JavaScript DOM manipulations, of course<\/p>\n<p>selenium.get_element_position_left(&#8220;xpath=\/\/div[1]&#8221;)<\/p>\n<p>selenium.get_element_position_top(&#8220;xpath=\/\/div[1]&#8221;)<\/p>\n<p>selenium.get_element_height(&#8220;xpath=\/\/table[1]&#8221;)<\/p>\n<p>selenium.capture_screenshot(&#8216;\/tmp\/site.png&#8217;)<\/p>\n<p>&#8230; but I&#8217;m not sure how you get the font\/text info. Selenium RC is designed to run headless and also has a &#8220;grid&#8221; implementation so you can throw more hardware at it. Scaling up to the grid is very transparent &#8212; same code as above, more or less.<\/p>\n<p>Links:<\/p>\n<p>http:\/\/selenium-rc.openqa.org\/<\/p>\n<p>http:\/\/selenium-rc.openqa.org\/python.html<\/p>\n<p>http:\/\/selenium-grid.openqa.org\/<br \/>\nPosted by anonymous on May 2, 2008 at 12:02 p.m.:<\/p>\n<p>seconding the jssh suggestion http:\/\/www.urbanhonking.com\/ideasfordozens\/archives\/2008\/03\/automating_fire.html<br \/>\nPosted by Ryan Shaw on May 2, 2008 at 12:26 p.m.:<\/p>\n<p>You might want to check out Crowbar:<\/p>\n<p>Crowbar is a web scraping environment based on the use of a server-side headless mozilla-based browser. Its purpose is to allow running javascript scrapers against a DOM to automate web sites scraping but avoiding all the syntax normalization issues.<br \/>\nPosted by mikeal on May 2, 2008 at 1:49 p.m.:<\/p>\n<p>I would go with windmill over Selenium if you&#8217;re going down that road. We have far more comprehensive javascript support, you can use execJS to get back the result of any arbitrary js.<\/p>\n<p>http:\/\/windmill.osafoundation.org<\/p>\n<p>And jssh is great, but MozRepl is jssh on crack.<\/p>\n<p>http:\/\/hyperstruct.net\/projects\/mozrepl<\/p>\n<p>The whole interface is much much nicer and I&#8217;m in the middle of a Python JavaScript bridge using MozRepl that I&#8217;ll be sure to send you a link to once it&#8217;s public.<br \/>\nPosted by Henning on May 2, 2008 at 2:29 p.m.:<\/p>\n<p>Qt 4.4 is available on all platforms and contains a WebKit port. Fortunately the newest PyQt snapshots also contain support for WebKit. Because Qt can render every widget to a pixmap, is should be fairly easy. To run Qt headless you could use xvfb.<\/p>\n<p>To access the DOM you can query with Javascript.<\/p>\n<p>The following is _not_ tested:<\/p>\n<p>from PyQt4.QtCore import *<\/p>\n<p>from PyQt4.QtGui import *<\/p>\n<p>from PyQt4.QtWebKit import *<\/p>\n<p>import sys<\/p>\n<p>app = QApplication(sys.argv)<\/p>\n<p>browser = QWebView()<\/p>\n<p>browser.show()<\/p>\n<p>browser.resize(800,600)<\/p>\n<p>#browser.setHtml(&#8220;Hello, world&#8221;)<\/p>\n<p>browser.load(&#8220;http:\/\/\/www.djangoproject.com&#8221;)<\/p>\n<p>pm = QPixmap.grabWidget(browser)<\/p>\n<p>pm.save(&#8220;website.jpg&#8221;)<\/p>\n<p>body = browser().page().mainFrame().evaluateJavaScript(&#8220;getElementByName(&#8216;body&#8217;)&#8221;)<br \/>\nPosted by anonymous on May 2, 2008 at 6:08 p.m.:<\/p>\n<p>HTMLUnit is a very good headless browswer implementation. It supports different browsers and Jacascript (using Rhino I think). And finally, is under active development.<\/p>\n<p>http:\/\/htmlunit.sourceforge.net\/<\/p>\n<p>Unfortuantely, its a Java library but you could use jpython to access it.<br \/>\nPosted by anonymous on May 2, 2008 at 6:12 p.m.:<\/p>\n<p>I looked at a few open source projects to do headless rendering.<\/p>\n<p>It&#8217;s tempting to use firefox\/gecko but the learning curve is steep,<\/p>\n<p>it&#8217;s 2 mln lines of netscape legacy C++ code.<\/p>\n<p>But if you figure it out you&#8217;ll have a fine tool.<\/p>\n<p>What is working for me now is lobo renderer (from cobra browser) (in java).<\/p>\n<p>It&#8217;s not the best rendering engine, but it&#8217;s decent, and easy to program.<\/p>\n<p>You can get rendered blocks and dom objects, and answer all the questions<\/p>\n<p>as to block location, color, text etc.<\/p>\n<p>It can be made to work on linux completely headless without an x server,<\/p>\n<p>the way I have it working is it takes in a url or html, and saves to another<\/p>\n<p>textual file format. What&#8217;s important is to encapsulate your choice<\/p>\n<p>of rendering engine, because it will change.<\/p>\n<p>Email me at dmitrim at yahoo dot com if you need help.<br \/>\nPosted by Phil on May 2, 2008 at 7:31 p.m.:<\/p>\n<p>Personally I&#8217;d try it with MozRepl and an X virtual framebuffer: http:\/\/emacspeak.blogspot.com\/2007\/06\/firebox-put-fox-in-box.html<br \/>\nPosted by Daniel on May 2, 2008 at 7:46 p.m.:<\/p>\n<p>As suggested above, run firefox on a virtual X server. Use a firefox extension (mozrepl or jssh) to get automated control over the browser.<\/p>\n<p>I set up a system doing exactly this (for taking screenshots) last summer. In the end it barely took any code, just a fair amount of faffing with config files. Happy to give more details if it&#8217;s helpful: (my first name) at ohuiginn.net<br \/>\nPosted by rex on May 3, 2008 at 8:44 a.m.:<\/p>\n<p>I went throught trying to work out a way to do this ages ago.<\/p>\n<p>Not sure if you&#8217;re feeling the same, Adrian, but what bothered me (purely from a principle level) was that I really wanted to be able to do this on my server _without_ having to run a headless X server, or an instance of firefox or whatever.. i wanted a library that was able to do it.. and give back my responses without having the uneccessary overhead of a browser, x server etc running (i know very little about it&#8230; but i can&#8217;t help but feel that these are uneccessary elements in the equation).<\/p>\n<p>Surely there is a way to do what you&#8217;re asking without having a program running that is designed to actually render the pictures on a screen&#8230; *shrug*<br \/>\nPosted by anonymous on May 5, 2008 at 4:55 a.m.:<\/p>\n<p>rex: Rendering HTML nowadays is a heavy complex task. So there is no light library, unfortunately. It sounds like using PyQt is the smartest approach because it does not load a full appliaction but only a rendering engine you can fully control. Having a dummy X-server on Unix seems to be a necessary evil.<br \/>\nPosted by Eric Moritz on May 5, 2008 at 3:50 p.m.:<\/p>\n<p>I was thinking of this very issue a while back:<\/p>\n<p>http:\/\/eric.themoritzfamily.com\/2008\/02\/08\/python-interface-mozilla-dom\/<\/p>\n<p>I came across this guy&#8217;s post:<\/p>\n<p>http:\/\/ejohn.org\/blog\/bringing-the-browser-to-the-server\/<\/p>\n<p>He&#8217;s using Rhino and some custom javascript to emulate the browser&#8217;s window object.<br \/>\nPosted by John Herren on May 13, 2008 at 1:20 a.m.:<\/p>\n<p>Rhino ftw<\/p><\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p>http:\/\/www.holovaty.com\/writing\/headless-html-rendering [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1759","post","type-post","status-publish","format-standard","hentry","category-passed-times"],"_links":{"self":[{"href":"https:\/\/www.mudone.com\/index.php?rest_route=\/wp\/v2\/posts\/1759","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mudone.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mudone.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mudone.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mudone.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1759"}],"version-history":[{"count":2,"href":"https:\/\/www.mudone.com\/index.php?rest_route=\/wp\/v2\/posts\/1759\/revisions"}],"predecessor-version":[{"id":1761,"href":"https:\/\/www.mudone.com\/index.php?rest_route=\/wp\/v2\/posts\/1759\/revisions\/1761"}],"wp:attachment":[{"href":"https:\/\/www.mudone.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1759"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mudone.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1759"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mudone.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1759"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}