November 19, 2013

HTML output plugin for Crawljax

I've just created a HTML output for Crawljax, check it out! :-)

It's an OnNewStatePlugin, so it's called when the DOM changes in the browser, so you can save every state to file. This plugin provides that each state will be stored only once, equality testing is based on MD5 hashes. Filenames will be generated from URLs, SaveHTML will create the directories just as they are in the URL. Directories and files will be created within the Crawljax output directory, so you should specify this in the Crawljax configuration. With this, you can store a mirror of a site, but links will not be modified, so they won't work in all cases. If Crawljax meets different DOM states on the same URL, SaveHTML will add a counter to the end of the filename.

It uses my URL2File class for generating the file and directory names. A '.html' suffix will be added if the URL doesn't end with '.html' or '.htm'. URLs with '/' ending will get an 'index.html' suffix. Special characters will be replaced with '_'.

You find it on GitHub.