I've just created a HTML output for Crawljax, check it out! :-)
It's an OnNewStatePlugin, so it's called when the DOM changes in the browser, so you can save every state to file. This plugin provides that each state will be stored only once, equality testing is based on MD5 hashes. Filenames will be generated from URLs, SaveHTML will create the directories just as they are in the URL. Directories and files will be created within the Crawljax output directory, so you should specify this in the Crawljax configuration. With this, you can store a mirror of a site, but links will not be modified, so they won't work in all cases. If Crawljax meets different DOM states on the same URL, SaveHTML will add a counter to the end of the filename.
It uses my URL2File class for generating the file and directory names. A '.html' suffix will be added if the URL doesn't end with '.html' or '.htm'. URLs with '/' ending will get an 'index.html' suffix. Special characters will be replaced with '_'.
You find it on GitHub.
November 19, 2013
February 22, 2013
January 30, 2013
What's new since 1.0
- rebuilded with the freshest JDK (7u11)
- GUI bugs caused by JavaFX changes fixed
- Google Scholar crawl patterns fixed
- Springerlink crawl patterns fixed
- publication databases added:
January 25, 2013
This is my first project that I published on the world wide web. This project was my thesis at my university, and I created a SourceForge account for the program, to use SVN. A year later, in the first weeks of 2013, I received a letter from Softpedia, where the team informed me that they included my program in the Softpedia database.
This is a Java tool which can search in multiple publication databases (such as Google Scholar, CiteSeerX, ACM, SpringerLink). You type the author's name and PubSearch grabs the basic information of her/his publications. It can transitively crawl the "cited-by" lists, so a researcher can use this tool for calculating her/his impact factor. It uses a proxy list to reach those sites, to avoid banning because of the heavy network traffic. The program uses definition files to crawl the databases, you can edit these with any simple text editor or add your own definiton. You can export publication data in citation formats.
- search in the following databases:
- ACM Digital Library
- Computer Science Bibliography Collection
- Google Scholar
- you can edit/add publication database definitions
- automatic proxy list downloading
- crawl cited by publications transitively (where possible)
- publication data stored in a MySQL database
- export results table in CSV or citation format
- export individual publication data in citation format
- you can edit/add citation format templates
- hungarian and english GUI
on my TODO list :-)
So I decided to create an english version of my blog. I won't translate all of my posts but the important ones, especially the project plans and new release announcements. You'll find all of my published projects here, with download links and I think manuals will be also available under this domain.
But what made me to create this english blog? My thesis project PubSearch was included by Softpedia yesterday, and this thing increased my motivation in developing this application. I see the download counter on the site which tells me people are interested in my work. So I think I should inform them about the new releases, bug fixes (there are some stuff to fix in that published version!), and my future plans with the project. And maybe I'll come up with new project ideas!
These days I'm quite busy, but I upload some post ASAP. Stay tuned! :-)