February 22, 2013

PubSearch 2 - The plan

I spent months writing wishlist and brainstorming and now I think I finally managed to plan the program's next version, which will be totally rewritten.

There are 2 causes of building it up from scratch:

1. I want to build in a lot of new options, like selecting which crawlers to use, proxy list management. But these new features wouldn't fit the current window, so I have to replan and recreate the whole GUI. In addition I will try out JavaFX Scene Builder and FXML.

2. I want PubSearch to be more universal. The websites of publication databases are constantly changing, and however definition files of PubSearch 1.x can be easily actualized, some of the features of those sites cannot be reached by the built-in uniform algorythm of PubSearch 1.x.

Chain of ideas through the final plan
  1. First I thought of extending the definition files with a script, which would describe how PubSearch should crawl the site, step by step. The problem with this idea is that I must find out a language for this, and of course I must write a parser, and in addition PubSearch would lose the control of crawler threads.
  2. I refined the "script" to be a simplified Java code, but with an own parser and I would use Reflection API for function and constructor calls. But this way only advanced users or developers can extend PubSearch.
  3. I said Java code... why not valid Java code? I would call the compliler and the class loader.
  4. But why should PubSearch compile? Crawlers will be compliled JAR files, and PubSearch contacts them through an interface.
  5. I thought about the interface, I wrote down what data is needed by PubSearch. I get back the content of the definition files, so there would be a parser interface, and implementing classes will be able to be loaded.
  6. But the universality is not increased by now, we need a connection point on a higher level, crawler algorythm should be pushed out to external libraries. So instead of parser interfaces we need crawler interfaces and implementing classes in the libraries. Disadvantage is that so PubSearch loses the control of threads and downloads too. But why this plan 6 is genius: the PubSearch 1.x algorythm can be modified to be a crawler of PubSearch 2, so it will be compatible backwards.
So user-friendly extending feature can survive.

Control of threads not, but control of downloads can be saved by providing a downloader to the crawler libraries through an interface.

And here's the final plan, "plan 6.2". Figure:


No comments: