pSearch – a peer to peer, distributed search engine
So I haven’t been posting very much for the last while, and this is mainly because I’ve been very busy.
I always have several projects on the go, and I don’t have enough time to devote to all of these things at once, so usually the least interesting project gets placed on the back burner.
That is what happened to this blog.
Now I’ve spent a great deal of time on this, and have produced some very good design documents as well as a bunch of source code. So… Without further ado
This is my distributed, peer-to-peer search engine.
Attached to this post you’ll find a couple of architecture documents, a pdf with a visual diagram of how this engine is suppose to work, and another pdf with a long winded, half written description of why and how I expect this conceptually to run.
I’m not a writer, and am mostly a technical person, however, I am actively updating and modifying this project so expect updates as it goes.
The first document is the “pSearch – Document”
In this document I attempt to explain the strategy, and reasons for this project and what I hope that it will accomplish. This document is incomplete, but I encourage you to read it anyway.
The second document is the “pSearch – Drawing”
In this document I have detailed the major aspects of the distributed search. Hopefully it’s easy to follow, I don’t expect this diagram to change very much.
And I have a LOT of source code that I still have to organize – much of it will be posted here and some of it is too embarrassing.
So, without drudging into my documentation in too much detail (I posted them above, feel free) a simplified “how does this work” seems appropriate.
Each peer will accept connections from the internet. Each search request is forwarded to other peers as defined in it’s database.
While this happens, it also uses a second task to search it’s own internal database. On a private home machine this internal crawler has a small collection of sites and keywords based on several configurable data collection points (such as your browser cache, or installed programs) which would automatically include a lot of data that would be specific to you. A public internet site would index their own pages (this isn’t mandatory, but preferred).
After that, it’s a simple case of matching the keyword and publishing the results to the connected client.
Peers who respond quickly, and with a lot of results are flagged as “experts” when it comes to this set of terms. This way, when you search for a similar set of terms again, the “expert” peers will be consulted first.
This way, common search terms will be responded to by clients who have a lot of information on these terms. For example a site that indexes movies (like imdb) would respond with a lot of results for movie titles and information about films, but probably have very little to respond when a query has some specific request about cars.
Expect more as I develop more. I encourage anyone to read and comment about my designs.