GNU Mifluz - description

First of all, GNU mifluz is at alpha stage.
The purpose of GNU mifluz is to provide a C++ library to build and query a full text inverted index. It is dynamically updatable, scalable (up to 1Tb indexes), uses a controlled amount of memory, shares index files and memory cache among processes or threads and compresses index files to 50% of the raw data. The structure of the index is configurable at runtime and allows inclusion of relevance ranking information. The query functions do not require to load all the occurences of a searched term. They consume very few resources and many searches can be run in parallel.
Implementing a library that manages an inverted index is a very easy task when there is a small number of words and documents. It becomes a lot harder when dealing with a large number of words and documents. GNU mifluz has been designed with the further upper limits in mind : 500 million documents, 50 giga words, 20 million document updates per day.
GNU mifluz has two main characteristics : it is very simple (one might say stupid :-) and uses 50% of the size of the indexed text for the index. It is simple because it provides only a few basic functionalities. It does not contain document parsers (HTML, PDF etc...). It does not contain a full text query parser. It does not provide result display functions or other user friendly stuff. It only provides functions to store word occurences and retrieve them. The fact that it uses 50% of the size of the indexed text is rather atypical. Most well known full text indexing systems only use 30%. The advantage GNU mifluz has over most full text indexing systems is that it is fully dynamic (update, delete, insert), uses only a controled amount of memory while resolving a query, has higher upper limits and has a simple storage scheme. Consuming more disk space allows all this.