discodb is comprised of a low-level data structure implemented in C, and a high-level discodb.DiscoDB class which exposes a dict-like interface for using the low-level data structure from Python. In contrast to Python’s builtin dict object, DiscoDB can handle tens of millions of key-value pairs without consuming gigabytes of memory.
In addition to basic key-value mappings, DiscoDB supports evaluation of Boolean queries expressed in Conjunctive Normal Form (see discodb.Q below). All queries are evaluated lazily using iterators, so you can handle gigabytes of data in Python with ease.
DiscoDBs are persistent, which means that once created in memory, they can be easily compressed, serialized and written to a file. The benefit of this is that after they have been persisted, instantiating them from disk and key lookups are lightning-fast operations, thanks to memory mapping.
DiscoDBs are also immutable, which means that once they are created, they cannot be modified. A benefit of immutability is that the full key-space is known when DiscoDB is built, which makes it possible to use perfect hashing for fast O(1) key lookups. Specifically, DiscoDB relies on the CMPH library for building minimal perfect hash functions.
DiscoDB compresses values first by replacing duplicate entries with references to a singe unique entry and then by compressing unique entries with a fast compression algorithm based on Huffman Coding. The main benefit of this approach is that each value can be random accessed efficiently while achieving reasonable compression ratios, thanks to statistics collected from all the data. This means that you can have lots of redundancy, e.g. common prefixes, in your values without having to worry about space consumption.
The format of a DiscoDB file essentially looks like this:
The benefits of these properties are realized when you need repeated, random-access to data, especially when the dataset is too large to fit in memory at once. DiscoDBs are a key component in Disco’s builtin distributed indexing system, Discodex
DiscoDB does not depend on Disco in any way, although it is a core component in Discodex. You can use it without Disco as a general-purpose scalable, immutable datastructure for Python. To install only DiscoDB without rest of Disco, clone Disco, and run:
make install-discodb
or if you just want to build it locally:
cd contrib/discodb
python setup.py build
DiscoDB requires CMPH library v0.9 or newer (libcmph-dev in Debian).
Here is a simple example that builds a simple discodb and queries it:
from discodb import DiscoDB, Q
data = {'mammals': ['cow', 'dog', 'cat', 'whale'],
'pets': ['dog', 'cat', 'goldfish'],
'aquatic': ['goldfish', 'whale']}
db = DiscoDB(data) # create an immutable discodb object
print list(db.keys()) # => mammals, aquatic, pets
print list(db['pets']) # => dog, cat, goldfish
print list(db.query(Q.parse('mammals & aquatic'))) # => whale
print list(db.query(Q.parse('pets & ~aquatic'))) # => dog, cat
print list(db.query(Q.parse('pets | aquatic'))) # => dog, cat, whale, goldfish
db.dump(file('animals.db', 'w')) # dump discodb to a file
For more detailed information on querying, see discodb.query.
DiscoDB stores only one copy of each unique value. Thus using the same value multiple times is very cheap. Applications should utilize this feature to maximize space and time efficiency.
DiscoDB sorts key-value pairs internally, so no pre-sorting is needed. You can request DiscoDB to remove duplicate key-value pairs automatically by setting unique_items=True in the DiscoDB constructor.
DiscoDB does not compress values which are smaller than four bytes. DiscoDB may also decide to disable compression if it is not likely to be beneficial. This doesn’t affect in DiscoDB behavior in any way. You can disable compression explicitly, e.g. if your values are already compressed, by setting disable_compression=True in the DiscoDB constructor.
Value lists can be empty, in which case DiscoDB becomes an efficient set data structure.
Keys and values can be arbitrary byte sequences or strings (see size limitations below). For instance, you can save a serialized DiscoDB, as produced by discodb.DiscoDB.dumps(), in another DiscoDB to build a tree of DiscoDBs!
If any of the keys contains duplicate values, the whole DiscoDB is tagged to contain multisets. In the multiset mode discodb.DiscoDB.query() is not available, as it is not clear currently how duplicate values should be handled in queries. However, looking up a single key works as usual, so applications can freely utilize the multiset feature (which is very efficient, as noted above) if complex queries are not needed.
In many cases it makes sense to have several (distributed) small or medium-size DiscoDBs with millions of keys at most, as created by Discodex, instead of having a single DiscoDB with hundreds of millions of keys and values, although it is technically possible. Run your own benchmarks to find the optimal size for your application.
DiscoDB(iter[, flags]) -> new DiscoDB from k, v[s] in iter.
self[key] if key in self, else default.
an inquiry over the items of self.
an inquiry over the keys of self.
an inquiry over the (q, values) of self in the expansion of the query.
See discodb.query for more information.
first element of self[key] or else default.
an inquiry over the values of self whose keys satisfy the query.
The query can be either a Q object, or a string. If a string, it is transformed into a Q via Q.parse().
an inquiry over the unique values of self.
an inquiry over the values of self.
Contains a conjunction of clauses.
Parse a string into a Q object.
The parsed string is subject to the following conditions:
- * characters will be replaced with +
- literal terms cannot contain the characters &|~+()
- leading and trailing spaces will be stripped from terms
- the characters &|~ signify the corresponding logical operations
- parentheses make operator precedence explicit
- + characters introduce a dereferencing operation
The string will be parsed into a logical combination of the terms, which will be stored internally in conjunctive normal form.
Dereferencing is performed when the Q object is actually used to query a discodb.DiscoDB.
In queries (discodb.DiscoDB.query()), the dereferencing operation replaces an expression with the union of the values which result from querying with the expression.
Thus in the case of:
discodb = DiscoDB({'A': ['B', 'C'], 'B': 'D'})
The following queries would be equivalent:
discodb.query(Q.parse('+A'))
discodb.query(Q.parse('*A'))
discodb.query(Q.parse('B | C'))
and would result in an iterator containing only the value 'D'.
In metaqueries (discodb.DiscoDB.metaquery()), the dereferencing operation replaces an expression with each value which results from querying with the expression, and yields a (query, values) pair for each item in the expansion.
Thus, for the discodb above:
discodb.metaquery('A')
would return an iterator roughly equivalent to [(Q.parse(‘A’), [‘B’, ‘C’])], while:
discodb.metaquery('*A')
would return an iterator roughly equivalent to [(Q.parse(‘B’), [‘D’]), (Q.parse(‘C’), [])].
Consider any object as a proposition and return a well-behaved Q.
Like itertools.groupby, but iterates over (k, vs) instead of (k, k-vs).
The result can be used to construct a discodb.DiscoDB, iff kviter is in sorted order.