Discodex uses mapreduce jobs to build and query indices...
ichunk parser == func.discodb_reader (iteritems)
parser: data -> records \
| kvgenerator \
demuxer: record -> k, v ... / |
| indexer
|
balancer: (k, ) ... -> (p, (k, )) ... \ /
| ichunkbuilder
ichunker: (p, (k, v)) ... -> ichunks /
Parsers are essentially the map_reader function for the discodex.mapreduce.Indexer.
A parser takes a chunk of a dataset and produces zero or more records (see discodex.mapreduce.demuxers).
Maps raw URLs to (key, value) pairs.
e.g. raw://a:b,c:d,e:f yields [(a, b), (c, d), (e, f)]
Demuxers are essentially the map function for the discodex.mapreduce.Indexer.
A demuxer takes a record (see discodex.mapreduce.parsers) and produces zero or more (key, value) pairs to be stored in the index.
Produce (‘fieldname:value’, record) pairs for a record.
Can be used to produce an inverted index.
Produce (‘fieldname:value’, id) pairs for a record.
Can be used to produce an inverted index when records contain a field named ‘id’.
Unpacks the kvsdict to produce all (‘k:v’, kvsdict) pairs.
If a key has no values, a (‘k’, kvsdict) pair is produced instead.
Produce (fieldname, value) pairs for a record.
Can be used to produce an index of all the possible values of each namedfield.
Balancers are essentially the partition function for the discodex.mapreduce.Indexer.
The balancer is called for every (key, value) pair (see discodex.mapreduce.demuxers) and returns an integer indicating which partition it belongs in.
Metakeyers are essentially the map function for the discodex.mapreduce.MetaIndexer.
The metakeyer is called for every key in the index and produces zero or more (metakey, value) pairs.