Discodex uses mapreduce jobs to build and query indices...
ichunk parser == func.discodb_reader (iteritems)
parser: data -> records \
| kvgenerator \
demuxer: record -> k, v ... / |
| indexer
|
balancer: (k, ) ... -> (p, (k, )) ... \ /
| ichunkbuilder
ichunker: (p, (k, v)) ... -> ichunks /
A discodex mapreduce job used to build an index from a dataset.
Convenient containers for holding bags of [named] attributes.
Parsers are essentially the map_reader function for the discodex.mapreduce.Indexer.
A parser takes a chunk of a dataset and produces zero or more records (see discodex.mapreduce.demuxers).
Splits lines of input by commas and creates discodex.mapreduce.Record objects.
Splits lines of input by whitespace and uses the fields as keys for a discodb.DiscoDB objects.
Like recordparse() except fields are named by the column they appear in.
Reads (key, value) pairs directly from netstr input.
Returns the iterable.
Maps raw URLs to (key, value) pairs.
e.g. raw://a:b,c:d,e:f yields [(a, b), (c, d), (e, f)]
Splits lines of input by whitespace and creates discodex.mapreduce.Record objects.
Splits lines of input by whitespace and uses the words as keys for the value fname
Demuxers are essentially the map function for the discodex.mapreduce.Indexer.
A demuxer takes a record (see discodex.mapreduce.parsers) and produces zero or more (key, value) pairs to be stored in the index.
Produce (‘fieldname:value’, record) pairs for a record.
Can be used to produce an inverted index.
Produce (‘fieldname:value’, id) pairs for a record.
Can be used to produce an inverted index when records contain a field named ‘id’.
Unpacks the kvsdict to produce all (‘k:v’, kvsdict) pairs.
If a key has no values, a (‘k’, kvsdict) pair is produced instead.
Produce (fieldname, value) pairs for a record.
Can be used to produce an index of all the possible values of each namedfield.
Yields the kvrecord itself.
Balancers are essentially the partition function for the discodex.mapreduce.Indexer.
The balancer is called for every (key, value) pair (see discodex.mapreduce.demuxers) and returns an integer indicating which partition it belongs in.
Randomly chooses a partition.
Cycles through the partitions in a round robin fashion.