Table Of Contents

Previous topic

discodex.settings – Discodex Settings

Next topic

discodex ReST API

This Page

discodex.mapreduce – mapreduce jobs and utilities

Discodex uses mapreduce jobs to build and query indices...

ichunk parser == func.discodb_reader (iteritems)


parser:  data -> records       \
                                | kvgenerator            \
demuxer: record -> k, v ...    /                          |
                                                          | indexer
                                                          |
balancer: (k, ) ... -> (p, (k, )) ...   \                /
                                         | ichunkbuilder
ichunker: (p, (k, v)) ... -> ichunks    /
class discodex.mapreduce.Indexer(master, name, dataset)

A discodex mapreduce job used to build an index from a dataset.

class discodex.mapreduce.Record(*fields, **namedfields)

Convenient containers for holding bags of [named] attributes.

discodex.mapreduce.parsers – builtin parsers

Parsers are essentially the map_reader function for the discodex.mapreduce.Indexer.

A parser takes a chunk of a dataset and produces zero or more records (see discodex.mapreduce.demuxers).

discodex.mapreduce.parsers.csvrecordparse(iterable, size, fname, params)

Splits lines of input by commas and creates discodex.mapreduce.Record objects.

discodex.mapreduce.parsers.discodbparse(iterable, size, fname, params)

Splits lines of input by whitespace and uses the fields as keys for a discodb.DiscoDB objects.

discodex.mapreduce.parsers.enumfieldparse(iterable, size, fname, params)

Like recordparse() except fields are named by the column they appear in.

discodex.mapreduce.parsers.netstrparse(fd, size, fname, params)

Reads (key, value) pairs directly from netstr input.

discodex.mapreduce.parsers.noparse(iterable, size, fname, params)

Returns the iterable.

discodex.mapreduce.parsers.rawparse(iterable, size, fname, params)

Maps raw URLs to (key, value) pairs.

e.g. raw://a:b,c:d,e:f yields [(a, b), (c, d), (e, f)]

discodex.mapreduce.parsers.recordparse(iterable, size, fname, params)

Splits lines of input by whitespace and creates discodex.mapreduce.Record objects.

discodex.mapreduce.parsers.wordparse(iterable, size, fname, params)

Splits lines of input by whitespace and uses the words as keys for the value fname

discodex.mapreduce.demuxers – builtin demuxers

Demuxers are essentially the map function for the discodex.mapreduce.Indexer.

A demuxer takes a record (see discodex.mapreduce.parsers) and produces zero or more (key, value) pairs to be stored in the index.

discodex.mapreduce.demuxers.inverteddemux(record, params)

Produce (‘fieldname:value’, record) pairs for a record.

Can be used to produce an inverted index.

discodex.mapreduce.demuxers.invertediddemux(record, params)

Produce (‘fieldname:value’, id) pairs for a record.

Can be used to produce an inverted index when records contain a field named ‘id’.

discodex.mapreduce.demuxers.itemdemux(kvsdict, params)

Unpacks the kvsdict to produce all (‘k:v’, kvsdict) pairs.

If a key has no values, a (‘k’, kvsdict) pair is produced instead.

discodex.mapreduce.demuxers.namedfielddemux(record, params)

Produce (fieldname, value) pairs for a record.

Can be used to produce an index of all the possible values of each namedfield.

discodex.mapreduce.demuxers.nodemux(kvrecord, params)

Yields the kvrecord itself.

discodex.mapreduce.balancers – builtin balancers

Balancers are essentially the partition function for the discodex.mapreduce.Indexer.

The balancer is called for every (key, value) pair (see discodex.mapreduce.demuxers) and returns an integer indicating which partition it belongs in.

discodex.mapreduce.balancers.nchunksbalance(key, partitions, params)

Randomly chooses a partition.

discodex.mapreduce.balancers.roundrobinbalance(key, partitions, params)

Cycles through the partitions in a round robin fashion.