Table Of Contents

Previous topic

Discodex

Next topic

discodex – Discodex command line utility

This Page

Tutorial

This tutorial will walk you through the steps necessary to get Discodex up and running on your local machine. It will also walk you through the basic concepts of building and querying indices with Discodex. To install Discodex on a server or using a remote Disco master, you should only need to change a few discodex.settings.

Installation

Install Discodex

To run Discodex, the Python package and discodex command line utility must be installed. See discodex for information on installing the command line utility. You can install the Python package either using a symlink, or by running:

make install-discodex

Install Django

Discodex uses the Python web framework, Django, to handle requests coming to the HTTP server and map them to Disco jobs. The Django server acts as a Disco client for you, so that you can use Discodex from thin clients, such as web applications.

Follow the instructions here. To use Discodex out of the box, you must also have lighttpd installed, as well as flup.

If you want to understand why these other projects are used, read this.

Note

If you prefer, you can use Discodex as a library, in which case you can configure your web server however you like.

Using Discodex

Start Disco

If you haven’t already started a Disco master, you will need to do so now. Discodex requires that a Disco master be running at DISCODEX_DISCO_MASTER, so that it can submit jobs to it.

Usually, you can start Disco simply by running:

disco start

For more information on starting Disco, see Setting up Disco.

Start Discodex

Discodex runs its own HTTP server which translates HTTP requests into Disco jobs. In order to use Discodex, you will need to start the server:

discodex start

Build an Index

Discodex makes it easy to build indices from data, assuming you know how you want to create keys and values from your data. The default parser for Discodex, is rawparse. It simply takes the string attached to raw:// URLs, and decodes them in a special way to produce keys and values:

discodex index raw://hello:world

If you check the disco status web page, you can still see the job Discodex executed to build the index. If you see the ‘green light’ next to the job, you’ve successfully built your first index! The job will remain there until you read it for the first time. Officially, it won’t become an index until you read it using discodex get or some other command (such as discodex clone). You can confirm that you don’t see it yet when you do:

discodex list

Using the name of the job returned from the discodex index command, let’s go ahead and make it official:

discodex get <INDEX>

You should see the tag object stored on DDFS printed out. You should also see the index name now when you do:

discodex list

Let’s copy the index to a more human-readable name:

discodex clone <INDEX> toyindex

Once more, let’s see whats available:

discodex list

Notice the prefix. This is the prefix stored in the settings DISCODEX_INDEX_PREFIX. Generally speaking, you can ignore this prefix and just use the name you gave it. The reason it exists is to provide Discodex with its own namespace in Disco Distributed Filesystem, where the indices are stored.

Let’s try seeing the keys stored in the index:

discodex keys toyindex

And the values:

discodex values toyindex

Let’s also try querying it:

discodex query toyindex hello

If you have ddfs installed, you can try:

ddfs ls
ddfs ls discodex

Notice how the indices are just tags stored on DDFS.

Now that we’ve created our first index and queried it, let’s clean up our mess:

discodex list | xargs -n 1 discodex delete

You could have also done:

ddfs ls discodex: | xargs ddfs rm

Warning

Be careful, these commands will delete all your indices!

If you ran the queries against Discodex, you should still see the query jobs Discodex ran on the Disco web interface. If you want Discodex to cleanup after itself automatically, touch the file stored in the DISCODEX_PURGE_FILE setting. If you don’t know what file that is, just run:

discodex -v

If the purge file exists, Discodex will purge query jobs after they complete. If you ever need to know why a query job fails, its a good idea to turn off purging. If you have disco installed, you can clean up any remaining jobs using:

disco jobs | xargs disco purge

Warning

Be careful, this command will purge all of your Disco jobs!

Querying the index

Let’s build a slightly more complicated index and try querying it:

discodex index raw://hello:world,hello:there,hi:world,hi:mom
discodex clone <index> rawindex

Go ahead and try the following queries:

discodex query rawindex hello
discodex query rawindex hi
discodex query rawindex hello hi
discodex query rawindex hello,hi

Discodex queries the underlying discodb objects using conjunctive normal form. In queries from the command line, you can use spaces to separate clauses, and commas to separate literals.

Index the docs

Let’s try indexing some real files now. We can use the Disco documentation:

find $DISCO_HOME/doc -name \*.rst | xargs discodex index --parser wordparse

Note

Any text files will work, just make sure to pass absolute paths.

Let’s name the index:

discodex clone <INDEX> words

If you indexed the docs as above, you can now see which files contain the word discodex:

discodex query words discodex

We can also see which files contain the words discodex and build:

discodex query words discodex build

Congratulations, you’ve built a basic search engine! Remember, Discodex scales automatically with the size of your cluster, so don’t be afraid to try it out with millions or billions of keys and values!