Hello, World Tutorial (Python)

Introduction

This Tutorial is for webmaster/programmers. By practicing simple tasks at the command line, you will learn the basics of how to:

  • Sign up for IndexDen
  • Create an index and populate it with searchable text
  • Send search queries to the index and understand the results
  • Tweak the search results and influence the order in which they appear (with scoring functions and variables)
  • Delete documents

Before You Start

To run the tutorial, you will need:

  • Python (version 2.6 or greater) installed on your machine (IndexDen can be used with other languages, but in this Tutorial we use Python. If you don't already have Python, you can get it at python.org).
  • Internet access (You must be able to view indexden.com in a Web browser).
  • Knowledge of your computer's command prompt (You will need to open a console window and issue commands at the prompt in order to perform the steps in the tutorial).

System requirements:

  • The Tutorial has been tested on Linux, Mac OS X, and Windows.

The following would be helpful before you start, but are not required:

  • Working knowledge of a programming language (You will be able to complete the Tutorial just by following along with our example code, but in order to build a real application using what you've learned, you will have to do some real programming through our HTTP API or, more likely, one of our client libraries).
  • Read the FAQ to familiarize yourself with the vocabulary and general architecture of IndexDen (The Tutorial will go more smoothly if you already know what kinds of things you can do with IndexDen, what an index is, and so on. Each step in the tutorial will provide links to any relevant FAQ sections, so, if you prefer, you can learn each concept at the moment you need to know it rather than reading all the concepts ahead of time).

About The Example App

The Tutorial shows the process for setting up an index and performing some example queries on a fictitious web site that includes a forum where members discuss video games.

Step 1: Sign Up for IndexDen's Free Plan

Background Concepts: See Sign-Up, Pricing, and Billing in the FAQ.

  1. Open your browser and go to http://www.indexden.com/pricing and choose one of the plans.
  2. Type your email address.
  3. Go to http://indexden.com/dashboard.

Result:
The IndexDen dashboard appears, showing your new account:

indexden dashboard

Step 2: Create an Empty Index

Background Concepts: See What is an index? in the FAQ.

  1. Click new index.
    my dashboard
  2. In Index Name, type test_index.
  3. Click Create Index.
    my dashboard
  4. You now have an empty index that is ready to be populated with content. my dashboard

Step 3: Download the Client Library

Background Concepts: See What languages do you support? in the FAQ.

  1. Click client documentation or go to http://indexden.com/documentation.
    docs and support
  2. Click Python client library. This takes you to http://indexden.com/documentation/python-client.
  3. In the Download area, in Stable Version, click one of the links: zip for Windows or Mac, tgz for Unix/Linux.
  4. Extract the compressed library to a convenient directory on your computer.
  5. Open a console window and change ( cd ) to the directory where the client library is installed.

Result:
If you run the command to view the contents of the directory ( ls on Unix or Mac OS X; dir on Windows), you should see the file indextank_client.py

command

Step 4: Instantiate the Client

  1. While still in the directory where the client library is installed, run the Python interpreter.
    If you don't know how to use the Python interpreter, see http://docs.python.org/tutorial/interpreter.html.
     C:\> python

    The Python prompt appears:
     >>>
  2. Import the IndexDen client library to the Python interpreter by typing the following command at the Python prompt.
     >>> import indextank.client as itc

    itc is a name you give to the imported client so you can refer to it later.
  3. Instantiate the client by calling ApiClient()
     >>> api_client = itc.ApiClient('YOUR_API_URL')

    For YOUR_API_URL , substitute the URL from Private URL in your Dashboard (refer to the screen shot at the beginning of the Tutorial if you forgot where to find this).

Step 5: Set Up the Index

Background Concepts: See How do I get my data to you? in the FAQ

  1. Get a handle to your test index.
     >>> test_index = api_client.get_index('test_index')

    Here we call the get_index() method in the client library and pass it the name you assigned when you created the index.
  2. >>> test_index.add_document('post1', {'text':'I love Bioshock'})
    >>> test_index.add_document('post2', {'text':'Need cheats for Bioshock'})
    >>> test_index.add_document('post3', {'text':'I love Tetris'})
    						
    Here we call the add_document() method in the client library three times to index three posts in the video gamer forum.
    post1, post2, and post3 are unique alphanumeric IDs you give to the documents. If the doc ID is not unique, you will overwrite the existing document with the same ID, so watch out for typos during this Tutorial!
    The method parameters are name:value pairs that build up the index for one document. In this example, there is a single name:value pair for each forum post.
    text is a field name, and it has a special meaning to IndexDen: in search queries, IndexDen defaults to searching text if the query does not specify a different field. You'll learn more about this later, in Use Fields to Divide Document Text.

Result:
test_index now contains:

Doc ID Field Value
post1 text I love Bioshock
post2 text Need cheats for Bioshock
post3 text I love Tetris

Background Concepts: See What types of queries work with IndexDen? in the FAQ

Background Concepts: See the discussion of field names in What is an index? in the FAQ
So far, we have worked with

Step 6: Search Documents Using the Index

  1. Suppose you're interested in a particular game, and you want to find all the posts that contain Bioshock:
     >>> test_index.search('Bioshock')

    The output should look like this (you can ignore facets for now). The search term was found in post1 and post2:
    {'matches': 2,
     'facets': {},
     'search_time': '0.070',
     'results': [{'docid': 'post2'},{'docid': 'post1'}]}
    						
  2. Suppose you want to find only the true enthusiasts on the forum. You can search for posts that contain Bioshock: and love.
     >>> test_index.search('love Bioshock')

    The output should look like this. The two search terms were found together only in post1:
    {'matches': 1,
     'facets': {},
     'search_time': '0.005',
     'results': [{'docid': 'post1'}]}
    						
  3. You can also use the query operators OR and NOT. Let's try OR, which would come in handy if you play more than one game and you want to find posts that mention any of your favorites.
     >>> test_index.search('Bioshock OR Tetris')

    The output should look like this:
    {'matches': 3,
     'facets': {},
     'search_time': '0.007',
     'results': [{'docid': 'post3'},{'docid': 'post2'},{'docid': 'post1'}]}
    						
  4. To ask IndexDen to return more than just the doc ID, add the argument fetch_fields.
     >>> test_index.search('love', fetch_fields=['text'])['results']

    Here, text means we would like the full text of the document where the search term was found. This would be useful, for example, to construct an output page that provides complete result text for the reader to look at. The output should look like this:
    [{'text': 'I love Tetris', 'docid': 'post3'},
     {'text': 'I love Bioshock', 'docid': 'post1'}]
    						
  5. To show portions of the result text with the search term highlighted, use snippet_fields.
     >>> test_index.search('love', snippet_fields=['text'])['results']

    The output should look like this:
    [{'snippet_text': 'I love Tetris', 'docid': 'post3'},
     {'snippet_text': 'I love Bioshock', 'docid': 'post1'}]
    						

Step 7: Use Fields to Divide Document Text

simple document index entries that contain only a single field, text, containing the complete text of the document. Let's redefine the documents now and add some more fields to enable more targeted searching.

  1. Set up two fields for each document: the original text field plus a new field, game, that contains the name of the video game that is the subject of the forum post.
    >>> test_index.add_document('post1', {'text':'I love Bioshock', 'game':'Bioshock'})
    >>> test_index.add_document('post2', {'text':'Need cheats for Bioshock', 'game':'Bioshock'})
    >>> test_index.add_document('post3', {'text':'I love Tetris', 'game':'Tetris'})
    						

    Here we call add_document() with the same document IDs as before, so IndexDen will overwrite the existing entries in your test index
    Result:
    test_index now contains:
    Doc ID Field Value
    post1 text I love Bioshock
    game Bioshock
    post2 text Need cheats for Bioshock
    game Bioshock
    post3 text I love Tetris
    game Tetris
  2. Now you can search within a particular field. Let's use fetch_fields again to get some user-friendly output.
     >>> test_index.search('game:Tetris', fetch_fields=['text'])['results']

    Note that we can return the contents of a field even if it is not being searched. The output should look like this:
     [{'text': 'I love Tetris', 'docid': 'post3'}]

Step 8: Customize Result Ranking with Scoring Functions

Background Concepts: See What are scoring functions? in the FAQ.

A scoring function is a mathematical formula that you can reference in a query to influence the ranking of search results. Scoring functions are named with integers starting at 0 and going up to 5. Function 0 is the default and will be applied if no other is specified; it starts out with an initial definition of -age, which sorts query results from most recently indexed to least recently indexed (newest to oldest).

Function 0 uses the timestamp field which IndexDen provides for each document. The time is recorded as the number of seconds since epoch. IndexDen automatically sets each document's timestamp to the current time when the document is indexed, but you can override this timestamp. To make this scoring function tutorial easier to follow, that's what we are going to do.

  1. Assign timestamps to some new posts in the index.
    >>> test_index.add_document('newest',{'text': 'New release: Fable III is out','timestamp':1286673129})
    >>> test_index.add_document('not_so_new',{'text': 'New release: GTA III just arrived!','timestamp':1003626729})
    >>> test_index.add_document('oldest',{'text': 'New release: This new game Tetris is awesome!','timestamp':455332329})
    						
  2. Search using the default scoring function.
     >>> test_index.search('New release')

    The output should look like this. The default scoring function has sorted the documents from newest to oldest:
    {'matches': 3,
     'facets': {},
     'search_time': '0.002',
     'results': [{'docid': 'newest'},{'docid': 'not_so_new'},{'docid': 'oldest'}]}
    						
  3. Redefine function 0 to sort in the opposite order, by removing the negative sign from the calculation.
     >>> test_index.add_function(0,'age')
  4. Search again.
     >>> test_index.search('New release')

    The output should look like this. The oldest document is now first:
    {'matches': 3,
     'facets': {},
     'search_time': '0.005',
     'results': [{'docid': 'oldest'},{'docid': 'not_so_new'},{'docid': 'newest'}]}
    						
  5. Let's try creating another scoring function, function 1, using a different IndexDen built-in score called relevance. Relevance is calculated using a proprietary algorithm, and indicates which documents best match a query. First, add some test documents that will more clearly illustrate the effect of the relevance score.
    >>> test_index.add_document('post4', {'text': 'When is Duke Nukem Forever coming out? I need my Duke.'})
    >>> test_index.add_document('post5', {'text': 'Duke Nukem is my favorite game. Duke Nukem rules. Duke Nukem is awesome. Here are my favorite Duke Nukem links.'})
    >>> test_index.add_document('post6', {'text': 'People who love Duke Nukem also love our great product!'})
    						
  6. Now define function 1.
     >>> test_index.add_function(1,'relevance')
  7. Search using the new scoring function.
     >>> test_index.search('duke', scoring_function=1, fetch_fields=['text'])['results']

    The output should look like this. The most relevant document is now first:
    [{'docid': 'post5',
    'text': 'Duke Nukem is my favorite game. Duke Nukem rules. Duke Nukem is awesome. Here are my favorite Duke Nukem links.'},
    {'docid': 'post4', 'text': 'When is Duke Nukem Forever coming out? I need my Duke.'},
    {'docid': 'post6', 'text': 'People who love Duke Nukem also love our great product!'}]
    						

Step 9: Add Document Variables To Your Scoring Functions

Background Concepts: See What is a scoring function? in the FAQ.

In addition to textual information, each document can have up to three (3) document variables to store any numeric data you would like. Each variable is referred to by number, starting with variable 0. Document variables provide additional useful information to create more subtle and effective scoring functions.

For example, assume that in the video game forum, members can vote for posts that they like. The forum application keeps track of the number of votes. These vote totals can be used to push the more popular posts up higher in search results.

Let's also assume that the forum software assigns a spam score by examining each new post for evidence that it is from a legitimate forum member and contains relevant content, and then assigning a confidence value from 0 (almost certainly spam) to 1 (high confidence that the post is legitimate).

  1. Assign the total votes to document variable 0 and the spam score to document variable 1.
    >>> test_index.add_document('post4', {'text': 'When is Duke Nukem Forever coming out? I need my Duke.'}, variables={0:10, 1:1.0})
    >>> test_index.add_document('post5', {'text': 'Duke Nukem is my favorite game. Duke Nukem rules. Duke Nukem is awesome. Here are my favorite Duke Nukem links.'}, variables={0:1000, 1:0.9})
    >>> test_index.add_document('post6', {'text': 'People who love Duke Nukem also love our great product!'}, variables={0:1, 1:0.05})
    						
  2. Use the document variables in a scoring function.
     >>> test_index.add_function(2, 'relevance * log(doc.var[0]) * doc.var[1]')
  3. Run a query using the scoring function.
     >>> test_index.search('duke', scoring_function=2, fetch_fields=['text'])['results']

    The output should look like this:
    [{'docid': 'post5', 'text': 'Duke Nukem is my favorite game. Duke Nukem rules. Duke Nukem is awesome. Here are my favorite Duke Nukem links.'},
     {'docid': 'post4', 'text': 'When is Duke Nukem Forever coming out? I need my Duke.'},
     {'docid': 'post6', 'text': 'People who love Duke Nukem also love our great product!'}]
    						
  4. When more readers vote for a post, update the vote total in variable 0.
     >>> test_index.update_variables('post4',{0:1000000})
  5. Now run the query again with the same scoring function.
     >>> test_index.search('duke', scoring_function=2, fetch_fields=['text'])['results']

    The output should show the new most-popular post first:
    [{'docid': 'post4', 'text': 'When is Duke Nukem Forever coming out? I need my Duke.'},
     {'docid': 'post5', 'text': 'Duke Nukem is my favorite game. Duke Nukem rules. Duke Nukem is awesome. Here are my favorite Duke Nukem links.'},
     {'docid': 'post6', 'text': 'People who love Duke Nukem also love our great prod]
    						

Learn More: Scoring Functions

Step 10: Delete a Document

If you're 100% confident something should not be in the index, it makes sense to remove it.

  1. Take out that spam document.
     >>> test_index.delete_document('post6')
  2. Search again to confirm the deletion.
     >>> test_index.search('duke', scoring_function=2, fetch_fields=['text'])['results']

    The output should show only two results:
    [{'docid': 'post4', 'text': 'When is Duke Nukem Forever coming out? I need my Duke.'},
     {'docid': 'post5', 'text': 'Duke Nukem is my favorite game. Duke Nukem rules. Duke Nukem is awesome. Here are my favorite Duke Nukem links.'}]
    						

Step 11: Use Variables to Refine Queries

You can pass variables with a query and use them as input to a scoring function. This is useful, for example, to customize results for a particular user. Suppose we're dealing with the search on the forum site. It makes sense to index the poster's gamerscore to use it as part of the matching process.

  1. Add a document variable. This will be compared to the query variable later. Here variable 0 holds the gamerscore of the person who made the post:
    >>> test_index.add_document('post1', {'text':'I love Bioshock'}, variables={0:115})
    >>> test_index.add_document('post2', {'text':'Need cheats for Bioshock'}, variables={0:2600})
    >>> test_index.add_document('post3', {'text':'I love Tetris'}, variables={0:19500})
    						
  2. Suppose we want to boost posts from forum members with a gamerscore closer to that of the searcher. Let's define a scoring function to do that.
     >>> test_index.add_function(1, 'relevance / max(1, abs(query.var[0] - doc.var[0]))')

    This scoring function prioritizes gamerscores close to the searcher's own (query.var[0]). The absolute (abs) function is there to provide symmetry for gamerscores above and below the searcher's. The max function ensures that we never divide by 0, and evens out all gamerscore differences less than 1 (in case we use floats for the gamerscores, instead of integers).
  3. Suppose John is the searcher, with a gamerscore of 25. In the query, set variable 0 to the gamerscore.
     >>> test_index.search('bioshock', scoring_function=1, fetch_fields=['text'], variables={0: 25})['results']

    The output should look like this:
    [{'docid': 'post1', 'text': 'I love Bioshock.'},
     {'docid': 'post2', 'text': 'Need cheats for Bioshock.'}]
    						
  4. For Isabelle, gamerscore 15,000:
     >>> test_index.search('love', scoring_function=1, fetch_fields=['text'], variables={0: 15000})['results']

    The output should look like this:
    [{'docid': 'post3', 'text': 'I love Tetris.'},
     {'docid': 'post1', 'text': 'I love Bioshock.'}]
    						

Next Steps

Now that you have learned some of the basic functionality of IndexDen, you are ready to go more in-depth:

  • Client library documentation will tell you more about the specific capabilities and syntax of the client in your programming language.
  • Scoring functions goes into much more detail about formulas, operators, variables, and functions.

Enjoy using IndexDen to improve the quality of search on your website.


Real Time Web Analytics