Empress Technical News – January 2009
Database Text SearchIndex – Fast Text Data Retrieval – Part 1
Empress Ultra Embedded 10.20 offers many additional features to application developers. One of those features is a Text Search Index capability.
Empress Text Search Index
Empress Ultra Embedded V10.20 Text Search Index capability empowers application developers to implement an efficient search for database records using keywords/tokens/phrases. The most typical usage would be to associate those keywords with particular character/text based attribute in a database table.
The search index capability is developed as an additional set of C calls that are used in conjunction with Empress C/C++ Kernel Level API – mr Routines. The search index is a user maintained index – an index not maintained via Empress database engine calls.
Application would supply the list (array) of tokens/keywords/phrases on insertion in the Empress database in order to create a text search index.
Lets take an example of storing song titles in an Empress TEXT data type attribute. For the famous Beatles song “I Want To Hold Your Hand” the list of tokens/keywords/phrases could look like as follows:
In the above example the list contains seven tokens. Tokens can contain a single word or multiple words. Words do not necessarily have to be contained in the original song title. This could be beneficial when, for example, using either token Mc or Mac to get a song title “Old Macdonald”.
Once created an Empress text search index enables very powerful searches using inserted tokens/keywords.
In the presence of hundreds of thousands of song titles a search for the song title with a token Hold could result in the result set containing the following titles:
You Really Got A Hold on Me (Beatles)
The following example provides the actual program code (select_text.c) using Empress C/C++ Kernel Level API – mr Routines to show how to retrieve records from the table songs using the search index on the same table. The search index C API calls are in bold. The example performs the retrieval of all the records from the song table that contain token “Hold”. When translated into SQL the example does something like:
SELECT id, title FROM songs WHERE title HAS “Hold”
#include < mscc.h>
global_shared_func msbool mssio_printf (char* format, ...)
mropen (DATABASE, L" songs" , 'r')
while (mrget (retrieve_desc))
When executing the above mr program the performance could be several orders of magnitude better than when performing the same search using an SQL query, e.g.:
SELECT id, title FROM songs WHERE title LIKE “%Hold%”
SELECT id, title FROM songs WHERE title SMATCH “*Hold*”
example a search for a keyword " Embedded" was performed against all titles stored in Empress
database containing Wikipedia data. The table pages contained
7,649,051 titles and related information such as id
id, title FROM pages WHERE title SMATCH '*Embedded*'
The Empress text search index was about 500 times faster than the SQL query!!