So; how do you actually search for text using the HPCC? Well the easiest method is to use an inverted index. An "inverted" index is really an index that looks a lot like the indexes you get in the back of the book; a list of words with the pages they appear on. For the purposes of searching the Bible we replace page numbers with verse references. Also, because I might eventually want to do more complex searches (such as find these two words one after the other) - I will also store the position of the word in the sentence.
As is my normal practice in a new ECL attribute I start by defining the file I aim to work upon and the format I'm trying to get it into:
It may surprise some people that I am actually naming the file I am going to act upon in the inversion module; surely it would be cleaner to have an inversion function and then pass the file in. The issue is that we tend to think in terms of encapsulating code, not data. My Inversion module is a module that exists to build the inversion - the ingredients of that inversion are a secret to be encapsulated within the module. If over time I expand my inversion to include a dozen different translations then I don't want any code outside of the inversion to have to change[1].
Now, I need to tokenize my verses (turn them into streams of words in the correct format). The ECL standard library has some functions that would make that extremely easy if it wasn't for all of that annoying punctuation. One has to be very careful handling messy data; if you just 'work around it' whenever you need to then you will probably grow multiple conflicting ways of dealing with the mess over time. Worse, as you find new bits of mess you will have to adapt your code to the new mess, something you will remember to do in most of the places you need to.
My solution is to pull the cleaning routines outside of the implementation modules and make them common and available across the whole application. That way there is one (and only one) standard for a cleaned verse. For the purposes of simplicity I have defined a verse as clean if it has letters, numbers, spaces and an apostrophe.
Given that I can now clean a verse trivially; tokenizing my verse into the inversion format becomes extremely easy:
The new function here is NORMALIZE; it defines that the transform is called multiple times for one record; putting out a new record each time. In this case it is called once for each word in the verse; the predefined value COUNTER gives the word position and allows the TRANSFORM to select the correct word.
If you wanted; you could view the first few records of your inversion:
This would yield:
Of course, if one really had a compulsive need, the list could also be used to produce a word count application:
What we need is an index that we can build and then use from Roxie; fortunately declaring that is extremely simple:
Which declares the index and the build method; to actually BUILD the index the 'Bld' definition has to be executed from a builder window.
Unfortunately, the piece of code that actually performs the search is not extremely simple. In fact it is probably the single nastiest piece of ECL there is: the GRAPH statement. However, once mastered (and trust me; that process hurts) it is also one of the most powerful ECL statements there is. Put simply[2] the ECL graph statement allows you to write the kind of dynamic query optimizer that is at the heart of tools such as Oracle and SQL Server. You can build a business on the GRAPH statement (we do).
I suppose that for the squeamish I should also show you the simple way to do this. If you want to find a verse with two particular words in it; the code is easy:
All that we need to do is get a list of the verses containing 'Love' and then look those verses up in the inversion again against those verses with the word 'God'. We could easily parameterize the above to be any two words; but what if we want to find a verse with three words? Well - we can add another JOIN and recompile - and we might even be able to jiggle things so that it still works if only two words are put in. But eventually chaining all those joins together is going to begin to hurt.
Enter the GRAPH statement. The GRAPH statement allows you to create a Roxie execution graph at QUERY time out of a collection of components constructed at COMPILE time. So ,first bit of confusion to get over: GRAPH does not help you process graphs; GRAPH creates an execution graph which it then executes.
The heart of a GRAPH statement is a function. In your GRAPH statement you get to pick how many times the function is called; at each call the function must return a dataset. The function will be passed the 'zero' dataset; and also all of the datasets that previous calls to the function have returned. Here is the function body to construct an 'N-way' search against our inversion:
I'll talk about how we get a 'zero dataset' but for now just trust me that we somehow managed to turn a search string into a dataset where there is one row for each searched word. In the function this is ALWAYS the zeroeth element of the set of datasets - so if you read I[0] in the above I am digging back to my zeroth dataset[3].
Now, my first call of this GraphBody is going to be special because I don't have any intermediate results yet. If you look at my simple 'two word' example my first dataset (I1) is constructed by a single index fetch. Similarly if you follow the GraphBody logic for the first call to the body (C = 1) all I do is a single index fetch. HOWEVER - I have to wrap this in a PROJECT as all of the datasets returned from this function have to have the same type.
The second time GraphBody is called there will be two datasets defined; the omnipresent zeroeth dataset and the first dataset (which is the result from the first function call). The third time it is called I will have the zeroeth dataset and also datasets 1 & 2 from the first and second calls respectively. In order to 'cascade my joins' all I need to do it use the data from the 'N-1'th dataset and join it to the data from the Nth word - where I get the Nth word from the zeroeth dataset!
The DEDUP is there because in my inversion I have preserved word position which means the same word might have multiple entries for the same verse; and for now I simply want to ignore that complexity.
That's it; if you understood the above the ECL only gets easier from here on. In order to tie up a few loose ends we first need to construct our zeroeth dataset. Fortunately we are back in the world of the simple:
The normalize may look familiar; it should, it is the same technique we used to get data into the inversion; so it should not be surprising that we use it to get a search term into a form to match the inversion.
Finally we invoke the GRAPH function itself (which takes the zeroeth dataset and returns the Nth dataset; the second parameter to GRAPH defines how many times the function is called):
In true ECL style all of this nastiness is wrapped up inside a function (you may notice the EXPORT Search that we defined earlier) and the results are returned in the common layout (which for now is defined in the KJV module):
For those that have followed both parts we have now ingested a raw datafile and converted into a common format. We have also now made that text searchable using an inversion. The inversion returns a list of Bible references; these can be turned back into a list of Bible verses using the GetBatch export from the KJV module.
Which returns:
The code above (which is copied below in 'cut & pasteable' form) can run in Roxie or on the ECL Agent. Of course at the moment it can only be called from ECL (albeit very simple ECL). In Part III we will turn this search service into something that can be driven from a SOAPCall or via the Web.
IMPORT Std.Str AS *;
EXPORT Inversion := MODULE
SHARED I := $.File_KJV.Txt;
SHARED R := RECORD
UNSIGNED1 BookNum := 0;
UNSIGNED1 Chapter := 0;
UNSIGNED1 Verse := 0;
UNSIGNED1 Word_Pos := 0;
STRING Word := '';
END;
R TakeWord(I le,UNSIGNED1 C) := TRANSFORM
SELF.Word_Pos := C;
SELF.Word := ToUpperCase(GetNthWord($.Clean(le.verse_Text),c));
SELF := le;
END;
N := NORMALIZE(I,WordCount($.Clean(LEFT.verse_Text)),TakeWord(LEFT,COUNTER));
EXPORT Records := N;
EXPORT Key := INDEX(Records,{STRING20 Wrd := Word,BookNum,Chapter,Verse,Word_Pos},'~key::kjv_inversion');
EXPORT Bld := BUILDINDEX(Key,OVERWRITE);
EXPORT Search(STRING S) := FUNCTION
D := DATASET([{s}],{ STRING T });
R SearchBlock(D le,UNSIGNED C) := TRANSFORM
SELF.Word := GetNthWord(le.T,C);
END;
N := NORMALIZE(D,WordCount(S),SearchBlock(LEFT,COUNTER));
R GraphBody(SET OF DATASET(R) I,UNSIGNED C) := FUNCTION
RETURN DEDUP(
MAP ( C = 1 => PROJECT(Key(Wrd=I[0][1].Word),TRANSFORM(R,SELF := LEFT)),
JOIN(I[C-1],Key,RIGHT.Wrd=I[0][C].Word
AND LEFT.BookNum=RIGHT.BookNum
AND LEFT.Chapter=RIGHT.Chapter
AND LEFT.Verse=RIGHT.Verse,TRANSFORM(R,SELF := LEFT))
)
,BookNum,Chapter,Verse);
END;
G := GRAPH(N,COUNT(N),GraphBody(ROWSET(LEFT),COUNTER));
RETURN PROJECT(G,TRANSFORM($.File_KJV.Layout_Reference,SELF := LEFT));
END;
END;