27/03/2017

[Java] Trie data structure for text input prediction

A Trie is a very useful data structure with a confusing name, which is luckily easier to use than to pronounce.

I have seen it mainly used in input text suggestion or String analysis. The structure is a Tree where each node can have a label and multiple children; nodes with a label represent a full piece of data (eg a word) while nodes without label are used only for navigation or substring evaluation for example.

The list of children can be implemented with different data structures, I propose a version using a HashMap for faster searching: given an input String s, starting from the first character keep calling get(char) until we reach a word to suggest. Each get is O(1) so we are only paying O(M) (worst case is we're going to predict the longest string present).

Using a List would require us to scan the list until we find the entry for the next character which I think due to big-O magic can be assumed to still be O(1) since the list is at most as big as the alphabet we are using and therefore we can call it constant size, but still sounds like an unnecessary operation to me.

LinkedHashMap would not help much unless we associate some meaning with the insertion order eg: most often used words first.

TreeMap would slow the process down but guarantee to retrieve words in the desired order (eg alphabetical).

Also I propose two suggestion methods based on depth-first and breadth-first traversals. The difference being that the first method will scan greedily until we find the first word we can suggest, regardless of the length, while the second method will always return the shortest word we can suggest.

Finally, I provide a getSuggestions method that will return all possible suggestions for a given partial word, in ascending length order.

You can check the code out on my Gist.

No comments:

Post a Comment

With great power comes great responsibility