Natural Language Querying

Corey Montella - 14 Jun 2016

As if Eve wasn’t enough work already, we decided to experiment with a natural language (NL) interface. While we recognized at the outset that natural language processing (NLP) is an unsolved problem, we couldn’t resist its siren song. I mean, what could be better for non-programmers than the ability to write a program in plain English (or their native language of choice)?

Despite the technical challenge, we thought we had a sufficiently constrained version of the problem that perhaps we had a shot at solving it. Typical NLP tasks include machine translation, article summarization, sentiment analysis, part-of-speech tagging, coreference resolution, and named entity recognition. Today’s best approaches train a classifier using a large corpus of annotated training text. Depending on the text and task, performance ranges from impressive (POS tagging, coreference resolution) to “needs work” (article summarization, sentiment analysis). This would be considered a statistics-based approach.

Our NLP task was very specific: to turn a plain English query into a formal query recognized by Eve. Thus, we leveraged a number of assumptions that simplified the problem enough to make it tractable (or at least we hoped).

We only had to deal with a single class of text: questions no longer than a sentence. We didn’t have to deal with harder tasks like understanding a whole news article.
Users could only ask about entities in the Eve database, so named entity recognition was just a lookup in the Eve database.
People typically ask questions in a limited number of ways, often starting with keywords like who, what, when, and where.
Questions are usually of a similar construction, i.e. people don’t usually invert questions, unless they’re Yoda.
Again, since you can only ask questions about data in the database, any ambiguities could be enumerated and presented to the user.

Thus, we hoped that even if we couldn’t answer every question, we could handle the vast majority with a set of simple heuristics, and then cover the very long tail of questions we can’t handle with case-specific rules (heuristics) that we build over time.

Method

Let’s take a look at an example query and its output. The English query "What is Corey's age?" would be transformed into the following formal query:

(query
 (select "eavs" :entity "corey" :attribute "age" :value corey|age)
 (project! "corey's age" :age corey|age))

How does this work? As mentioned before, we used a heuristic approach rather than a statistical machine learning approach. While learning approaches have proven to be very effective, they require a large amount of tagged and classified training data, which can be expensive to obtain. For our problem, these data would have been mappings from plain English queries to a formal queries. We just didn’t have these data, so a machine learning approach would require a large up-front cost to obtain them.

An alternative route is to come up with a set of hand-coded rules, called heuristics, based on our expert knowledge of the English language; we use English enough to be able to recognize (and exploit) structure and patterns. While statistics based approaches predominant, heuristic approaches have their own advantages:

A heuristic approach is lightweight and can easily be stored on the client. We don’t need training data, which also means we don’t have to host a model on a server or ship it to the client.
A heuristic approach can be improved simply by adding a new rule. Statistical approaches require re-training a new model with with new data to improve performance, which can take an inordinately long time. Then we would have to ship the new model to the client, which is another cost we don’t have to pay.
The heuristic approach is easy to reason about, since the rules are generated and encoded by humans. Statistical models use millions of automatically generated weights to arrive at decisions. It’s very hard to reason about how or why the classifier behaved the way it did beyond “Well, it made that decision because the weights told it to.”
Heuristic approaches are blazingly fast. Our parser did its job in ~1 ms in Chrome. The best performance we could get out of a comparable statistical parser was 20ms, and that wasn’t even in the browser.

Let’s circle back to the original question: how did it work? Well if you really want to know, then take a look at the source here. I’ll go through the process step-by-step with the example query "What is the average salary per department?"

Tokens

First we break the string at whitespace boundaries to create tokens. To facilitate matching, we normalize the words by depluralizing and lowercasing them. Then we tag each word with a part of speech, which is just referenced in a small dictionary. The POS of a word helps us determine how it relates to other words. Most words in a sentence will be glue, like determiners and prepositions. Nouns are usually entities. Plural nouns are usually collections. Adjectives are usually either attributes or the value of some attribute. With these a priori classifications, we can start constructing the query graph before we even look up the word in the Eve database, which is the next step.

------------------------------------
   WORD        | NORM WORD  | POS  |
------------------------------------
root        | root       | NOUN |
What        | what       | GLUE |
is          | is         | VERB |
the         | the        | GLUE |
average     | average    | NOUN |
salary      | salary     | NOUN |
per         | per        | GLUE |
department? | department | NOUN |

Nodes

Next, we classify each token in the context of the Eve database. First we n-gram the tokens and match each one against the database. This determines the class of a word, whether it’s an entity, attribute, value, collection, or function. In our example, average and per are functions, salary is an attribute, and department is a collection.

--------------------------------------------------------
   WORD        | NORM WORD  | POS  | PROPERTIES
--------------------------------------------------------
root        | root       | NOUN | (IMPLICIT|ROOT)
What        | what       | GLUE |
is          | is         | VERB |
the         | the        | GLUE |
average     | average    | NOUN | (FUNCTION)
salary      | salary     | NOUN | (ATTRIBUTE)
per         | per        | GLUE | (GROUPING|FUNCTION)
department? | department | NOUN | (COLLECTION)

Tree

Next, we form a tree with the tagged and classified tokens. If the tokens represent objects in the Eve database, the tree represents relationships between the objects. Unmatched glue words like “is” and “the” are omitted from the tree. Relationships between nouns are found through a truncated graph search; we store links between entities and collections in a special links table in Eve, and then try to find a path between each word. For example, in this query we want to find the relationship between salary and department. When we look them up in the links table, we find there is no direct link. So we try to find a 1-degree link, which is found through the employee collection (employees have salaries and belong to a department).

|*0: root  (IMPLICIT|ROOT)
|* 6: per [group] (GROUPING|FUNCTION)
|*  0: root  (IMPLICIT|ROOT)
|*   4: average [average] (FUNCTION)
|*    0: average  (IMPLICIT|ARGUMENT|OUTPUT|OUTPUT)
|*     0: output0 [output0] (IMPLICIT|OUTPUT)
|*    0: value  (IMPLICIT|ARGUMENT|INPUT|ATTRIBUTE)
|*     5: salary [employee|salary] (ATTRIBUTE)
|*   0: department [department] (IMPLICIT|ATTRIBUTE)
|*  0: collection  (IMPLICIT|ARGUMENT|INPUT|COLLECTION|ATTRIBUTE)
|*   7: department [department] (COLLECTION)

Notice the implicit nodes in the tree. These come from functions, which are added to the tree as a subtree. For instance, the average function takes two children: an input and an output. The per function takes two inputs as well: a body and a collection over which to group the body. By adding these subtrees, we can ensure that the query plan is fully specified before we try to turn it into a formal query.

Formal Query

Once we have the fully formed tree, converting it to a formal query is straightforward, since the tree structure was designed to directly map to the formal query grammar. The formal query for our example is

(query
 (select "tags" :entity department :collection "department")
 (query
  (select "eavs" :entity employee :attribute "salary" :value employee|salary)
  (select "eavs" :entity employee :attribute "department" :value department)
  (select "tags" :entity employee :collection "employee")
  (select "tags" :entity department :collection "department")
  (select "average" :average output0 :value employee|salary))
 (project! :average output0 :department department))

Which you can see is almost a direct translation of the tree. One major difference is that we inline the children of functions as arguments. Also, the grouping function per is implicitly represented by the subquery.

And that’s it. The client sends this code to the Eve server, and the server returns a result. In this case:

--------------------------------
| AVERAGE SALARY | DEPARTMENT  |
--------------------------------
| 8.5            | engineering |     
| 10             | operations  |     
--------------------------------

Results

This looked pretty great to us at the time. It appeared we had a system that could generate a relatively complex formal query from a simple English query. But how does this generalize? Well, again, we didn’t have any data on what kind of questions people would ask. The best we could do was try and grow the space of acceptable queries as large as we could, and then release to the public, updating the rules as we find queries that break the system.

So we did exactly that; in February we released WikiEve to the public and the queries started rolling in. Almost immediately people started breaking all the assumptions we made. Of course we knew things would break once we released WikiEve to the wild. Most of the problems seemed very fixable until two:

First, some users had trouble formulating their query, even in English. This surprised us, because these same users have no problem formulating vague questions for Google searches. When it came to specific searches over their own data, some users expressed that they didn’t even know where to begin.

Second, we encountered a query that really gave us pause: "Which planets are uninhabitable?" On the surface, this seems okay. We had a database of planets, and only one of them was tagged "habitable". If the user had expressed his query as "Which planets are not habitable?" then our system would have worked. But this alternative query is completely valid, so we should be able to support it. But look at the word “uninhabitable”? This word is nefarious. First consider just the prefix “in-“, which commonly negates the root word. “Insecure”, “inaccurate”, “inanimate”, “indifferent”, “incoherent”. But “inhabitable” does not mean “not habitable”. It still means “habitable”. If I were to write a general heuristic to handle “in-“ based on its common meaning, it would classify “inhabitable” as “not habitable”. Then “uninhabitable”, which contains a double negative prefix (i.e. “un-“, “in-“), would be classified as “habitable”. Oh my…

Lessons Learned

So where do we go from here? The first problem might be okay; we intentionally didn’t explain how the system worked in the hope that users would figure it out on their own. But I think it pointed to a deeper issue, namely that the text interface needed to be augmented with another interface to guide the formulation of an English query. A key difference between Google and our brand of search is that Google is very fuzzy, well we… well, we had no fuzz. If a specific English query failed, we just returned no results. Certain users definitely displayed Google-like behavior, such as trying multiple variations on a query until one worked. But we didn’t offer any tooling which told the user why the query was wrong and how to fix it. Adding this tooling could certainly help the process.

The second issue was pretty much a show stopper though. The best solutions we came up with to solve the problem in a general way involved statistics, machine learning, and training data. This put us back at square one, since we had no training data.

In the end, we all agreed that we could spend an arbitrary amount of time developing a system that meets our requirements. But first of all, it looks like a NL interface might not be the UI panacea we thought it was. Second, we’re not a NLP/AI/ML company; we’re building a programming language, and the NL interface was only created to support that goal.

So we decided to abandon the effort, at least for now. As the language stabilizes and the company grows, we will eventually find enough time/resources to revisit this problem. Or, in the near parsing text might be as easy as a Google API call. Either way, we still feel that a natural language interface has a place in Eve’s future.