"We do really badly on the query ca chp" a coworker complained in one email.
"Ca chp?" I thought. "What the heck does that mean?"
It turned out it was pretty simple: "ca" was short for California and "chp" was short for California highway patrol. Obviously, my coworker knew what he meant by the query ca chp, but I didn't know it, and our search engine definitely didn't know it. After seeing many complaints from customers of this sort we began to realize that to truly improve the relevance of our search engine, it was more confirmation that we had to move past just simple keyword matching, and into understanding the intent of your query.
So when you search for crossroads mall in OKC we take this to mean crossroads mall in Oklahoma City. When you search for Julia child bio we'll also look for Julia child biography to give you better results. But of course, the same word could mean something different in another context. Hence, when you search for nw university we we'll search for northwestern university but if you search for nw co-ed soccer we'll search for northwest co-ed soccer instead.
Intelligent "stop word" retention
Another area that fell under the "Do what I mean, not what I say!" category were "stop words".
What are "stop words" you ask?
Well, in Search Engine parlance they are words that oftentimes may not contain much "meaning" in the query – words such as (a, the, in, etcÃ¢â‚¬Â¦) and hence it may not be crucial as to whether they are found on the desired results page or not. For example if the query was the aurora borealis, you probably wouldn't be too concerned as to whether the word "the" was found on the top page returned or not, since "the" doesn't contain much meaning here. Hence, it may be perfectly acceptable to drop it from the query when retrieving pages.
However, if your query was The Office (the title of a popular televisions show) it would be absolutely ridiculous to drop the word "the" since the query would essentially change meaning – and we received a lot of emails about how we were doing just that. In fact, previously we were routinely dropping all stop words Ã¢â‚¬â€œ and knew this needed dramatic improvement.
In our recent release we've overhauled our logic, and if you search for something where the "stop words" contain crucial meaning, we can sense thatand realize that "the" in The Office is crucial, or the "A" in Avenue A is crucial; Whereas if you query for something like the aurora borealis we realize that the word "the" isn't as crucial as the other query words.