Semantic Search for intergator

When I wrote my last blog post 16 months ago I had not imagined to be part of the intergator team for so long. Nevertheless, following my internship I was not only offered a student job, but after that the opportunity to write my diploma thesis here.

What had been the focus of my internship assignment I wanted to expand on in my thesis: How can enterprise search be further enhanced at a semantic level? Let’s look at various scenarios: If someone searches for the term “bank”, he expects to find results on the financial institution, while another one aims at answers referring to a river bank. When one of my colleagues enters “books”* into our intergator enterprise search bar, he would expect to find hits for “Documentation Reader” or “Docreader”; someone else conducting the same query  in a general context would want to find something about a set of written sheets of paper bound together…

Various problems contribute to these examples: disambiguation, domain-, corporate- and department-specific coining and also searches across multiple languages. A popular subject of current discussions is the approach to project a word’s meaning onto a multidimensional vector space. Machine learning of this projection is performed with the help of artificial neural networks and local considerations of word and context in large amounts of text.

Now, what precisely can be achieved with these word vectors? Let’s consider the example query “cycle”. First of all the query’s information goal is ambiguous: is it about some kind of periodically recurring event? Or a book’s series? Or perhaps a bicycle? Not to mention the possibility the information seeker refers to another word class and meaning altogether. For the given example these meanings luckily align quite well. What is important in terms of information retrieval is that if the word vectors had been trained on respective texts containing the mentioned relations, these can also be found based on the search term.

A model reflects the contents of its corpus. In particular in an enterprise context this can be an advantage. For my example, normally when my colleagues enter “books” into intergator’s search bar they are not interested in arbitrary paper volumes.

In the course of my thesis I was given the chance for a closer investigation of the technical basics and parameters which influence the inner structure of these vector space models. I enjoyed the discussions on obstacles and new discoveries with my colleagues who are just as fascinated as I am about the semantic powers of our machine learning algorithms.

Unfortunately, now that my thesis is finished my path is leading me away from the beautiful city of Dresden. A huge thanks to my boss and colleagues for this exciting time!

* former internal name of intergator’s Documentation Reader component