Redesign of Mongo Text Search solution for better performance & accurate results.

Sujit Udhane
5 min readSep 18, 2020

This is the second part of the series . After publishing the first part of the article, many of the readers of the first part expressed curiosity in knowing what lies ahead in the second part (Momentarily, I reached a fictitious world, and started believing, I am writing a novel similar to Sherlock Holmes. I giggled) .

If you wish to know about first part, here is the link https://medium.com/@sujit.udhane/a-powerful-search-solution-powered-by-mongodb-cc38cf5114c1

Search dataset exposed in the first line of fire:

It’s quite obvious, if each query is running against the DB, performance of db will be on stake, and ultimately poor experience (limited busy DB connections, will keep other queries on hold).

What’s more.. If many users search the keyword Sujit near about the similar time frame, and the query is running against db (I know, DB sometimes returns a cached result, but let’s assume, it’s not coming to your rescue), and providing the same result set.

Does it make sense to compute the same search term (keyword) again (and again, and again,…)? Everyone will scream ..No, it doesn’t (many times echo).

i) API was serving the result data as and when we got a result that fits into defined chunk size OR all documents got scanned. But, the result count API query (and, ultimately API) was taking too much time. So, it was taking too much time to show the result count on the UI.

ii) Results were getting displayed in the same order. To achieve pagination, we were using the Mongo skip operation (which was a grave mistake), it was providing correct pagination, but scanning of the dataset was always starting from the first record (ideally, it should start from where it scanned till last time.).

iii) For more search matching possibilities, it has been observed slow performance of query having AND and OR operations together E.g. (Sujit+Udhane)+(Java|NodeJs). Query was first finding all documents having all 4 keywords Sujit/Udhane/Java/NodeJS, and then it was applying a Mongo regex expression.

iv) In few scenarios, quality of search results was not acceptable due to broken queries. E.g. User input such as “Sujit Udhane”+(Java|NodeJs), was returning results having Sujit & Java, Sujit & NodeJs, Udhane & Java, Udhane & NodeJs. While it was expected to provide strictly all documents having “Sujit Udhane” and either Java or NodeJs.

v) We were not storing the outcome of search results anywhere, so it prevented us from further filtering of search result sets, based on additional filter criteria's. So, to reach the right filtered match was a daunting task for the end user. Isn’t this frustrating one?

No other option alternate to going back to the boardroom, and redesign the system.

New High Level Design

We added an index on the search term for Master Search DataSet collection.

Notable changes -

i) All input search keywords looked up first in Master Search DataSet

  • Additional layer is introduced to put the search dataset in the second line of fire.
  • Limited the search queries on data collection. Many queries got eliminated as precomputed search results for critical/business important search terms served from the Master Search data set.

ii) Slow query performance. With subtle use of other Mongo operators like Union/Intersection/Subtraction(Reduce), we are able to overcome on this challenge

  • (Sujit|Udhane) means Union of two search terms and result is [3, 12, 56, 245, 708, 856, 912]
  • (Sujit+Udhane) means Intersection of two search terms and result is [708]
  • (Sujit+Udhane)-Java means Union of search terms Sujit & Udhane, and Subtraction (Reduce) of search term from the Union and result is [3, 245, 708, 856, 912]

iii) What if some of the search terms or all search terms are missing?

  • In Async mode start the computation of all missing search terms and fetch the results for these terms and dump into Master Search DataSet, and then start the expression evaluation. (Very few times, we observed it is little slower than direct expression evaluation against the DB. But, it is still better as we are offering execution time predictability in providing the search results).

iv) How did the broken query problem get resolved?

  • This got majorly solved with change#3. However, we had one more complexity due to synonyms (especially, those having a white space(s)).
  • While doing the text search operation for search, we have temporarily excluded the computation of search terms having a white space(s), and computation of such terms done separately and added their results back to the original search term (for which they are synonyms). Solution ensured the user should get what he/she has asked for, and with little latency results for synonyms as well.

v) As we started storing the search terms in the Result DataSet, we are able to achieve a further level of filtering. Not only that, a few sorting options as well. Now, users should be able to hunt their target in an acceptable time frame.

vi) Showing a search result count is now cake walk. Simply the size of the result array of any expression evaluation.

vii) Tracking the usage of search terms and periodically cleaning up those search terms, which have no usage or less usage.

Performance

i) For queries having all search terms/keywords available in Master Search DataSet, any kind (any number of ANDing/ORIng/Subtraction) of search expression, API is returning results in less than 1 second.

ii) For queries having less than 5 terms/keywords missing in Master Search DataSet, any kind (any number of ANDing/ORIng/Subtraction) of search expression, API is returning results in less than 10 second.

iii) For queries all search terms/keywords missing in Master Search DataSet, any kind (any number of ANDing/ORIng/Subtraction) of search expression, API is returning results in less than 150 second. (In ~95% scenarios).

Below improvement can be introduced to bolster the above solution further -

  1. Distributed cache to store Master Search DataSet, which can support operations like Union/Intersection/Subtraction. Solutions like Apache Ignite or Hazelcast can help in this. This way we can ensure, Database server itself goes into the second line of fire. And, more better throughput.
  2. Better eviction strategy for search terms from Master Search DataSet.
  3. Better strategy to clean up Result DataSet.

I am super excited to share another update with you. Atlas Enterprise Solution came out with a search solution with support of search operators in MongoDB 4.2 version, but as a commercial solution. We got it done in FREE :) . Such news from Atlas’s end helped us to reimpose faith on us, yes…we do have have a design thinking.

Atlas solution may be far more better than our solution in terms of performance and scalability.

If you found this article useful, please clap. Also, you can leave your constructive comment below.

--

--

Sujit Udhane

I am Lead Platform Architect, working in Pune-India. I have 20+ years of experience in technology, and last 10+ years working as an Architect.