Quantcast
Channel: Active questions tagged mongodb-atlas - Stack Overflow
Viewing all articles
Browse latest Browse all 271

MongoDB's Atlas Search on people names (fuzzy search) produces poor results

$
0
0

Searching for people's names in a database may be one of the most common use cases there is for fuzzy searches. However, I'm struggling mightily using Atlas Search (MongoDB v7) to produce consistently reliable results when considering all the potential "gotchas" when searching for people's names:

  • Simple Typos: "Jack Smtih"
  • Nicknames: Robert=Bob, Richard=Dick, Joseph=Joe
  • Variations: Cathy vs Kathy vs Katie, Clark vs Clarke
  • Accidental First & Last Name reversals "Stallone Sylvester"
  • Compound names & hyphens: "Carrie-Anne Moss", "Philip Seymour Hoffman". i.e. separated by spaces vs hyphens? Is the "middle" name a compound first name or last name (separate fields).
  • accented characters
  • appended suffixes: "Dale Earnhardt Jr"
  • etc

The MongoDB paid support hasn't been all that helpful, suggesting "hire a paid consultant" and "use compound searches and custom analyzers". The issue is that some results are waaaay off. Firstly I have ~200K names saved as:

{    ..."profile" : {"first_name": "Mary","last_name": "Smith"    }{

The default MongoDB Fuzzy text uses the Damerau-Levenshtein distance algo, but limits the maximum single-character edits to a max of 2, rendering it pretty useless?! I likely would have been happy enough with the results of concatenating first & last names and doing a D-L distance search of the search term with a max of 10 or 15 edits.

On top of that, it doesn't appear like I can concatenate the first_name & last_name fields to do the search, since $search must be the first item in the pipeline. The plain fuzzy search I tried first simply produced bad results. For example, a search for "Steve Clark" would give the "Steve Clark" record the highest score. However, the "Steven Clarke" record was at the bottom of the list well below names like "Clark Hobart", or "Jasmine Steve"?!

Then, as Mongo Support suggested, I delved deep into compound operators and custom analyzers. My index looked like this monstrosity:

{"mappings": {"dynamic": false,"fields": {"profile": {"fields": {"first_name": [            {"multi": {"edge": {"analyzer": "custom.edge","searchAnalyzer": "lucene.simple","type": "string"                },"english": {"analyzer": "lucene.english","searchAnalyzer": "lucene.english","type": "string"                },"exact": {"analyzer": "custom.keyword_exact","searchAnalyzer": "custom.keyword_exact","type": "string"                },"lowercased": {"analyzer": "custom.keyword_lower","searchAnalyzer": "custom.keyword_lower","type": "string"                },"shingled": {"analyzer": "custom.shingled","searchAnalyzer": "custom.shingled","type": "string"                }              },"type": "string"            },            {"type": "autocomplete"            },            {"type": "token"            }          ],"last_name": [            {"multi": {"edge": {"analyzer": "custom.edge","searchAnalyzer": "lucene.simple","type": "string"                },"english": {"analyzer": "lucene.english","searchAnalyzer": "lucene.english","type": "string"                },"exact": {"analyzer": "custom.keyword_exact","searchAnalyzer": "custom.keyword_exact","type": "string"                },"lowercased": {"analyzer": "custom.keyword_lower","searchAnalyzer": "custom.keyword_lower","type": "string"                },"shingled": {"analyzer": "custom.shingled","searchAnalyzer": "custom.shingled","type": "string"                }              },"type": "string"            },            {"type": "autocomplete"            },            {"type": "token"            }          ]        },"type": "document"      },"type": [        {"type": "token"        },        {"type": "stringFacet"        }      ]    }  },"analyzers": [    {"charFilters": [],"name": "custom.edge","tokenFilters": [        {"maxGram": 10,"minGram": 1,"termNotInBounds": "include","type": "edgeGram"        },        {"type": "lowercase"        }      ],"tokenizer": {"type": "standard"      }    },    {"charFilters": [],"name": "custom.keyword_lower","tokenFilters": [        {"type": "lowercase"        }      ],"tokenizer": {"type": "keyword"      }    },    {"charFilters": [],"name": "custom.keyword_exact","tokenFilters": [],"tokenizer": {"type": "keyword"      }    },    {"charFilters": [],"name": "custom.shingled","tokenFilters": [        {"type": "lowercase"        },        {"maxShingleSize": 3,"minShingleSize": 2,"type": "shingle"        }      ],"tokenizer": {"type": "standard"      }    }  ]}

My resulting search then simply tries searching first_name & last_name fields if a single word is submitted or first_name and last_name separately if 2 "words" are submitted. (forget reversed or compound names) The 2 "word" example:

[{"$search": {"index": "idx_user_search_adv","compound": {"should": [{"text": {"query": "Steve","path": {"value": "profile.first_name","multi": "exact"                    },"score": {"boost": {"value": 6                        }                    }                }            }, {"wildcard": {"query": "Steve*","path": {"value": "profile.first_name","multi": "lowercased"                    },"allowAnalyzedField": true,"score": {"boost": {"value": 5                        }                    }                }            }, {"text": {"query": "Steve","path": {"value": "profile.first_name","multi": "english"                    },"fuzzy": {"prefixLength": 1,"maxExpansions": 10                    },"score": {"boost": {"value": 0.7                        }                    }                }            }, {"phrase": {"query": "Steve","path": {"value": "profile.first_name","multi": "edge"                    },"slop": 100                }            }, {"text": {"query": "Clark","path": {"value": "profile.last_name","multi": "exact"                    },"score": {"boost": {"value": 6                        }                    }                }            }, {"wildcard": {"query": "Clark*","path": {"value": "profile.last_name","multi": "lowercased"                    },"allowAnalyzedField": true,"score": {"boost": {"value": 5                        }                    }                }            }, {"text": {"query": "Clark","path": {"value": "profile.last_name","multi": "english"                    },"fuzzy": {"prefixLength": 1,"maxExpansions": 10                    },"score": {"boost": {"value": 0.7                        }                    }                }            }, {"phrase": {"query": "Clark","path": {"value": "profile.last_name","multi": "edge"                    },"slop": 100                }            }]        },"scoreDetails": true    }}, {"$project": {"first_name": "$profile.first_name","last_name": "$profile.last_name","scoreDetails": {"$meta": "searchScore"        }    }}, {"$sort": {"scoreDetails": -1    }}]

The result this produces is all the first_name "Steve" records first. Then all the last_name "Clark" records, then finally "Steven Clark", then the lowercase "steve" records?!, the "steve" variations like "Stevie", etc. Basically just a big spider's web mess trying to balance out all the lowercase, edgeGram, fuzzy score boosting.

I thought going in that, hey, since I'm using MongoDB on Atlas already for my app, let's leverage Atlas Search fuzzy search instead of complicating things by adding Solr or ElasticSearch. Now it feels like this whole "name search" implementation using Atlas Search might be a bust. So:

A - It seems someone would have worked out performing Fuzzy Name Searches using Atlas Search. Can you share your Index & $search strategies? Can anyone spot any blatant errors in my search above?

B - Is there a simpler way to accomplish this from my NodeJS-based app (and MongoDB database)? If that ultimately means a different third-party tool, package, intermediary, I'd welcome any suggestions.


Viewing all articles
Browse latest Browse all 271

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>