1

We use SharePoint Server 2007 to allow employees to search network file shares, but it seems that underscores in filenames are not treated as word separators when indexing the files.

As a result, a search for chocolate will:

  • match "chocolate milkshake.doc"
  • but not match "chocolate_cake.doc"

(Of course, this is a simplified example; in practice the content of the second file might include the word "chocolate" and match on that instead of the filename. But the problem itself is real enough, because a common scenario in a corporate environment is that a user knows the the partial name of the file they are looking for and expects to see matching filenames at the top of the search results. And using underscores in filenames is a widely used convention within our company).

Underscores are not treated as word separators in the file content either, although this is less of a concern for us. The root cause of this problem is possibly related to the behaviour of the word breakers that SharePoint uses (i.e. the language-specific DLLs that implement the IWorkBreaker interface), although I haven't confirmed this yet.

Does anyone know of a workaround for this issue? I have tested with Search Server 2008 Express too (which is based on the same technology), and it is also affected. I do not know whether the problem is fixed in SharePoint 2010 or not.

Todd Owen
  • 301
  • 3
  • 8

2 Answers2

1

I don't think underscores are treated as delimiters, and there's a bit of traffic on social.technet that seems to confirm this. If (since) that's the case, you'll need a partial/wildcard search to match 'chocolate' from 'chocolate_cake.doc', which the core results web part won't do. However, there's a codeplex web part for 2007 that does just that.

FYI, the 2010 version of this same web part notes that SharePoint 2010 adds wildcard searches, provided the user types the asterisk.

vinny
  • 456
  • 3
  • 6
1

I have confirmed that the word breaker determines the treatment of underscores for both document content and filenames. Word breakers are configured on a per-language basis in the registry.

Word breakers are implemented as ActiveX controls, and theoretically it should be possible to write your own (the Microsoft Platform SDK for Windows XP includes an example, "lrsample"), but I don't have the tools at hand to do that. It seems that a lot of the word breakers that Microsoft supplies all treat underscores as part of a word, but I did find one which breaks on underscores: version 2 of the word breaker for Simplified Chinese (chsbrkr.dll - 1,677,824 bytes). Note that this behaviour differs from version 3 of the Simplified Chinese word breaker, which is the one supplied with Search Server 2008 Express, and probably SharePoint 2007 too.

So to get the search behaviour that I want, I have configured SharePoint Search to use this word breaker:

  1. Copy the DLL to C:\Program Files\Microsoft Office Servers\12.0\Bin\chsbrkr2.dll
  2. Use regedit to browse to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\LanguageResources\Default
  3. For the relevant languages, in my case "English (United Kingdom)" and "English (United States)", modify the following keys: "WBDLLPathOverride" = "C:\PROGRA~1\MI54E7~1\12.0\Bin\ChsBrkr2.dll" (your path may be different) and "WBreakerClass" = "{9717fc70-c1bc-11d0-9692-00a0c908146e}"
  4. Restart the "Office SharePoint Server Search" service (can be done via the command line by running net stop osearch followed by net start osearch).
  5. Go the the search administration page and initiate a full crawl.

Apart from treating underscores as a word break, I'm not sure if there are any other significant differences between chsbrkr.dll and the default English word breaker, but so far it hasn't caused any problems for me. It would be great if there was a way to apply the custom word breaker to specific managed properties (Path, in this case), but I don't know if this is possible. There is a promisingly-named column in the MSSManagedProperties table of the database called "WordBreakerOverride", but I don't know what its purpose is.

NOTE: In SharePoint 2010, managed properties apparently have an additional setting called SplitStringCharacters, which may well make this workaround obsolete.

Todd Owen
  • 301
  • 3
  • 8