If an Admin enables multilingual document handling for the Vault, all documents include the standard document field Language. When users search, Vault respects the language of a document by incorporating language-specific elements like word separators, stop words (ignores “a” and “the” in English, etc.), and word stemming. Vault also attempts to automatically identify the document’s language from the first 100 characters of its source file and auto-populates the Language field for new documents, if possible.
When a document has non-standard special characters, Vault may not recognize these characters. These unrecognized characters may interfere with Vault’s language detection mechanism and cause the detected language to be inaccurate.
See a list of supported languages for document handling.
About Document Language
Each document in a multilingual Vault includes the Language field. For PDF and text-based files like HTML or CSV, Vault attempts to assign a language automatically upon import based on the document’s language. Otherwise, Vault uses the current user’s language as the default for the document, and allows users to edit this field. By default, the Language field is required, but Admins can make this field optional.
In some situations, Vault does not attempt to detect a language automatically:
- The document has a Microsoft 365 source file, for example, DOC, DOCX, PPT, etc.
- The source file has fewer than 100 characters.
Users can choose to filter their Library or document tab results by document language using the Filters panel.
Export to CSV & TXT
When multilingual document handling is enabled, users will see Export to TXT or Export as Text rather than Export to CSV throughout the Vault. Vault exports files as TXT in order to prevent corruption when opening and re-saving files which contain multibyte characters in Excel.