Working with Stopword Files - Full-Text Retrieval (FTR) - Help

Full-Text Retrieval (FTR) Help

Language
English
Product
Full-Text Retrieval (FTR)
Search by Category
Help

Stopword files are supported with Directa, but not SmartPlant Foundation.

A stopword file contains words that are used so frequently in a collection that they provide no search value (such as and, the, of, to, and for). To reduce overhead, you can specify not to index these words.

The optional stopword file helps make searches on your collections more effective. You should be careful when deciding which words to include in the stopword file. The word should be included only if it is known to be of no relevant search value in most contexts. For example, the word a should not be included because the letter a could have different meanings in some contexts. It could designate an Appendix A or a Section A. If a is included in the stopword file, then these entries could not be searched.

Currently, three stopword files are delivered with FTR. The fultext.stp, cc_all.stp, and cc_join.stp files each contain frequently used words. Because only one stopword file can be associated with a collection, you should be careful when choosing a stopword file to use. You should consider the type of documents the collection will contain. The three stopword files provided produce different results. For more information about FTR-supplied Stopword files, see FTR-Supplied Stopword Files.

FTR uses a character class rule base to determine where a word begins and ends. Character classes can be redefined in the stopword file in a number of ways. By default, any punctuation mark, tab, or space begins or ends a word. Words that contain numerical digits are also broken apart. The fultext.stp stopword file uses this default character class. The cc_all.stp stopword file defines all punctuation and digits in the same class as letters. This means that word breaks occur only on spaces or tabs. The cc_join.stp stopword file defines digits and letters to be the same and specifies that any punctuation mark that is between two characters is part of the word.

Modifications to character classes affect search performance and index size. They do not affect search results when terms are specified as in the original document.

For more information on character classes, see Character Classes.

See Also

fultext.stp
cc_all.stp
cc_join.stp
Differences between Stopword Files