The following chart shows the effects of the different character classes available in the delivered stopword files. The input character string is followed by the terms that will appear in the collection after indexing has occurred with each stopword file in place.
Stopword files are supported in Directa, but not SmartPlant Foundation.
String |
fultext.stp |
cc_all.stp |
cc_join.stp |
---|---|---|---|
(in parens) |
IN PARENS |
(IN PARENS) |
IN PARENS |
ABC123 |
ABC 123 |
ABC123 |
ABC123 |
It is x12- 328. |
IT IS X 12 328 |
IT IS X12-328. |
IT IS X12-328 |
Cherokee, N.C. |
CHEROKEE N C |
CHEROKEE, N. C. |
CHEROKEE N.C |
08/03/94 |
08 03 94 |
08/03/94 |
08/03/94 |
401(k) |
401 K |
401(K) |
401(K |
Hi there! |
HI THERE |
HI THERE! |
HI THERE |
$1,000,000.00 |
1,000,000.00 |
$1,000,000.00 |
1,000,000.00 |
abc-123 |
ABC 123 |
ABC-123 |
ABC-123 |
Subject: My test. |
SUBJECT MY TEST |
SUBJECT: MY TEST. |
SUBJECT MY TEST |
/usr/tmp/dog.txt |
USR TMP DOG TXT |
/USR/TMP/DOG.TXT |
USR/TMP/DOG.TXT |
x==1; |
X 1 |
X==1; |
X |
The fultext.stp stopword file breaks all strings except numbers when a punctuation character occurs. It also breaks alphanumeric terms when a digit is encountered. The cc_all.stp file breaks terms only when a space is encountered. This causes all punctuation (including beginning and ending) to become part of the term. The cc_join.stp file breaks terms on spaces, but ignores beginning and ending punctuation.
The delivered stopword files can be used as a basis for forming almost any character class definition of your own.