Words maybe and non

I was having fun chatting to my teenage nephew recently about the different operating systems and free software and ended up revisiting Ken Church's Unix™ for poets classic, which somehow led me to the burning question: If one were to letter sort all of /usr/share/dict/words, which contains every Webster's Second International Dictionary (1934) term, what portion of the results are actual words defined in that same input list? I know there are bound to be a few like "now" for example, but how many really? 🤔

# Count all the words (235886 on macOS latest).
wc -w /usr/share/dict/words
Kunsthaus Tacheles How Long is Now text mural

That is a tall stack, there are probably fancier awk(1) or Perl one-liner solutions to avoiding the while loop, which may run faster even, but for the sake of readability:

# Sample ./filter script:
while read word; do
  echo "${word}" |
    # Break up each word into line separated letters.
    fold -w 1 |
    # Ignore case when sorting.
    sort -f |
    # Pull letters back into words.
    tr -d '\n' |
    # Look for exact matches in the dictionary. Choosing fgrep(1) here, because it
    # is supposed to be "quicker" when not using regular expressions.
    fgrep -w -f - /usr/share/dict/words
done

As a matter of principle, I am on a very basic machine, so let me also time(1) the process for good measure:

time cat /usr/share/dict/words | ./filter | wc -l

Okay, well, that makes my laptop hurt and seems to be taking forever, save for later maybe. But, there are more wordlists under /usr/share/dict, such as connectives on the Mac at least, a far smaller set of 150, what if I try those instead? And the answer is a mind blowing 41 unique entries plus a duplicate one, which is… 42 of course!

Here is the full list in order of appearance: a, in, is, eh, for, it, as, his, no, be, at, by, i, hist, not, aer, or, an, all, him, been, how, no, fi, os, pu, ist, amy, do, first, any, my, now, em, most, how, now (own), begin, ady, know, aery, go.

More