1. Throughout this column we have used the simple definition
that words are separated by white space. Many real documents,
such as those in HTML or RTF, contain formatting commands.
How could you deal with such commands? Are there any other
tasks that you might need to perform?
2. On a machine with ample main memory, how could you use
the C++ STL sets or maps to solve the searching problem in Section 13.8? How much memory does it consume
compared to McIlroy's structure?
3. How much speedup can you achieve by incorporating the special-purpose malloc of Solution 9.2 into the
hashing program of Section 15.1?
4. When a hash table is large and the hash function scatters the data well, almost every list in the table has only a
few elements. If either of these conditions is violated, though, the time spent searching down the list can be
substantial. When a new string is not found in the hash table in Section 15.1, it is placed at the front of the list. To
simulate hashing trouble, set NHASH to 1 and experiment with this and other list strategies, such as appending to
the end of the list, or moving the most recently found element to the front of the list.
5. When we looked at the output of the word frequency programs in Section 15.1, it was most interesting to print
the words in decreasing frequency. How would you modify the C and C++ programs to accomplish this task? How
would you print only the M most common words, where M is a constant such as 10 or 1000?
6. Given a new input string, how would you search a suffix array to find the longest match in the stored text? How
would you build a GUI interface for this task?
7. Our program for finding duplicated strings was very fast for ``typical'' inputs, but it can be very slow (greater
than quadratic) for some inputs. Time the program on such an input. Do such inputs ever arise in practice?
8. How would you modify the program for finding duplicated strings to find the longest string that occurs more
than M times?
9. Given two input texts, find the longest string that occurs in both.
10. Show how to reduce the number of pointers in the duplication program by pointing only to suffixes that start on
word boundaries. What effect does this have on the output produced by the program?
11. Implement a program to generate letter-level Markov text.
12. How would you use the tools and techniques of Section 15.1 to generate (order-0 or non-Markov) random text?
13. The program for generating word-level Markov text is at this book's web site. Try it on some of your
documents.
14. How could you use hashing to speed up the Markov program?
15. Shannon's quote in Section 15.3 describes the algorithm he used to construct Markov text; implement that
algorithm in a program. It gives a good approximation to the Markov frequencies, but not the exact form; explain
why not. Implement a program that scans the entire string from scratch to generate each word (and thereby uses the
true frequencies).
16. How would you use the techniques of this column to assemble a word list for a dictionary (the problem that
Doug McIlroy faced in Section 13.8)? How would you build a spelling checker without using a dictionary? How
would you build a grammar checker without using rules of grammar?
17. Investigate how techniques related to k-gram analysis are used in applications such as speech recognition and
data compression.
No comments:
Post a Comment