Lab 17: Sentence Similarity
File processing patterns, tokenization, and similarity scoring
After this lab, you will be able to:
- Read structured data from a file into an array
- Tokenize strings using
splitand normalize text withtoLowerCase - Build a unique vocabulary from multiple sentences
- Compute a basic similarity score using word frequency vectors
What You’re Building
You will read pairs of sentences from a file, tokenize them into words, build a shared vocabulary, and compute how similar the two sentences are. The similarity score is based on word frequency overlap — a simplified version of techniques used in search engines and natural language processing. The program reads input from a file and prints the similarity score for each pair.
Concepts and Misconceptions
| Concept | Common Mistake | What the Test Catches |
|---|---|---|
| File-to-array | Hardcoding array size instead of reading the count from the file header or using a two-pass approach | ArrayIndexOutOfBoundsException or empty slots |
split tokenization |
Splitting on " " (single space) instead of "\\s+" (any whitespace), missing words separated by tabs or multiple spaces |
Word count mismatch |
| Case normalization | Comparing "The" and "the" as different words |
Vocabulary larger than expected, score wrong |
| Frequency counting | Using == to compare strings instead of .equals() |
Words never match, frequency always zero |
Checkpoints
Checkpoint 1: Read Sentences from a File into an Array
What to do: Open the input file with Scanner. The first line contains the number of sentence pairs. Read each subsequent line into a String[] array. Each pair of sentences occupies two consecutive lines.
What the test checks: The array contains the correct number of sentences, and each sentence matches the file content exactly (after trimming).
Debugging tip: If you get an ArrayIndexOutOfBoundsException, check that your array size matches the number of lines you need to read (pairs times two). If the first sentence is missing, you may have forgotten to consume the first line (the count) before reading sentences.
Checkpoint 2: Build a Unique Vocabulary
What to do: For a given pair of sentences, split both on "\\s+" and convert every token to lowercase. Collect all unique words into a String[] vocabulary. A word appears in the vocabulary only once, even if it occurs in both sentences.
What the test checks: The vocabulary array contains exactly the expected set of unique words (order does not matter, but duplicates are not allowed).
Debugging tip: If your vocabulary has duplicates, check your “already exists” logic. Loop through the vocabulary array before adding a word — if it is already present, skip it. Use .equals() for comparison, not ==. If words that should match are treated as distinct, make sure you called toLowerCase() before comparing.
Checkpoint 3: Compute Word Frequency and Similarity Score
What to do: For each sentence, build a frequency vector: an int[] where each index corresponds to a word in the vocabulary and the value is how many times that word appears in the sentence. Compute the similarity using the dot product of the two vectors divided by the product of their magnitudes (cosine similarity). Print the score rounded to four decimal places.
What the test checks: The similarity score matches the expected value within a tolerance of 0.0001.
Debugging tip: If your score is always 0.0, your frequency vectors are probably all zeros — check that your word-matching logic uses .equals() and that you are iterating over the vocabulary correctly. If your score is greater than 1.0, check your magnitude calculation: magnitude is the square root of the sum of squares, not the sum itself. Use Math.sqrt and Math.pow or manual multiplication.
How to Debug
-
Print the vocabulary. Before computing frequencies, print the vocabulary array. Verify it contains every unique word from both sentences and nothing more.
-
Print the frequency vectors. Display each sentence’s frequency vector alongside the vocabulary. You can quickly spot mismatches — a word that appears in the sentence but has a frequency of 0 means your matching logic is broken.
-
Test with trivial cases. Two identical sentences should produce a score of 1.0. Two sentences with no shared words should produce 0.0. Use these as sanity checks.
Scoring
| Component | Points | Criteria |
|---|---|---|
| Checkpoints | 3 | 1 pt each. Binary: the checkpoint test passes or it does not. |
| Autograder | 5 | Correctness across all test cases. Partial credit by proportion of tests passed. |
| Timeliness | 2 | Full credit if submitted by the due date. 0 if late. |
| Total | 10 |