Lab 17 10 pts Week 7 Due May 15

Lab 17: Sentence Similarity

File processing patterns, tokenization, and similarity scoring

file-processing split tokenization
Clone from GitHub
3 checkpoints 5 autograder 2 timeliness
Prerequisites: lesson-3-2

After this lab, you will be able to:

  • Read structured data from a file into an array
  • Tokenize strings using split and normalize text with toLowerCase
  • Build a unique vocabulary from multiple sentences
  • Compute a basic similarity score using word frequency vectors

What You’re Building

You will read pairs of sentences from a file, tokenize them into words, build a shared vocabulary, and compute how similar the two sentences are. The similarity score is based on word frequency overlap — a simplified version of techniques used in search engines and natural language processing. The program reads input from a file and prints the similarity score for each pair.


Concepts and Misconceptions

Concept Common Mistake What the Test Catches
File-to-array Hardcoding array size instead of reading the count from the file header or using a two-pass approach ArrayIndexOutOfBoundsException or empty slots
split tokenization Splitting on " " (single space) instead of "\\s+" (any whitespace), missing words separated by tabs or multiple spaces Word count mismatch
Case normalization Comparing "The" and "the" as different words Vocabulary larger than expected, score wrong
Frequency counting Using == to compare strings instead of .equals() Words never match, frequency always zero

Checkpoints

Checkpoint 1: Read Sentences from a File into an Array

What to do: Open the input file with Scanner. The first line contains the number of sentence pairs. Read each subsequent line into a String[] array. Each pair of sentences occupies two consecutive lines.

What the test checks: The array contains the correct number of sentences, and each sentence matches the file content exactly (after trimming).

Debugging tip: If you get an ArrayIndexOutOfBoundsException, check that your array size matches the number of lines you need to read (pairs times two). If the first sentence is missing, you may have forgotten to consume the first line (the count) before reading sentences.

Checkpoint 2: Build a Unique Vocabulary

What to do: For a given pair of sentences, split both on "\\s+" and convert every token to lowercase. Collect all unique words into a String[] vocabulary. A word appears in the vocabulary only once, even if it occurs in both sentences.

What the test checks: The vocabulary array contains exactly the expected set of unique words (order does not matter, but duplicates are not allowed).

Debugging tip: If your vocabulary has duplicates, check your “already exists” logic. Loop through the vocabulary array before adding a word — if it is already present, skip it. Use .equals() for comparison, not ==. If words that should match are treated as distinct, make sure you called toLowerCase() before comparing.

Checkpoint 3: Compute Word Frequency and Similarity Score

What to do: For each sentence, build a frequency vector: an int[] where each index corresponds to a word in the vocabulary and the value is how many times that word appears in the sentence. Compute the similarity using the dot product of the two vectors divided by the product of their magnitudes (cosine similarity). Print the score rounded to four decimal places.

What the test checks: The similarity score matches the expected value within a tolerance of 0.0001.

Debugging tip: If your score is always 0.0, your frequency vectors are probably all zeros — check that your word-matching logic uses .equals() and that you are iterating over the vocabulary correctly. If your score is greater than 1.0, check your magnitude calculation: magnitude is the square root of the sum of squares, not the sum itself. Use Math.sqrt and Math.pow or manual multiplication.


How to Debug

  1. Print the vocabulary. Before computing frequencies, print the vocabulary array. Verify it contains every unique word from both sentences and nothing more.

  2. Print the frequency vectors. Display each sentence’s frequency vector alongside the vocabulary. You can quickly spot mismatches — a word that appears in the sentence but has a frequency of 0 means your matching logic is broken.

  3. Test with trivial cases. Two identical sentences should produce a score of 1.0. Two sentences with no shared words should produce 0.0. Use these as sanity checks.


Scoring

Component Points Criteria
Checkpoints 3 1 pt each. Binary: the checkpoint test passes or it does not.
Autograder 5 Correctness across all test cases. Partial credit by proportion of tests passed.
Timeliness 2 Full credit if submitted by the due date. 0 if late.
Total 10