c-foundations Lesson 1 25 min read

How Do Regular Expressions Help Find Text Patterns?

grep, find, and the art of searching — matching patterns across files and directories

Reading: Linux Text: Ch. 17 §2 (find), Ch. 19 §1–9 (grep & regex), Ch. 20 §2 (cut/paste)

After this lesson, you will be able to:

  • Use grep with flags (-n, -i, -c, -v, -r, -l, -w) to search for text patterns inside files
  • Write basic regular expressions using metacharacters (., ^, $, *, [...], [^...]) to match text patterns
  • Use extended regular expressions (-E) with +, ?, |, and grouping to build complex pattern matches
  • Use find to locate files by name, type, size, or other attributes in a directory tree
  • Combine find with -exec and grep to search for text patterns across specific file types
  • Use sed for basic find-and-replace operations on text streams
  • Distinguish between grep (searches file contents) and find (searches for files by attributes)

Finding a Needle in a Haystack

Imagine you’re working on a C project with a dozen source files. You wrote a function called calculate_total last week, but you can’t remember which file it’s in. Or maybe you just got a compiler error mentioning line 47 of some file, and you need to find every printf call to figure out what went wrong.

You could open each file and scroll through it manually. Or you could let Unix do the searching for you in under a second.

Unix gives you two tools for this: grep searches inside files for text patterns, and find searches for files by name or attributes. Together with regular expressions — a mini-language for describing text patterns — they turn searching from a chore into a one-liner.

We’ll use a small C project as our running example throughout this lesson. Imagine a directory with these files:

project/
├── main.c          # entry point, calls calculate_total()
├── utils.c         # helper functions
├── utils.h         # header for utils.c
├── data.txt        # sample input data
└── README.md       # project notes with TODOs

Every search example below operates on this project. By the end, you’ll be able to find anything in any project — fast.


grep: Searching Inside Files

The name grep stands for Global Regular Expression Print — it reads lines from a file (or stream) and prints every line that matches a pattern. It’s one of the most-used Unix commands. Developers reach for it dozens of times a day.

Here’s the simplest form — find every line in main.c that mentions printf:

grep 'printf' main.c

This prints each line containing the literal text printf. Nothing more, nothing less.

Useful grep Flags

Bare grep is handy, but flags make it powerful. Here are the ones you’ll use constantly:

grep -n 'printf' main.c          # Show line numbers with each match
grep -i 'error' main.c           # Case-insensitive (matches ERROR, Error, error)
grep -c 'printf' main.c          # Count matching lines (just the number)
grep -v '#include' main.c        # Invert — show lines that do NOT match
grep -r 'TODO' .                 # Recursive — search all files in current dir and below
grep -l 'calculate_total' *.c    # List only filenames that contain a match
grep -w 'int' main.c             # Whole word only — won't match "printf" or "pointer"

Each flag does one thing. Combine them for precision: grep -rn 'TODO' . recursively finds every TODO and shows the filename plus line number — a quick way to see what’s left to do in a project.

From Java: In CSCD 210, you used String.contains() or String.matches() with java.util.regex.Pattern to search text. grep is the command-line equivalent — but instead of operating on a single String object, it operates on entire files and streams. The regex syntax is similar because both descend from the same mathematical foundations (formal language theory).

grep in Pipelines

Remember pipes from Lesson 1.5? grep shines as a filter in the middle of a pipeline:

ls -la | grep '.c'               # Show only .c files in a directory listing
history | grep 'gcc'             # Find your past compilation commands
cat /etc/passwd | grep "$USER"   # Find your account info in the system user list

Each of these takes the output of one command and uses grep to keep only the lines you care about.

Quick Check: What does grep -rn 'malloc' . do?

Recursively (-r) searches all files starting from the current directory (.) for lines containing malloc, and shows the filename and line number (-n) for each match. You’d use this to find every place in a project that allocates memory.

Check Your Understanding
You want to find all lines in main.c that do not contain #include. Which command does this?
A grep '#include' main.c
B grep -v '#include' main.c
C grep -c '#include' main.c
D grep -n '#include' main.c
Answer: B. The -v flag inverts the match — it prints every line that does not contain the pattern. Option A shows lines that do match. Option C counts matches. Option D shows matches with line numbers.

Regular Expression Basics

So far we’ve been searching for literal text — the exact string printf or TODO. But what if you want to find lines that start with int, or lines that contain any digit, or lines that end with a semicolon? You need regular expressions (regex for short).

A regular expression is a pattern that describes a set of strings. Plain text in a regex matches itself literally, but special characters called metacharacters let you describe more flexible patterns.

Metacharacter Meaning Example Matches
. Any single character h.t hat, hit, hot, h9t
^ Start of line ^int Lines starting with “int”
$ End of line ;$ Lines ending with a semicolon
* Zero or more of previous ab*c ac, abc, abbc, abbbc
[...] Any one character in the set [aeiou] Any vowel
[^...] Any character NOT in the set [^0-9] Any non-digit
\ Escape a metacharacter \. A literal period (not “any character”)

Watch out: Regex * and shell glob * are NOT the same thing! In shell globbing (like ls *.c), the * means “any sequence of characters.” In regex, * means “zero or more of the previous character” — so .* (dot-star) is the regex for “any sequence of characters.” This trips up almost everyone at first. Always quote your regex patterns to prevent the shell from interpreting them.

Let’s use these metacharacters on our project. Find all function definitions in main.c — they’re lines that start with a lowercase letter and contain a (:

grep -n '^[a-z].*(' main.c

This reads: “lines starting with (^) a lowercase letter ([a-z]), followed by anything (.*), followed by a (.” The -n flag shows line numbers so you can jump right to each function.

Find all #include directives:

grep -n '^#include' main.c

The ^ anchors the match to the start of the line, so this won’t match a comment that happens to mention #include.

Find lines ending with a semicolon (statements, not function headers or comments):

grep -n ';$' main.c

The trick: Always quote regex patterns with single quotes to prevent shell expansion. Write grep '^int' file.c, not grep ^int file.c. Without quotes, the shell might interpret ^, $, *, or [ before grep ever sees them.

Quick Check: What does the pattern [^0-9] match?

Any single character that is NOT a digit. The ^ inside square brackets means “not” — it’s different from ^ at the start of a pattern (which means “start of line”). So [^0-9] matches letters, spaces, punctuation, and anything else that isn’t 0 through 9.

Check Your Understanding
What does the regex pattern ^int match when used with grep?
A Lines that start with the text "int"
B Any line containing "int" anywhere
C Lines that end with "int"
D The literal text "^int" in a file
Answer: A. The ^ metacharacter anchors the pattern to the start of a line. Without it, grep 'int' would match "int" anywhere on the line (option B). The $ anchor is for end-of-line (option C). Option D would require escaping the caret: grep '\^int'.

Extended Regular Expressions

Basic regex gives you ., *, ^, $, and character classes. That handles a lot, but sometimes you need more. The -E flag enables extended regular expressions, which add four more metacharacters:

Extended Meaning Example
+ One or more of previous [0-9]+ matches one or more digits
? Zero or one of previous colou?r matches “color” or “colour”
\| Alternation (OR) int\|double matches “int” or “double”
(...) Grouping ^(int\|void) matches lines starting with “int” or “void”

Let’s use these on our project. Find every line in main.c that declares either an int or double variable:

grep -E 'int|double' main.c

Find lines containing one or more digits (useful for spotting magic numbers):

grep -En '[0-9]+' main.c

Find function return types — lines that start with int or void:

grep -En '^(int|void)' main.c

Key insight: The difference between * and + matters more than you’d think. [0-9]* matches zero or more digits — which means it matches every single line (since every line has zero digits somewhere). [0-9]+ matches one or more digits — only lines that actually contain a number. This is the single most common regex mistake beginners make.

Check Your Understanding
What is the difference between grep -E '[0-9]+' data.txt and grep '[0-9]*' data.txt?
A They are identical — + and * mean the same thing
B The first requires the -E flag and matches fewer lines; the second matches only lines with no digits
C The first matches one or more digits; the second matches zero or more digits (which matches every line)
D The first is extended regex and won't work; the second is the correct basic regex form
Answer: C. + means "one or more of the previous," so [0-9]+ only matches lines containing at least one digit. * means "zero or more of the previous," so [0-9]* matches zero digits — which is every line (zero is a valid count). The + quantifier requires -E (extended regex), but it does work correctly with that flag.
Quick Check: How would you find lines containing either "printf" or "fprintf" in one command?

grep -E 'f?printf' main.c — the ? makes the leading f optional, matching both printf and fprintf. Alternatively: grep -E 'printf|fprintf' main.c uses alternation. Both require -E for extended regex.


find: Searching for Files

grep searches inside files. But what if you don’t know which file to search? That’s where find comes in. It walks a directory tree and prints files matching criteria like name, type, or size.

Back to our project. Find all C source files anywhere in the directory tree:

find . -name '*.c'

This starts at the current directory (.) and prints every file whose name matches *.c. Note that the * here is a glob pattern (handled by find, not the shell) — you must quote it to prevent the shell from expanding it first.

Here are the most useful find options:

find . -name '*.c' -type f        # Only regular files (not directories named "something.c")
find . -type d                    # Only directories
find . -name '*.o' -delete        # Find and delete all object files
find . -empty                     # Empty files and directories
find . -size +1M                  # Files larger than 1 megabyte

Combining find and grep

The real power comes from combining the two tools. Find all .c files in your project that call malloc:

find . -name '*.c' -exec grep -l 'malloc' {} \;

This works in two steps: find locates every .c file, then -exec runs grep -l 'malloc' on each one. The {} is a placeholder for each filename find discovers, and \; marks the end of the -exec command.

Key insight: grep and find serve different purposes. grep searches file contents for text patterns. find searches the directory tree for files matching name, type, or attribute criteria. Use find to locate files, then grep to search inside them.

VS Code comparison: grep is like VS Code’s “Find in Files” (Ctrl+Shift+F). find is like the file explorer’s search bar. The difference: the command-line versions are scriptable, composable with pipes, and work on any system — even a remote server with no GUI.

Quick Check: Why do you need to quote '*.c' in find . -name '*.c'?

Without quotes, the shell expands *.c into a list of .c files in the current directory before find even runs. If there are files matching *.c right here, find would receive those literal filenames instead of the pattern. Quoting ensures find gets the raw pattern *.c and can match it against files in every subdirectory.

Check Your Understanding
You want to find all .c files in your project that call malloc. Which command does this?
A grep -r '*.c' malloc
B find . -name 'malloc' -type f
C grep -r 'malloc' *.c
D find . -name '*.c' -exec grep -l 'malloc' {} \;
Answer: D. This combines both tools: find locates all .c files recursively, then -exec grep -l checks each one for "malloc" and prints only matching filenames. Option A has the arguments backwards. Option B searches for files named "malloc", not files containing it. Option C looks close but only searches .c files in the current directory — the shell expands *.c before grep runs, so subdirectories are missed even with -r.

A Brief Look at sed

The lecture notes introduce one more text tool: sed (stream editor). While grep finds lines matching a pattern, sed can change them. Its most common use is find-and-replace:

sed 's/TODO/DONE/' README.md

This substitutes the first TODO on each line with DONE and prints the result. It doesn’t modify the file — it prints the transformed output to stdout (redirect with > to save it).

Add g for “global” to replace every occurrence on each line, not just the first:

sed 's/TODO/DONE/g' README.md

sed is powerful but deep — we’re only scratching the surface here. For now, know that it exists and that s/old/new/g is the pattern you’ll reach for most often.

Quick Check: What does sed 's/printf/fprintf/g' main.c do?

It reads main.c, replaces every occurrence of printf with fprintf on every line, and prints the result to stdout. The original main.c is unchanged — you’d need to redirect the output (> newfile.c) or use sed -i to edit in place.

Check Your Understanding
What does the g at the end of sed 's/TODO/DONE/g' file.txt do?
A Makes the search case-insensitive (like grep -i)
B Writes the changes directly to the file
C Replaces ALL occurrences on each line, not just the first
D Searches all files in the directory (global search)
Answer: C. Without g, sed only replaces the first match on each line. With g (global), it replaces every occurrence. For example, if a line has three TODOs, s/TODO/DONE/ fixes only the first one, while s/TODO/DONE/g fixes all three.

Quick Reference

Task Command
Find text in a file grep 'pattern' file
Find text recursively in .c files grep -rn 'pattern' --include='*.c' .
Find files by name find . -name '*.c'
Find files then search inside them find . -name '*.c' -exec grep -l 'pattern' {} \;
Count matches grep -c 'pattern' file
Show matches with line numbers grep -n 'pattern' file
Find and replace (preview) sed 's/old/new/g' file

Try It Yourself

These exercises build on each other. Try them in a terminal:

Exercise 1: In your home directory, use find to locate all .txt files. Then use grep to find which of those files contains the word “hello”. Try combining them: find . -name '*.txt' -exec grep -l 'hello' {} \;

Exercise 2: Create a small .c file with a few printf calls, some comments, and a couple of #include directives. Then:

  1. Find all lines starting with # using grep '^#' yourfile.c
  2. Find all lines ending with ; using grep ';$' yourfile.c
  3. Find lines containing any digit using grep -E '[0-9]+' yourfile.c

Exercise 3: Use grep -rn 'TODO' . in any project directory to find all TODO comments. How many are there? (grep -rc 'TODO' . gives counts per file.)


Pattern Matching Is Everywhere

Regular expressions aren’t just a Unix thing. The patterns you learned with grep work in Python (re module), JavaScript (RegExp), and Java (java.util.regex). Every serious text editor supports regex search. Every programming language has a regex library. The syntax varies slightly, but the core metacharacters — ., *, ^, $, [...] — are universal.

When you start writing C programs next week, grep becomes essential for debugging. Searching for where a variable is declared, finding every function that calls printf, or locating a bug across a multi-file project — grep -rn is faster than any IDE search, and it works on remote servers where there’s no GUI.

Next up: processes. Every command you’ve been running — grep, find, sed — creates a process, a running instance of a program with its own PID and memory space. In Lesson 2.2, you’ll learn to see processes, manage them, and understand how Unix runs multiple programs simultaneously. That’s the foundation you’ll need when we start writing our own C programs and watching them run.

Why this matters: The grepfindsed trio is the bread and butter of Unix text processing. Professional developers and sysadmins use these tools every day — searching logs for errors, finding configuration files, batch-renaming variables. Mastering them now means you’ll be productive on any Unix system for the rest of your career.