How Do Regular Expressions Help Find Text Patterns?
grep, find, and the art of searching — matching patterns across files and directories
After this lesson, you will be able to:
- Use
grepwith flags (-n,-i,-c,-v,-r,-l,-w) to search for text patterns inside files - Write basic regular expressions using metacharacters (
.,^,$,*,[...],[^...]) to match text patterns - Use extended regular expressions (
-E) with+,?,|, and grouping to build complex pattern matches - Use
findto locate files by name, type, size, or other attributes in a directory tree - Combine
findwith-execandgrepto search for text patterns across specific file types - Use
sedfor basic find-and-replace operations on text streams - Distinguish between
grep(searches file contents) andfind(searches for files by attributes)
Finding a Needle in a Haystack
Imagine you’re working on a C project with a dozen source files. You wrote a function called calculate_total last week, but you can’t remember which file it’s in. Or maybe you just got a compiler error mentioning line 47 of some file, and you need to find every printf call to figure out what went wrong.
You could open each file and scroll through it manually. Or you could let Unix do the searching for you in under a second.
Unix gives you two tools for this: grep searches inside files for text patterns, and find searches for files by name or attributes. Together with regular expressions — a mini-language for describing text patterns — they turn searching from a chore into a one-liner.
We’ll use a small C project as our running example throughout this lesson. Imagine a directory with these files:
project/
├── main.c # entry point, calls calculate_total()
├── utils.c # helper functions
├── utils.h # header for utils.c
├── data.txt # sample input data
└── README.md # project notes with TODOs
Every search example below operates on this project. By the end, you’ll be able to find anything in any project — fast.
grep: Searching Inside Files
The name grep stands for Global Regular Expression Print — it reads lines from a file (or stream) and prints every line that matches a pattern. It’s one of the most-used Unix commands. Developers reach for it dozens of times a day.
Here’s the simplest form — find every line in main.c that mentions printf:
grep 'printf' main.c
This prints each line containing the literal text printf. Nothing more, nothing less.
Useful grep Flags
Bare grep is handy, but flags make it powerful. Here are the ones you’ll use constantly:
grep -n 'printf' main.c # Show line numbers with each match
grep -i 'error' main.c # Case-insensitive (matches ERROR, Error, error)
grep -c 'printf' main.c # Count matching lines (just the number)
grep -v '#include' main.c # Invert — show lines that do NOT match
grep -r 'TODO' . # Recursive — search all files in current dir and below
grep -l 'calculate_total' *.c # List only filenames that contain a match
grep -w 'int' main.c # Whole word only — won't match "printf" or "pointer"
Each flag does one thing. Combine them for precision: grep -rn 'TODO' . recursively finds every TODO and shows the filename plus line number — a quick way to see what’s left to do in a project.
From Java: In CSCD 210, you used
String.contains()orString.matches()withjava.util.regex.Patternto search text.grepis the command-line equivalent — but instead of operating on a singleStringobject, it operates on entire files and streams. The regex syntax is similar because both descend from the same mathematical foundations (formal language theory).
grep in Pipelines
Remember pipes from Lesson 1.5? grep shines as a filter in the middle of a pipeline:
ls -la | grep '.c' # Show only .c files in a directory listing
history | grep 'gcc' # Find your past compilation commands
cat /etc/passwd | grep "$USER" # Find your account info in the system user list
Each of these takes the output of one command and uses grep to keep only the lines you care about.
Quick Check: What does grep -rn 'malloc' . do?
Recursively (-r) searches all files starting from the current directory (.) for lines containing malloc, and shows the filename and line number (-n) for each match. You’d use this to find every place in a project that allocates memory.
main.c that do not contain #include. Which command does this?-v flag inverts the match — it prints every line that does not contain the pattern. Option A shows lines that do match. Option C counts matches. Option D shows matches with line numbers.
Regular Expression Basics
So far we’ve been searching for literal text — the exact string printf or TODO. But what if you want to find lines that start with int, or lines that contain any digit, or lines that end with a semicolon? You need regular expressions (regex for short).
A regular expression is a pattern that describes a set of strings. Plain text in a regex matches itself literally, but special characters called metacharacters let you describe more flexible patterns.
| Metacharacter | Meaning | Example | Matches |
|---|---|---|---|
. |
Any single character | h.t |
hat, hit, hot, h9t |
^ |
Start of line | ^int |
Lines starting with “int” |
$ |
End of line | ;$ |
Lines ending with a semicolon |
* |
Zero or more of previous | ab*c |
ac, abc, abbc, abbbc |
[...] |
Any one character in the set | [aeiou] |
Any vowel |
[^...] |
Any character NOT in the set | [^0-9] |
Any non-digit |
\ |
Escape a metacharacter | \. |
A literal period (not “any character”) |
Watch out: Regex
*and shell glob*are NOT the same thing! In shell globbing (likels *.c), the*means “any sequence of characters.” In regex,*means “zero or more of the previous character” — so.*(dot-star) is the regex for “any sequence of characters.” This trips up almost everyone at first. Always quote your regex patterns to prevent the shell from interpreting them.
Let’s use these metacharacters on our project. Find all function definitions in main.c — they’re lines that start with a lowercase letter and contain a (:
grep -n '^[a-z].*(' main.c
This reads: “lines starting with (^) a lowercase letter ([a-z]), followed by anything (.*), followed by a (.” The -n flag shows line numbers so you can jump right to each function.
Find all #include directives:
grep -n '^#include' main.c
The ^ anchors the match to the start of the line, so this won’t match a comment that happens to mention #include.
Find lines ending with a semicolon (statements, not function headers or comments):
grep -n ';$' main.c
The trick: Always quote regex patterns with single quotes to prevent shell expansion. Write
grep '^int' file.c, notgrep ^int file.c. Without quotes, the shell might interpret^,$,*, or[beforegrepever sees them.
Quick Check: What does the pattern [^0-9] match?
Any single character that is NOT a digit. The ^ inside square brackets means “not” — it’s different from ^ at the start of a pattern (which means “start of line”). So [^0-9] matches letters, spaces, punctuation, and anything else that isn’t 0 through 9.
^int match when used with grep?^ metacharacter anchors the pattern to the start of a line. Without it, grep 'int' would match "int" anywhere on the line (option B). The $ anchor is for end-of-line (option C). Option D would require escaping the caret: grep '\^int'.
Extended Regular Expressions
Basic regex gives you ., *, ^, $, and character classes. That handles a lot, but sometimes you need more. The -E flag enables extended regular expressions, which add four more metacharacters:
| Extended | Meaning | Example |
|---|---|---|
+ |
One or more of previous | [0-9]+ matches one or more digits |
? |
Zero or one of previous | colou?r matches “color” or “colour” |
\| |
Alternation (OR) | int\|double matches “int” or “double” |
(...) |
Grouping | ^(int\|void) matches lines starting with “int” or “void” |
Let’s use these on our project. Find every line in main.c that declares either an int or double variable:
grep -E 'int|double' main.c
Find lines containing one or more digits (useful for spotting magic numbers):
grep -En '[0-9]+' main.c
Find function return types — lines that start with int or void:
grep -En '^(int|void)' main.c
Key insight: The difference between
*and+matters more than you’d think.[0-9]*matches zero or more digits — which means it matches every single line (since every line has zero digits somewhere).[0-9]+matches one or more digits — only lines that actually contain a number. This is the single most common regex mistake beginners make.
grep -E '[0-9]+' data.txt and grep '[0-9]*' data.txt?+ means "one or more of the previous," so [0-9]+ only matches lines containing at least one digit. * means "zero or more of the previous," so [0-9]* matches zero digits — which is every line (zero is a valid count). The + quantifier requires -E (extended regex), but it does work correctly with that flag.
Quick Check: How would you find lines containing either "printf" or "fprintf" in one command?
grep -E 'f?printf' main.c — the ? makes the leading f optional, matching both printf and fprintf. Alternatively: grep -E 'printf|fprintf' main.c uses alternation. Both require -E for extended regex.
find: Searching for Files
grep searches inside files. But what if you don’t know which file to search? That’s where find comes in. It walks a directory tree and prints files matching criteria like name, type, or size.
Back to our project. Find all C source files anywhere in the directory tree:
find . -name '*.c'
This starts at the current directory (.) and prints every file whose name matches *.c. Note that the * here is a glob pattern (handled by find, not the shell) — you must quote it to prevent the shell from expanding it first.
Here are the most useful find options:
find . -name '*.c' -type f # Only regular files (not directories named "something.c")
find . -type d # Only directories
find . -name '*.o' -delete # Find and delete all object files
find . -empty # Empty files and directories
find . -size +1M # Files larger than 1 megabyte
Combining find and grep
The real power comes from combining the two tools. Find all .c files in your project that call malloc:
find . -name '*.c' -exec grep -l 'malloc' {} \;
This works in two steps: find locates every .c file, then -exec runs grep -l 'malloc' on each one. The {} is a placeholder for each filename find discovers, and \; marks the end of the -exec command.
Key insight:
grepandfindserve different purposes.grepsearches file contents for text patterns.findsearches the directory tree for files matching name, type, or attribute criteria. Usefindto locate files, thengrepto search inside them.
VS Code comparison:
grepis like VS Code’s “Find in Files” (Ctrl+Shift+F).findis like the file explorer’s search bar. The difference: the command-line versions are scriptable, composable with pipes, and work on any system — even a remote server with no GUI.
Quick Check: Why do you need to quote '*.c' in find . -name '*.c'?
Without quotes, the shell expands *.c into a list of .c files in the current directory before find even runs. If there are files matching *.c right here, find would receive those literal filenames instead of the pattern. Quoting ensures find gets the raw pattern *.c and can match it against files in every subdirectory.
.c files in your project that call malloc. Which command does this?find locates all .c files recursively, then -exec grep -l checks each one for "malloc" and prints only matching filenames. Option A has the arguments backwards. Option B searches for files named "malloc", not files containing it. Option C looks close but only searches .c files in the current directory — the shell expands *.c before grep runs, so subdirectories are missed even with -r.
A Brief Look at sed
The lecture notes introduce one more text tool: sed (stream editor). While grep finds lines matching a pattern, sed can change them. Its most common use is find-and-replace:
sed 's/TODO/DONE/' README.md
This substitutes the first TODO on each line with DONE and prints the result. It doesn’t modify the file — it prints the transformed output to stdout (redirect with > to save it).
Add g for “global” to replace every occurrence on each line, not just the first:
sed 's/TODO/DONE/g' README.md
sed is powerful but deep — we’re only scratching the surface here. For now, know that it exists and that s/old/new/g is the pattern you’ll reach for most often.
Quick Check: What does sed 's/printf/fprintf/g' main.c do?
It reads main.c, replaces every occurrence of printf with fprintf on every line, and prints the result to stdout. The original main.c is unchanged — you’d need to redirect the output (> newfile.c) or use sed -i to edit in place.
g at the end of sed 's/TODO/DONE/g' file.txt do?g, sed only replaces the first match on each line. With g (global), it replaces every occurrence. For example, if a line has three TODOs, s/TODO/DONE/ fixes only the first one, while s/TODO/DONE/g fixes all three.
Quick Reference
| Task | Command |
|---|---|
| Find text in a file | grep 'pattern' file |
Find text recursively in .c files |
grep -rn 'pattern' --include='*.c' . |
| Find files by name | find . -name '*.c' |
| Find files then search inside them | find . -name '*.c' -exec grep -l 'pattern' {} \; |
| Count matches | grep -c 'pattern' file |
| Show matches with line numbers | grep -n 'pattern' file |
| Find and replace (preview) | sed 's/old/new/g' file |
Try It Yourself
These exercises build on each other. Try them in a terminal:
Exercise 1: In your home directory, use
findto locate all.txtfiles. Then usegrepto find which of those files contains the word “hello”. Try combining them:find . -name '*.txt' -exec grep -l 'hello' {} \;
Exercise 2: Create a small
.cfile with a fewprintfcalls, some comments, and a couple of#includedirectives. Then:
- Find all lines starting with
#usinggrep '^#' yourfile.c- Find all lines ending with
;usinggrep ';$' yourfile.c- Find lines containing any digit using
grep -E '[0-9]+' yourfile.c
Exercise 3: Use
grep -rn 'TODO' .in any project directory to find all TODO comments. How many are there? (grep -rc 'TODO' .gives counts per file.)
Pattern Matching Is Everywhere
Regular expressions aren’t just a Unix thing. The patterns you learned with grep work in Python (re module), JavaScript (RegExp), and Java (java.util.regex). Every serious text editor supports regex search. Every programming language has a regex library. The syntax varies slightly, but the core metacharacters — ., *, ^, $, [...] — are universal.
When you start writing C programs next week, grep becomes essential for debugging. Searching for where a variable is declared, finding every function that calls printf, or locating a bug across a multi-file project — grep -rn is faster than any IDE search, and it works on remote servers where there’s no GUI.
Next up: processes. Every command you’ve been running — grep, find, sed — creates a process, a running instance of a program with its own PID and memory space. In Lesson 2.2, you’ll learn to see processes, manage them, and understand how Unix runs multiple programs simultaneously. That’s the foundation you’ll need when we start writing our own C programs and watching them run.
Why this matters: The
grep→find→sedtrio is the bread and butter of Unix text processing. Professional developers and sysadmins use these tools every day — searching logs for errors, finding configuration files, batch-renaming variables. Mastering them now means you’ll be productive on any Unix system for the rest of your career.