student@ubuntu:~$
ctf Lesson 32 30 min read

NCL: Open Source Intelligence

OSINT methodology, Google dorks, email headers, and digital footprints

Open Source Intelligence (OSINT) Deep Dive

Open Source Intelligence means finding information using publicly available sources. No hacking. No unauthorized access. Just knowing where to look and how to connect the dots. In NCL, OSINT challenges give you a username, email address, image, or domain and ask you to find specific information about it. In the real world, OSINT is how investigators, journalists, law enforcement, and security teams gather intelligence before an engagement ever starts.

This page goes beyond the basics covered in the NCL OSINT overview. It covers the OSINT mindset, whois lookups, Google dorking, the Wayback Machine, image metadata, email header analysis, and social media investigation. Each section is a concrete skill with commands you can run.

Prerequisites: You should be comfortable with basic shell commands from Week 1.


1. The OSINT Mindset

Everything leaves a digital footprint. Every photo has metadata embedded by the camera. Every domain has registration records in a public database. Every website has historical snapshots archived by bots. Every person who has ever posted online has some trace of that post still accessible. The skill is not in using any single tool — it is in knowing the right sequence of lookups and recognizing when one piece of information leads to another.

The OSINT cycle

  1. Define the target — What exactly are you looking for? (a person’s location, a domain’s owner, an image’s origin)
  2. Collect — Use tools to gather raw data (whois, dig, exiftool, Google, archives)
  3. Analyze — Cross-reference findings. A username on one platform may appear on another. A timestamp in EXIF may match a social media post.
  4. Report — In competition, this means entering the answer. In practice, this means documenting what you found and how you found it.

The single most important habit: follow the thread. Every piece of information you find can lead to the next. A domain registration shows a registrant email. That email appears on a GitHub profile. That profile has a repo with commit metadata. That metadata contains a real name and timezone. One data point becomes four.


2. whois: Domain Registration Records

Every domain name is registered through a registrar (like Namecheap, GoDaddy, or Cloudflare). The registration record is stored in a public database queried via the whois protocol. This record contains who registered the domain, when, and sometimes their contact information.

whois example.com

Key fields

Domain Name: EXAMPLE.COM
Registry Domain ID: 2336799_DOMAIN_COM-VRSN
Registrar: ICANN
Registrar URL: http://www.iana.org
Updated Date: 2023-08-14T07:01:38Z
Creation Date: 1995-08-14T04:00:00Z
Registry Expiry Date: 2024-08-13T04:00:00Z
Registrar WHOIS Server: whois.iana.org
Name Server: A.IANA-SERVERS.NET
Name Server: B.IANA-SERVERS.NET
Field What It Tells You
Registrar The company the domain was purchased through
Creation Date When the domain was first registered
Expiry Date When registration lapses (expired domains can be re-registered)
Updated Date Last time the record was modified
Name Server Which DNS servers handle this domain
Registrant Name/Email Who registered it (often redacted)

Extracting specific fields

# Who is the registrar?
whois example.com | grep -i "registrar:"

# When was it created?
whois example.com | grep -i "creation date"

# What nameservers does it use?
whois example.com | grep -i "name server"

# Is WHOIS privacy enabled?
whois example.com | grep -i "privacy\|redacted\|withheld"

Use grep -i (case-insensitive) because different registrars format field names differently.

WHOIS privacy

Many registrants pay for WHOIS privacy, which replaces their personal information with the registrar’s proxy details. When you see “Registrant Name: REDACTED FOR PRIVACY” or a proxy email like proxy@domainprivacy.com, the domain owner has paid to hide their identity. In NCL, you work with what is available — privacy redaction itself is a valid answer (“WHOIS privacy is enabled”).

Checkpoint: whois shows a creation date of 2002-03-15 and an expiry date of 2025-03-15. The registrant is "Domains By Proxy, LLC". What do you know?

The domain has been registered for over 20 years (a long-lived domain, likely legitimate or well-established). WHOIS privacy is enabled — “Domains By Proxy, LLC” is GoDaddy’s privacy service, not the actual owner. The registrar is likely GoDaddy. You cannot determine the real registrant from whois alone, but other OSINT methods (reverse DNS, certificate transparency logs, web archive history) may reveal it.


3. Google Dorking

Google indexes billions of pages and lets you search them with advanced operators. Google dorking (also called Google hacking) uses these operators to find specific types of content that are publicly accessible but not intended to be easily discoverable. This is legal — you are searching Google’s index, not accessing anything unauthorized.

Operators

Operator Purpose Example
site: Only results from a specific domain site:ewu.edu "computer science"
filetype: Only files of a specific type filetype:pdf site:ewu.edu syllabus
inurl: URL must contain this string inurl:admin site:example.com
intitle: Page title must contain this string intitle:"index of" /backup
intext: Page body must contain this string intext:"password" filetype:txt
"exact phrase" Match this exact string "default password" filetype:pdf
- Exclude results containing this term site:example.com -inurl:blog
cache: Show Google’s cached version of a page cache:example.com

Practical examples

Find directory listings (web servers that list files instead of serving a page):

intitle:"index of" site:example.com

If the server has directory listing enabled, you see every file in that directory — including backups, configs, and logs that should not be public.

Find exposed configuration files:

filetype:env site:example.com
filetype:conf inurl:backup
filetype:sql "INSERT INTO"

.env files contain API keys and database credentials. SQL files contain database dumps. Configuration files contain server paths and credentials.

Find login pages and admin panels:

inurl:admin site:example.com
inurl:login site:example.com
intitle:"dashboard" site:example.com

Find documents with sensitive content:

filetype:pdf "confidential" site:example.com
filetype:xlsx "password" site:example.com

Combine operators for precision:

site:example.com filetype:pdf -inurl:public "internal use only"

This searches example.com for PDFs not in the /public directory that contain “internal use only”.

Checkpoint: You search "intitle:'index of' /backup site:target.com" and find a directory listing with database.sql.gz. What do you do?

Download it: the file is publicly accessible on a web server with directory listing enabled. database.sql.gz is a compressed SQL database dump that likely contains table structures, user data, and possibly credentials. Decompress with gunzip database.sql.gz and search with grep -i "password\|admin\|secret" database.sql. In a competition, this is fair game. In the real world, you would report it to the site owner as a vulnerability.


4. The Wayback Machine

The Internet Archive’s Wayback Machine (web.archive.org) captures snapshots of websites over time. As of 2025, it holds over 800 billion web pages going back to 1996. This matters for OSINT because:

  • Pages that have been deleted still exist in the archive
  • Old versions of a site may reveal information removed from the current version
  • Contact information, employee lists, and technical details change over time

Using it

https://web.archive.org/web/2023*/example.com

This shows all snapshots of example.com from 2023. Click any date to view the archived version.

From the command line

# Check if a URL has been archived
curl -s "https://archive.org/wayback/available?url=example.com" | python3 -m json.tool
{
    "url": "example.com",
    "archived_snapshots": {
        "closest": {
            "status": "200",
            "available": true,
            "url": "http://web.archive.org/web/20231015120000*/example.com",
            "timestamp": "20231015120000"
        }
    }
}
# Fetch an archived snapshot directly
curl "https://web.archive.org/web/2023/http://example.com/about.html"

What to look for in old snapshots

  • Removed employee pages — names, emails, titles that are no longer on the current site
  • Old contact forms — may reveal internal email addresses
  • Technology stack changes — what software did they use before? Old versions may still be running on subdomains
  • Deleted blog posts or announcements — may contain sensitive project details
  • robots.txt history — what directories used to be listed (and may still exist)?
Checkpoint: The current version of target.com/about shows no employee information. How would you find old employee names?

Search the Wayback Machine: https://web.archive.org/web/2020*/target.com/about. Browse snapshots from different years. Companies frequently redesign their sites and remove old employee listings, but the archived versions preserve them. Also check target.com/team, target.com/staff, and target.com/people — old pages may use different URL patterns.


5. Image Metadata (EXIF)

Digital cameras and phones embed metadata into every photo they take. This metadata is stored in EXIF (Exchangeable Image File Format) fields inside the image file. It is invisible when viewing the photo but trivially extractable.

What EXIF contains

Field What It Reveals
Make / Model Camera or phone manufacturer and model
CreateDate When the photo was taken
GPSLatitude / GPSLongitude Where the photo was taken (if GPS was enabled)
Software Editing software used (Photoshop, Lightroom, etc.)
ImageSize Resolution in pixels
ExposureTime Shutter speed
FNumber Aperture setting
ISO Sensor sensitivity
LensModel Specific lens used
OwnerName Sometimes set by the camera owner

Extracting metadata with exiftool

# Dump all metadata
exiftool photo.jpg
ExifTool Version Number         : 12.70
File Name                       : photo.jpg
Camera Model Name               : iPhone 14 Pro
Create Date                     : 2024:06:15 14:23:07
GPS Latitude                    : 47 deg 39' 23.40" N
GPS Longitude                   : 117 deg 25' 12.80" W
Software                        : 17.5.1
Image Size                      : 4032x3024

From this single photo, you now know: it was taken with an iPhone 14 Pro running iOS 17.5.1, on June 15, 2024 at 2:23 PM, at GPS coordinates 47.6565, -117.4202 (Spokane, Washington).

Targeted extraction

# Specific fields (-s3 outputs value only, no label)
exiftool -s3 -GPSPosition photo.jpg      # 47 deg 39' 23.40" N, 117 deg 25' 12.80" W
exiftool -s3 -CreateDate photo.jpg        # 2024:06:15 14:23:07
exiftool -s3 -Make photo.jpg              # Apple
exiftool -s3 -Model photo.jpg             # iPhone 14 Pro

GPS coordinate conversion

EXIF stores GPS in DMS (Degrees, Minutes, Seconds) format. To convert to decimal degrees:

Decimal = Degrees + (Minutes / 60) + (Seconds / 3600)

Latitude:  47 + 39/60 + 23.4/3600 = 47.6565
Longitude: 117 + 25/60 + 12.8/3600 = 117.4202 → -117.4202 (West is negative)

Paste 47.6565, -117.4202 into Google Maps to see the location.

Metadata beyond images

exiftool works on PDFs, Office documents, audio files, and video files:

exiftool document.pdf      # Author, creation date, software
exiftool report.docx       # Author, company, revision count
exiftool recording.mp3     # Title, artist, duration, software
Checkpoint: You extract GPS coordinates from a suspect's photo. The coordinates are 47 deg 39' 23.4" N, 117 deg 25' 12.8" W. Where is this, and what else does the metadata tell you?

The coordinates convert to 47.6565, -117.4202 — this is in Spokane, Washington (near the EWU campus area). Combined with the timestamp, you know exactly where and when the photo was taken. Combined with the device model, you know what phone the person uses. This is why privacy-conscious people strip EXIF data before sharing photos.


6. Email Header Analysis

Every email contains headers — metadata added by each server that handles the message as it travels from sender to recipient. These headers record the path the email took, the originating IP address, and whether authentication checks passed. Reading them reveals whether an email is legitimate or spoofed.

Viewing email headers

In Gmail: open the email, click the three dots, “Show original.” In Outlook: open the email, File > Properties > Internet Headers. This gives you the raw headers.

Reading headers

Headers are read bottom to top. The bottom Received: line is the originating server. Each subsequent Received: line is a server that relayed the message.

Received: from mail-yw1-f169.google.com (mail-yw1-f169.google.com [209.85.128.169])
        by mx.recipient.com with ESMTPS id abc123
        for <user@recipient.com>; Mon, 15 Jan 2024 10:30:45 -0800

Received: by mail-yw1-f169.google.com with SMTP id abc456
        for <user@recipient.com>; Mon, 15 Jan 2024 10:30:44 -0800

From: Alice <alice@gmail.com>
To: user@recipient.com
Subject: Meeting tomorrow
Date: Mon, 15 Jan 2024 10:30:42 -0800
Message-ID: <unique-id@mail.gmail.com>

Key header fields

Header Purpose What to Look For
From: Who sent the email (easily spoofable) Does it match the actual sender?
Return-Path: Where bounces go (harder to spoof) Should match the From domain
Received: Server hop (each server adds one) Read bottom-to-top for the message path
X-Originating-IP: IP of the original sender Geolocation reveals true origin
Message-ID: Unique identifier Domain after @ usually matches the sending server
Authentication-Results: SPF, DKIM, DMARC results pass = legitimate, fail = likely spoofed

Detecting spoofed emails

A spoofed email has a forged From: header. The authentication headers reveal the truth:

Authentication-Results: mx.google.com;
       spf=fail (google.com: domain of attacker@evil.com does not
       designate 192.168.1.100 as permitted sender)
       dkim=fail
       dmarc=fail
  • SPF fail — the sending IP is not authorized to send mail for that domain
  • DKIM fail — the cryptographic signature does not match
  • DMARC fail — the domain’s policy says this email should be rejected

All three failing is strong evidence of a spoofed or phishing email.

Checkpoint: An email claims to be from security@bankofamerica.com. The Received headers show it originated from IP 185.234.72.15, and SPF, DKIM, and DMARC all show "fail". Is this legitimate?

No. Three authentication failures mean the email did not come from Bank of America’s mail servers. The originating IP (185.234.72.15) is not authorized to send mail for bankofamerica.com. This is a phishing email with a spoofed From header. The From field in email is trivially forgeable — the authentication headers are what reveal the truth.


7. Social Media Intelligence

Public social media profiles are one of the richest OSINT sources. People voluntarily post their location, workplace, daily routines, travel plans, relationships, and technical interests. In competitions, OSINT challenges often provide a username and ask you to find information across platforms.

Username enumeration

The same username often appears across multiple platforms. Given one, check for the others:

  • GitHub: https://github.com/<username>
  • Twitter/X: https://twitter.com/<username>
  • Reddit: https://reddit.com/u/<username>
  • LinkedIn: Search by name
  • Instagram: https://instagram.com/<username>
  • Stack Overflow: Search by username
# Check if a username exists on a platform (200 = exists, 404 = no)
curl -o /dev/null -s -w "%{http_code}" https://github.com/targetuser
curl -o /dev/null -s -w "%{http_code}" https://reddit.com/u/targetuser

What people reveal accidentally

  • Location: Check-ins, geotagged photos, “just moved to…” posts
  • Employer/role: LinkedIn profiles, “excited to start at…” posts
  • Technical stack: GitHub repos, Stack Overflow questions, blog posts
  • Schedule: “Working late again”, “flight delayed”, regular posting times reveal timezone
  • Relationships: Tagged photos, mutual follows, group memberships
  • Email: GitHub commit metadata often contains the committer’s email

GitHub as an OSINT goldmine

# View public repos (may reveal projects, employers, skills)
curl https://api.github.com/users/targetuser/repos

# View commit emails (Git commits contain author email)
curl https://api.github.com/users/targetuser/events | grep '"email"'

Git commits embed the committer’s name and email address. Even if someone uses a pseudonym on their profile, their commits may contain their real email from their git config.

Ethics boundary

OSINT uses only public information. Accessing private accounts, guessing passwords, social engineering people into revealing information, or using unauthorized tools to bypass privacy settings is not OSINT — it is unauthorized access. In NCL and professional practice, the line is clear: if it requires authentication or deception to access, it is out of scope.

Checkpoint: You are given the username "cyb3rhunter42" and asked to find the person's real name. Their GitHub profile shows no name, but they have 12 public repos. How do you proceed?

Check their Git commit metadata: curl https://api.github.com/repos/cyb3rhunter42/<repo>/commits — the commit author field contains name and email, which are set in the committer’s local git config and often contain their real identity. Also check other platforms with the same username (Reddit, Twitter, Stack Overflow) — one of them may use a real name. Finally, check if the email address from Git commits appears in any other public contexts.


8. OSINT Toolbox Reference

A summary of every tool and technique covered, organized by target type:

Given a domain

Step Command / Tool What You Get
WHOIS whois example.com Registrar, dates, nameservers, registrant
DNS dig ANY example.com A, MX, NS, TXT records
Subdomains dig +short @dns.google example.com Subdomain enumeration
Web archive web.archive.org/web/*/example.com Historical snapshots
Certificate transparency crt.sh/?q=example.com All SSL certificates ever issued (reveals subdomains)
robots.txt curl http://example.com/robots.txt Hidden directories

Given an image

Step Command / Tool What You Get
Metadata exiftool photo.jpg Camera, timestamp, GPS, software
Reverse image search Google Images, TinEye Where else this image appears
GPS lookup Google Maps with decimal coordinates Physical location

Given an email address

Step Command / Tool What You Get
Domain WHOIS whois <domain> Organization info
Email headers “Show original” in email client Originating IP, auth results
PGP key servers gpg --search-keys <email> Public key, fingerprint, associated identities
Username search Check the part before @ on social platforms Cross-platform presence
Have I Been Pwned haveibeenpwned.com Whether the email appeared in data breaches

Given a username

Step Action What You Get
Platform check Visit github.com, twitter.com, reddit.com with the username Cross-platform profiles
GitHub commits Check commit author metadata Real name and email
Google search "username" site:reddit.com Historical posts and comments
Wayback Machine Archived versions of profiles Deleted posts or old identities

Resources

Practice: TryHackMe OSINT rooms · picoCTF (OSINT category) · OSINT Framework (tool directory) · Trace Labs (real-world OSINT for missing persons)

Reference: CyberChef (encoding/decoding) · crt.sh (certificate transparency) · Have I Been Pwned · Google Advanced Search

Video: John Hammond — OSINT CTF walkthroughs · The OSINT Curious Project · David Bombal — Google dorking