Web Reconnaissance

You scanned a target with nmap and found port 80 open — a web server. Now what? Web recon is the process of examining what a web server reveals about itself: through its HTTP responses, its publicly accessible files, its source code, and its configuration. Most of this information is handed to you willingly by the server. You just have to ask the right questions.

This page covers HTTP fundamentals, curl (the command-line browser), response headers, robots.txt, source code inspection, and cookies. Each section gives you the exact commands to run and what the output means.

Prerequisites: You should be comfortable with piping, grep, and file operations from Weeks 1-2, and with the basics of port scanning from the Network Scanning lesson.

1. HTTP: How the Web Works

Every time you load a web page, your browser sends an HTTP request and the server sends back an HTTP response. This exchange has a strict structure.

Request

GET /index.html HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0
Accept: text/html

Method: GET (retrieve data), POST (send data), PUT (update), DELETE (remove)
Path: /index.html — the resource being requested
Headers: Key-value pairs with metadata (who you are, what you accept, cookies)

Response

HTTP/1.1 200 OK
Server: nginx/1.18.0
Content-Type: text/html; charset=UTF-8
Set-Cookie: session=abc123; Path=/

<html>
<body>Welcome to the site</body>
</html>

Status code: 200 — the result of the request (more on these below)
Headers: Server metadata (software, cookies, caching rules)
Body: The actual content (HTML, JSON, an image, etc.)

This request-response cycle is the entire web. Every page load, every API call, every form submission follows this pattern.

2. HTTP Status Codes

The status code is a three-digit number that tells you what happened. In CTF, these codes reveal whether hidden pages exist, whether authentication is required, and whether the server is misconfigured.

Code	Name	What It Means
200	OK	Request succeeded. The page exists and was returned.
301	Moved Permanently	The resource moved to a new URL (check the `Location` header).
302	Found	Temporary redirect. Often used after login.
400	Bad Request	Your request was malformed.
401	Unauthorized	Authentication required. The server wants credentials.
403	Forbidden	You are authenticated but lack permission. The page exists but you cannot access it.
404	Not Found	The resource does not exist at this URL.
405	Method Not Allowed	The method (GET, POST, etc.) is not supported for this URL.
500	Internal Server Error	The server crashed while processing your request.
503	Service Unavailable	The server is overloaded or down for maintenance.

Key distinction: 401 vs 403. A 401 means “who are you?” (no credentials provided). A 403 means “I know who you are, and you’re not allowed.” Both confirm the resource exists — which is itself useful information.

Checkpoint: You request /admin and get a 403. You request /secret and get a 404. What do you know?

/admin exists but you lack permission to access it — worth investigating further (try different credentials, check for authentication bypass). /secret does not exist — move on. A 403 is more interesting than a 404 in recon because it confirms something is there.

3. curl: The Command-Line Browser

curl sends HTTP requests from the terminal. It does exactly what a browser does, but you control every detail and can see the raw request and response.

Fetch a page

curl http://example.com

<!doctype html>
<html>
<head>
    <title>Example Domain</title>
</head>
<body>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples.</p>
</body>
</html>

Headers only (HEAD request)

curl -I http://example.com

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Content-Length: 1256
Server: EOS (vny/044F)

The -I flag sends a HEAD request — the server returns headers but not the body. Fast way to check what software is running without downloading the entire page.

Verbose mode (see everything)

curl -v http://example.com

> GET / HTTP/1.1
> Host: example.com
> User-Agent: curl/8.1.2
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/html; charset=UTF-8
< Server: EOS (vny/044F)
< Content-Length: 1256
<
<!doctype html>
...

Lines starting with > are what you sent. Lines starting with < are what the server sent back. This is the most useful curl flag for recon — you see the complete conversation.

Follow redirects

curl -L http://example.com

Without -L, curl stops at a 301/302 redirect and shows the redirect response. With -L, it follows the chain to the final destination. Useful when a site redirects HTTP to HTTPS or login pages redirect after authentication.

Save output to a file

curl -o page.html http://example.com

Checkpoint: You run curl -I on a target and see "Server: Apache/2.4.49". What should you do next?

Search for known vulnerabilities: searchsploit Apache 2.4.49 or search the NVD. Apache 2.4.49 has CVE-2021-41773, a path traversal vulnerability that lets you read files outside the web root. Knowing the exact version turns a header into an attack vector.

4. HTTP Response Headers

Response headers are metadata the server sends alongside the page content. Many servers reveal far more than they should.

Headers that reveal the tech stack

curl -I http://target.ncl.game

HTTP/1.1 200 OK
Server: Apache/2.4.52 (Ubuntu)
X-Powered-By: PHP/8.1.2
X-Generator: WordPress 6.4.2
Set-Cookie: PHPSESSID=abc123def456; path=/
Content-Type: text/html; charset=UTF-8

From this single response, you now know:

Web server: Apache 2.4.52 on Ubuntu
Language: PHP 8.1.2
Application: WordPress 6.4.2
Session management: PHP sessions (the PHPSESSID cookie)

Each of these has a version you can search for vulnerabilities.

Security headers (or their absence)

Header	Purpose	Missing = Risk
`Strict-Transport-Security`	Forces HTTPS connections	Downgrade attacks possible
`Content-Security-Policy`	Controls which scripts/resources can load	XSS more likely
`X-Frame-Options`	Prevents page from being embedded in iframes	Clickjacking possible
`X-Content-Type-Options`	Prevents MIME sniffing	Browser may misinterpret files
`X-XSS-Protection`	Legacy XSS filter (deprecated but still seen)	—

In competitions, the absence of security headers is as informative as their presence. A server missing Strict-Transport-Security might accept HTTP connections, allowing interception.

5. robots.txt

robots.txt is a file in a website’s root directory that tells search engine crawlers which pages to skip when indexing. It is a suggestion to well-behaved bots, not access control. Anyone can read it.

curl http://target.ncl.game/robots.txt

User-agent: *
Disallow: /admin/
Disallow: /backup/
Disallow: /api/v1/debug/
Disallow: /old-site/
Sitemap: http://target.ncl.game/sitemap.xml

This file just told you:

/admin/ exists (probably a login panel)
/backup/ exists (might contain database dumps or old configs)
/api/v1/debug/ exists (debug endpoints often leak internal state)
/old-site/ exists (old code is often unpatched)
A sitemap exists (lists every page on the site)

In CTF, robots.txt is one of the first things to check. Developers use it to hide directories from Google, but in doing so, they create a list of everything interesting on the server.

# Always check these common paths too
curl http://target.ncl.game/sitemap.xml
curl http://target.ncl.game/.well-known/security.txt
curl http://target.ncl.game/.git/HEAD

The last one (/.git/HEAD) checks whether the site’s Git repository was accidentally deployed. If it returns ref: refs/heads/main, the entire source code and commit history is downloadable.

Checkpoint: robots.txt lists "Disallow: /backup/db-dump.sql". What should you do?

Request it directly: curl http://target.ncl.game/backup/db-dump.sql -o dump.sql. robots.txt is not access control — it is a polite request to search engines. The file is still publicly accessible unless the server has separate authentication or access rules. A SQL dump could contain usernames, password hashes, and application data.

6. View Source

HTML source code is delivered to your browser in plaintext. It can contain information the rendered page does not show — developer comments, hidden form fields, JavaScript with hardcoded API keys, and debug endpoints.

What to look for

HTML comments often contain notes from developers:

<!-- TODO: remove before production -->
<!-- admin password: changeme123 -->
<!-- API endpoint: /api/v2/internal/users -->

Hidden form fields may contain session tokens, user roles, or flags:

<input type="hidden" name="role" value="admin">
<input type="hidden" name="debug" value="true">
<input type="hidden" name="flag" value="SKY-ABCD-1234">

JavaScript files frequently contain API endpoints, tokens, and logic:

<script>
  const API_KEY = "EXAMPLE_KEY_not_real_abc123";  // hardcoded!
  const DEBUG_ENDPOINT = "/api/debug?token=EXAMPLE";
  fetch('/api/users', { headers: { 'Authorization': 'Bearer ' + API_KEY }});
</script>

How to search effectively

# Download the page source
curl -o page.html http://target.ncl.game

# Search for comments
grep '<!--' page.html

# Search for hidden fields
grep 'type="hidden"' page.html

# Search for JavaScript files
grep '<script' page.html

# Search for keywords across all downloaded files
grep -ri "password\|secret\|flag\|api_key\|token" page.html

For JavaScript files referenced by the page, download and search them separately:

curl -o app.js http://target.ncl.game/js/app.js
grep -i "key\|secret\|password\|endpoint" app.js

Checkpoint: You view the source of a login page and find a hidden field: <input type="hidden" name="isAdmin" value="false">. What might you try?

Intercept the form submission and change the value to "true". Hidden fields are hidden from the user’s view but are sent as normal form data. If the server trusts client-submitted values without validation, changing isAdmin to true could grant admin access. This is a common web vulnerability called parameter tampering.

7. Cookies and Sessions

HTTP is stateless — the server does not inherently remember who you are between requests. Cookies solve this by having the server store a small piece of data in your browser, which the browser sends back with every subsequent request.

How sessions work

You log in (send username and password)
The server creates a session — a record of your identity stored server-side
The server sends back a session cookie — a random token that maps to your session
Your browser sends this cookie with every future request
The server reads the cookie, looks up the session, and knows who you are

# See the cookies a server sets
curl -v http://target.ncl.game/login 2>&1 | grep -i set-cookie

< Set-Cookie: PHPSESSID=7f3a9b2c1d4e5f6a; path=/; HttpOnly
< Set-Cookie: user_role=guest; path=/

Attribute	Meaning	Security Impact
`HttpOnly`	JavaScript cannot access this cookie	Protects against XSS cookie theft
`Secure`	Only sent over HTTPS	Prevents interception on HTTP
`SameSite`	Controls cross-site sending	Protects against CSRF attacks
`Path`	Cookie sent only for this URL path	Limits scope
`Expires`/`Max-Age`	When the cookie is deleted	Session vs persistent cookie

Why cookies matter for security

If you steal someone’s session cookie, you become that user — no password needed. This is called session hijacking. If a cookie lacks HttpOnly, JavaScript can read it (XSS attack). If it lacks Secure, it is sent over unencrypted HTTP (network sniffing).

# Send a request with a stolen/modified cookie
curl -b "PHPSESSID=stolen_value_here" http://target.ncl.game/dashboard
curl -b "user_role=admin" http://target.ncl.game/dashboard

Checkpoint: A server sets two cookies: PHPSESSID with HttpOnly and Secure flags, and user_pref with no flags. Which cookie is vulnerable and to what?

user_pref is vulnerable. Without HttpOnly, JavaScript can read it (document.cookie), making it accessible via XSS. Without Secure, it is sent over HTTP connections, making it interceptable via network sniffing. PHPSESSID is better protected — JavaScript cannot access it and it is only sent over HTTPS.

8. Putting It All Together: A Web Recon Checklist

When you find a web server during a competition, run through this sequence:

# 1. Headers — identify the tech stack
curl -I http://target.ncl.game

# 2. robots.txt — find hidden directories
curl http://target.ncl.game/robots.txt

# 3. Sitemap — discover all pages
curl http://target.ncl.game/sitemap.xml

# 4. Source code — search for comments, hidden fields, scripts
curl -o index.html http://target.ncl.game
grep -i "comment\|hidden\|password\|flag\|secret\|api" index.html

# 5. Common hidden paths
curl -o /dev/null -s -w "%{http_code}" http://target.ncl.game/.git/HEAD
curl -o /dev/null -s -w "%{http_code}" http://target.ncl.game/.env
curl -o /dev/null -s -w "%{http_code}" http://target.ncl.game/wp-login.php
curl -o /dev/null -s -w "%{http_code}" http://target.ncl.game/admin/

# 6. Cookies — check security attributes
curl -v http://target.ncl.game 2>&1 | grep -i "set-cookie"

Step 5 uses -w "%{http_code}" to print only the status code. A 200 or 403 means the path exists. A 404 means it does not.

Resources

Practice: TryHackMe — Web Fundamentals (search “web”) · PortSwigger Web Security Academy (free, excellent) · OverTheWire — Bandit (curl and HTTP)

Reference: MDN HTTP Reference · curl Manual · HTTP Status Codes

Video: LiveOverflow — Web hacking · John Hammond — web CTF