NCL: Web Reconnaissance
HTTP headers, robots.txt, source inspection, and curl
Web Reconnaissance
You scanned a target with nmap and found port 80 open — a web server. Now what? Web recon is the process of examining what a web server reveals about itself: through its HTTP responses, its publicly accessible files, its source code, and its configuration. Most of this information is handed to you willingly by the server. You just have to ask the right questions.
This page covers HTTP fundamentals, curl (the command-line browser), response headers, robots.txt, source code inspection, and cookies. Each section gives you the exact commands to run and what the output means.
Prerequisites: You should be comfortable with piping, grep, and file operations from Weeks 1-2, and with the basics of port scanning from the Network Scanning lesson.
1. HTTP: How the Web Works
Every time you load a web page, your browser sends an HTTP request and the server sends back an HTTP response. This exchange has a strict structure.
Request
GET /index.html HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0
Accept: text/html
- Method:
GET(retrieve data),POST(send data),PUT(update),DELETE(remove) - Path:
/index.html— the resource being requested - Headers: Key-value pairs with metadata (who you are, what you accept, cookies)
Response
HTTP/1.1 200 OK
Server: nginx/1.18.0
Content-Type: text/html; charset=UTF-8
Set-Cookie: session=abc123; Path=/
<html>
<body>Welcome to the site</body>
</html>
- Status code:
200— the result of the request (more on these below) - Headers: Server metadata (software, cookies, caching rules)
- Body: The actual content (HTML, JSON, an image, etc.)
This request-response cycle is the entire web. Every page load, every API call, every form submission follows this pattern.
2. HTTP Status Codes
The status code is a three-digit number that tells you what happened. In CTF, these codes reveal whether hidden pages exist, whether authentication is required, and whether the server is misconfigured.
| Code | Name | What It Means |
|---|---|---|
| 200 | OK | Request succeeded. The page exists and was returned. |
| 301 | Moved Permanently | The resource moved to a new URL (check the Location header). |
| 302 | Found | Temporary redirect. Often used after login. |
| 400 | Bad Request | Your request was malformed. |
| 401 | Unauthorized | Authentication required. The server wants credentials. |
| 403 | Forbidden | You are authenticated but lack permission. The page exists but you cannot access it. |
| 404 | Not Found | The resource does not exist at this URL. |
| 405 | Method Not Allowed | The method (GET, POST, etc.) is not supported for this URL. |
| 500 | Internal Server Error | The server crashed while processing your request. |
| 503 | Service Unavailable | The server is overloaded or down for maintenance. |
Key distinction: 401 vs 403. A 401 means “who are you?” (no credentials provided). A 403 means “I know who you are, and you’re not allowed.” Both confirm the resource exists — which is itself useful information.
Checkpoint: You request /admin and get a 403. You request /secret and get a 404. What do you know?
/admin exists but you lack permission to access it — worth investigating further (try different credentials, check for authentication bypass). /secret does not exist — move on. A 403 is more interesting than a 404 in recon because it confirms something is there.
3. curl: The Command-Line Browser
curl sends HTTP requests from the terminal. It does exactly what a browser does, but you control every detail and can see the raw request and response.
Fetch a page
curl http://example.com
<!doctype html>
<html>
<head>
<title>Example Domain</title>
</head>
<body>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples.</p>
</body>
</html>
Headers only (HEAD request)
curl -I http://example.com
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Content-Length: 1256
Server: EOS (vny/044F)
The -I flag sends a HEAD request — the server returns headers but not the body. Fast way to check what software is running without downloading the entire page.
Verbose mode (see everything)
curl -v http://example.com
> GET / HTTP/1.1
> Host: example.com
> User-Agent: curl/8.1.2
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/html; charset=UTF-8
< Server: EOS (vny/044F)
< Content-Length: 1256
<
<!doctype html>
...
Lines starting with > are what you sent. Lines starting with < are what the server sent back. This is the most useful curl flag for recon — you see the complete conversation.
Follow redirects
curl -L http://example.com
Without -L, curl stops at a 301/302 redirect and shows the redirect response. With -L, it follows the chain to the final destination. Useful when a site redirects HTTP to HTTPS or login pages redirect after authentication.
Save output to a file
curl -o page.html http://example.com
Checkpoint: You run curl -I on a target and see "Server: Apache/2.4.49". What should you do next?
Search for known vulnerabilities: searchsploit Apache 2.4.49 or search the NVD. Apache 2.4.49 has CVE-2021-41773, a path traversal vulnerability that lets you read files outside the web root. Knowing the exact version turns a header into an attack vector.
4. HTTP Response Headers
Response headers are metadata the server sends alongside the page content. Many servers reveal far more than they should.
Headers that reveal the tech stack
curl -I http://target.ncl.game
HTTP/1.1 200 OK
Server: Apache/2.4.52 (Ubuntu)
X-Powered-By: PHP/8.1.2
X-Generator: WordPress 6.4.2
Set-Cookie: PHPSESSID=abc123def456; path=/
Content-Type: text/html; charset=UTF-8
From this single response, you now know:
- Web server: Apache 2.4.52 on Ubuntu
- Language: PHP 8.1.2
- Application: WordPress 6.4.2
- Session management: PHP sessions (the
PHPSESSIDcookie)
Each of these has a version you can search for vulnerabilities.
Security headers (or their absence)
| Header | Purpose | Missing = Risk |
|---|---|---|
Strict-Transport-Security |
Forces HTTPS connections | Downgrade attacks possible |
Content-Security-Policy |
Controls which scripts/resources can load | XSS more likely |
X-Frame-Options |
Prevents page from being embedded in iframes | Clickjacking possible |
X-Content-Type-Options |
Prevents MIME sniffing | Browser may misinterpret files |
X-XSS-Protection |
Legacy XSS filter (deprecated but still seen) | — |
In competitions, the absence of security headers is as informative as their presence. A server missing Strict-Transport-Security might accept HTTP connections, allowing interception.
5. robots.txt
robots.txt is a file in a website’s root directory that tells search engine crawlers which pages to skip when indexing. It is a suggestion to well-behaved bots, not access control. Anyone can read it.
curl http://target.ncl.game/robots.txt
User-agent: *
Disallow: /admin/
Disallow: /backup/
Disallow: /api/v1/debug/
Disallow: /old-site/
Sitemap: http://target.ncl.game/sitemap.xml
This file just told you:
/admin/exists (probably a login panel)/backup/exists (might contain database dumps or old configs)/api/v1/debug/exists (debug endpoints often leak internal state)/old-site/exists (old code is often unpatched)- A sitemap exists (lists every page on the site)
In CTF, robots.txt is one of the first things to check. Developers use it to hide directories from Google, but in doing so, they create a list of everything interesting on the server.
# Always check these common paths too
curl http://target.ncl.game/sitemap.xml
curl http://target.ncl.game/.well-known/security.txt
curl http://target.ncl.game/.git/HEAD
The last one (/.git/HEAD) checks whether the site’s Git repository was accidentally deployed. If it returns ref: refs/heads/main, the entire source code and commit history is downloadable.
Checkpoint: robots.txt lists "Disallow: /backup/db-dump.sql". What should you do?
Request it directly: curl http://target.ncl.game/backup/db-dump.sql -o dump.sql. robots.txt is not access control — it is a polite request to search engines. The file is still publicly accessible unless the server has separate authentication or access rules. A SQL dump could contain usernames, password hashes, and application data.
6. View Source
HTML source code is delivered to your browser in plaintext. It can contain information the rendered page does not show — developer comments, hidden form fields, JavaScript with hardcoded API keys, and debug endpoints.
What to look for
HTML comments often contain notes from developers:
<!-- TODO: remove before production -->
<!-- admin password: changeme123 -->
<!-- API endpoint: /api/v2/internal/users -->
Hidden form fields may contain session tokens, user roles, or flags:
<input type="hidden" name="role" value="admin">
<input type="hidden" name="debug" value="true">
<input type="hidden" name="flag" value="SKY-ABCD-1234">
JavaScript files frequently contain API endpoints, tokens, and logic:
<script>
const API_KEY = "EXAMPLE_KEY_not_real_abc123"; // hardcoded!
const DEBUG_ENDPOINT = "/api/debug?token=EXAMPLE";
fetch('/api/users', { headers: { 'Authorization': 'Bearer ' + API_KEY }});
</script>
How to search effectively
# Download the page source
curl -o page.html http://target.ncl.game
# Search for comments
grep '<!--' page.html
# Search for hidden fields
grep 'type="hidden"' page.html
# Search for JavaScript files
grep '<script' page.html
# Search for keywords across all downloaded files
grep -ri "password\|secret\|flag\|api_key\|token" page.html
For JavaScript files referenced by the page, download and search them separately:
curl -o app.js http://target.ncl.game/js/app.js
grep -i "key\|secret\|password\|endpoint" app.js
Checkpoint: You view the source of a login page and find a hidden field: <input type="hidden" name="isAdmin" value="false">. What might you try?
Intercept the form submission and change the value to "true". Hidden fields are hidden from the user’s view but are sent as normal form data. If the server trusts client-submitted values without validation, changing isAdmin to true could grant admin access. This is a common web vulnerability called parameter tampering.
7. Cookies and Sessions
HTTP is stateless — the server does not inherently remember who you are between requests. Cookies solve this by having the server store a small piece of data in your browser, which the browser sends back with every subsequent request.
How sessions work
- You log in (send username and password)
- The server creates a session — a record of your identity stored server-side
- The server sends back a session cookie — a random token that maps to your session
- Your browser sends this cookie with every future request
- The server reads the cookie, looks up the session, and knows who you are
# See the cookies a server sets
curl -v http://target.ncl.game/login 2>&1 | grep -i set-cookie
< Set-Cookie: PHPSESSID=7f3a9b2c1d4e5f6a; path=/; HttpOnly
< Set-Cookie: user_role=guest; path=/
Cookie attributes
| Attribute | Meaning | Security Impact |
|---|---|---|
HttpOnly |
JavaScript cannot access this cookie | Protects against XSS cookie theft |
Secure |
Only sent over HTTPS | Prevents interception on HTTP |
SameSite |
Controls cross-site sending | Protects against CSRF attacks |
Path |
Cookie sent only for this URL path | Limits scope |
Expires/Max-Age |
When the cookie is deleted | Session vs persistent cookie |
Why cookies matter for security
If you steal someone’s session cookie, you become that user — no password needed. This is called session hijacking. If a cookie lacks HttpOnly, JavaScript can read it (XSS attack). If it lacks Secure, it is sent over unencrypted HTTP (network sniffing).
# Send a request with a stolen/modified cookie
curl -b "PHPSESSID=stolen_value_here" http://target.ncl.game/dashboard
curl -b "user_role=admin" http://target.ncl.game/dashboard
Checkpoint: A server sets two cookies: PHPSESSID with HttpOnly and Secure flags, and user_pref with no flags. Which cookie is vulnerable and to what?
user_pref is vulnerable. Without HttpOnly, JavaScript can read it (document.cookie), making it accessible via XSS. Without Secure, it is sent over HTTP connections, making it interceptable via network sniffing. PHPSESSID is better protected — JavaScript cannot access it and it is only sent over HTTPS.
8. Putting It All Together: A Web Recon Checklist
When you find a web server during a competition, run through this sequence:
# 1. Headers — identify the tech stack
curl -I http://target.ncl.game
# 2. robots.txt — find hidden directories
curl http://target.ncl.game/robots.txt
# 3. Sitemap — discover all pages
curl http://target.ncl.game/sitemap.xml
# 4. Source code — search for comments, hidden fields, scripts
curl -o index.html http://target.ncl.game
grep -i "comment\|hidden\|password\|flag\|secret\|api" index.html
# 5. Common hidden paths
curl -o /dev/null -s -w "%{http_code}" http://target.ncl.game/.git/HEAD
curl -o /dev/null -s -w "%{http_code}" http://target.ncl.game/.env
curl -o /dev/null -s -w "%{http_code}" http://target.ncl.game/wp-login.php
curl -o /dev/null -s -w "%{http_code}" http://target.ncl.game/admin/
# 6. Cookies — check security attributes
curl -v http://target.ncl.game 2>&1 | grep -i "set-cookie"
Step 5 uses -w "%{http_code}" to print only the status code. A 200 or 403 means the path exists. A 404 means it does not.
Resources
Practice: TryHackMe — Web Fundamentals (search “web”) · PortSwigger Web Security Academy (free, excellent) · OverTheWire — Bandit (curl and HTTP)
Reference: MDN HTTP Reference · curl Manual · HTTP Status Codes