Web Application Black-Box Testing

July 6, 2023

Black box testing

In this article we will outline various black-box testing techniques and their usefulness in performing offensive tests against web applications. We will also examine the ways in which payloads are designed and how to detect and distinguish behaviours generated by applications.

Purpose of our testing

When working in a black-box testing environment, we do not have access to the source code. This means we cannot see how the function we are testing actually works or how it handles our input data. A black-box test is therefore performed to provide an understanding of how the different aspects work in the function.
When we perform a black box test, we usually want a result that answers the questions we have about the function being tested – such as:

  • What is the purpose of this function?
  • What does the function need from us to work?
  • What inputs are expected from the function?
  • What inputs are allowed?
  • What is the data type of our input data (such as int, string or array)?
  • Which technologies may be related to this function (such as database or API)?
  • What are the possible security concerns?

The image below provides a basic visual overview of a black box workflow:

To get an understanding of a feature we are testing, we collect all data that it generates from our input. The longer in duration our test is, the better we can understand how the feature works. In the end, we hopefully have a clear picture of what we are dealing with.


There are many black-box testing techniques and they all essentially share one primary goal: to improve our understanding of the function and/or trigger an unexpected behaviour. These behaviours may include:

  • Code coverage
  • Process time
  • Error messages
  • Input validations
  • Crashes

State transition

This technique helps us analyze how a system behaves under different input conditions. It’s a useful testing technique to start with as it helps the tester better understand the workflow of the function in a specific use case. The tester can then adapt future testing techniques accordingly to provide a better result.


Fuzzing is usually used in combination with automation tools to detect unexpected behaviours, crashes or outputs from a function. By leveraging automation tools this well-known technique can rapidly test a wide variety of payloads to discover unusual behaviours.

Generation and mutation-based fuzzing

There are multiple types of fuzzing and two of the most interesting are generation-based and mutation-based fuzzing.
A generation-based fuzzer generates a fresh input (payload) from scratch for each new test attempt. Unlike a mutation-based fuzzer, it generates inputs by modifying the most recently used input. A mutation-based fuzzer therefore relies on the quality of this previous input and behaviour of the tested target to perform these mutations. The images below give a basic idea of how these two fuzzing methods can work:


Error guessing

Error guessing relies on the tester’s creativity and experience to produce test cases that might trigger errors in the system. By leveraging the tester’s expertise and ability to think outside the box, this technique can produce major breakthroughs and results that other methods would struggle to achieve.

Regression testing

This technique is used to check whether a function still works as expected after any change is made to the code. The function is tested before and after the system change and the results are compared in order to spot any unexpected behaviour.
For instance, imagine a function works flawlessly within a web application that uses an old version of a particular content management system (CMS), but starts behaving strangely when the CMS is updated to the latest version. This indicates the presence of a potential security issue that did not exist before.

Payload creation

A payload could produce useful results whether it’s randomly or manually generated or whether it consists solely of single characters, for instance.

The advantage of creating your own payloads is that they are likely to be unique. This is useful in achieving our goal of triggering an unexpected behaviour from a function being tested but without exploiting any vulnerability.

A simple example is if we try to find an XSS (Cross Site Scripting) vulnerability.
Let’s assume you use a payload like "><script>alert(1)</script>. If that payload worked and exploited an XSS vulnerability then the payload "><h1>hello</h1> should be enough to prove a potential XSS without actually exploiting it.
In doing so you can manually confirm the existence of a security bug without risking being filtered, WAF-blocked, CSP-blocked, etc.
An effective fuzzing payload might look like: <z>"z'z`%}})z${{z\. This payload does not exploit any specific bug but contains syntax that will hopefully trigger unexpected behaviour in the application being tested. The image below describes the structure of the payload and the purpose of each character:

Payload Structure Explain

The purpose of this payload is to trigger general errors, quote outbreaks, template engine errors, XSS detection, format injection and exposure of invalid input messages from the target.

References to use for payloads

Adjusting our payload to the technology used in the application, then we have a better chance of triggering unexpected behaviour in our target.
However, designing a payload for a specific context is very time-consuming. A compromise solution is creating a polyglot payload, which accommodates the structure of most technologies and can test applications in multiple contexts.
For example, the most common syntax errors are an odd number of quotation marks, square brackets, parentheses or backslashes. Armed with this knowledge, we can craft a payload with high accuracy for our general tests.

Structural variations

The structure of a payload is its most important part. Imagine that a payload triggers a server error in a web application (unexpected behaviour) and the cause centers on a backslash character.
You tested two payloads of identical length (numbers of characters), but only one triggered the unexpected behaviour:

Payload 1: 'z\
Payload 2: \z'

But which one triggered the unexpected output? Let’s consider theories that could explain the behaviour for each payload.

  1. First payload: It is very likely that the payload has broken out of an input section by escaping the last quote with the backslash character. This would result in a syntax error as the original quotes will be an odd number in the backend.
  2. Second payload: The backend tries to resolve \z to an escape sequence. Valid escape sequences like \n or \t may not have triggered the error, however, because the payload contains \z. The server may then try to resolve it into a valid escape sequence and fail, resulting in an escape sequence error.

Bear in mind that if only one payload is used, there is a 50% chance that you fail to detect the unexpected behaviour. Testing multiple structural variations increases your chances of making breakthroughs.

Recognizing code patterns and behaviours

When a payload triggers an unexpected behaviour in an application it prompts the question: where is the behaviour triggered and how?

Reflected output

When our test payload is reflected back in the response, we can more easily determine where the behaviour is triggered. A reflection of an input can come from several different sources, such as DOM JavaScript or a backend server. The methods below can be helpful in determining which source(s) your input is reflected from.

DOM JavaScript
The fastest way to detect if an input is reflected by client-side JavaScript is simply to make a request to the application without having client-side JavaScript enabled. This way you can see if your input is reflected even when JavaScript is not running on the client side – and therefore whether client-side JavaScript is reflecting the input.

Keep in mind that client-side JavaScript is also often used to fetch Cross-Origin Requests in order to retrieve data from other servers. It is common to include user input in the outgoing request and sometimes your input is modified by this process before being reflected.

Backend server
To detect if an input comes directly from the backend, we can use the same technique we used to check whether a reflection came from client-side JavaScript. When our input data is reflected without client-side JavaScript running, it is managed and reflected from a backend server. The only question is: which one?
This depends heavily on the application we are testing, some of the possible servers that our reflection can come from can be the application’s backend, CDN, WAF or another proxy server that has a relation to the application.
We can answer this question by analysing the HTTP response headers and/or response text to see if they hint at which source reflected our input.

Trace errors

This behaviour tends to expose the lines of code and functions used when the error is triggered. By providing this information a trace error enables us to collect possible code snippets, third-party code or CVEs that might be used to exploit security flaws in the functionality.


Change of content

Changes of content can be among the easiest or most difficult behaviours to confirm, whether they are unexpected or not. If they are large changes in the application’s response, such as replacing response text with new text that has not been seen before, then they are likely unexpected behaviours that are easy to confirm.
The challenge starts when small changes appear in the answer. This is because most modern web applications are based on highly dynamic content that changes regularly. In this scenario, we need a deep understanding of how the application works to be able to detect a small change that is in fact an unexpected behaviour triggered by our payload.
The more dynamic junk content we can filter out, the more efficient our management of response content. We can do that by first extracting and focusing on the HTML node, if the attributes and attribute values seem to represent normal behaviour. What remains is to check the general response text, such as words, and the response headers.

The next step is to collect all unique words and compare the words and headings to normal response behaviour, in order to identify words that do not match a normal response.
Burp Suite has built-in functionality – the ‘Compare’ tab – that is effective at comparing these responses. A Burp extension called Diff Last Response, which compares the current and previous response in order to detect differences, is also useful here.


Response time

Servers can produce varying delays in responses. If a payload happens to trigger a delay that is significantly longer or shorter than the average delay, it may be due to an error triggered in the application that we are yet to detect.
To confirm a response time behaviour, we usually have to stop all other background scanning and testing of the application. This means we can attribute any time delay to our payload, rather than the application being overloaded or another payload that triggered unexpected behaviour in a different process.


Payload demonstration in action

The payload created earlier (see ‘Payload Creation’) was randomly inserted into user input sections in the OWASP Juicy Shop and DVWA applications to provide a basic understanding of what the payload could accomplish in a real-world scenario. Below you can see the unexpected behaviours that the payload triggered:

e_dockerImage crash
trace error

As you can see, we generated many hits just by randomly inserting the payload in different input sections. While these applications are obviously vulnerable at their core, this is still a useful way to test and demonstrate a payload’s accuracy and build an understanding of its effectiveness at triggering certain behaviours.