Menu

BDD 101: Unit, Integration, and End-to-End Tests

There are many types of software tests. BDD practices can be incorporated into all aspects of testing, but BDD frameworks are not meant to handle all test types. Behavior scenarios are inherently functional tests – they verify that the product under test works correctly. While instrumentation for performance metrics could be added, BDD frameworks are not intended for performance testing. This post focuses on how BDD automation works into the Testing Pyramid. Please read BDD 101: Manual Testing for manual test considerations. (Check the Automation Panda BDD page for the full table of contents.)

The Testing Pyramid

The Testing Pyramid is a functional test development approach that divides tests into three layers: unit, integration, and end-to-end.

Unit tests are white-box tests that verify individual “units” of code, such as functions, methods, and classes. They should be written in the same language as the product under test, and they should be stored in the same repository. They often run as part of the build to indicate immediate success or failure.

Integration tests are black-box tests that verify integration points between system components work correctly. The product under test should be active and deployed to a test environment. Service tests are often integration-level tests.

End-to-end tests are black-box tests that test execution paths through a system. They could be seen as multi-step integration tests. Web UI tests are often end-to-end-level tests.

Below is a visual representation of the Testing Pyramid:

The Testing Pyramid

From bottom to top, the tests increase in complexity: unit tests are the simplest and run very fast, while end-to-end require lots of setup, logic, and execution time. Ideally, there should be more tests at the bottom and fewer tests at the top. Test coverage is easier to implement and isolate at lower levels, so fewer high-investment, more-fragile tests need to be written at the top. Pushing tests down the pyramid can also mean wider coverage with less execution time. Different layers of testing mitigate risk at their optimal returns-on-investment.

Behavior-Driven Unit Testing

BDD test frameworks are not meant for writing unit tests. Unit tests are meant to be low-level, program-y tests for individual functions and methods. Writing Gherkin for unit tests is doable, but it is overkill. It is much better to use established unit test frameworks like JUnit, NUnit, and pytest.

Nevertheless, behavior-driven practices still apply to unit tests. Each unit test should focus on one main thing: a single call, an individual variation, a specific input combo; a behavior. Furthermore, in the software process, feature-level behavior specs draw a clear dividing line between unit and above-unit tests. The developer of a feature is often responsible for its unit tests, while a separate engineer is responsible for integration and end-to-end tests for accountability. Behavior specs carry a gentleman’s agreement that unit tests will be completed separately.

Integration and End-to-End Testing

BDD test frameworks shine at the integration and end-to-end testing levels. Behavior specs expressively and concisely capture test case intent. Steps can be written at either integration or end-to-end levels. Service tests can be written as behavior specs like in Karate. End-to-end tests are essentially multi-step integrations tests. Note how a seemingly basic web interaction is truly a large end-to-end test:

Given a user is logged into the social media site
When the user writes a new post
Then the user's home feed displays the new post
And the all friends' home feeds display the new post

Making a simple social media post involves web UI interaction, backend service calls, and database updates all in real time. That’s a full pathway through the system. The automated step definitions may choose to cover these layers implicitly or explicitly, but they are nevertheless covered.

Lengthy End-to-End Tests

Terms often mean different things to different people. When many people say “end-to-end tests,” what they really mean are lengthy procedure-driven tests: tests that cover multiple behaviors in sequence. That makes BDD purists shudder because it goes against the cardinal rule of BDD: one scenario, one behavior. BDD frameworks can certainly handle lengthy end-to-end tests, but careful considerations should be taken for if and how it should be done.

There are five main ways to handle lengthy end-to-end scenarios in BDD:

Don’t bother. If BDD is done right, then every individual behavior would already be comprehensively covered by scenarios. Each scenario should cover all equivalence classes of inputs and outputs. Thus, lengthy end-to-end scenarios would primarily be duplicate test coverage. Rather than waste the development effort, skip lengthy end-to-end scenario automation as a small test risk, and compensate with manual and exploratory testing.

Combine existing scenarios into new ones. Each When-Then pair represents an individual behavior. Steps from existing scenarios could be smashed together with very little refactoring. This violates good Gherkin rules and could result in very lengthy scenarios, but it would be the most pragmatic way to reuse steps for large end-to-end scenarios. Most BDD frameworks don’t enforce step type order, and if they do, steps could be re-typed to work. (This approach is the most pragmatic but least pure.)

Embed assertions in Given and When steps. This strategy avoids duplicate When-Then pairs and ensures validations are still performed. Each step along the way is validated for correctness with explicit Gherkin text. However, it may require a number of new steps.

Treat the sequence of behaviors as a unique, separate behavior. This is the best way to think about lengthy end-to-end scenarios because it reinforces behavior-driven thinking. A lengthy scenario adds value only if it can be justified as a uniquely separate behavior. The scenario should then be written to highlight this uniqueness. Otherwise, it’s not a scenario worth having. These scenarios will often be very declarative and high-level.

Ditch the BDD framework and write them purely in the automation programming. Gherkin is meant for collaboration about behaviors, while lengthy end-to-end tests are meant exclusively for intense QA work. Biz roles will write behavior specs but will never write end-to-end tests. Forcing behavior specification on lengthy end-to-end scenarios can inhibit their development. A better practice could be coexistence: acceptance tests could be written with Gherkin, while lengthy end-to-end tests could be written in raw programming. Automation for both test sets could still nevertheless share the same automation code base – they could share the same support modules and even step definition methods.

Hi Andy, I’m an automation tester. Can you advise me about this case of my company:
According to your pyramid, unit test should be done but currently my company’s devs do not write down unit test then I cannot run them automatically. We are trying to write end-to-end tests only through GUI which is fragile as you said. I also find it fragile because I encounters time-out errors many times and I have to spend a lot of time to maintain them. I think we are doing it wrongly. Should I try to convince my company to write unit tests and how? Is there any other ways?

Hi Hung! You are absolutely right in your convictions. I’m sorry to hear that you’re stuck in that rut with your company. I also know it’s really hard to convince people who are stuck in their ways.

Here are some more reasons to push for unit tests:
– Strong unit tests would probably lead to more stable UI tests because more bugs would be found at the lower level.
– Unit tests run much more quickly than integration and end-to-end tests.
– Unit tests are simpler to maintain.
– Writing unit tests forces developers to think about their code more critically.

Here’s a crazy idea: Could you volunteer to write unit tests? It may be worthwhile for you to stop writing end-to-end tests for now and build up a battery of unit tests. Run code coverage tools as you develop them so you can track the increase in coverage (and make yourself look like a champion to managers). Once coverage is good enough, you can teach developers how to write good unit tests and then hold them accountable to writing new tests for new features. Finally, you can return to the higher levels of the pyramid after unit tests are set in place.

The timeout errors are another problem to address. Make sure your timeouts are tuned well – not too short, but not too long. Also make sure that timeouts aren’t performance bugs. You may also want to revisit the framework’s design for handling timeouts – there should be some sort of constants or other central authority for controlling timeouts in different circumstances.

Hi Andy,
I do have one question about this article: Is it about lengthy Gherkin *scenarios* or about lengthy *testing* via BDD?

For example I think of a feature file with multiple scenarios:
First the user creates an account for an online shop, entering a user name, an address, a passwort and a date of birth.
Than the user logs in with this username/password and performs various tests, like adding/removing items to/from the shopping cart, finding the address as shipping address, not being charged shipping costs if the item costs exceed some threshold, and so on.

Those could each be rather small scenarios, each of them testing one behavior…or would this violate the “one scenario, one behavior” maxime, because the scenario about placing an order would depend on the ability to log in in the first place?
Or to put it different: Should each scenario in a feature file be able to be run isolated?

The content under the “Lengthy End-to-End Tests” specifically addresses lengthy test cases. I’ve seen situations where testers will write a test case with 50+ steps! My visceral reaction is, “Don’t do that!” However, many people really struggle with decomposing a long test procedure into smaller, isolated behaviors. Or, they fail to adequately capture the intention of combining a few behaviors together in a unique way.

To answer your questions directly: Make sure each *desired* behavior has its own dedicated scenario. There are often more behaviors in a product than can be written into Gherkin scenarios. So, in your example, the main desired behaviors appear to be (1) account creation, (2) login, (3) browsing items, (4) handling shopping cart items, and (5) checkout. Each of these five has multiple behaviors (for example, login could be successful or unsuccessful). Then, it seems like there could be an end-to-end scenario where the user logs in, adds a few items to the cart, removes one, goes to checkout, changes the address, and pays with a card. The end-to-end scenario could be written in Gherkin; my article provides a few recommendations for how.

Should each scenario in a feature file be able to be run isolated? Absolutely yes: test case independence is vital for scalable test automation. Consider the use case of filtering tests to run by tag – not all test will be run, and not all in order. Each scenario should be independent of others.

Hi Andy, many thanks for your sharing! Can you advise me about this case of my team: we need to check validation of a field:maxlength of this field must be less than 100 characters, if user input value more than 100 character, the dialog message should be display. We use Gherkin language, but I don’t know how to create a scenario to describe this case.
Thanks once more for this great blog,
Giang

Given the “name-of-the-page” page is displayed
When the user enters a 100-character string for “name-of-the-field”
Then an error message is displayed indicating that the input for “name-of-the-field” is too long

Great example. Would the positive-version of this test be an appropriate usage of BDD/Gherkin?

Example:

Given the “name-of-the-page” page is displayed
When the user enters a string of 99 or fewer characters for “name-of-the-field”
And the user clicks the Save button
Then the form “name-of-the-form” is saved

How would I handle situations with long multi-page forms where selections made on an earlier page could effect the next page or the next several (think turbotax style wizards). I can’t figure out how to go about writing concise end to end Gherkin for this type of scenario.

Great article Andy! Can you advise how would I implement BDD in an optimized way in my project? We have jbehave scenario which cover the client side android and server side. If I have to write a jbehave test to validate a scenario which processes the information on server side, should I write a test using appium or an api test? I can write an api test using karate but that will not cover the a user behavior as a user will always use the android app to perform a scenario. But if I write an appium test, the execution will be slower. How should be the right approach?

Hi Biswajit! Without knowing your project, it’s hard for me to give the right answer. Here’s what I would ask you to consider:

Consider the risk. This behavior has a mobile-app-UI component and a backend-API component. What would be the risk of skipping the UI component? If the UI doesn’t do much other than fire off an API call, the risk is probably low, and it would probably be okay to do an API-only test. However, if the UI is doing some more complicated stuff (like number crunching or transformations or a multi-step data entry), then it might be worthwhile to bite the bullet and make it an Appium test.

Consider the Testing Pyramid. Do you already have good unit test coverage in this area of the mobile app? Do you have robust API tests already in place? Do you have other Appium-level tests that cover similar behaviors?

Consider the possibilities. How many ways could the API be called? How many equivalence classes of inputs can it have? Perhaps you could write one basic Appium-level happy path test to verify end-to-end behavior together with a few API tests to squeeze out API input edge cases.