How to check broken links in Website using Selenium Web Driver

Introduction

In this tutorial we will see how to check broken links in website using Selenium web driver. Broken links are not reachable or simply do not work. A URL for the website is no longer available. A URL of the web page was moved without a redirect being added. The URL structure of a website was changed. The link may be down or not functioning due to server error.

For valid URL or link a status code is returned in form of 2xx and for invalid link, a status code is returned in the form of either 4xx or 5xx.

4xx is mainly for client error, whereas 5xx is mainly for the server error.

Why do we need to check for broken links?

It is recommended that your website does not broken link which may land your users on error page or page not found, which quite degrades the quality of your website to your users.

Search engines measure links as a vote for a website’s quality. Links to your website and links within your website can affect ranks in search results. Therefore, it’s best practice to either remove or update broken links.

Cleaning up broken links can improve user experience, and make content within your website easier for visitors and search engines to discover.

Manual checking of broken links in a website is tedious task and time consuming as there may be too many links in a website.

An automation script that automates the process using Selenium is more viable solution.

Prerequisites

Java at least 1.8, Selenium 2.53.0, Chrome Driver, Gradle 6.1.1

Build Script

We create a gradle based project (selenium-broken-links-finder) in Eclipse and update the build.gradle script as follows:

plugins {
    id 'java-library'
}

sourceCompatibility = 12
targetCompatibility = 12

repositories {
    jcenter()    
}

dependencies {
	implementation('org.seleniumhq.selenium:selenium-server-standalone:2.53.0')
}

Java Code to find Broken Links

Now we will write some Java code to find the broken links in a website. I am using my website Roy Tutorials to find the broken links. You may use any website URL to check for broken links.

package com.roytuts.selenium.broken.links.finder;

import java.io.IOException;
import java.net.URL;
import java.util.Iterator;
import java.util.List;
import java.util.logging.Level;
import java.util.logging.Logger;

import javax.net.ssl.HttpsURLConnection;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

public class BrokenLinkChecker {

	private static final Logger LOG = Logger.getLogger(BrokenLinkChecker.class.getName());

	private static WebDriver driver = null;

	public static void main(String[] args) {
		System.setProperty("webdriver.chrome.driver", "C:/chromedriver.exe");

		String homePage = "https://roytuts.com";
		String url = "";
		HttpsURLConnection https = null;

		driver = new ChromeDriver();
		driver.manage().window().maximize();

		driver.get(homePage);

		List<WebElement> links = driver.findElements(By.tagName("a"));

		Iterator<WebElement> it = links.iterator();

		while (it.hasNext()) {
			url = it.next().getAttribute("href");

			if (url == null || url.isEmpty()) {
				LOG.log(Level.SEVERE, "URL not found");
				continue;
			}

			if (!url.startsWith(homePage)) {
				LOG.log(Level.SEVERE, "URL belongs to another domain, skipping it.");
				continue;
			}

			try {
				https = (HttpsURLConnection) (new URL(url).openConnection());
				https.setRequestMethod("HEAD");
				https.connect();

				int respCode = https.getResponseCode();

				if (respCode >= 400) {
					LOG.log(Level.SEVERE, url + " is a broken link");
				} else {
					LOG.info(url + " is a valid link");
				}
			} catch (IOException e) {
				e.printStackTrace();
			}
		}

		driver.quit();
	}

}

Let’s see what we have written in the above Java class.

First we load the Chrome web driver into system property from the disk location where we have put the Chrome driver exe file after downloading from the driver URL.

Next we find all the links on a home page by anchor tag name <a/>.

We iterate through the list we found for anchor tag name.

We get the href value from the anchor tag. We validate for its value. We also validate whether this home page belongs to the main domain or third party domain.

We establish connection over https protocol. You may also establish over http protocol. So using class HttpsURLConnection we send request and receive response for each URL or link.

We set request type HEAD instead of GET to get the header information only.

Finally we receive the response code and check for the valid or invalid URL or link.

Download

Thanks for reading.

Roy Tutorials