Sort large File using Java

Here we will see how the huge volume of data in a file can be sorted easily.

You can use tutorial Generate File with Random Content in Java to generate a large text file.

For a smaller file whose contents fit into the memory, sorting the file programmatically can often be as simple as reading the contents of the file into memory and then writing the sorted data back into a file. But a file that is gigabytes in size (for example, 10GB in size), reading all the data into memory as required by traditional programmatic sorting may not be possible for most of the contexts.

In this situation, the file to be sorted is read in chunks, each chunk of which is sorted independently and each written to its own temporary file. Afterwards, one big file is left with several files – each with its contents sorted – that must be merged back together. This can be accomplished by reading lines from each file into memory and merging the lines together in sorted order, writing to the final file as the algorithm moves along through each temporary file.

Create the class FileSplitSortMergeUtil and put the below file utility methods.

The following code example will show you how to split the big file into several temporary files with sorted contents.

/**
 * split the large file into several sorted temp files
 *
 * @param fileName
 * @param tempDirectory
 * @return List<File>
 * @throws IOException
 */
public static List<File> splitAndSortTempFiles(final String fileName, final String tempDirectory,
		final int noOfSplits, final StringComparator cmp) throws IOException {
	List<File> files = new ArrayList<>();
	RandomAccessFile raf = new RandomAccessFile(fileName, "r");
	long sourceSize = raf.length();
	long bytesPerSplit = sourceSize / noOfSplits;
	long remainingBytes = sourceSize % noOfSplits;
	int maxReadBufferSize = 8 * 1024; // 8KB
	int fileCounter = 1;
	for (int i = 1; i <= noOfSplits; i++) {
		File dir = new File(tempDirectory);
		if (dir.exists()) {
			dir.delete();
		}
		dir.mkdir();
		File file = new File(tempDirectory + "/temp-file-" + fileCounter + ".txt");
		BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(file));
		if (bytesPerSplit > maxReadBufferSize) {
			long numReads = bytesPerSplit / maxReadBufferSize;
			long numRemainingRead = bytesPerSplit % maxReadBufferSize;
			for (int j = 0; j < numReads; j++) {
				readWrite(raf, bos, maxReadBufferSize);
			}
			if (numRemainingRead > 0) {
				readWrite(raf, bos, numRemainingRead);
			}
		} else {
			readWrite(raf, bos, bytesPerSplit);
		}
		file = sortFileContent(file, cmp);
		files.add(file);
		fileCounter++;
		bos.close();
	}
	if (remainingBytes > 0) {
		File file = new File(tempDirectory + "/temp-file-" + fileCounter + ".txt");
		BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(file));
		readWrite(raf, bos, remainingBytes);
		file = sortFileContent(file, cmp);
		files.add(file);
		bos.close();
	}
	return files;
}
private static void readWrite(RandomAccessFile raf, BufferedOutputStream bos, long numBytes) throws IOException {
	byte[] buf = new byte[(int) numBytes];
	int val = raf.read(buf);
	if (val != -1) {
		bos.write(buf);
		bos.flush();
	}
}
/**
 * Sort file content
 *
 * @param file
 * @return file
 * @throws IOException
 */
private static File sortFileContent(File file, StringComparator cmp) throws IOException {
	List<String> lines = new ArrayList<>();
	try (Stream<String> ln = Files.lines(file.toPath())) {
		lines = ln.collect(Collectors.toList());
	}
	Collections.sort(lines, cmp);
	try (BufferedWriter bw = Files.newBufferedWriter(file.toPath())) {
		for (String line : lines) {
			bw.write(line);
			bw.write("\r\n");
		}
	}
	return file;
}

The following code example finally merges the temporary files into the final output file.

/**
 * Merge all sorted files into a big file in a sorted manner
 *
 * @param files
 * @param outputFile
 * @param cmp
 * @throws IOException
 */
public static void mergeSortedFiles(final List<File> files, final String outputFile, final StringComparator cmp)
		throws IOException {
	List<BufferedReader> brReaders = new ArrayList<>();
	TreeMap<String, BufferedReader> map = new TreeMap<>(cmp);
	File f = new File(outputFile);
	if (f.exists()) {
		f.delete();
	}
	BufferedWriter bw = new BufferedWriter(new FileWriter(outputFile, true));
	try {
		for (File file : files) {
			BufferedReader br = new BufferedReader(new FileReader(file));
			brReaders.add(br);
			String line = br.readLine();
			map.put(line, br);
		}
		while (!map.isEmpty()) {
			Map.Entry<String, BufferedReader> nextToGo = map.pollFirstEntry();
			bw.write(nextToGo.getKey());
			bw.write("\r\n");
			String line = nextToGo.getValue().readLine();
			if (line != null) {
				map.put(line, nextToGo.getValue());
			}
		}
	} finally {
		if (brReaders != null) {
			for (BufferedReader br : brReaders) {
				br.close();
			}
			File dir = files.get(0).getParentFile();
			for (File file : files) {
				file.delete();
			}
			if (dir.exists()) {
				dir.delete();
			}
		}
		if (bw != null) {
			bw.close();
		}
	}
}

Create below StringComparator class that sorts the contents of the file

package com.roytuts.file;
import java.util.Comparator;
public class StringComparator implements Comparator<String> {
	@Override
	public int compare(String s1, String s2) {
		return s1.compareToIgnoreCase(s2);
	}
}

Now create the main class to test the example

package com.roytuts.file;
import java.io.File;
import java.io.IOException;
import java.nio.file.Path;
import java.util.List;
public class TestSortLargeFile {
	public static void main(String[] args) throws IOException {
		Path fullPath = new File("./data", "large.txt").toPath();
		List<File> files = FileSplitSortMergeUtil.splitAndSortTempFiles(fullPath.toAbsolutePath().toString(), "D:/temp", 25,
				new StringComparator());
		FileSplitSortMergeUtil.mergeSortedFiles(files, "D:/LargeFile.txt", new StringComparator());
	}
}

You can run the above main class and verify the output.
Thanks for reading.

1 thought on “Sort large File using Java

Leave a Reply

Your email address will not be published. Required fields are marked *