Recovering Redacted Text From a Document

With the impending release of the final cache of documents related to the JFK assasination I thought it would be interesting to take a look into the topic of information leakage, or the unintentional revealing of information in a secure system. In this case the leaking of information through a redaction in a document.

For a properly redacted document the possibility is almost 0 that you will retrieve any information but according to a paper by Daniel Lopresti and A. Lawrence Spitz of Lehigh University, it may be possible to take an educated guess at a redaction not done properly.

This could be a few cases:

* The redaction method is not sufficient to cover the original text. This can mean that even though the color of the ink (usually black) matches, the original text would still be visible through the redaction.

* The redaction only partially obscures the original text. If you were to just draw a redaction through the center of some text, the tops (ascenders) and bottoms (descenders) of the letters may still be visible. These may give clues as to some of the letters being redacted.

* Analysis of the redaction size combined with knowledge of the font, from the unredacted portions of the document. If you can guess at the amount of letters redacted, even if its not a single word you can start to take a guess as to what the word or phrase is.

If any clues are to be found it may be possible to combine such clues with some Natural Language Processing and a dictionary to reveal possible word combinations for a redaction.

Although this sounded like a very interesting project to me, in the end, I decided I would rather leave the redacted documents to spies. I still felt there were some fun things to learn however and so I decided to create a prank library instead to do some fake analysis of a redacted document instead.

I wanted to learn and refresh myself of the HTML Canvas object and functions and so I decided I’d like my algorithm to scan a Canvas object, automatically finding any redaction and then fill in the redaction with song lyrics while matching the font size.

I started with a very simple HTML page with a canvas object.

<!DOCTYPE html>
<html>
<head>
<title>Deredactyl</title>
<style type="text/css">
#canvas{
border: 1px solid silver;
}
</style>
</head>
<body>
<h1>Drag a redacted image into the area below.</h1>
<p>Click a redaction to try and recover the text.</p>
<canvas id="canvas" width="600" height="800"></canvas>
<script type="text/javascript" src="deredactyl.js"></script>
</body>

</html>

I grabbed a redacted document off the internet and then started to create the JavaScript.

var canvas = document.getElementById('canvas');
var ctx = canvas.getContext('2d');
var img = new Image();

img.addEventListener('load', function() {
canvas.width = img.width;
canvas.height = img.height;
ctx.drawImage(img, 0, 0);
}, false);

img.src = "test.png";

This was the first revision, I wanted to make sure I could at the very least get an image loaded onto a Canvas.

From there it was all about scanning the document looking for large areas of redacted text.

I started at the top and worked my way through using ctx.getImageData(x,y,1,1) to get the RGB data for a given pixel.

It was a complete failure, crashing my browser several times. It turns out this method was just too slow. I removed console output, I even started incrementing by 2,3,5,10 at a time and it was still too slow, forcing me to rethink my algorithm.

I ended up opting for a user click even to find a specific area for a redaction. The new algorithm was this:

* A user clicks a coordinate in a canvas (hopefully in a redaction)
* Determine the left and right bounds by searching in either direction starting from the point of the click until the edge of the document is found or a pixel is found that is lighter than our threshold.
* Determine the upper and lower bounds with the same technique
* Calcuate the width and height of the area
* If the width is greater than 10 pixels and the height greater than 3, fill in the redaction. (This was just actually an arbitrary guess, I could have improved this by seeing if the size met the minimum ratio for a monospace font letter – about 1/1.7)

Here is some of the relevant code.

canvas.addEventListener('mousedown', selectRedaction);

function selectRedaction(event) {
var x = event.layerX;
var y = event.layerY;
var bounds = getRedactionBoundaries(x, y);

if (bounds.width > 10 && bounds.height > 3) {
clearTextAnalysisCalled = false;
for (var q = 0; q < (bounds.width / bounds.height); q++) { analyzeText(bounds); } } } function getRedactionBoundaries(x, y) { var bounds = { left: findLeft(x, y), right: findRight(x, y), top: findTop(x, y), bottom: findBottom(x, y), } bounds.height = bounds.bottom - bounds.top; bounds.width = bounds.right - bounds.left; return bounds } function findLeft(x, y) { for (var search = x; search > 0; search--) {
var pixel = ctx.getImageData(search, y, 1, 1);
var isDark = pixel.data[0] + pixel.data[1] + pixel.data[2] <= sensitivity;
if (!isDark)
return search
}
}

function findRight(x, y) {
for (var search = x; search < img.width; search++) {
var pixel = ctx.getImageData(search, y, 1, 1);
var isDark = pixel.data[0] + pixel.data[1] + pixel.data[2] <= sensitivity; if (!isDark) return search } } function findTop(x, y) { for (var search = y; search > 0; search--) {
var pixel = ctx.getImageData(x, search, 1, 1);
var isDark = pixel.data[0] + pixel.data[1] + pixel.data[2] <= sensitivity;
if (!isDark)
return search
}
}

function findBottom(x, y) {
for (var search = y; search < img.height; search++) {
var pixel = ctx.getImageData(x, search, 1, 1);
var isDark = pixel.data[0] + pixel.data[1] + pixel.data[2] <= sensitivity;
if (!isDark)
return search
}
}

After determining the space I decided I wanted to add some visual effects to make it seem like the code was actually doing some analysis. I really wanted to create a ‘snow like’ effect so again I looped over the height and width of the redaction and again it was way too slow.

I tried just randomly putting up dots on the redaction but this was actually too fast. Of course there was no easy way to slow this down, so I ended up adding a setTimeout calling the function again and again until a limit was reached. This again was too slow.

Finally, after much annoyance with what should be a trivial feature, I decided to spawn several instances of the function each of which would continue for a pre-determined amount of time before calling the next function, a wipe effect.

Of course, I created a race condition in the process. To fix this condition I created a global flag that would be set by the first function to finish, thereby collapsing the x amount of threads into a single one.

The wipe function was comparatively easy, just going from left to right, filling the redaction with white.

Finally, with the smoke and mirrors portion done, I could fill in the text.

I needed a couple pieces of information, one was the size of the font, which I determined by the height of the redaction. The second was the length of text that could fit in the space. For this I took the raw width and divided it by the rough width of a font that would fit in the space given the height

var fontSize = bounds.height;
var textLength = Math.floor((bounds.width / fontSize) * 1.7);

Now with this length I could start to populate words. I am nothing if not a sucker for the classics so I chose the lyrics to Rick Astley’s Never Gonna Give You Up.

To determine which words to use I iterated over the lyrics pulling words whose length would fit into the remaining space. I’d repeat this for one loop and finally give up if I couldn’t fill the space exactly.

function scrapeText(length) {
var words = "";
var attempts = 0;

while (words.length < length && attempts < sourceWords.length) {
var randomWord = sourceWords[wordCounter];
wordCounter++;

if (wordCounter == sourceWords.length)
wordCounter = 0;

if ((words.length + randomWord.length) < length) { words = words + " " + randomWord; attempts = 0; } else { attempts++ } } return words; } 

Once I was certain this was all working I added one last feature – which was more of a re-learning exercise – and that was to drag and drop files on to the canvas. It turned out to be much easier than I had remembered.

 canvas.addEventListener("dragover", function(evt) { evt.preventDefault(); }, false); canvas.addEventListener("drop", function(evt) { var files = evt.dataTransfer.files; if (files.length > 0) {
var file = files[0];

if (typeof FileReader !== "undefined" && file.type.indexOf("image") != -1) {
var reader = new FileReader();

reader.onload = function(evt) {
img.src = evt.target.result;
};

reader.readAsDataURL(file);
}
}

evt.preventDefault();
}, false);

Finally, not wanting to ruin the joke I took the plain text of the lyrics and fed that portion through a JavaScript obfuscator, nothing someone couldn’t reverse if they really wanted, but just enough to fool people on the first glance.

All in all it took roughly an hour to get everything working and to my surprise I was even able to Rick Roll a couple friends although I am sure they were shaking their heads.

 

 

All source code from this article available at: https://github.com/AnErrantProgrammer/deredactyl

Sources:
* Daniel Lopresti and A. Lawrence Spitz – Information Leakage Through Document Redaction: Attacks and Countermeasures (www.cse.lehigh.edu/~lopresti/Publications/2005/spie05a.pdf)

Leave a Reply

Your email address will not be published. Required fields are marked *