A long-time client has asked us to help screen their work machines for pornography. They're worried about liability if sensitive materials were found. Their main concerns (for obvious reasons) are video, audio, and image files. If possible, they'd also like to scan text-based documents for inappropriate content. They have a hierarchy of not-for-work content starting with blatantly illegal (I don't have to list details), moving down to obviously offensive, and also including things that may be offensive to some - think lingerie ads, joke cards featuring butt cracks, and anything related to Howie Mandel.

My questions are:

Is this ethical? I think it is since every employee legally agrees that their work machine belongs to the company and is subject to search. The screenings are not to occur on personal machines brought to work.

Is it feasible? I've done a lot of image processing/indexing but this seems like a whole new world of complexity.

Any references to successful techniques for discovering porn?

Is it appropriate for me to archive the results when something is discovered?

Create a script that posts all images it finds on 4chan; if other members answer "MOAR!", you know it's porn. If the script gets banned, it's probably CP.
–
user281377Mar 3 '11 at 7:29

8

You'd have to think there's umpteen million commercial products available for this already.
–
GrandmasterBMar 3 '11 at 7:32

34

Honest question: is this actually a likely problem? Porn on the work computer? I mean … who does that? Furthermore, how do they intend to handle accidental porn content? My GF actually had a virus on her work PC recently which redirected arbitrary Google queries to porn sites and ever so often I will accidentally type “python.com” [NSFW!] instead of “python.org” … What’s more, if this is actually a problem, I think this betrays a more fundamental trust and / or professionality problem in the company. Address that instead of searching the computers.
–
Konrad RudolphMar 3 '11 at 11:14

@Anonymous While you're at it. Create a GUI in Visual Basic to see if you can track an IP address. youtube.com/watch?v=hkDD03yeLnU. Seriously though, this is way to awesome of a technique to put on some second rate TV show script.
–
Evan PlaiceMar 19 '11 at 3:07

This is an obvious neural network task. First you need a large training set of images selected by experts in your company.....

A more effective solution is to announce that you will be checking everyones machine for porn NEXT week/month/whatever, then write a simple app that just exercises the disk. I guarantee that the machines will have been cleaned by then.

If you do find a couple of images in a browser cache then perhaps they hit a bad link or a dodgy popup - remember the teacher fired over whitehouse.com? If you fire/discipline them for this then there is going to be a backlash from workers/union. How would your company work if every click had to be submitted to legal for approval before your workers researched a question or checked a price online?

If you find a stack of porn on a machine how are you going to prove it was put there by that employee? Do you have the sort of security and audit systems that would stand up in court? Do you use (or even know of) an OS where a system admin couldn't put them there and make it look like the user's files?

Plus in my experience the most common locations for porn stashes are on the laptops of CxOs and senior VPs.

It's much better to just arrange for the files to just vanish ahead of time.

This approach of controlling is certainly painful for both employees and IT people. Once anything enters inside the employee machine, there is no sure way of detecting it. You need to stop it entering in the machine at the first place.
The best known practice for this is obviously control over the sites/domain which can be visited. Such list must be available somewhere on the net. Other than this you can also track the number of images, videos the employee has downloaded and from where it has come.
There are chances that the material can come from other than web, like from external hard drive. There could be once a month random scan of the system where you can randomly pick some of the videos and images and check it manually. Not sure how it can be done. But automating of checking the images and videos is certainly out of scope and certainly will be erroneous.
Actually I am not very much with the idea of restricting employees from doing personal stuff. You should trust your employees for this. Your employees should be busy enough in the office so that they don't get any time for this. The more worries are is the employee not doing his/her work right? Or has s/he installed some cracked or hacked software?

I agree that Developers - and other creative folks - shouldn't have machines that are locked down. However - and trust me when I say this - when you have 200+ employees processing workflow documents you do not want to give those guys anything that can distract them, and including a browser. Yes, 90% of folks are hard working and won't be distacted, but that means you'll have 20+ gobshites pulling the piss and being unproductive.
–
Binary WorrierMar 3 '11 at 8:04

6

those 10% will be unproductive anyway. If not browsing websites, then playing games, reading, goofing off, sitting around being bored, etc.).
–
jwentingMar 3 '11 at 8:25

2

People either get their work done or they don't. They're easier to spot when you have 200 doing similar tasks that can be measured.
–
JeffOMar 3 '11 at 9:50

2

In the US, there are legal issues involved with porn on company computers, and there are really serious legal issues involved with child porn. It's safest to have a no-porn policy and take steps to keep it off.
–
David ThornleyMar 3 '11 at 17:18

There are a number of products in the marketplace that perform "content filtering" of various forms. (A Google search on some obvious terms throws up some obvious candidates.) It is probably a better idea to use one of these products than building a lot of scanning / filtering software from scratch. Another option is to just watch at the borders; e.g. by monitoring external emails and web traffic. Again there are products that do this kind of thing.

While there is no doubt that it is ethical for a company to scan its computers for "bad stuff", this does not mean that there aren't issues.

First issue:

Determining what is and what is not "objectionable content" is subjective.

Software for detecting images, videos containing (let us say) "depictions of the naked body" is (AFAIK) likely to be unreliable, resulting in false positives and false negatives.

So ... this means that someone in your customer's organization needs to review the "hits". That costs money.

Second issue: There can be an innocent explanation. The file could have been downloaded by accident, or it could have been planted by a vindictive co-worker. If there is an innocent explanation, the customer's organization needs to be careful what they do / say. (OK this is not really your issue, but you might cop some of the backwash.)

Third issue: Not-withstanding that the company has a right to monitor for objectionable material, a lot of employees will find this distasteful. And if they too far, this will impact on employee morale. Some employees will "walk". Others may take protest action ... e.g. by trying to create lots of false positives. (Again, not really your issue, but ...)

Fourth issue: People can hide objectionable material by encrypting it, by putting it on portable or removable media, etc. People can fake the metadata to make it look like someone else is responsible.

In France, there is the notion of private copy exception: you are not allowed to copy copyrighted material, but copyright holders cannot claim anything if your copy is privately used.
–
mouvicielMar 3 '11 at 15:01

If the employees agreed that their work machine belongs to the company and is subject to search, then yes, this is legal. For proof, archival of the files would most likely be necessary.

As for how to actually find the material. You could:

First and foremost, scan file names for a certain set of words (porn, lesbians, etc.)

Scan text documents for the same set of words

For images, you could find the average color of the image, and if that color happens to be within a range that most would refer to as 'flesh' colored, then flag the image (someone double checking these flagged images will most likely be necessary). Wouldn't want to report someone for an image that ends up being a family photo from the beach.

If you scan the files as they're entering the computer (e.g. have the program loaded on every work machine and log flagged cases to a central database), then I don't think it would be too obtrusive (other than the blatant distrust the employer clearly has for their employees).

With the video files, I'm not 100% sure. Possibly a similar approach as with the image scanning (choose random frames and scan for a certain level of 'flesh' color).

Scanning audio files seems like it would get into speech recognition, which is a whole 'nother can of worms. Scanning the file name, however, would be easy and could be done as with the documents, images, and video.

Depends on the implementation and reasonable expectations of the employees. For example, if your software scans any machine connected to the network, then there's an additional requirement that infra needs to prevent unauthorized machines from plugging in. (Maybe that should be obvious, but it's frequently overlooked on networks I've seen.)

Is it feasible? I've done a lot of image processing/indexing but this seems like a whole new world of complexity.

Is it feasible to drug test every employee? Maybe so, but I question its worth. I would randomize it. Let employees know their machines may be scanned for inappropriate content at any time.

Any references to successful techniques for discovering porn?

I'm not touching this one. I don't think I could keep my sense of humor in check. But watch out for The Scunthorpe Problem when searching text.

Is it appropriate for me to archive the results when something is discovered?

This one concerns me the most, and I would ask a lawyer. I suspect if you find illegal content you may technically be legally obliged to disclose it. That's bad, particularly if the user was exposed by no real fault of his own. You(r client) will need real legal advice on how to handle this. Get HR and the lawyers involved.

From a purely technical standpoint: This sounds like an object category recognition problem. I've never done anything like that, but from what I've read, state of the art category recognition systems work like this:

First you search for a large number of interest points (e.g. using a Harris Corner Detector, extremal points of LoG/DoG filters in scale space; some authors even suggest picking random points)

Then you apply a feature transform to each point (something like SIFT, SURF, GLOH or many others)

Combine all the features you found into a histogram (Bag-Of-Features)

Use standard machine learning algorithms (like support vector machines) to learn the distinction between object categories using a large number of training images.

In the case of Gravatar, you could add a function to filter out from a list of clean sites in internet cache locations. I.E. Gravatar and other sites you don't want false positives from. You could also filter out things like the desktop wallpaper. If they are displaying porn on the desktop you'd think people would notice outside of your audit.

Such things never work reliably.
You can use a blocklist to block domains either on name or on being included on some list (a common practice).
But those lists are never complete, and blocking on name based on criteria can lead to many false positives.

You can block on words appearing in the text of sites, but again this can lead to false positives (and gets very slow as you need to parse every single bit of data that passes through your network in order to detect "naughty bits").

you can block pictures (and maybe sites containing them) that show more than a certain percentage of skintones.
But again it leads to many false positives. A university medical department blocking a medical encyclopedia with images of limbs and torsos showing wounds and skin conditions is a well known example of that.
And of course it'd be racist as it'd only block certain skintones. If you block colours matching Caucasian skin, there's always porn using black actors for example.

Best just trust your employees, and have policies in place for when that trust is broken.

Image and content analysis to determine the differences between a tasteful photograph of a person, a swimsuit photograph, a nude photograph, depictions of pornography... as far as I know is nowhere near sophisticated enough to do in software alone.

Fortunately crowdsourcing should be useful here, as @ammoQ suggested in a comment. However I don't believe members of 4chan or any other forum would appreciate the vast number of non-pornographic images, such as generic web graphics for buttons, frames, advertisements, etc. being posted.

My recommendation would be to look into existing crowdsourcing solutions, such as Amazon Mechanical Turk. (However the terms of service may explicitly prohibit the involvement of pornographic content, so be advised you might have to find another solution or roll your own.)

To make crowdsourcing feasible, your software should be prepared to do some or all of the following:

Store information that links the content with the computer it came from

Identify exact duplicates across the entire inventory and remove them (but origin information is retained)

Downsample images to some dimension, perhaps 320x200, which is sufficient to identify the content of the image without retaining unnecessary detail and wasting storage space/bandwidth

Create still images of video content at some regular interval and apply the same downsampling rule

Finally, the database of reduced images that represent the original image and video content is checked by users (or a designated team if you have the resources) according to your company's code of conduct. The program or interface might show a single image at a time, or a screen of thumbnails--whatever you deem best to obtain accurate information.

The identity of the computer from which images came should absolutely be secret and unknown to the persons evaluating the data. Additionally it should be randomized and each image probably checked more than once to remove bias.

The same technique could be used for text, but first the content could be scored by keyword rankings which remove the bulk of text from crowdsource review. Classifying a long document will of course be more time consuming than classifying an image.