Content Filtering Deep Dive Whitepaper [ClearOS Documentation]

Content Filtering Deep Dive Whitepaper
Filtering at different levels
Filtering: It's All About the Process
- DNS
- HTTP
- HTTPS
- OTHER
Authentication and User Tracking
- Transparent Authentication Revisited
Content Processing in the Content Filter module
Context Filter
- Manipulating Weighted Phrases
Manual Site Classification
Computer, User and Group Considerations
- Username
- IP address

Content Filtering Deep Dive Whitepaper

The purpose of this document is to provide real world and in-depth discussion on the Content Filtering system of ClearOS. This article is meant to augment the existing manual page by delving into use case scenarios and making suggestion on implementations. This guide also discusses deeper customizations that can occur to assist you in tailoring your content filter to your needs.

Before we begin, you should install the Content Filter module and familiarize yourself with the controls of the app by referencing the existing guide for the Content Filter module.

Filtering at different levels

One of the great strengths of ClearOS is that it can filter various types of content (HTML, SMTP, others) using a variety of different services and technologies. This document focuses particularly on the web content filter engine provided by the Content Filter module but it is important to note that no ONE technology is providing all of the solution on ClearOS. With ClearOS, it is a hybrid of different services to create a holistic filtering solution.

Here is a list of modules that are included in ClearOS that can help you filter various types of content coming across your firewall.

Content Filter
Web Proxy
Web Access Control
Gateway Antivirus
Gateway Antiphishing
Protocol Filter
Intrusion Prevention
Incoming Firewall
Egress Firewall
Custom Firewall

Filtering: It's All About the Process

To understand and implement how to implement filtration, you need to both understand the process of what is going on and also decide how you are going to interject your filtration into the process. This is important because you can easily block some things using elementary methods or discern through the processes at a specific layer. When you are surfing the internet, the principle protocols are http, https, and dns. You can filter on all of these if you know what is required by these protocols and how they work.

Protocol	Transparent Filter	Enforceable Filtration	IP-based authentication	User authentication
DNS	Yes	Yes $$	Gateway Management	No
HTTP	Yes	Yes	Network Map	Yes, Account Manager
HTTPS	No*	Yes	Network Map	Yes, Account Manager
Others	Maybe	Maybe	Maybe, Custom FW	Maybe

* You can filter on HTTPS if you compromise the security context of the workstation with a implicit trust of ClearOS as a server. This is called SSL-Bump on Squid and is included with Redwood. $$ While you can force DNS to not work, lookups of websites by IP addresses only may still work if you don't block in http and https as well. A holistic approach is recommended.

DNS

DNS is a great way to filter because you can block DNS lookups to happen exclusively under your control (unless your workstations are using some sort of VPN out of your network). You can manually make poisoned DNS entries for sites like Facebook and ads. For example, you can send ads on Youtube down a hole so you get ad-less Youtube and you can completely block any DNS resolution to Facebook if that is just not something you want to allow on your network. DNS is just part of the process and if your user decides to just use IP addresses instead, you may find out that they will just do an end-run around this blocking protocol with IP addresses or a hosts file.

ClearOS' DNS server is a caching DNS server. As such, you can populate entries on that server that are totally invalid. When a user queries the server for the hostname, they will get the poisoned address instead of the real one. You can do individual hostnames like or you can blacklist WHOLE domains is easily by directing the DNS lookups for add site domains to bogus network IPs or the loopback address of the client workstation. You can even redirect it to the root domain on the server where you boldly state that they are not allowed to surf that site (they will get a certificate error if you do this.) Since the DNSMasq daemon processes all .conf files in /etc/dnsmasq.d/ simply create a file called:

/etc/dnsmasq.d/poison.conf

In it, create listings similar to this:

server=/doubleclick.net/127.0.0.1
server=/pointroll.com/127.0.0.1
server=/facebook.com/127.0.0.1

You can use any network address of RFC 1918 or the loopback address of 127.0.0.1. It is best to use the top network address of your network so that the response is quicker as this top address is known to be invalid for the machines on the subnet that use the subnet mask. For example, if you use 192.168.4.1/255.255.255.0 for your ClearOS server. Then you can use 192.168.4.0 for the bogus DNS IP. The packet will instantly fail and will not route.

Another option in the realm of DNS filter is Gateway.Management. This Marketplace app filters content before it ever leaves the network and is a great alternative to using the Content Filter or Web Proxy. We say alternative because getting it to work well with the traditional content filter and web proxy requires a bit of knowledge about how these work. For example, if you are using the proxy server, all traffic will originate from the proxy. Therefore, Gateway.Management cannot distinguish between users and originators of traffic. If you wish to use these together, it is possible but the Proxy needs to be configured in Transparent mode and will not filter https. Nevertheless, Gateway.Management can be a superior solution in many cases especially with the Don't Talk to Strangers feature.

HTTP

HTTP traffic is easy to intercept, easy to distinguish and filter and hard to get around filtration mechanism. By default, ClearOS in transparent mode will tell the running firewall to grab port 80 out of the stream and push it into the proxy and possibly into the Content Filter if it is running. Users of the Network Map or the Directory Servers or AD connector can have policies that provide different filtration for different users. If you want user-based authentication, you MUST use non-transparent mode. There are may tools including the WPAD, GPO and PAC files which can make this seamless on your network and far more enforceable.

HTTPS

HTTP traffic is easy to intercept, difficult to distinguish and filter and can hard to get around filtration mechanism. By default, ClearOS in transparent mode will NOT filter on HTTPS, the best way to filter on HTTPS is to do so using non-transparent mode with proxy settings configured. If this is not an option, you can filter at the DNS layer (see above.) Users of the Network Map or the Directory Servers or AD connector can have policies that provide different filtration for different users. User authentication is the typical because you it is usually integrated and you may want the users to show in the reports. You MUST use non-transparent mode or else you will need to use SSL Bump/Redwood in order to filter on HTTPS. There are many tools including the WPAD, GPO and PAC files which can make this seamless on your network and far more enforceable.

If you wanted to block information on Facebook, you will use user authentication and non-transparent mode typically. This will let you block all of Facebook or allow all of it. In this mode, you will not be able to determine the content of the thread and dynamically block individual parts only parts based on the URL. This is true for all HTTPS traffic and this is a good thing because this means that even if your box is compromised, a hacker would not be able to see the HTTPS traffic flowing through the system! For environments where there are legal protections for expectations of privacy (ie. banking, PCI, HIPAA) this is a very, very good thing. If you want to delve into the packet and do analysis using the content filter's contextual filter, you will need to use Redwood or SSL Bump under ClearOS which may require some command line configurations and a lot of distribution of certificates.

OTHER

Other protocols might be able to be filters and viewable to you as an administrator.

Authentication and User Tracking

Authentication for the Proxy and Content filter can be done but it requires non-Transparent mode. The reason why is because the browser is unaware that it needs to authenticate when using transparent mode. Additionally, transparent filtration is only really capable of classifying traffic based on IP address.

Typical settings include configuring your proxy in non-transparent mode. Whether you are using ClearOS' OpenLDAP directory server or using Active Directory Connector, you can then pass authentication on to the Directory Server. The user must be part of the web_proxy_plugin group for this to work. In ClearOS, this is simply a checkbox on the user's profile that says they can use the Web Proxy. With AD Connector, you must make this group and assign users.

Browsers that support the NTLM protocol will transparently authenticate directory users if challenged. This means that users who have previously logged into the domain with their user credentials will have already obtained the keys for the web_proxy_plugin group and will authenticate behind the scenes. From there, you can define groups in the Content Filter to allocate different policies on a first match basis.

Transparent Authentication Revisited

Some users will place a default block and on the block page may implement a password or authentication mechanism through a custom process but these are highly customized solutions based on what is available in the marketplace and are beyond the traditional turnkey approach of the default marketplace app. That being said, it is possible to authenticate in transparent mode using a captive portal approach like this one.

Content Processing in the Content Filter module

The Content Filter module uses various mechanisms to block or pass content to your users. Some or all of these may be in effect on a particular piece of content. For example, it is possible for a piece of content to be blocked by multiple stages.

URL Filtering - URL Filtering classifies and allows for blocking or allowing sites based on classification. This sub-module is only effective with a current Content Filter subscription.
Context Filter - Context Filter looks at the words on a webpage and then blocks or allows that content based on the context of the topic. This means that pages NOT listed in URL Filtering or even dynamic content sites can get blocked if the topic of discussion exceeds that of the filter.
File extensions - This sub-filter can block content based on file extension. This means that you can block content like executable files, scripts or other items which can contain malicious code or which compromises your security policies. This is also useful during times where known viruses are riding as a payload on certain file types.
MIME Types - Like file extensions, the content filter can stop certain file types based on their web content type, or MIME type. This can stop certain types of audio or video content.
Manual listing - You can also list sites manually to allow them to be blocked or allowed.
Web Proxy Bypass - It needs to be mentioned that another way to 'whitelist' a site is by specifying that site in the Web Proxy module (not in the Content Filter module). The reason for doing this may be that even when you whitelist a site, it fails to operate properly. Some content on the web cannot properly be interpreted by the content filter and fails to work. This may be true with proprietary protocols or content which falls outside open specifications. If your web authentication or site does not work even after whitelisting it, try putting the address in the 'Web Proxy Bypass'.

Context Filter

The Open Source technology that ClearOS uses provides awesome filtering based on topics and subjects that are deemed inappropriate. It does this using a concept called weighted phrases. For example, the word 'breast' can be used appropriately and inappropriately. The content filter works by looking at the words near and around the subject matter which may prove inappropriate. This allows you to look up your favorite chicken marsala recipe while at the same time blocking pornography.

Another aspect of this contextual filter is that content which is deemed generally okay can still be processed under this filter. This means that normal sites do NOT have to be wholly blocked but can be blocked when dynamic conversation passes safe limits. To implicitly exempt a site from URL filtering but still maintain context filtering simply add the site to the 'grey list' section.

Manipulating Weighted Phrases

If the default word definitions included with ClearOS does NOT work sufficiently for your situation, you can manipulate those word and phrase lists by editing the lists located in '/etc/dansguardian-av/lists/phraselists'. Words can be specified by themselves or with other words. The engine will look near multiple word selections to determine whether those word pairings are proximal or contextual. A weight is then assigned. Positive numbers (bad words/phrases) increase the phrase limit for a webpage. Negative number (good words/phrases) decrease the phrase limite for a webpage. If the weighted phrase limit is exceeded, the page will be blocked with the message 'Weighted Phrase Limit Exceeded'.

For example, in the /etc/dansguardian-av/lists/phraselists/goodphrases directory there is a list called weighted_general. On that list are the following phrases:

<breast cancer>←50>
<breast>,<cancer>,<treatment>←50>
<breast>,<medical>←30>
< chicken>,<breast>←50>
< turkey>,<breast>←50>

All of these entries are weighted DOWN by specifying a negative number in accordance with a relative and arbitrary weight. The syntax of these entries are more than mere specification of the word 'breast' but rather indicate innocuous uses of the term. Other such terms may exist in the lists but may not be weighted as high or low as you like or perhaps are missing altogether. Instead of duplicating new words, consider modifying existing sets. A useful command to determine whether a word already exists can be executed from command line:

cd /etc/dansguardian-av/lists/phraselists/
grep -R breast *

In this case, we are looking for the work 'breast' in all of the phraselists. The results show which lists the word appears and the weights associated.

Manual Site Classification

There are three Manual Site Classifications:

Blacklists
Gray Sites
Exception Sites

You can set these by clicking 'Configure Policy' next to the policy that you are defining.

Blacklists are sites that you specify that should not be visited. The engine uses the entries as wildcards. For example, a setting of 'example.com' will blacklist all sites that have 'example.com' in the url. A setting of 'www.example.com' will NOT blacklist 'example.com' because it is MORE specific and does not match.

Gray lists are sites that are normally allowed but that have the context filter still applied. This is useful for sites like wikipedia.org which may have plenty of articles that are useful but have some articles which violate what your users should see.

Exception Sites are useful to 'whitelist' places that you deem completely safe for your users. Exception Sites override 'blacklisted' sites.

Computer, User and Group Considerations

The content filter is capable of distinguishing different users and applying differing policies based on who or where the access is originating. In order to work this way, the content filter MUST derive user information. There are two pieces of information that the content filter can use to make this determination:

Username
IP address

Username

If the content filter can derive the username it can make classifications as to which policy to apply based on that username and the group membership of that user. The order of the content filter groups is important. The default policy is the top filter group and it is the one that gets applied both FIRST and LAST. First, if the username is NOT specified and LAST if it the username was specified but didn't match. The policy is a first match first apply policy. If a user belongs to multiple groups, the policy listed in which they first match is the one applied.

In order for the proxy server to receive a username, the browser must supply it. The only way that this can occur is if the proxy server is specified in the browser settings. There are two methods for applying configuration setting for use with User Authentication:

Manual Settings in the browser
Automatic Detection in the browser facilitated by WPAD.

For testing purposes, manual settings can suffice. In the browser you will specify the IP or hostname address of the proxy server and also the port number. If you are using the content filter, this will be port 8080.

For large scale environment, or to ease administration, we recommend using WPAD as a way to promulgate your configuration to your various browsers. This guide will show you how to configure WPAD on ClearOS to distribute your proxy settings.

Active Directory

If you are using the Active Directory Connector, you will need to add the 'web_proxy_plugin' group and other settings and considerations. Please refer to this guide.

Additionally, a bug exists in the open source software which ClearOS uses for content filtering which prohibits caseless usernames. Because the Active Directory Connector renders all users caseless, any users specified in Active directory which use upper or mixed case will NOT be properly identified in the Content Filter. The workaround is to rename all AD users to lower case.

IP address

A module add-on for filtering by IP address is being written but functionality exists today for filtering users by IP addresses. When used in conjunction with DHCP IP address reservations or with static addresses, this can be a powerful way to filter users in different groups. Moreover, this works WITHOUT user authentication so it ALSO works in transparent mode.

First, modify the following in /etc/dansguardian-av/dansguardian.conf:

authplugin = '/etc/dansguardian-av/authplugins/proxy-ntlm.conf'
authplugin = '/etc/dansguardian-av/authplugins/proxy-basic.conf'
#authplugin = '/etc/dansguardian-av/authplugins/ident.conf'
#authplugin = '/etc/dansguardian-av/authplugins/ip.conf'

-TO-

authplugin = '/etc/dansguardian-av/authplugins/proxy-ntlm.conf'
authplugin = '/etc/dansguardian-av/authplugins/proxy-basic.conf'
#authplugin = '/etc/dansguardian-av/authplugins/ident.conf'
authplugin = '/etc/dansguardian-av/authplugins/ip.conf'

Since Dansguardian processes authentication in ORDER, if you want it to process IP authentication before user authentication you will need to change the order to this:

authplugin = '/etc/dansguardian-av/authplugins/ip.conf'
authplugin = '/etc/dansguardian-av/authplugins/proxy-ntlm.conf'
authplugin = '/etc/dansguardian-av/authplugins/proxy-basic.conf'
#authplugin = '/etc/dansguardian-av/authplugins/ident.conf'

Next, modify the file /etc/dansguardian-av/lists/authplugins/ipgroups. You will need to specify the ip addresses and which groups they belong.

You can match with IP address, CIDR, or Range:

Straight IP matching:
- 192.168.0.1 = filter1
Subnet matching:
- 192.168.1.0/255.255.255.0 = filter1
Range matching:
- 192.168.1.0-192.168.1.255 = filter1

You MUST specify the groups with the syntax of 'filter1', 'filter2', 'filter3' and so forth. The content filter will NOT process your filters by any other names.

Once you have loaded these parameters, you can activate them without restarting or disrupting content filtration service by executing the following command:

service dansguardian-av reload

search?q=clearos%2C%20clearos%20content%2C%20AppName%2C%20app_name%2C%20kb%2C%20howto%2C%20xcategory%2C%20maintainer_dloper%2C%20maintainerreview_x%2C%20keywordfix&btnI=lucky

CLEAROS DOCUMENTATION

Table of Contents