National
Telecommunications and Information Administration
Docket No. 020514121-2121-01
COMMENTS
OF THE ELECTRONIC PRIVACY INFORMATION CENTER
AND
THE AMERICAN CIVIL LIBERTIES UNION
ON
THE EFFECTIVENESS OF "INTERNET PROTECTION MEASURES"
In response to NTIA's
"Request for Comment on the Effectiveness of Internet Protection Measures
and Safety Policies," 67 Fed. Reg.
37396(May
29, 2002), the Electronic Privacy Information Center ("EPIC") and the American
Civil Liberties Union ("ACLU") submit these comments to address the demonstrated,
inherent flaws in content blocking systems and other so-called "Internet
protection measures."
In
furtherance of its proceeding "to evaluate whether currently available
Internet blocking or filtering technology protection measures and Internet
safety policies adequately address the needs of educational institutions,"
as required by the Children's
Internet Protection Act ("CIPA"), Pub. L. No. 106-554, 114 Stat. 2763,
2763A-336 (2000), NTIA seeks comments on, inter alia, the following
questions:
-
How do technology protection products block orfilter
prohibited content?
-
Do these methods successfully block or filter prohibited online content?[1]
These
questions recently received extensive consideration by a three-judge federal
court panel during the litigation of the constitutional challenge to CIPA's
requirement that libraries install blocking software in order to qualify
for participation in the E-Rate program.[2]The
court showed — and we agree — that blocking methods currently in use block
access to a large amount of online content permissible under CIPA (i.e.
they “overblock”), while failing to block access to a large amount of online
content prohibited by CIPA (i.e., they “underblock”).Because
current blocking methods overblock permissible speech and underblock prohibited
speech, they fail to “successfully block or filter prohibited online content.”In
the remainder of this comment, we will use the court’s decision to illustrate
why this is so.
The
court noted that, "following an intensive period of discovery . . . the
court conducted an eight-day trial at which [the court] heard 20 witnesses,
and received numerous depositions, stipulations and documents.The
principal focus of the trial was on the capacity of currently available
filtering software."[3]This
examination resulted in extensive findings of fact on the nature of blocking
software, its operation, and the limits of the technology.These
inherent technological limitations, the court held, make it impossible
for a public library to comply with CIPA without violating the First Amendment.[4](A
copy of the court's decision is being filed herewith
for inclusion in NTIA's record.)
The
court's findings of fact, as summarized below, were based primarily on
depositions and testimony concerning the content blocking provided by four
tools: SurfControl's Cyber Patrol, N2H2's Bess/i2100, Secure Computing's
SmartFilter and Websense's Enterprise.The
court's findings contain general information concerning the blocking methods
used by these companies, and assess the broader implications of "the sources
of error that are at once inherent in those methods and unavoidable given
the current architecture of the Internet and the current state of the art
in automated classification systems."[5]
I.
How Technology "Protection" Products Work.
Conceptually,
blocking programs function in a straightforward manner."When
an Internet user requests access to a certain Web site or page, either
by entering a domain name or IP address into a Web browser, or by clicking
on a link, the filtering software checks that domain name or IP address
against a previously compiled 'control list.'"[6]If
the control list responds that the address is restricted, then the user
will not be allowed to access it.
As
the blocking software companies review individual Web sites or pages, they
place the address (URL) into content categories within the control list.For
example, SurfControl uses 40 different content categories such as Adult/Sexually
Explicit; Education; Real Estate; and Violence.The
administrator of the blocking software then has the ability to restrict
access to specific categories, and all of the web addresses included therein.[7]
Therefore,
when gathering the URLs to place onto the control lists, blocking software
companies go through two distinct phases."First,
they must collect or 'harvest' the relevant URLs from the vast number of
sites that exist on the Web.Second,
they must sort through the URLs they have collected to determine under
which of the company's self-defined categories (if any), they should be
classified."[8][9]The
methods used in both of these phases are, however, flawed and thus unsuccessful
at controlling access to restricted materials while improperly blocking
access to a vast amount of valuable content.
II.The
Harvesting Phase is Flawed.
The
harvesting phase introduces flaws into content blocking because it only
considers a small proportion of relevant URLs.An
effective control list should include the full universe of currently available
web addresses.However, "filtering
companies, given their limited resources, do not attempt to index or classify
all of the billions of pages that exist on the Web.Instead,
the set of pages that they attempt to examine and classify is restricted
to a small portion of the Web."[10]SurfControl,
N2H2 and Secure Computing maintain control lists with only 200,000 to 600,000
web addresses,[11]
a miniscule number given that the number of pages that can be accessed
by standard search engines has been estimated at 2 billion pages.[12]
The
control lists contain so few web addresses because the automated and manual
search methods blocking software companies use to find Web pages are imperfect.The
bulk of the web addresses are gathered through automated methods such as
"entering certain key words into search engines [and] following links from
a variety of online directories (e.g., generalized directories like Yahoo
or various specialized directories, such as those that provide links to
sexually explicit content)."[13]These
are then supplemented by manually "reviewing lists of newly-registered
domain names; buying or licensing lists of URLs from third parties; 'mining'
access logs maintained by their customers; and reviewing other submissions
from customers and the public."[14]
A.
Keyword Searching at Commercial Search
Engines
Only Searches a Small Proportion
of
All Web Addresses.
As
the court found, "the first method, entering certain keywords into commercial
search engines, suffers from several limitations.[T]he
Web pages that may be 'harvested' through this method are limited to those
pages that search engines have already identified.However,
. . . a substantial portion of the Web is not even theoretically indexable."[15]
In
addition to the 2 billion web pages that blocking software companies could
reach through keyword searching, there are at least a similar number of
pages which cannot be reached through commercial search engines.The
court found that "the size of the un-indexable Web, or the Deep Web, while
impossible to determine precisely, is estimated to be two to ten times
that of the publicly indexable Web."[16]Thus,
keyword searching, based upon the estimates credited by the court, could
at best reach a very small percent of all Web pages.As
the court found, "no currently available method or combination of methods
for collecting URLs can collect the addresses of all URLs on the Web."[17]
B.
Keyword Searching at Commercial Search
Engines
Doesn't Identify Visual Depictions.
The
second drawback of relying upon automated methods is that they use text
as a proxy for finding visual content.The
commercial search engines are only able to search text, not images.As
the court noted, "[t]his is of critical importance, because CIPA, by its
own terms, covers only 'visual depictions.'"[18]The
court found that
[i]mage
recognition technology is immature, ineffective, and unlikely to improve
substantially in the near future..
. .Due to the reliance on automated
text analysis and the absence of image recognition technology, a Web page
with sexually explicit images and no text cannot be harvested using a search
engine. This problem is complicated by the fact that Web site publishers
may use image files rather than text to represent words, i.e., they may
use a file that computers understand to be a picture, like a photograph
of a printed word, rather than regular text, making automated review of
their textual content impossible.For
example, if the Playboy Web site displays its name using a logo rather
than regular text, a search engine would not see or recognize the Playboy
name in that logo.[19]
For
these reasons, control lists are both insufficient and inaccurate as a
result of flaws in the harvesting process.By
accessing less than 20 percent of Web addresses, and returning only those
that use descriptive text, harvesting produces incomplete control lists.Such
incomplete lists lead to the phenomenon of underblocking, where blocking
programs fail to block Web sites containing content that should be blocked
according to the program's stated criteria.
III.The
Categorization Phase Introduces Flaws.
Flaws
in the process of categorizing harvested Web sites lead to the phenomenon
of overblocking, where content that should not be blocked is miscategorized
as content that should be blocked.An
expert witness for the government admitted that six to fifteen percent
of the Web pages blocked on library computer terminals are wrongly blocked.The
court found, however, that these already substantial estimates "greatly
understate the actual rates of overblocking that occurs, and therefore
cannot be considered as anything more than minimum estimates of the rates
of overblocking that happens in all filtering programs."[20]Therefore,
at least fifteen percent of the Web pages blocked in libraries using
blocking software are wrongly blocked.The
court further found that the thousands of overblocked Web pages identified
by plaintiffs' experts were only a small fraction of those that are actually
overblocked: "[W]e conclude that many times the number of pages that [plaintiffs]
identified are erroneously blocked by one or more of the filtering programs
that [were] tested."[21]
The
court cited many specific examples of "overblocking," including Web pages
containing information about religious organizations (e.g.,
the Web site of the Knights of Columbus
Council 4828, a Catholic men's group associated with St. Patrick's Church
in Fallon, Nevada), governmental
entities and specific political candidates (e.g., the Web site for Kelley
Ross, a Libertarian candidate for the California State Assembly), health
issues (e.g., the Web site of the Willis-Knighton Cancer Center, a cancer
treatment facility), education and careers (e.g., several Web sites with
information on home schooling, and a site for aspiring dentists), travel
and sports (e.g., the Web sites of a North Carolina bed & breakfast
and a fly-fishing outfitter in Alberta).[22]The
cause of such overblocking can be traced to the following flaws in the
categorization process used by blocking software companies.
A.The
Category Definitions are Inaccurate.
The
categories for content that blocking companies use are inconsistent with
those categories identified in CIPA, and as a result, they block content
permissible under CIPA. After blocking software companies use the above-described
methods to compile control lists, they allocate the URLs to different content
categories.For example,Websense
uses the following headings: Abortion Advocacy; Advocacy Groups; Adult
Material; Business & Economy; Drugs; Education; Entertainment; Gambling;
Games; Government; Health; Illegal/Questionable; Information Technology;
Internet Communication; Job Search; Militancy/ Extremist; News & Media;
Productivity Management; Bandwidth Management; Racism/Hate; Religion; Shopping;
Society & Lifestyle; Special Events; Sports; Tasteless; Travel; Vehicles;
Violence; and Weapons.[23]
The
blocking software companies do not define their categories according to
legislative or common law definitions of prohibited materials.Obviously,
none of Websense's above categories matches the prohibited categories of
obscenity, child pornography and materials harmful to minors referenced
in CIPA.The "Adult" category is
defined as including "full or partial nudity of individuals."However,
it also includes unprohibited content such as Web pages that contain "light
adult humor and literature" and "sexually explicit language."
In
addition, blocking software users are not generally allowed to either define
categories for themselves or have access to the control lists and their
categorizations.As the court noted,
"the specific methods that filtering software companies use to.
. . categorize control lists are, like the lists themselves, proprietary
information."[24]Therefore,
blocking software is, at the most, effective at blocking the categories
as defined by the blocking companies, not the desired categories of a particular
user, or those mandated by law.
By
using category definitions, which cannot be meaningfully customized, that
are much broader than the categories of content prohibited by CIPA, blocking
software will necessarily overblock a large amount of content permissible
under CIPA — even when the blocking companies’ definitions are accurately
applied. Of course, those definitions are not always accurately applied,
as discussed below; content matching a category definition is often not
placed in the appropriate category, while content not matching a category
definition will often be categorized as such by mistake. These inaccurate
categorizations lead to further overblocking and underblocking.
B.
Automated Allocations Inaccurately Categorize URLs.
Both
of the methods that blocking software companies currently use to automatically
put a URL into one or more content categories result in overblocking of
content permissible under CIPA, as well as underblocking of content prohibited
under CIPA.Blocking software companies
currently use two textual methods to automatically categorize addresses:
simple key word searching, and statistical algorithms.The
court found that "simple key-word-based filters are subject to the obvious
limitation that no string of words can identify all sites that contain
sexually explicit content, and most strings of words are likely to appear
in Web sites that are not properly classified as containing sexually explicit
content."[25]The
court similarly found that the use of statistical algorithms yields flawed
results:
Notwithstanding
their "artificial intelligence" description, automated text classification
systems are unable to grasp many distinctions between types of content
that would be obvious to a human. And of critical importance, no presently
conceivable technology can make the judgments necessary to determine whether
a visual depiction fits the legal definitions of obscenity, child pornography,
or harmful to minors.[26]
C.
Human Allocations are Unable to
Efficiently
and Accurately Categorize URLs.
Use
of human review — as opposed to automated blocking — also results in the
overblocking of content permissible under CIPA, as well as underblocking
of content prohibited under CIPA. Most blocking software companies engage
in some percentage of human review, in addition to their reliance upon
automated methods.As the court
noted, "[h]uman review of Web pages has the advantage of allowing more
nuanced, if not more accurate, interpretations than automated classification
systems are capable of making."[27]However,
human review is unable to categorize many Web pages without incurring a
significant rate of error.
The
enormity of the task of human review of billions of Web pages introduces
its own sources of error.With limited
resources, any attempt at human review of billions of existing web pages,
and the approximately 1.5 million new Web pages created every day will
introduce human error.The court
found that "errors are likely to result from boredom or lack of attentiveness,
overzealousness, or a desire to err on the side of caution by screening
out material that might be offensive to some customers, even if it does
not fit within any of the company's category definitions."[28]
The
demands of human review also introduce two procedural errors.Often
times, to cope with the demands of review, categorical determinations will
be made not upon the basis of the content of each specific Web page, but
based upon the site's home page.[29]Thus,
if the home page of a Web site appears to be objectionable, all of the
Web pages associated with that site will be categorized similarly.These
broad determinations lead to over-blocking of many useful Web pages based
upon a cursory review.In other cases,
this practice allows justifiably prohibited content to remain unblocked.For
example, a site which hosts the Web pages of thousands of individuals (or
which charges for access) will not be reviewed thoroughly, and thus some
of its content may be available for viewing notwithstanding the "inappropriate"
nature of that particular material.
D.
All Allocation Methods are Inaccurate Because
They
are Static While Web Content is Dynamic.
Blocking
software companies do not have the resources or technological means to
continually monitor Web page content, although that content changes daily.As
the court found:
Most
filtering software companies do not engage in subsequent reviews of categorized
sites or pages on a scheduled basis.Priority
is placed on reviewing and categorizing new sites and pages, rather than
on re-reviewing already categorized sites and pages. Typically, a filtering
software vendor's previous categorization of a Web site is not re-reviewed
for accuracy when new pages are added to the Web site.To
the extent the Web site was previously categorized as a whole, the new
pages added to the site usually share the categorization assigned by the
blocking product vendor.[30]
The
court further noted that, "in addition to the content on Web sites or pages
changing rapidly, Web sites themselves may disappear and be replaced by
sites with entirely different content.If
an IP address associated with a particular Web site is blocked under a
particular category and the Web site goes out of existence, then the IP
address likely would be reassigned to a different Web site."[31]Likewise,
"[t]hrough 'virtual hosting' services, hundreds of thousands of Web sites
with distinct domain names may share a single numeric IP address."When
blocking software companies block the IP addresses of such services, they
"necessarily block a substantial amount of content without reviewing it,
and will likely overblock a substantial amount of content."[32]
Thus,
the categorization of URLs within control lists is inaccurate due to inherent
flaws in the methods employed.Because
the blocking software companies unilaterally determine the categories that
can be blocked, and use inaccurate human and automated allocation methods,
categorization produces highly unreliable results.When
coupled with the insufficiency of the control lists, those results compound
the inaccuracy of blocking systems.
IV.
Less Restrictive, More Effective Means are Available.
Alternatives
to Internet blocking software adequately address the needs of educational
institutions seeking to comply with CIPA, as the court found.Many
of the practices and policies considered by the court to be less restrictive
in the library environment are either currently used or would be applicable
to an educational environment.Schools
can "adopt Internet use policies that make clear to patrons that the [school's]
Internet terminals may not be used to access illegal content."[33]Other
techniques recognized by the court, such as detecting violations of use
policies through direct observation, review of Internet use logs, and subsequent
disciplinary measures are all easily adaptable to an educational environment.These
methods would avoid the documented flaws of blocking software, while more
effectively accomplishing the desired ends.
As
the court concluded after an exhaustive examination of content blocking
technology, "we find that it is currently impossible, given the Internet's
size, rate of growth, rate of change, and architecture, and given the state
of the art of automated classification systems, to develop a filter that
neither underblocks nor overblocks a substantial amount of speech."[34]The
court stressed that these problems are inherent to the blocking process,
and not the result of deficiencies in particular programs:
The
more effective a filter is at blocking Web sites in a given category, the
more the filter will necessarily overblock. Any filter that is reasonably
effective in preventing users from accessing sexually explicit content
on the Web will necessarily block substantial amounts of non-sexually
explicit speech.[35]
NTIA
should adopt the court's findings and conclude that currently available
Internet blocking software does not “successfully block or filter” the
online content that CIPA prohibits, and that use of such software is antithetical
to the educational mission of the nation's schools.
Respectfully
submitted,
David
L. Sobel
Electronic
Privacy Information Center
Laura
W. Murphy, Director
ACLU
Washington National Office
Greg
Pemberton
EPIC
Legal Intern
August 27, 2002