CS262A Project Proposal

Approximate Text Addressing and Spam Filtering on P2P Systems

Feng Zhou(zf@cs), Li Zhuang(zl@cs)
10/10/2002 11:32AM

Motivation

Peer-to-peer overlay infrastructures such as Tapestry, CHORD, Pastry and CAN demonstrated the benefits of scalable, wide-area lookup services for Internet applications. These services are built on top of different distributed hash table algorithms, which efficiently maps a global name to the participating node that offers a object with the name. We now propose to build an additional layer on top of DHTs, called the Approximate Text Addressing (ATA) service. The Approximate Text Addressing service maps all approximately identical texts to an object somewhere in the peer-to-peer network. The observation behind ATA is that text documents about a certain real world object or topic are often copied verbatim or with few modifications on a wide-area network. Examples include junk emails received by different people, same news articles generated from a single company press release on different news sites and, plaigarized scientific papers. A lot of application can be built if we can identify these similaries, such as an anti-spam system to identify junk emails, an automatic news service to group similar reports and let users discuss on a specific news topic globally, a copy detection system to detect plaigarized articles or even plaigarized computer programs.

Systems have been built to do approximate text copy-detection. However, most of them are centralized systems with limited scalability. Moreover the nature of these applications determines that the more people they serve, the more useful they are to each user. Thus building these applications in a peer-to-peer way is desirable because of potentially more scalability and availability. It can also possibly reduce network traffic because DHTs often exploit locality and cache data. To make these applications truely scalable, an efficient approximate text addressing service is vital. Developing such a general-purpose service is the main part of our project. We are also going to build an application on it: a scalable collaborative junk email filtering service, in order to prove ATA's usefulness and discover potential problems.

Approximate Text Addressing in DHTs

DHT-based p2p infrastructures provide name-based routing and directory service. ATA, in contrast, provide routing and directory based on approximate text. It locates an object according to a text document, so that approximately identical text documents map to the same object. To do that with the help of DHT, the intuitive process is,

Document --(1)--> global document ID --(2)--> object/record

The second mapping is done by DHT while the first one is done by the addition layer. However, because we are providing approximate text addresing, we have to find a way to map all similar documents to a unique document ID. Although we can normalize the format of the document by removing extra spaces or mark-ups, this can not solve the problem of small literal changes like word reordering and substitution. So a simple standard hash function will not be suitable. Actually the document ID could be a little bit different in that some DHTs provide proximity routing functionality. So if we can map similar text documents to similar summary codes, we can use proximity routing to make them route to the same node. There is at least one existing summary code providing this feature, namely the Nilsimsa code [1]. However the problem with Nilsimsa is that codes of similar documents can be different at any position within the 256 bits. But proximity routing of DHTs can only handle IDs that differs at about the end.

We will work on ways to solve this problem. First, some variant of Nilsimsa code may meet the requirement of proximity routing. Another possible way is that we can transform the stream of text into frequency domain using DFT or other transformations. The observation is that any small modification will only result in changes in the high frequency components. So we can encode the components in a way that similar documents have similar codes that differs only near the end.

The most promising way out of the problem seems to be using block text fingerprints. This technique has been used in text watermarking and copy/plagiarism detection[2]. The basic idea is to generate a number of, say 10, checksums of fixed-length blocks, i.e. fingerprints, for each document. If all or most of the fingerprints of two documents match, they will probably be the same or highly similar. Using text fingerprints, the process for fetching the data record (communication with the object is the same) corresponding to a certain document will be,

  1. Calculate fingerprints
  2. Search with DHT for document IDs matching each of the fingerprints
  3. The Document ID that appears in more than a certain percentage of the search results is the one we are looking for.
  4. Fetch the record by document ID with DHT.

Junk Email Filtering with ATA

The goal is to build a service that helps people identify junk emails, or SPAM. Only people can really identify SPAM! So we want to build a system that let users collaborate with each other to fight SPAM. Junk emails are identified by users through their mail user agent. Then each email is mapped to a record in the p2p network using ATA. An incoming email can be easily identified as either useful email or SPAM by looking at the record, which contains information like how many people claim it to be a junk email.

We are building the system with ATA/DHT because the effectiveness of the service relies on how large the number of users is. Because of the scalability and availability of DHTs, a global scale anti-spam network can be built in this way, without central bottleneck or single point of failure. Another potential advantage of using peer-to-peer approach is reduced network traffic/latency. SPAM tends to exhibit locality. So in this system, a node in Europe probably do not need to query a node in North America for a certain junk email, because records of popular junk emails will in most cases be cached nearby.

Evaluation Approach

We are going to do correctness evaluation and performance evaluation. We do evaluation of ATA itself and the junk email filter together in the context of SPAM filtering. Simulation is our main method.

For correctness evaluation, there are several variables:

  1. Dataset. We can use real emails tagged by us as SPAM or normal emails or synthesized emails with a certain SPAM distribution. An important question here is that we need to know how much of emails in real world are SPAM?
  2. User behavior. Input will be synthesized based on several variables like, whether they tag SPAM in time, or tag normal emails as SPAM, or if there exist malicious users.
We can also do direct measurement if we have a running system.

For performance evaluation, we need to add a few variables: the cache and replication policy employed by nodes in the network, the frequency of node joining/leaving. The metrics to measure include at least: throughput scalability, network traffic, query latency.

Reference

[1]. Nilsimsa code, available at: http://lexx.shinn.net/cmeclax/nilsimsa.html

[2]. U. Manber, "Finding Similar Files in a Large File System," Usenix Winter 1994 Technical Conference, San Francisco (January 1994), pp. 1-10.