CS262A Reading Summary 22

Join Processing in Database Systems with Large Main Memories

L. D. Shapiro
Summary by Feng Zhou
10/24/2002

3 key features,
  1. 3 existing algorithms for join of large relations are described and their performance evaluated. The algorithms are: sort-merge, simple hash and GRACE hash. It is pointed out that hash-based join can potentially be much faster because they are O(n) instead O(n log n). However, they tend to require more memory than sort-merge methods.
  2. A hybrid algorithm based on simple hashing and GRACE hashing is presented. It also partitions the relation so that they can fit into main memory. In one pass, it places first partition of the relation into memory and put other partition into temporary relations on the disk or in virtual memory. In this way, for small relations that fits into memory, it does not incur any additional I/O because the relation is processed as one partition. For larger relations, the first partition is only read once, while GRACE hashing reads all partitions twice. Thus the new algorithm is better than both simple and GRACE. It is shown to perform much better than sort-merge if main memory is large enough.
  3. The interaction between the join algorithm and the VM system is discussed. It is pointed out that LRU can be counter-producted for relational operations. If the C part (data besides the hot sets) of the join operation is put in the VM. The LRU replacement policy can result in worst possible performance. No new method is proposed to solve the problem.

1 flaw:

One assumption behind partition the relation with the hash values is the values are near-uniformly distributed. I believe this is often not true in real world databases. Another assumption is there are enough possible values for the field in the relation, to make partition possible. For example, partition a relation by the "gender" field into 3 partitions are not possible. This can be done with partition overflow, although it may incur additional costs.