CS262A Reading
Summary 22
Join Processing in Database Systems with Large Main Memories
L. D. Shapiro
Summary by Feng Zhou
10/24/2002
3 key features,
-
3 existing algorithms for join of large relations are described and their performance evaluated.
The algorithms are: sort-merge, simple hash and GRACE hash. It is pointed out that hash-based
join can potentially be much faster because they are O(n) instead O(n log n). However, they
tend to require more memory than sort-merge methods.
-
A hybrid algorithm based on simple hashing and GRACE hashing is presented. It also partitions
the relation so that they can fit into main memory. In one pass, it places first partition of
the relation into memory and put other partition into temporary relations on the disk or in
virtual memory. In this way, for small relations that fits into memory, it does not incur any
additional I/O because the relation is processed as one partition. For larger relations,
the first partition is only read once, while GRACE hashing reads all partitions twice. Thus the
new algorithm is better than both simple and GRACE. It is shown to perform much better than
sort-merge if main memory is large enough.
-
The interaction between the join algorithm and the VM system is discussed. It is pointed out
that LRU can be counter-producted for relational operations. If the C part (data besides the
hot sets) of the join operation is put in the VM. The LRU replacement policy can result in
worst possible performance. No new method is proposed to solve the problem.
1 flaw:
One assumption behind partition the relation with the hash values is the values are near-uniformly
distributed. I believe this is often not true in real world databases. Another assumption is
there are enough possible values for the field in the relation, to make partition possible.
For example, partition a relation by the "gender" field into 3 partitions are not possible.
This can be done with partition overflow, although it may incur additional costs.