11 Comments

  1. A minor typo: its “Conscious” not “Concious” 🙂

  2. Minor but important! Fixed and thanks mk.

  3. Hello Mr Dennehy,
    while I was searching for, additional informations, to the research paper written for the SPIRE 2005 conference by Nikolas Askitis and Justin Zobel, I found Your blog.

    I am a student in this http://www.uni-weimar.de/cms/medien/webis/home.html research group,We are currently evaluating the opportunities, to hold large dictionary’s in the main memory, for our retrieval applications.

    I would like to ask You, if You would allow us to test your implementation, against our’s, and do some further testing with Your code.

    Please let me know, If You would agree and If You would like to get back with me to maybe discuss a little bit.

    Thanks a lot, best regards
    Hagen Tönnies


  4. I think performance gains are mainly coming from the encoding of integers in the contents instead of creating new Entry objects for each integer value. I would like to see the comparison of HashSet and this implementation (without integer values) under Java 6, my bet is they would be pretty close.

  5. Ok, I also tried to implement a simple String set using same tricks, I don’t know but Java 6 standart HashSet implementation consistently gave better results. The problem is, Set uses String references and comparisons are basicly free because hash values of strigs are cached. So this is basically not very useful if you have the strings stored in memory as well, but could be an option to have a compact storage for character arrays. I also think your benchmark application could be flawed as well,

  6. Hi! I was surfing and found your blog post… nice! I love your blog. 🙂 Cheers! Sandra. R.

  7. The double-encoded HTML in the code samples is an unreadable mess. Please fix.

  8. Author

    It was an artifact of the move from wordpress.com to wordpress.org. Should be fixed now.

  9. It should be possible to make CCHashTable even more cache conscious by using byte arrays for storing strings instead of char arrays, this way it’s likely to take half the space when UTF-8 encoded. The idea is simple — the less data you fetch, the more cache hits you get.

    And you don’t seem to cache key hash code in the slot bytes along with string length. This would improve string comparison speed and probably increase overall hash table performance.

  10. In the function addString():

    for (k = 0; k > 16) & 0xffff);

    It looks like your null-terminating every string inserted into the
    CChash, but the strings are length-encoded, so I don’t think you
    have to do this. From what I gather, only the 2D-char array needs to
    be nulled. You should try removing the nulls here, it will save
    space and may increase performance too. To make this work, the two
    bytes used to encode the “id” can be stored after the
    length-encode: [length-encode][id][string]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.