|
What's the best compression algorithm for URIs that takes into the account the URI syntax. I remember that the WAP specifications had a compression algorithms with tokens for "http://", "www", ".com" etc. I'm looking for something that achieves better results than simply gzipping each URI. |
|
One of the best paper I've ever read about this topic is:
The paper describes how to use AVL tree to achieve ~50% compression. A quote from the paper:
There is an experimental AVL Tree implementation in the Andy's scratch area on the Jena repository. See also "how to compress small strings" on StackOverflow. |
|
Some kind of difference encoding will be quite good. AVL's aren't necesarily very compact (it depends on the ratio of object stored to pointer size). But if there is advanced knowldge of the URI's to be stored, then a prefix/namespace based compression (i.e. setting the compression tokens based on a preset dictionary) can achieve higher compression. A mixture of compressing by prefix and by a general mechanism for URI's that don't match supplied prefixes (or expand the prefix set dynamically) would be interesting. |
|
Take a look at this poster paper: Fernández, J. D., Gutierrez, C., and Martínez-Prieto, M. A. 2010. RDF compression: basic approaches. In Proceedings of the 19th international Conference on World Wide Web (Raleigh, North Carolina, USA, April 26 - 30, 2010). WWW '10. ACM, New York, NY, 1091-1092. DOI= http://doi.acm.org/10.1145/1772690.1772819 Available at http://www.dcc.uchile.cl/~cgutierr/papers/www2010.pdf |
|
This paper is also relevant: URL Forwarding and Compression in Adaptive Web Caching. B. Scott Michel, Konstantinos Nikoloudakis, Peter Reiher, Lixia Zhang. INFOCOM 2000. They build trees from URLs based on scheme/domain/path tokens. Hashes are used to encode the tree. They achieve ~70% compression. |
|
There is a W3C Member Submission about a compression format for RDF. Not particularly focused on compression URIs in particular but may be worthwhile looking at: Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutierrez, Axel Polleres. Binary RDF Representation for Publication and Exchange (HDT). W3C Member Submission 30 March 2011. |

