6
1

What's the best compression algorithm for URIs that takes into the account the URI syntax. I remember that the WAP specifications had a compression algorithms with tokens for "http://", "www", ".com" etc. I'm looking for something that achieves better results than simply gzipping each URI.

asked 29 Jun '10, 09:34

Ian%20Davis's gravatar image

Ian Davis
2.3k1414
accept rate: 13%

edited 27 Jan '12, 13:05

Signified's gravatar image

Signified ♦
23.5k1623


One of the best paper I've ever read about this topic is:

The paper describes how to use AVL tree to achieve ~50% compression. A quote from the paper:

"Like the delta-encoding, our method compresses the URLs by only keeping the differences of URLs tails. To find the URLs different, the sorting of URLs is required. However, a new URL is on the fly discovered and it is impractical to sort out the URLs list every time a new URL is discovered. Instead of sorting all URLs, we incrementally construct an AVL tree of URLs.

[...]

Each node in the tree contains five fields: RefID is a URL identification used to reference to its predecessor, CommonPrefix is a number of common characters referenced to its predecessor, diffURL is the tail of the URL that does not common to its predecessor, Lchild and Rchild are the pointers to the node’s left subtree and right subtree respectively."

There is an experimental AVL Tree implementation in the Andy's scratch area on the Jena repository.

See also "how to compress small strings" on StackOverflow.

link

answered 29 Jun '10, 11:30

castagna's gravatar image

castagna
1.4k1415
accept rate: 27%

edited 01 Jul '10, 07:04

Some kind of difference encoding will be quite good. AVL's aren't necesarily very compact (it depends on the ratio of object stored to pointer size).

But if there is advanced knowldge of the URI's to be stored, then a prefix/namespace based compression (i.e. setting the compression tokens based on a preset dictionary) can achieve higher compression.

A mixture of compressing by prefix and by a general mechanism for URI's that don't match supplied prefixes (or expand the prefix set dynamically) would be interesting.

link

answered 03 Jul '10, 16:37

AndyS's gravatar image

AndyS ♦
13.0k37
accept rate: 32%

Take a look at this poster paper:

Fernández, J. D., Gutierrez, C., and Martínez-Prieto, M. A. 2010. RDF compression: basic approaches. In Proceedings of the 19th international Conference on World Wide Web (Raleigh, North Carolina, USA, April 26 - 30, 2010). WWW '10. ACM, New York, NY, 1091-1092. DOI= http://doi.acm.org/10.1145/1772690.1772819

Available at http://www.dcc.uchile.cl/~cgutierr/papers/www2010.pdf

link

answered 29 Jun '10, 11:10

Antoine%20Zimmermann's gravatar image

Antoine Zimm... ♦
10.0k514
accept rate: 32%

This paper is also relevant:

URL Forwarding and Compression in Adaptive Web Caching. B. Scott Michel, Konstantinos Nikoloudakis, Peter Reiher, Lixia Zhang. INFOCOM 2000.

They build trees from URLs based on scheme/domain/path tokens. Hashes are used to encode the tree.

They achieve ~70% compression.

link

answered 27 Jan '12, 13:04

Signified's gravatar image

Signified ♦
23.5k1623
accept rate: 37%

edited 27 Jan '12, 13:05

There is a W3C Member Submission about a compression format for RDF. Not particularly focused on compression URIs in particular but may be worthwhile looking at:

Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutierrez, Axel Polleres. Binary RDF Representation for Publication and Exchange (HDT). W3C Member Submission 30 March 2011.

link

answered 30 Jan '12, 11:55

Antoine%20Zimmermann's gravatar image

Antoine Zimm... ♦
10.0k514
accept rate: 32%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or _italic_
  • **bold** or __bold__
  • link:[text](http://url.com/ "Title")
  • image?![alt text](/path/img.jpg "Title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×78
×3
×3
×2

Asked: 29 Jun '10, 09:34

Seen: 4,817 times

Last updated: 30 Jan '12, 11:55