XML Compression

Last update: May 17, 2013

Return Collaborators Publications Bibliography

XML Data Compression

Under Construction

Description

Development of algorithms for queryable and updateable XML compression using parallel algorithms and its applications.

Past projects:

1) SXSAQCT and XSAQCT: XML Queryable Compressors Our previous work on two novel queryable XML compressors; a schema-based compressor, SXSAQCT, and a schema-free compressor, XSAQCT. Our experiments (reported here and at Balisage) show that our approach may be successfully competing with other known queryable compressors.

2) Axechop (designed by G. Leighton, collaboration with J. Diamond)

AXECHOP uses grammar compression to further improve the compression of XML data files.

AXECHOP compression strategy is:

   1. Separate structure from content
   2. Apply MPM algorithm to the structure string
   3. Compress data value lists
   4. Write compressed data to file

The AXECHOP compressed file format has three sections:

    * the encoded grammar used to derive the structure string
    * the contents of the element and attribute tables
    * the compressed contents of data lists for individual elements and attributes, along with those of the comment list, CDATA list, processing instruction list, and DOCTYPE list.

AXECHOP decompression strategy is:

   1. Reverse the effects of the MPM to obtain the original structure string and data lists
   2. The original XML file can be reconstructed by the decoder by conducting a single pass through the reconstituted structure string.

3) Treechop (designed by G. Leighton, collaboration with J. Diamond)

TREECHOP supports querying of compressed XML data without requiring prior decompression.

TREECHOP compression strategy is:

   1. Generate the document tree for the input XML document
   2. Assign a codeword to each tree node
   3. Write Node Encodings to Compression Stream

Format of Encoding Stream
The content of the prologue container is written to the stream, followed by the encodings for all tree nodes in depth-first order. Finally, the content of the epilogue container is written to the stream. As content is added to the stream, it is compressed using gzip.

TREECHOP decompression strategy is:

   1. Since tree node encoding is written to the encoding stream in a depth-first order, it is possible for the decoder to regenerate the original XML document incrementally.
   2. A code table is used to keep track of the codewords processed so far and the data type and text value for the tree node associated with each of these codewords.

Querying of Compressed Data

   1. Querying of the compressed data is done by registering one or more filters with the decoder. Each filter has two components: a query string specified using XPath-like syntax, and a handler class which is responsible for processing each query match as it is returned by the decoder. For example, a filter used to find the names of all students in the example XML document would have the query string 'select /class/students/student'.
   2. As the decoder processes the encoding stream, it will attempt to calculate the codeword that satisfies the filter's query predicate. Once the codeword has been discovered, each occurrence of this codeword in the stream results in a notification being sent to the handler associated with the filter, along with the data contained in the next encountered leaf node.
   3. Handlers offer a flexible way to handle query results. A handler could be configured to perform a simple action (such as outputting the query results to console), or some more complex operation such as inserting query results into a database or generating a new XML document containing the query results.

Return Collaborators Publications Bibliography