Tuesday 6 January 2015

Working with GZIP and compressed data

Abstract

We all know what it means to zip a file with zip or gzip.  But using zipped files in Java is not quite as straight forward as you would like to think, especially if you are not working directly with files but rather with compressing streaming data.  We'll go though:
  • how to convert a String into a compressed / zipped byte array and visa versa
  • create utility functions for reading and writing files without having to know in advance whether the file or stream is gzip'd or not.

The basics

So why would you want to zip anything?  Quite simply because it is great way to cut down the amount of data that you have to ship across a network or store to disk therefore increase the speed of the operation.  A typical text file or message can be reduced by a factor of 10 or more depending on the nature of your document.  Of course you will have to factor in the cost of zipping and unzipping but when you have a large amount of data it will be unlikely that these costs will be significant.

Does Java support this?

Yes, Java support reading and writing gzip files in the java.util.zip package. It also supports zip files as well data inflating and deflating of the popular ZLIB compression library.

How do I compress/uncompress a Java String?

Here's an example of how to compress and decompress a String using the DeflaterOutputStream.

Here are two methods to use the Java built in compressor as well as a method using GZIP:

1. Using the DeflaterOutputStream is the easiest way


enum StringCompressor {
        ;
        public static byte[] compress(String text) {
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            try {
                OutputStream out = new DeflaterOutputStream(baos);
                out.write(text.getBytes("UTF-8"));
                out.close();
            } catch (IOException e) {
                throw new AssertionError(e);
            }
            return baos.toByteArray();
        }

        public static String decompress(byte[] bytes) {
            InputStream in = new InflaterInputStream(new ByteArrayInputStream(bytes));
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            try {
                byte[] buffer = new byte[8192];
                int len;
                while((len = in.read(buffer))>0)
                    baos.write(buffer, 0, len);
                return new String(baos.toByteArray(), "UTF-8");
            } catch (IOException e) {
                throw new AssertionError(e);
            }
        }
    }

2. If you want to use the Deflater / Inflater directly


enum StringCompressor2 {
        ;
        public static byte[] compress(String text) throws Exception{
            byte[] output = new byte[text.length()];
            Deflater compresser = new Deflater();
            compresser.setInput(text.getBytes("UTF-8"));
            compresser.finish();
            int compressedDataLength = compresser.deflate(output);
            byte[] dest = new byte[compressedDataLength];
            System.arraycopy(output, 0, dest, 0, compressedDataLength);
            return dest;
        }

        public static String decompress(byte[] bytes) throws Exception{
            Inflater decompresser = new Inflater();
            decompresser.setInput(bytes, 0, bytes.length);
            byte[] result = new byte[bytes.length *10];
            int resultLength = decompresser.inflate(result);
            decompresser.end();

            // Decode the bytes into a String
            String outputString = new String(result, 0, resultLength, "UTF-8");
            return outputString;
        }
    }

3. Here's how to do it using GZIP

enum StringGZipper {
        ;
        private static String ungzip(byte[] bytes) throws Exception{
            InputStreamReader isr = new InputStreamReader(new GZIPInputStream(new ByteArrayInputStream(bytes)), StandardCharsets.UTF_8);
            StringWriter sw = new StringWriter();
            char[] chars = new char[1024];
            for (int len; (len = isr.read(chars)) > 0; ) {
                sw.write(chars, 0, len);
            }
            return sw.toString();
        }

        private static byte[] gzip(String s) throws Exception{
            ByteArrayOutputStream bos = new ByteArrayOutputStream();
            GZIPOutputStream gzip = new GZIPOutputStream(bos);
            OutputStreamWriter osw = new OutputStreamWriter(gzip, StandardCharsets.UTF_8);
            osw.write(s);
            osw.close();
            return bos.toByteArray();
        }
    }

How to decode a stream of bytes to allow for both GZip and normal streams:

The code below will turn a stream of bytes into a String (dump) irrespective of without having to know in advance if the stream was zipped or not.

if (isGZIPStream(bytes)) {
            InputStreamReader isr = new InputStreamReader(new GZIPInputStream(new ByteArrayInputStream(bytes)), StandardCharsets.UTF_8);
            StringWriter sw = new StringWriter();
            char[] chars = new char[1024];
            for (int len; (len = isr.read(chars)) > 0; ) {
                sw.write(chars, 0, len);
            }
            dump = sw.toString();
        } else {
            dump = new String(bytes, 0, length, StandardCharsets.UTF_8);
        }
}

This is the implementation of the isGZIPStream method.  Reveals the truth about what's behind GZIP_MAGIC!

public static boolean isGZIPStream(byte[] bytes) {
        return bytes[0] == (byte) GZIPInputStream.GZIP_MAGIC 
         && bytes[1] == (byte) (GZIPInputStream.GZIP_MAGIC >>> 8);
}

This is a simple way to do read a file without knowing whether it was zipped or not (relying on the extension .gz).

static Stream<String> getStream(String dir, @NotNull String fileName) 
  throws IOException {
        File file = new File(dir, fileName);
        InputStream in;
        if (file.exists()) {
            in = new FileInputStream(file);
        } else {
            file = new File(dir, fileName + ".gz");
            in = new GZIPInputStream(new FileInputStream(file));
        }

        return new BufferedReader(new InputStreamReader(in)).lines();
}

4 comments:

  1. Could you please add a disclaimer that you're using Java 6 syntax, especially for file handling? The code would look differently if modern Java features were used and I'd expect a blog post from 2015 to at least acknowledge that...

    ReplyDelete
  2. Good point thanks for highlighting - obviously the main point of the post is do with gzip. I'll look to update the post.

    ReplyDelete
    Replies
    1. Daniel is there some specific reason to use enum without constants? i saw this many times in openhft

      Delete
    2. Just an easy way to ensure a Singleton.

      Delete