No 7zip Allowed: Extracting Oracle's Gzipped Java Tarball On Windows to Create an Isolated, Zero Footprint Java Install for CIS CAT Pro
I had a project to package the CIS CAT Pro benchmark auditing tool for Windows and Linux. The unique Windows challenges I experienced are applicable anytime you either need to extract Java for Windows or extract any gzipped or tar archive on Windows - without using 7zip. CIS CAT Pro requires Java and I wanted to create a zero footprint Java install that could be cleanly wiped out by deleting a folder. This allows the automation to be more readily used on production systems because it won’t force a Java install, nor compete with an existing version of Java. (I find it ironic that CIS CAT requires Java - and then frequently flags the copy of Java it is using as a problem)
7zip has had a fair share of security vulnerabilities - consequently installing or using it can set off more than a few security bells where I work - so it was required to have a solution that was 7zip-less.
While it is more than a little frustrating that Java is only provided by Oracle as a gzipped tarball for Windows, this method will work fine for anything else that is only provided for Windows as a gzipped tarball.
Applying Infrastructure as Code: Principles of Minimalism
The term Infrastructure as Code is thrown around a lot and yet without very many specifics on how adopting this approach would inform coding choices. Sometimes people think of it exclusively in terms of desired state configuration management platforms like Chef, Puppet or Ansible. I have heard others reference it purely as setting up things other than actual end nodes - like what Terraform does.
When I think of Infrastructure as Code it is all inclusive - whether imperative or declarative or whether OS oriented or hypervisor oriented. To me a pure definition of Infrastructure as Code means every last scrape of your configuration can be checked into source control and that the hard core disciplines of traditional development are applied (e.g. structured code, lots of testing, etc.)
Put another way, if the bottom of the stack imperative code (that always underlies declarative systems) stays as the quick and dirty admin coding of the past - it would be the achiles heal of the rest of the stack!
If you’re lucky enough to live in a pure PaaS or FaaS (Serverless) world - then this imperative level probably does not exist (but then you would also not have made it to this point in this article ;) )
For a long time I have felt that minimalizing code around the most pragmatic implementation made it more flexible than I originally anticipated in engineering it - it was a repeating theme. Recently I found that exact idea asserted in the book FIRE: How Fast, Inexpensive, Restrained, and Elegant Methods Ignite Innovation. I see it time and again that picking rudamentary implementations frequently increases their scope by reduce the assumptions. This aspect of coding is somewhat unique to developing automation for a broad audience because business applications rely on bringing all dependencies with them (at least the ones with well written installation code do ;) )
In working with operating system provisioning and software deployment automation, I frequently deal with bootstrap automation - a system that does not have extras and may be in a build environment where it cannot easily get to extras. Additionally, I frequently have to go from freshly booted OS to complete working software stack in one set of orchestration. Dealing with these constraints automatically causes me to reduce the external dependencies I take on anything that does not ship on the box. It’s why I code in PowerShell and Bash - usually the shipped version of these languages is sufficient for anything I want to do. Reducing dependencies not only means I can get to the real work of configuration faster, it means you don’t soil the system with a bunch of installations that have nothing to do with the final software stack that will run on it. In addition, in the Windows world we constantly deal with the fact that exe and msi based installers frequently require special handling like reboots - what a painful situation to be in simply because you need a given utility to automate an installation.
The first phase in a minimalistic approach is to ask “Is there anything on the machine that can already do this task?”
.NET (and therefore PowerShell) has a class for standard .zip extracts (system.io.compression.filesystem) and at first blush this class seems to contain some attempt to handle linux archive technology - but it is not complete and definitely does not handle both gzip and tar.
In fact, using system.io.compression.filesystem is another exercise in minimalism - it has the following benefits over using Windows Explorer’s unzip capability which you find in many code samples:
- It works on non-GUI OS variants such as Server Core, Containers and Nano Server (Windows explorer calls do not)
- It has fantastic version reach, working on PowerShell 2 through 6 (Server 2008 R2 through 2019) - the *-Archive CMDLets are only on later versions of Windows.
Other Options Investigated:
- Very recent versions of Windows ship with Tar - but the Microsoft compiled version of tar is not available for download on any previous version of Windows.
- A binary compiled by a third party (with the options as tar for windows) is not available either.
- Most other 3rd party tars rely on heavy runtime libraries like Cygwin - a bit of overkill to untar one file :(
Another IaC principle I apply is that software and utilities needed only for installation or only for a temporary purpose, should not be fully installed and integrated (even if removed later) if at all possible. This is a slightly higher scoped “minimalism” than what type of code and utilities are used to perform the installation. In this case it affects both the installation automation and the overall idea of putting CIS CAT on a system. The reason for putting CIS CAT on any system does not have to do with the what the software stack on that system is designed to do for customers - so effort should be made to minimize any impact it would have on the target system. In the case of CIS CAT we have a special concern in that it might be the only reason Java needs to be put on a given system - so it should be isolated and easy to clean off. This level of minimalism, then, informs us that the design of making the CIS CAT and Java install self-isolated and easy to clean off applies to both Linux and Windows.
Here is a summary of the benefits of using the tarball rather than installer edition:
- By not fully installing Java we don’t change the configuration of the machine in ways that don’t always back out cleanly (e.g. the system path)
- By not fully installing Java we don’t create challenges for local applications that are using Java (by upgrading it or removing older versions)
- We can pick a version of Java that only concerns itself with compatibility with the exact app we are using (CIS-CAT in this case)
- It is very easy to clean up when the purpose is temporary.
- We can support a easy to clean off install of CIS CAT for Windows and Linux for CIS CAT 3 & 4 (4 total editions). CIS CAT has a “disolvable bundle” - however, you can’t pick the Java version and it is only for Windows and only CIS CAT v3.
I should mention that I tried tartool.exe - which depends on the assembly we will end up using - unfortunately, tartool was insisting that I install .NET 3.5 / 2.0. Not only do I not want this old version of .NET on my system - but for many versions of Windows this particular optional OS feature must be retrieved from Microsoft and it frequently fails to deploy.
Since this is primarily for instances in Amazon, Amazon’s Corretto Java was tried (which does come as a Zip). However, it was incompatible with at least some of the CIS CAT tests.
I finally settled on calling the assembly ICSharpCode.SharpZipLib.dll directly from PowerShell to untar the Oracle edition.
The following code downloads and extracts SharpZipLib and then uses it to extract Java. Look closely because the lines to acquire SharpZipLib include a little, but surprisingly helpful secret - .nupkg files are really just .zip files. This means any .nupkg file you find on nuget.org or chocolatey.org can be minimalized by downloading them, extracting them and using their contents. In fact, the Universal OpenSSH Installer I created takes advantage of exactly this fact to be usable for non-chocolatey installs!
Another point of IaC minimalism - it turns out SharpZipLib is now available with the “Install-Package” command - however, below I have chosen a direct .zip download for these reasons:
- many times I have to automate for off the shelf configurations of Windows, and before PowerShell 5, there was no package management.
- my use of SharpZipLib will also not leave any residue on the system - clean up with a simple delete - this is not how package management works.
- If my usage of package management is the first on the given machine (very frequent with deployment automation) I have to use several commands and switches for all the underlying pieces and parts (package provider, package source) to be automatically configured and used. This further soils the system with configurations that are not easy to return to a pristine state.
- the location of the extracted assembly can move around when using package management and I don’t want to have to probe to find it.
#This code should work on PowerShell 2 and later
#Acquire and unzip the nupkg file containing the assembly
Invoke-WebRequest -uri 'https://github.com/icsharpcode/SharpZipLib/releases/download/v1.1.0/SharpZipLib.1.1.0.nupkg' -outfile "$PWD/SharpZipLib.1.1.0.nupkg"
Add-Type -assembly "system.io.compression.filesystem"
Write-host "Untaring Java..."
#Using the net45 version because that is the most likely to be preinstalled for my case, but check other folders under "lib" for other .NET serializations
Add-Type -Path "$PWD\lib\net45\ICSharpCode.SharpZipLib.dll"
#Automating the download of Java is intense, here are some ideas: https://stackoverflow.com/questions/24430141/downloading-jdk-using-powershell
$gzippedtarball = [IO.File]::OpenRead("$PWD\jre-8u212-windows-x64.tar.gz")
$inStream=New-Object -TypeName ICSharpCode.SharpZipLib.GZip.GZipInputStream $gzippedtarball
$tarIn = New-Object -TypeName ICSharpCode.SharpZipLib.Tar.TarInputStream $inStream
$archive = [ICSharpCode.SharpZipLib.Tar.TarArchive]::CreateInputTarArchive($tarIn)
#Set JRE Home and add the JRE Bin folder to the path of the current process (the next two lines could also be written to script to allow quick setup of the isolated version from other scripts)
Why Not Just Re-Zip It Darwin?
After all that, you may wonder “Why not just untar and rezip the Java archive.” The reasons that I would not do that are rooted in hard experience, they are as follows:
- The more human procedures there are for a re-release - the more likely there will be procrastination in taking new versions - and in this case, this is Java!
- The more human procedures there are for a re-release, the more likely that steps will be missed.
- Whenever a file bundle is extracted, there is always the risk that someone decides to lean up what is included into the new bundle (yep, seen it too many times).
- When dealing with installers, I always prefer to use the vendor file directly - this way I am using a known preparation and it may even have vendor checksums that can be checked.
- If I ensure my solution can deal with the original file, then it could also potentially download that file automatically (not always the best choice for other reasons).
Summary of Infrastructure as Code Principles Followed in This Solution
- Be minimalistic in automation dependencies.
- Anytime you are installing something for management or administration or troubleshooting, try to not use package managers so that these items and their dependencies will not shift the mix of software used in the main software stack.
- Use facilities outside of your chosen language in order to increase the range of “supported targets”. For instance, in this solution we used “system.io.compression.filesystem” for the broadest possible regular unzip support - non-GUI windows and PowerShell 2 through 6. Another example is favoring schtasks.exe over PowerShell ScheduledTask CMDLets.
- Minimize soiling of the system using isolated, file only (portable) installs when ever possible - this makes them easy to clean up.
- If you need support for PowerShell versions that do not have built-in package management, download the .nupkgs and handle them manually.
- Downloading and extracting .nupkgs prevents any scripts inside them from running and soiling the configuration.