The User Side
To see how it works on the User Side, please go and explore the Demo
To see how it works on the User Side, please go and explore the Demo
After there was a working gwreports module and it was nearing release to the world, the question of how to demonstrate it came up. It needed a substantial set of data that wasn't proprietary, confidential or otherwise encumbered. After striking out with a google search for "big gobs of data" (hey. sometimes you get lucky,) the idea surfaced to look for weather data. That lead rather quickly to the Global Historical Climatology Network data base. This was a perfect find.The data has very familiar qualities for anyone that has spent time around legacy business data locked in older technologies. The data is in fixed width text files with the instructions for interpreting it expressed in Fortran code. But, too, this is not long forgotten data. Quite the opposite, this is a current and growing collection, invaluable to ongoing climate research. The only thing more one might want is maybe a "big gobs of data" meta tag?This made for a nearly perfect fit for a system that is supposed to be a tool to make data more accessible. Hopefully, this document describing the transformation from a set of text files into the interactive reports of the gwreports demo will be of value as a case study for those facing similar problems in real world business situations.
A Plan
As this was not an open-ended project plan, some base decisions had to be made to limit the scope. After a quick review of the available data, here was the plan outline:
Focus on precipitation and related meta-data
Define tables based on existing data model
Convert data into loadable form
Load data into MySQL
Build some reports
OK, it was a bit loose, but it was a plan.
In this plan, these were the files of interest:
v2.country.codes - countries and numerical country code
v2.prcp.inv - meta data describing observation stations
v2.prcp - raw precipitation data
v2.prcp_adj - adjusted precipitation data
The v2.prcp.readme indicates the adjusted data file both eliminates some data points and adds some others. This lead to another design decision, merge these two sets for the demo.
For reference, here are brief snippets from the three major files. (v2.prcp and v2.prcp_adj are identical in structure.)
v2.country.codes
v2.prcp.inv
v2.prcp
Defining tables
The tables below mimic the file layout of the existing data. The full station code is actually three fields, one of which is the country code. The country code will be important to accessing the data, while the full station code is a key to the precipitation data. They both were needed separately.
Not beautiful, but it represents the existing data.
From Files to Database Tables
There are lots of options for pulling data out of text files. And there are several options for getting data into some MySQL tables. The "quick" part of the initial planning was still a major consideration. This led to a choice of a few lines of Awk code to transform the data as needed, followed by a MySQL LOAD DATA to pull things into the database.
Awk is a terse but powerful tool. The economy of effort for tasks like this more than offsets the initial investment in learning the language. In example, here is the complete program to convert the v2.country.codes file into a file ready to use with LOAD DATA:
That's it. When it was saved as countryload.awk the command line to perform the conversion looked like this:
The resulting file was then loaded into the database with the following MySQL command:
One down, two (or three) to go.
Just a note on the REPLACE option. Sometimes, the data files as supplied end with repeated lines. I don't know why, and I'm not too concerned with the reason that is probably buried in some Fortran code. I've used Fortran before, but in this case it doesn't really matter, since someone else pulls the data, we just want to use it. By using the REPLACE option we sidestep any errors on the repeating lines, make it so we automatically merge the two v2.prcp files, and make is so we can set up an automatic refresh with nothing more than rerunning the process. Hmmmm, love that "quick" constraint.
Loading the stations was just a tiny bit more complicated:
And the precipitation data was similar, just with more columns:
We need to use this last bit twice, once for v2.prcp, and again for v2.prcp_adj, and LOAD DATA in that order to capture the adjusted records.
Got data, now what?
Populating the tables is just the first step, as the whole point of this exercise was to create a demo, not of precipitation data, but of gwreports. There were three main reports that were obvious:
A list of Countries
A list of Stations in a Country
Precipitation records for a Station
Starting with the Country list, the report SQL looks like this:
Add a parameter definition for cname as Like Text and we have a report ready to run. We can move on to the Stations by Country list:
Note here we use the MySQL CASE construct to convert the coded elevation value of -999 (meaning the elevation is not available) into something more meaningful for humans.
We can make this a bit fancier by adding another section with country information like this:
Make that section Multirow=No, and reorder the sections to put it on top. Define a ccode parameter (Text, length of) and we've got another one down.
The Precipitation listing has more CASE magic, and lots of repetition, but is fairly simple. We also convert the tenths of millimeters into millimeters, again for human readability:
Of course we need a scode parameter, text, length of 11. And, just for fun, add a Station information section:
From Dumb Reports to a System
The reports "work" at this point. You can look up a country code, plug that into the stations report to look up a station, plug that into the precipitation listing. Uggh! How about we go to the country list, click test, click on the country_name column header, and enter the following as the Extended format:
OK, we really needed to look up the report id to put after the rid= part, but that really is all there is to linking the country list to the station list. Repeat that concept on the stations listing, adding this to the station_code column:
Voila! No more memorizing or cutting and pasting. Just click through from one report to another. Now, ready for the kicker?
Elapsed Time
The time from setting out to find "big gobs of data" to having a system of three interlinked reports was one afternoon for one person.Yes, really. "Quick" was a requirement, remember? OK, there were several more tweaks to be made, and after a little bit of playing, a few more reports suggested themselves. There was more effort expended to polish the demo, test it under different circumstances, and find and fix a tiny bug in the process that gave everyone very dry Decembers. But building the proof of concept with live data fit between lunch and dinner. Time for pizza and beer!
Wrapping Up
Hopefully, when you put this write-up together with the manual and a little bit of quality experimentation time, you will be able to envision more practical uses for this power. The purpose behind gwreports was to make data accessible, and we hope this demonstration proves success in that goal.
Last updated