Using Geocoding API’s to Add Address Information from a Zip to a CSV File

Pin in a map

Today I was talking to a musician importing thousands (many many thousands) of contacts into a newly setup Mautic install.

The issue was that the CSV of contacts to import only contained Country Code and Zip columns, making it difficult to target people in certain areas when something is about to happen in that location.

So we set about trying to expand the location information in the file using free online geolocation API's.

The intention was to come up with a single script file that could be dropped into the same folder as the CSV which, loops through each of the contacts, Geolocating their address from the Country and Zip and then puts them back into a new CSV ready to import into Mautic complete with City/State information.

We looked into a few API's, but pretty much settled on the first 2 we found. Both have their pros and cons.

  • Google maps geolocation - Good for global addresses but has a limitation of 2500 lookups per day on the free tier (50 per second).
  • OpenStreetMap Nominatim Search - Only seems to know US Zip codes, no daily quota, but only 1 request allowed per second.

You can find the Google quota documentation here, and the Nominatim ones here.

I will start by saying that the code is not a one-stop drop-in for anyone to use with any CSV file. You may need to adapt it to use the information that your file contains. It is made up of various snippets from around the web. It's hacky, it's not neat, but it gets the job done.

What it needs to do:

  1. Convert the CSV to an array and assign the column headers as the array keys.
  2. Loop through the rows in batches.
  3. Build the address string for the API from the available information.
  4. Look up the location and assign the results to pre-determined array keys (abiding by any usage limits).
  5. Put the result into a new file, either initially with the headers, or appending information without header information.

The first script we came up with tried to process the whole file at once and dump the results into a file in one go. Testing it with a small number of rows made it clear that it was going to take some time for a large file to process, and that this approach wasn't (at all) good for error feedback. Using Google initially, I basically used up my daily quota just getting the script to work.

It was clear that to get any sizeable file processed without running into timeout or memory issues it was going to need to process smaller batches, being kept alive by some AJAX on the frontend.

Part 1: The Google API

I'll add the original file here which uses the Google API, however, I wouldn't recommend using it unless you have a small file and are prepared to run into a few issues!

If you want it to work on a large file (Within the limits of the API usage) I would suggest taking a look at the second script and changing the address string and geocode function to work in the same way as this script.

To get it to work, you need to edit the first 8 lines so that they match the name of the file that you are reading, the header of the Country and Zip columns and the file that you want to save the results to.

You need to be patient and let it continue processing (Unless it's running for a crazy amount of time in which case you should put it out of its misery).

 

Part 2: OpenStreetMap Nominatim API

This worked out to be slightly easier to use when it came to getting the results since the address came back in one array rather than being a collection of components with the name on a second level of the array as Googles API does. The only issue was that some results returned some fields and others didn't and I couldn't find any specific documentation about what gets returned.

Nominatim address:

array(6) {
      ["city"]=>
      string(10) "Gardendale"
      ["county"]=>
      string(16) "Jefferson County"
      ["state"]=>
      string(7) "Alabama"
      ["postcode"]=>
      string(5) "35071"
      ["country"]=>
      string(24) "United States of America"
      ["country_code"]=>
      string(2) "us"
    }

Whereas Google returned segments of an address which need to be looped over like this:

["address_components"]=>
      array(4) {
        [0]=>
        object(stdClass)#10 (3) {
          ["long_name"]=>
          string(5) "32792"
          ["short_name"]=>
          string(5) "32792"
          ["types"]=>
          array(1) {
            [0]=>
            string(11) "postal_code"
          }
        }
        [1]=>
        object(stdClass)#11 (3) {
          ["long_name"]=>
          string(11) "Winter Park"
          ["short_name"]=>
          string(11) "Winter Park"
          ["types"]=>
          array(2) {
            [0]=>
            string(8) "locality"
            [1]=>
            string(9) "political"
          }
        }

So the trade-off was that Nominatim was easier to use but only really works for the US when doing a lookup by Zip code.

So, for the following script, we started with a Bootstrap CDN shell of a page to get it going in one file.

It initially checks to see if there is a file called 'mailing_list.csv', if not then it won't go any further unless you add that file, or change the code to modify the input filename.

If the file exists then it shows a start button.

Clicking the button sends an AJAX request to the same script with post data telling it to start processing the file, the line it should start at (Initially 1) and the number of rows to process at one time (Set to 20).

The process function opens the file, assigns the first row of information as the row headers and then maps that as the keys for all the other rows in the file. It then counts the total number of rows and cuts out the segment of the array that we want to process. If the number of lines that we want to process is greater than the number that exist then we flag it as the last time the loop should run.

Once we have the rows we want to process we start looping through them as we did with the Google API script and grab the information from the Nominatim API.

At the end of each loop, we put PHP to sleep for 1 second to make sure that we don't hit the API limits and get ourselves blocked.

When we have the 20 rows of information there's a quick check to see if we are starting at line one. If we are, then we save the headers to the file as well as the information and use the 'w' method to write to the file (essentially wiping any previous information). If the start number is greater than one then we use 'a' to append the new data to the end of the existing file, not writing the headers again.

Once the file is written, we send back a JSON encoded array to the browser letting it know that the information was saved and whether it should continue or if we have processed everything and it should stop. We return the number of the row we started on and the number of rows we processed.

When the message is received by the AJAX function, it knows if it should continue. If it should, then it changes the start number to the next row that should be processed and passes the chunk size again. This allows us to give a little feedback in the browser that things are continuing without having to monitor the console and network tabs continually to see what's going on.

When we hit the end of the file, the process function returns a 'next' value of false and the loop stops, telling the user that the process has (hopefully) finished!

The script:

Final Thoughts

There are many improvements that could be made, but as with many one-off things, if they work then there's no need.

First off would be error reporting and recovery so that the script could continue reporting any issues at the same time.

Another would be error recovery, following on from the last row that was written if anything did hlt the script unexpectedly.

For a file that isn't primarily based on the US, both methods could be combined, first trying Nominatim, and then using Google if no results are returned (Possibly adding other API's into the mix as well).

This data is only initially needed for Mautic to be able to use it straight away. As tracked user hit the site and landing pages, known users will get their locations updated in the system anyway if you are using one of the ip -> geolocation data options.

Let me know if the scripts helped you out for any obscure reasons or if you made any improvements!