Ara te

Nz dictionary biography search The papers of the Provincial Councils are almost complete in the General Assembly Library and the Welling- ton Public Library ; and most of the leading provincial libraries have the official papers of their own province. In cases where they rank amongst the great men of the Old Country and appear in the British Dictionary of National Biography, they are not treated at greater length than seems necessary for the benefit of readers who have not access to that work. Weldon — was a prostitute and character. The website has been developed by the Ministry for Culture and Heritage on contract to the New Zealand Historical Association, and with the assistance of Wellington company Click Suite.

About this Website

The Dictionary of New Zealand Biography () by Guy Hardy Scholefield is available on the NZ History website as two PDFs (Volume 1, Volume 2). Surprisingly the dictionary does not include any index of names, and while a text search is possible (thanks to OCR), quickly locating specific names is difficult.

I decided to index the Dictionary so that names could be matched against searches on my 🌳 Ancestor Search Helper site.

This project was also a way to try out various AI tools in my workflow.

My understanding is that the Dictionary is released under the Creative Commons Attribution-NonCommercial New Zealand Licence (according to the NZ History website footer).

I offer this indexed version of Scholefield's Dictionary of New Zealand Biography as a freely available resource, and I believe that this is consistent with the spirit of the Creative Commons licence.

Indexing Notes

Despite my initial expectations that the indexing would be straightforward, I hit an early hurdle: simply copying and pasting the text of the PDFs appeared to work, but on closer inspection many entries were garbled, because the original OCR did not separate the columns of text correctly.

A fresh OCR was performed on the PDF files with WebAssembly PDF Viewer and Editor.

This produced plain TXT files containing 50 pages of PDF at a time, which could be combined for volumes 1 and 2.

The new OCR output was largely satisfactory, apart from a few pages where the scanner was slanted and the columns again became confused. These pages were corrected manually.

Cleanup proceeded with removing lines containing single capitalised words (ie, 'MURTON', the index at the top of each page) and removing the printed page number.

The next step was to index the entries by name and place them into a CSV (spreadsheet) file.

Nz dictionary biography Though the beginnings of New Zealand history are so close to us I have treated with the greatest caution the frequent claims of relatives or biographers that such a one was the first to use a plough in such a district, the first to use a steam threshing machine, owned a certain invaluable section in the town of Dunedin or Auckland; or was the first white child born in a district. History of New Zealand 2 vols. Letters in N. The British papers relating to New Zealand G.

A simple PHP script was written with the aid of GPT Each entry was identified by finding capitalised words at the beginning of a line (SAVAGE, MICHAEL JOSEPH)

  • The surname and forenames were distinguished by the comma (SURNAME, FORENAME) and used in the primary index columns.
  • The end of the name was identified by brackets or lowercase letters.
  • Allowances were made for the lowercase c used in, eg, 'McDONNELL'.

A fair amount of manual editing was done for Māori chiefs, who were often recorded with one-word names, and/or multiple aliases.

Aliases were indexed into a separate 'Also Known As' column.

The PDF Page numbers were inserted into the scanned TXT file by the OCR software, and these were recognised by a preg_match function and saved against each entry, before the page number line was removed from the text of the biography.

Further cleanup included:

  • Re-combining hyphen-ated words split across lines
  • Removing the original linebreaks to make the text of each biography into continuous text, while preserving the original paragraphs
  • Common OCR errors were fixed via search-and-replace, such as 'tlle' for 'the' and ''/Vanganui' for 'Wanganui'

Inspection of the resulting CSV revealed that hyphenated names and very long names which spanned multiple lines had often not been detected correctly, requiring manual fixes.

Lastly, each entry was given a unique 'handle' for URL purposes (eg, william-bayly-2).

The final output is a reasonably clean CSV file, from which the content of this site is retrieved.

Closing notes

The majority of entries have at least a date of death and often a date of birth, but the date formatting is too inconsistent to reliably index with an algorithm.

The text of the Introduction was straightforward to format from plain text to HTML, with the exception of the long table of sources in the Bibliography section, which was difficult to transcribe accurately without a lot of manual correction.

Eventually I was able to use the Claude 3 'Opus' model, which transcribed screenshots of the Bibliography pages directly to HTML with excellent accuracy.

About Me

I'm Luke Howison, a web developer based in Lower Hutt. I'm building a suite of free digital research tools.