Instructions for updating the series of POTY shortlist articles

( Page 1 2 3 4 5 6 )
Creative Commons Licence [Some Rights Reserved]   Text © Copyright January 2024, Oliver Dixon; licensed for re-use under a Creative Commons Licence.


Input data

There are two datasets as input:

1. The raw data for the current year.

2. A historical cumulative summary of previous years.

Raw data for the current year

The original data for the current year can be accessed at

Original data for current year

This comes in the form of a CSV file and contains a list of all the thumbnails referenced in all the discussion for the given year which contain the words POTY or GOTY. Note that this only applies to the current year. So for a complete year’s record, it is preferable to save the file at the very beginning of the following year.

If this date is missed, it will be necessary to download the complete archive

Original data archive

This is of course a much bigger file, and it will be necessary to discard much more of the data, only retaining that relevant to the year being processed.

Create a new Excel workbook and name it for the relevant year. Open one or other of the raw data files mentioned above and save in spreadsheet format in a worksheet which should be named original_data. The workbook must be saved in .xlsm (macro-enabled Excel) format.

The column headings in the worksheet are reasonably self-explanatory

A forum_id
B topic_id
C topic_title
D topic_time
E post_id
F seq_id
G type
H gridimage_id
I user_id
J real_name
K title


Cleaning the raw data

To clean the original data, it should by copied to another worksheet in the same workbook and named clean_data. Each worksheet covering a single year is pretty large (more than 2000 rows).

By examining column C topic_title, it should be easy to identify all the rows which are not the weekly shortlist – ie final selection, monthly winners, POTY musings, POTY hall of fame etc. All these rows should be deleted, as should any rows clearly belonging to a different year. The header row should also be deleted.

The wording in the remaining rows of column C topic_title which serves to identify the week is very variable, a fact that militates against the possibility of automating this part of the process. It should, however, be a straightforward matter to inspect the data visually with the purpose of identifying blocks of data for 52 (or occasionally 53) separate weeks. Note that in some years, the year is divided into 53 weeks – ie Week 52 part 1 and Week 52, part 2. Where this occurs these two weeks should be renamed weeks 52 and 53.

Be aware that occasionally the data for weeks are listed out of order. The blocks of rows for any such week should be moved to preserve the correct sequence. Be aware also that weeks can get embedded in the wrong year. There has been at least one instance in the past where the week has been titled incorrectly. This should be rectified manually.

Accessing the historic data

Yearly listings of the number of selected images for each contributor are stored in .csv format in Geograph Media (GM) (this is a depository for files used in Geograph collections). This can be accessed at

Alltime summary to 2024 (v3)External link

Clicking the above link allows you to download a spreadsheet showing the number of POTY selections for every user by year. Downloading may generate sundry security warning messages which can be ignored. The downloaded file can be viewed in MS Excel. This worksheet should be inserted into the current workbook and renamed crosstable.

Note that this input file is not needed if the sole purpose is to generate the AGF.

( Page 1 2 3 4 5 6 )
You are not logged in | login | register