better go game storage

One of the nicest things about researching cultural evolution in the game of Go is that game records are easy to digitize and the community quickly settled on a single sortage format, the SGF file format.

Originally developed by Anders Kierulf as part of his dissertation.

Allows one to record not just the game,but branching alternatives. It's a flexible, reliable text-based format. Every go-playing program uses sgfs as the universal output.

Most important is that it's plaintext, so is open-source.

There's a few things that are undesireable about it.

There's no good programs for manipulating large databases of games.

From a technical perspective, the format has some limitations for beginners.

Scraping is kind of a pain because SGF uses three different brackets to store different classes ( ) for branches ; for nodes, and and [ ] for tags. It's somewhat complicated.

The games are not especially readible, and the lettering convention is not the same as other conventions.

There's no way to record the position of the board with respect to each player.

The most important is that data management has moved on to better solutions since the late 1980s. Heirarchical datasets are ubiquitous and techniques for doing all these things.

When I wrote my own dissertation using GoGoD, I used a bespoke parser for SGFs that moved into a flat dataframe in R. It was buggy and overdesigned, but pushed me to develop my skills better.

The most important problem was that I wasn't thinking heirarchically, I wanted it to be 2-dimensional array when it was fundamentally not.

The near-universal standard is JSON. If we could get Go games into JSON, we can then do whatever we want with them using industry-standard methods, including what I wanted to do in my dissertation with far less work.

Redesigning the Go Game

What does an SGF want? It's a tree structure. All Go Games look like this

( root )

Where the ( ) represents a branch. The stuff inside can include branches out from root, so

( root ( branch1 ) ( branch2 )

Each branch is exactly the same object, so you can recurse this as you please

( root ( branch1 ( branch11 ) ( branch 12 ) ) ( branch2 )

each branch is made up entirely of a collection of nodes, each starting iwth ;. So a branch might look like

;tag1[data]tag2[data]tag3[data];tag4[data]

where the firstnode has 3 key/value pairs.

The restriction that each tag have only two characters leads to cryptic data storage.

(
  ;FF[4]
  AP[CGoban:3]
  BR[1p]
  CA[UTF-8]
  DT[2016-12-29 19:01:34]
  GM[1]
  RU[Japanese]
  KM[6.5]
  TM[0]
  OT[3x20 byo-yomi]
  PB[Pan Tingyu]
  PC[Tygem]
  PW[AlphaGo Master]
  RE[W+Resign]ST[2]SZ[19]WR[9p]
  ;B[qd]
  ;W[pp]
)

Here's the JSON

{
  "nodes": [
    {
      "FF": ["4"],
      "DT": ["2016-12-29 19:01:34"],
      "GM": ["1"],
      "RU": ["Japanese"],
      "KM": ["6.5"],
      "OT": ["3x20 byo-yomi"],
      "PB": ["Pan Tingyu"],
      "PW": ["AlphaGo Master"],
      "RE": ["W+Resign"],
    },
    {
      "B": ["qd"]
    },
    {
      "W": ["pp"]
    }
}
nodes:
- board_size: 19
  black_setup_stones: ac, af, dc
  white_setup_stones: fi, cb
  player_black: bret beheim
  player_white: SmartGo
- move_number: 1
  move : aa
  color: black
- move_number: 2
  move : bb
  color: white
branches:
- nodes:
  - B: cc
  - W: dd
  - B: ad
  - W: bd
- nodes:
  - B: hh
  - W: hg

While this is more verbose, it is vastly more flexible. - explicit numbering of moves in the game, which also serves as a check against corruptions in the game record - no longer requires every key to have at most two characters. - parsing using standard JSON tools, on any platform - explict naming of nodes of the branch as an object, along with a branches object, so each can be addressed directly - forces escaping of { } and [ ] inside comments - no longer uses ( ) which is more common to encounter inside comments - keys can be any length, not just two - somewhat more readible - encourages new tags not originally though of (e.g. what is top of the board with respect to black, white players

The Go community is not going to change to a more universal format anytime soon. However, tool can be useful if you want to use the Go games for analysis.

Golden age of AI research, having beginners having to deal with cryptic SGF format is annoying. I wrote this tool so people can get to JSON automatically, then proceed from there!

kaya features to add: - expand sgf tags into more readible format by a dictionary, as part of parse sgf - storing databases of games...how exactly? individual json files, yes, then whats the next step? - verbose moves, all the stuff in moves table as seperate tags...do away with coordinates? - simplify unit test inputs, rename normal games as "real" games and normal tests as simple tests pointed at redbean games - do the same thing for CHESS

how the parsing works

I can probably make a better version of this where the parenthesis don't need to be turned into curly races

split, which breaks one string of text into multiple pieces; split_sgf takes an sgf_string and returns an object with $nodes and $branches, each of which is also the same kind of object split_branch takes a branch string and returns vector of nodes, split_node takes a node string and breaks it up into a list of tag strings split_tag takes the tag string and breaks it up into a key and value

from these workhorse functions, we have the parse functions: parse_tag takes a vector of tag strings (from split_node), applies split_tag, and turns it into a named list parse_node takes a branch (from split_sgf), applies split_node, and then runs parse_tag on each node, creating an unnamed lits of nodes parse_sgf takes an sgf_file, applies split_sgf, and then runs parse_node on each output

(?<!\)[|(?<!\)](?<!\)[|(?<!\)] this identifies the locations of three patterns, [, ] and [], with a lookbehind to ignore escapes we simply split the string at each location and we're good!

(;FF[4]GM[1]SZ[19];B[aa];W[bb] (;B[cc];W[dd];B[ad];W[bd]) (;B[hh];W[hg]) )

each branch needs the same operations on it: 1. identify if it is a tip or not 2. store nodes output$nodes, store sub-branches as entries in the output$branches list 3. parse all branches in the same way

output$nodes ";FF[4]GM[1]SZ[19];B[aa];W[bb]"

output$branches output$branches[[1]] ";B[cc];W[dd];B[ad];W[bd]" output$branches[[2]] ";B[hh];W[hg]"

we absolutely must escape everything in the comments... but we cannot process a game that has an unmatched [ in the comments! even a [ will fail, I believe!

this is the remaining problem, I dunno how you can parse this correctly... string <- "PB[br[et\]asdf]PW[paul]" check <- gregexpr(comment_pattern, string, perl = TRUE) y <- regmatches(string, check)[[1]] expect_true(y[1] == "[br[et\]asdf]")

split_node

in other words, splits do the operation of one to many strings

parse does the operation of creating the new logical structure on individual string

the split operation removes whatever the seperator was, if any

split_node removes nothing,

split_branch removes ;

split_sgf removes the ( )

parse sgf is awkward in how it passes around the sgf_string

and it needs to be able to handle MULTIPLE GAMES

seperated at the top level by () ()

also needs to be able to discard all stuff OUTSIDE

the ( ) ( )

parse tags

parse tags expects individua tag strings, e.g. "PB[bret]" and returns a named list it will also work for setup stone tags, e.g. "AB[aa][bb]"

so, it's looking for square brackets to break up

it will ignore escaped square brackets, but without escaping it will will improperly break up nested square brackets

it is smart enough to ignore escaped

output is named list of length 1, where names are object keys



babeheim/kaya documentation built on Oct. 10, 2022, 12:13 p.m.