Until recently, I’ve thought of Crescat Graffiti as an art/anthropology project, and it never occurred to me to treat it as a data set. But now that’s what I’m doing as part of putting together a guest post for a science magazine. I love a good data set, but in the process of making it I’m finding myself wracked with indecision about how much metadata to capture. Sure, I could just type out all the words, and I could make a pretty word cloud or a word-count list, but I find myself thinking about what other avenues might be fruitful.
- Differentiating walls from whiteboards from study carrels
- Month? Day?
- Sentence level?
- Graffiti-post level? – you could do things like count average number of words in a piece of graffiti
- Writing implement?
I could do any or all of these, but I wonder how many of them are going to yield any actual interesting results.
Then there’s the question of format. I was initially thinking public Google Doc spreadsheet, but I prefer doing my data crunching using XSLT, which suggests XML. (And I find it easier to go from XML to spreadsheet than the other way around.) That said, if I’m going the XML route, I’m doing my own schema– no doubt there’s a way to encode all of the things I want using TEI, but I’d prefer to have at least an ounce of sanity left by the time I’m done with this.
The data set will be free for anyone to use, so if you’ve got preferences (other than using TEI) or suggestions for what metadata I should include, do leave a comment.