The Simmering Pot: From Data Source to Data Sauce

In my last post, I tried to situate my exploration of a data set within media archaeology by foregrounding the gaps contained within the data set and how those point back to the discursive context of urban space. It was for this reason, that I am exporting the data set from NYC Open Data and loading it into a new database – to simulate a restoration of what is removed or omitted. The following is a more specific and technical description of my process that includes the problems encountered, patterns discovered, and the decisions made.

Extracting the data is the first step of data storage process known as the extract, transform, and load (ETL). I am adopting this term, because moving and transforming a source data set is always substantial and therefore ought to be documented. Certainly even the contents of NYC Open Data is the product of data extracted from operational systems, transformed to meet certain operational goals in presenting the data on a website, and finally loaded on to the Open Data platform.

The source data set contains the following attributes and formats:

Incident Address Display (Text, e.g. “1 AUDUBON AVENUE”)
Borough (Text, e.g. “MANHATTAN”)
Community Board (Text, e.g. “12 MANHATTAN”)
Police Precinct (Text, e.g. “Precinct 33”)
City Council District (Number, e.g. “10”)
Created Date (Date Format, i.e. “MM/DD/YYYY HH:MM:SS AM -0500/-0400”)
Status (Text, i.e. “Open” or “Closed”)
Resolution Action (Text, “Cleaning crew dispatched. Property cleaned.”)
Closed Date (Date Format, i.e. “MM/DD/YYYY HH:MM:SS AM -0500/-0400”)
X Coordinate (Number, e.g. “1,000,966”)
Y Coordinate (Number, e.g. “244,894”)

The mysterious X, Y coordinate system within the data set was the most challenging transformation I decided to make. I could have geocoded the addresses – converting them into geographic coordinates (longitude/latitude), but that would have been a transformation that subtracted meaning from the coordinate system that the City uses to locate graffiti. Through a process of system research and trial-and-error, I discovered that the City was using the State Plane Coordinate System [1]. I used the Earth Point State Plane conversion tool to test this out and locations began to map in New York City under the following parameters: 3104 New York Long Island with measurements in US Survey Feet [2].

There are three reasons why I made this transformation from State Plane to the Geographic Coordinate System. First, it is relatively easier to map data points using the Geographic Coordinate System. Mapping software like Google Maps understands it and processes it much more readily. Second, there is a lot more data and examples of how to query the Geographic Coordinate System than the State Plane System. Third, any geospatial data I add in the future will be easier to compare and contrast when the base coordinate system is the same. That it is not the source data set’s coordinate system may be important, but I hope to acknowledge that by maintaining X, Y coordinates as an attribute of my data model.

Meaningful patterns emerged in the process of transforming the coordinate system. From the points I mapped, clusters and paths became evident. Those that were clustered together or were along a path often shared the same “Created Date” or “Closed Date”. Could this possibly be evidence of the path a SCOUT investigator or a clean up crew made as they moved through the city? It was pretty exciting to see!

Loading the data into a new database was the next challenge. This is where I started to think about database design along the lines of requirement analysis and logical and conceptual design as discussed in class.

The source data set that I am working with is not final. Every month the City inserts new records and updates previous records. Records from over a year ago are dropped from the data set. The text is alive. The database should be able to handle changes to the data set, retaining removed records and recording the changes that took place for values in updated records.

Documentation such as photos and notes on records need to be connected, but not be confused for the record. Maintaining a distinction between the record and research documentation will help in the design of queries to answer questions such as “What photos do we have of this graffiti before it was removed?”

Maintaining the separation of the record from research documentation brings attention to the conception of graffiti as a database-able entity. The conceptual design of my database excludes a specific entity “graffiti”. The graffiti itself, on the building wall is impossible to capture with certainty and so ought to be considered a non entity. The graffiti that the SCOUT logs may not be the same one that a cleanup crew removes. After all, about 8% of the closed cases ended with the cleanup crew not finding the graffiti.

Data Set Structure — The tables defined within the relational database into which the source data is imported.

The entities I am proposing in my conceptual design are location oriented. I chose to structure the data model upon location as the primary entity because it is here that places are defined and actions coordinated. It is defined by geographic coordinates, a street address, and the borough name. I was considering separating street addresses from geographic coordinates, but it appears that every street address is assigned only one geographic coordinate (something I failed to consider in my earlier decision to translate the X,Y coordinate system instead of geocoding the street address).

Multiple records can be associated with one location, so I am decomposing the city’s record into a location entity and a “graffito” entity. I’m using the word “graffito” because it is the plural of graffiti, but also because the archaeological use of the word captures a sense of legal authority in the marking. The graffito here is the City’s record of a graffiti mark and not the graffiti. The City is inscribing on the walls of geospace and hard drives a legal copy of the illegal trace. Would the graffiti be illegal if the city did not have a record of it?

Documentation of the graffiti then is guided by the official record, so photos and descriptions in analysis of the “graffito” will mean that these need to be stored in the database as separate entities related to the location by way of the official record under consideration. The documentation is concerned with changes over time in the record on the graffiti. The separate documentation entity is connected to the graffito through an intermediate entity I’m calling the “graff” – a play on the word’s etymology derived from both “stylus” and “grave”. This is where the urban researcher inserts documentation on the graffito, but also forms an essential part of exhuming the body of layered traces left by the combination of the City’s data set and the researcher’s records.

I am loading the data into a custom created database where I can analyze, visualize, and append data for an argument to be finally imported into the Urban Research Tool. That transformation will require its own ETL process.

References
[1] “State Plane Coordinate System.” Wikipedia, n.d. http://en.wikipedia.org/wiki/State_Plane_Coordinate_System.
[2] “State Plane Coordinate System”, n.d. http://www.earthpoint.us/StatePlane.aspx.

One comment

Shannon says:

November 27, 2011 at 5:15 am

Astounding, Christo! Your awareness of linguistic politics — the significance of pluralizing “graffito,” the linking of writing and burying/exhuming in the “graff” — offers such a beautiful and appropriate conceptual framework for your custom-designed database itself. Equally interesting are the implications of using various coordinate systems — how awareness of the methodology by which place is designed offers insight into how the data are generated.

Leave a Reply Cancel reply