Data Science Series #1: Predictions Don’t Always Work
During an exploration of open art datasets/databases during my time at the University of Illinois at Chicago, I became interested in the possibility of predicting whether an artwork would be cataloged or not. Cataloging, as Merriam-Webster defines it, is the action of “classify[ing] (something, such as books or information) descriptively.” And, for an art historian and those who work at museums and galleries, this activity is of the utmost importance.
The Museums & Galleries of NSW explains that “activities such as research, interpretation, conservation, risk management, exhibition development and publications are dependent on detailed and up-to-date collection information.” Therefore, my hope to find trends through a
predictive algorithm could tell fascinating stories about an institution’s preferences in exhibiting art, and what better institution to pick than the Museum of Modern Art (MoMA) in New York City, NY.
MoMA holds one of the largest art collections in the United States and currently has almost 200,000 artworks in its collection. Following the lead of the Tate and Copper-Hewitt museums, the MoMA has released a portion of its collection online as a GitHub repository. This online dataset contains 131,151 works with 15,222 artists referenced. For the purpose of this predictive question, we see “Not Curator Approved” as not cataloged. So, let’s start data sciencing!!
What Does The Data Even Look Like?
Well, the MoMA has 29 descriptive features for each artwork, and you can see some of them in the table snippet below. From this snippet, we can also get a wider understanding of the art represented in the MoMA, like how some pieces have multiple artists attributed or how some artists only have a few of their works cataloged. Most importantly, we see which attributes list multiple information in one cell.
From here, we – data scientists – must clean and prepare the data in such a way to ease the placement of it into our algorithms. In my case, I replaced the words (for columns: “CreditLine,” “Department”, and “Classification”) with numerical data if possible. And, if not, for example with the listed information, I either counted the observations within the list, or created new binary columns to represent them. For example, “NumArtists” columns refers to the number of artists who worked on the said piece, while the “color” column refers to pieces which have color as part of their medium type. A table snippet on that is shown below.
Assumptions, Algorithms, … and Assessments?
The assumptions about a dataset fuels which algorithms are used on it, and my assumption was that the features of an artwork was independent in some way from the other features. Independence is the idea that the probability of one feature occurring does not affect the probability of the occurrence of another. For example, the artist being Pablo Picasso could be independent of the artwork being colored. However, in hindsight, this assumption seems obviously flawed to me now. Many artists focus on particular crafts, similar to how the departments of the MoMA focus on particular styles.
In any case, I tried out two different predicting algorithms: logistic regression and naive bayes. Both these algorithms assume some type of independence. Logistic regression is a discriminative model that works by finding the decision boundary between classes (e.g. not cataloged and cataloged). While, naive bayes is a simplified version of Bayesian Theorem where the algorithm assumes all attributes are independent.
“In any case, I tried out two different predicting algorithms: logistic regression and naive bayes.”
From here, I took the average accuracy of both algorithms; logistic regression having 64% and naive bayes having 39%. (I stared at this for a minute, then 2 minutes. Recomposed myself and asked why?) Why did it not have an accuracy above 80%? What could I learn from this?
As said earlier, a lot of the attributes are not independent of each other. Though, some attributes like “CreditLine” and “Department” show interesting relations like how all artworks inside the Fluxus Collection (a type of “Department”) have not been cataloged, yet. If we take a step back, and only look at certain departments instead of the whole MoMA, this type of prediction could have been better suited. As, the departments might have different priorities or metrics to decide which works need complete catalogization. Therefore, making it harder to find which attributes are actually independent, if there is even independence at all.
Featured Image Credit: Matheus Viana
The App Designed for Artists, and Art Institutions
Easily showcase your artwork and gain insights on collectors and art enthusiasts with Budget Collector.