The New AI
Analyzing Artworks¶
Accomplishing this in a statistically meaningful way requires a
substantial volume of art.
Fortunately, in today’s day and age, several museums maintain APIs.
These allow us to programmatically access not just the artwork itself, but metadata about the piece.
One such museum is the Art Institute of Chicago, whose collection contains over 120,000 pieces.
The Institute maintains a feature-rich API, allowing for bothmetadata
andimage acquisition.
Let’s take a tour of these APIs, and see if we can fuse them into our overarching color science process.
import ast
import binascii
from itertools import repeat
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
import requests
import scipy.cluster
import scipy.stats as ss
import sqlite3
from sqlalchemy import create_engine
First¶
Let’s write a small function to query the Art Institute of Chicago’s
Artworks API,
and test with a random integer.
def get_art_attributes(_id):
url = (
f"https://api.artic.edu/api/v1/artworks/{_id}?"
f"fields=id,image_id,date_end,place_of_origin,artwork_type_title"
)
r = requests.get(url)
data = json.loads(r.text)['data']
return data
get_art_attributes(897)
{'id': 897, 'date_end': 1852, 'place_of_origin': 'France', 'artwork_type_title': 'Painting', 'image_id': '5ae91cbf-66c5-cf9b-f355-629e458cb063'}
Success!¶
We now have several useful
attributes.
Notice theimage_id
, which we can use to retrieve the artwork itself.
Let’s write a small function to do just that, then test it with theimage_id
value we got previously.
def get_art_image(image_id):
im = None
url = (
f"https://www.artic.edu/iiif/2/"
f"{image_id}"
f"/full/843,/0/default.jpg"
)
try:
im = Image.open(requests.get(url, stream=True).raw)
except:
pass
return im.resize((300, 300))
get_art_image(get_art_attributes(893)['image_id'])
Excellent!¶
Now we have three very important data points:
time, place, and the painting itself.
Let’s run this through our color clustering algorithm that we developed a couple weeks ago.
def get_color_stats(im=None, ar=None):
if im and not ar:
ar = np.asarray(im)
shape = ar.shape
ar = ar.reshape(np.prod(shape[:2]), shape[2]).astype(float)
codes, dist = scipy.cluster.vq.kmeans(ar, 4)
vectors, distance = scipy.cluster.vq.vq(ar, codes)
counts, bins = np.histogram(vectors, len(codes))
colors = dict(zip(ss.rankdata(-counts), codes.tolist()))
colors = {int(k): {'rgb': v} for k, v in colors.items()}
for i, v in enumerate(colors):
colors[v]['count'] = counts[i]
for v in colors.values():
v['rgb'] = [round(n) for n in v['rgb']]
v['hex'] = f"#{binascii.hexlify(bytearray(int(c) for c in v['rgb'])).decode('ascii')}"
v['r'] = v['rgb'][0]
v['g'] = v['rgb'][1]
v['b'] = v['rgb'][2]
df = pd.DataFrame.from_dict(
data=colors, orient='index').reset_index().rename(
columns={'index': 'rank'}).sort_values(by='rank')
return df, ar
df, ar = get_color_stats(get_art_image(get_art_attributes(893)['image_id']))
df
rank | rgb | count | hex | r | g | b | |
---|---|---|---|---|---|---|---|
1 | 1 | [32, 23, 16] | 36387 | #201710 | 32 | 23 | 16 |
0 | 2 | [200, 191, 136] | 25097 | #c8bf88 | 200 | 191 | 136 |
3 | 3 | [140, 133, 79] | 17072 | #8c854f | 140 | 133 | 79 |
2 | 4 | [75, 66, 27] | 11444 | #4b421b | 75 | 66 | 27 |
We now have…¶
…the four most dominant
colors
and theircounts.
Let’s visualize the RGB pixels from the painting in 3D.
def plot_rgb(ar,s=0.1):
X = np.hsplit(ar, np.array([1, 2]))[0].flatten().tolist()
Y = np.hsplit(ar, np.array([1, 2]))[1].flatten().tolist()
Z = np.hsplit(ar, np.array([1, 2]))[2].flatten().tolist()
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection = '3d')
ax.scatter(X, Y, Z, s=s, c=ar / 255.0)
plt.show()
plot_rgb(ar)
Huzzah!¶
We’ve now combined the
API-based data retrieval
process with our color extracting algorithm.
But how do we execute this at scale?
With the never-ending usefulness ofSQLite.
Let’s write a few small functions that will:
- Create an in-memory sqlite dB
- Load a pandas dataframe into the sqlite dB
- Unify our previous functions into a workflow
def create_mem_sqlite():
engine = create_engine('sqlite://', echo=False)
return engine
def load_to_sqlite(df, name, engine):
df.to_sql(name, con=engine)
def workflow(_id):
data = {}
try:
data = get_art_attributes(_id)
if 'image_id' in data:
im = get_art_image(data['image_id'])
if im:
df, ar = get_color_stats(im)
data['colors'] = str(df['hex'].to_list())
data['counts'] = str(df['count'].to_list())
except KeyError:
pass
return data
Now let’s use our new functions…¶
to
drive the process
, and toview
the results in our SQLite dB
engine = create_mem_sqlite()
paintings = {}
for _id in range(890, 900):
data = workflow(_id)
if data:
paintings[_id] = data
paintings = {k: v for k, v in paintings.items() if v}
df = pd.DataFrame.from_dict(paintings, orient='index')
load_to_sqlite(df, 'paintings', engine)
pd.read_sql_query("SELECT * FROM paintings", engine).head(5)
index | id | date_end | place_of_origin | artwork_type_title | image_id | colors | counts | |
---|---|---|---|---|---|---|---|---|
0 | 890 | 890 | 1856 | France | Painting | e0d8a305-15b0-bdcd-1e83-06d8594a2f7e | [‘#31250f’, ‘#b2baa2’, ‘#81896e’, ‘#614d1e’] | [35462, 21664, 16916, 15958] |
1 | 891 | 891 | 1865 | France | Painting | f4d85da1-5c80-3c7b-38cc-bf324d6ce670 | [‘#2b2316’, ‘#7b7551’, ‘#5c5b43’, ‘#3f3622’] | [29658, 23832, 18438, 18072] |
2 | 893 | 893 | 1855 | France | Painting | 3527f037-a9b2-9253-1a92-dcd281b54340 | [‘#211710’, ‘#c8c089’, ‘#8d854f’, ‘#4c431c’] | [36579, 24979, 17068, 11374] |
3 | 894 | 894 | 1865 | France | Painting | f5731565-80bd-6d4c-8790-d0c252d92bd4 | [‘#493d19’, ‘#b6c2bd’, ‘#6c602c’, ‘#8f9e98’] | [34497, 23738, 20884, 10881] |
4 | 895 | 895 | 1885 | Germany | Painting | fa96ef54-c3b1-8f4d-390a-219f7bc64c4a | [‘#5e6652’, ‘#878566’, ‘#aeab8d’, ‘#3b332b’] | [39559, 23829, 16540, 10072] |
Lastly…¶
Let’s create a unified
3D plot
for all the pixels in all of our paintings.
ar = np.empty(shape=[1, 3])
for image_id in [i for i in pd.read_sql_query("SELECT image_id FROM paintings", engine)['image_id'].to_list() if i]:
im = get_art_image(image_id)
im = im.resize((300, 300))
_ar = np.asarray(im)
shape = _ar.shape
_ar = _ar.reshape(np.prod(shape[:2]), shape[2]).astype(float)
ar = np.append(ar, _ar , axis=0)
plot_rgb(ar)
Take a close look at this plot¶
What strikes you about it?
It’s crammed to the gills with data points.
Remember that our ultimate goal is to analyze
dominant colors
acrossmultiple artworks.
So it seems we have two options:
- Add all pixels from
all paintings
into one 3D space, then find dominant color clusters.- For
each painting
, extract dominant pixel cluster center and count. Add these to 3D space, weigh by count, and find dominant color clusters within that space.
Let’s compare the results of these two methods and talk through the pros and cons of each.¶
option1_df, option1_ar = get_color_stats(ar=ar)
option2_raw = pd.read_sql_query("SELECT * FROM paintings", engine)
option2_raw['colors'] = option2_raw['colors'].apply(lambda x: ast.literal_eval(x))
option2_raw['colors'] = option2_raw['colors'].apply(lambda x: [list(int(h.replace('#', '')[i:i+2], 16) for i in (0, 2, 4)) for h in x])
option2_raw['counts'] = option2_raw['counts'].apply(lambda x: ast.literal_eval(x))
my_array = []
for c, n in zip(option2_raw['colors'].to_list(), option2_raw['counts'].to_list()):
for ci, ni in zip(c, n):
my_array.extend(repeat(ci, ni))
my_array = np.array([[float(number) for number in group] for group in my_array])
option2_df, option2_ar = get_color_stats(ar=my_array)
print(option1_df)
print('*'*60)
print(option2_df)
figure, axis = plt.subplots(1, 2, figsize=(10,5))
axis[0].bar(
x=option1_df['hex'],
height=option1_df['count'],
color=option1_df['hex'].tolist())
axis[0].set_title("Option 1: All Pixels in one Space")
axis[1].bar(
x=option1_df['hex'],
height=option2_df['count'],
color=option2_df['hex'].tolist())
axis[1].set_title("Option 2: Weighted Dominant Colors from Each Painting")
plt.show()
rank rgb count hex r g b 3 1 [45, 33, 15] 212886 #2d210f 45 33 15 1 2 [83, 65, 25] 183114 #534119 83 65 25 0 3 [175, 177, 152] 163586 #afb198 175 177 152 2 4 [116, 114, 83] 160415 #747253 116 114 83 ************************************************************ rank rgb count hex r g b 0 1 [116, 115, 84] 199462 #747354 116 115 84 2 2 [45, 33, 14] 197276 #2d210e 45 33 14 3 3 [79, 62, 25] 180488 #4f3e19 79 62 25 1 4 [179, 181, 155] 142774 #b3b59b 179 181 155
Our results¶
Option 1
considers all pixels from multiple artworks in the color clustering algorithm.
Option 2
considers only the dominant colors from each painting, and weighs them by the number of original pixels in each cluster.
Interesting…
The cluster centers barely changed, but the associated counts did, resulting in different dominant clusters.
Theexplanation
for these results is that in option 1, pixels near the borders of clusters made their way into different clusters.
At first glance, this seems to be the result of randomness, suggesting that option 2 is superior.
However, option 2 is ultimately a summary of a summary – in statistical terms, more degrees of freedom – and therefore, option 1 is superior.
Tylor Mondloch¶
Tylor is a
data scientist
at aBig 4 Consulting Firm.
His day-to-day includes buildingstatistical models
forcybersecurity
contracts.
He was born inSouth Dakota
and now resides inBillings, Montana.
Stephen Zhu¶
Stephen is a
data scientist
working with ahydro electricity company.
In his spare time, he lovesrock climbing
and modeling thefinancial market.
He was born inHangzhou, China,
and now resides inVancouver, Canada.
Nigel Joseph¶
Nigel is an
analytics manager
at a multinationalpharmaceutical company.
He currently providesbusiness insights
andforecasting
expertise for newly launched products.
In his spare time, Nigel likes to playboard games
andgolf
(poorly).
tags:¶
georgia tech
programming
computer vision
color science
machine learning
data science
K-Means
Clustering
Art Institute of Chicago