An Exploration of Color Recognition through Clustering

June 5, 2023 In Data Science, News By Analytics Avengers

An Exploration of Color Recognition through Clustering

Analyzing Artworks

Accomplishing this in a statistically meaningful way requires a substantial volume of art. Fortunately, in today's day and age, several museums maintain APIs. These allow us to programmatically access not just the artwork itself, but metadata about the piece. One such museum is the Art Institute of Chicago, whose collection contains over 120,000 pieces. The Institute maintains a feature-rich API, allowing for both metadata and image acquisition. Let's take a tour of these APIs, and see if we can fuse them into our overarching color science process.

Step 1

In [1]:

import ast
import binascii
from itertools import repeat
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
import requests
import scipy.cluster
import scipy.stats as ss
import sqlite3
from sqlalchemy import create_engine

Step 2

Let’s write a small function to query the Art Institute of Chicago’s Artworks API, and test with a random integer.

In [2]:

def get_art_attributes(_id):url = (
f"https://api.artic.edu/api/v1/artworks/{_id}?"
f"fields=id,image_id,date_end,place_of_origin,artwork_type_title"
)
r = requests.get(url)
data = json.loads(r.text)['data']
return data
get_art_attributes(897)

Out[2]:

{'id': 897,
'date_end': 1852,
'place_of_origin': 'France',
'artwork_type_title': 'Painting',
'image_id': '5ae91cbf-66c5-cf9b-f355-629e458cb063'}

Success! We now have several useful attributes. Notice the image_id , which we can use to retrieve the artwork itself. Let’s write a small function to do just that, then test it with the image_id value we got previously.

Step 3

In [3]:

def get_art_image(image_id):
im = None
url = (
f"https://www.artic.edu/iiif/2/"
f"{image_id}"
f"/full/843,/0/default.jpg"
)
try:
im = Image.open(requests.get(url, stream=True).raw)
except:
pass
return im.resize((300, 300))
get_art_image(get_art_attributes(893)['image_id'])

Out[3]: Excellent! Now we have three very important data points: time, place, and the painting itself. Let’s run this through our color clustering algorithm that we developed a couple weeks ago.

Step 4

In [4]:

def get_color_stats(im=None, ar=None):if im and not ar:ar = np.asarray(im)
shape = ar.shape
ar = ar.reshape(np.prod(shape[:2]), shape[2]).astype(float)codes, dist = scipy.cluster.vq.kmeans(ar, 4)
vectors, distance = scipy.cluster.vq.vq(ar, codes)
counts, bins = np.histogram(vectors, len(codes))
colors = dict(zip(ss.rankdata(-counts), codes.tolist()))
colors = {int(k): {'rgb': v} for k, v in colors.items()}

for i, v in enumerate(colors):colors[v]['count'] = counts[i]
for v in colors.values():v['rgb'] = [round(n) for n in v['rgb']]
v['hex'] = f"#{binascii.hexlify(bytearray(int(c) for c in v['rgb'])).decode('ascii')}"
v['r'] = v['rgb'][0]
v['g'] = v['rgb'][1]
v['b'] = v['rgb'][2]
df = pd.DataFrame.from_dictdata=colors, orient='index').reset_index().rename(
columns={'index': 'rank'}).sort_values(by='rank')
return df, ardf, ar = get_color_stats(get_art_image(get_art_attributes(893)['image_id']))
df

Out[4]:

Rank	RGB	Count	Hex	r	g	b
1	[32, 23, 16]	36405	#201710	32	23	16
2	[200, 192, 137]	24989	#c8c089	200	192	137
3	[140, 133, 79]	17134	#8c854f	140	133	79
4	[75, 66, 27]	11472	#4b421b	75	66	27

We now have…the four most dominant colors and their counts. Let’s visualize the RGB pixels from the painting in 3D.

Step 5

In [5]:

def plot_rgb(ar,s=0.1):X = np.hsplit(ar, np.array([1, 2]))[0].flatten().tolist()
Y = np.hsplit(ar, np.array([1, 2]))[1].flatten().tolist()
Z = np.hsplit(ar, np.array([1, 2]))[2].flatten().tolist()
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection = '3d')
ax.scatter(X, Y, Z, s=s, c=ar / 255.0)
plt.show()
plot_rgb(ar)

Huzzah! We’ve now combined the API-based data retrieval process with our color extracting algorithm. But how do we execute this at scale? With the never-ending usefulness of SQLite. Let’s write a few small functions that will: 1. Create an in-memory sqlite dB 2. Load a pandas dataframe into the sqlite dB 3. Unify our previous functions into a workflow

Step 6

In [6]:

def create_mem_sqlite():engine = create_engine('sqlite://', echo=False)
return engine
def load_to_sqlite(df, name, engine):df.to_sql(name, con=engine)
def workflow(_id):data = {}
try:data = get_art_attributes(_id)
if 'image_id' in data:im = get_art_image(data['image_id'])
if im:df, ar = get_color_stats(im)
data['colors'] = str(df['hex'].to_list())
data['counts'] = str(df['count'].to_list())
except KeyError:pass
return data

Now let’s use our new functions… to drive the process , and to view the results in our SQLite dB

Step 7

In [7]:

engine = create_mem_sqlite()
paintings = {}
for _id in range(890, 900):
data = workflow(_id)
if data:
paintings[_id] = data
paintings = {k: v for k, v in paintings.items() if v}

df = pd.DataFrame.from_dict(paintings, orient='index')

load_to_sqlite(df, 'paintings', engine)

pd.read_sql_query("SELECT * FROM paintings", engine).head(5)

Out[7]:

Index	ID	Date End	Place of Origin	Artwork Type Title	Image ID	Colors	Counts
0	890	1856	France	Painting	e0d8a305-15b0-bdcd-1e83-06d8594a2f7e	[‘#30240f’, ‘#b2baa2’, ‘#82896d’, ‘#5f4b1d’]	[34565, 21714, 17089, 16632]
1	891	1865	France	Painting	f4d85da1-5c80-3c7b-38cc-bf324d6ce670	[‘#2b2316’, ‘#7a7451’, ‘#5b5b42’, ‘#3f3623’]	[30047, 24408, 17908, 17637]
2	893	1855	France	Painting	3527f037-a9b2-9253-1a92-dcd281b54340	[‘#201710’, ‘#c8bf88’, ‘#8c854f’, ‘#4b421b’]	[36420, 25082, 17065, 11433]
3	894	1865	France	Painting	f5731565-80bd-6d4c-8790-d0c252d92bd4	[‘#493d19’, ‘#b7c2bd’, ‘#6d612d’, ‘#909f9a’]	[35077, 23334, 20413, 11176]
4	895	1885	Germany	Painting	fa96ef54-c3b1-8f4d-390a-219f7bc64c4a	[‘#5e6652’, ‘#878566’, ‘#aeab8d’, ‘#3b332b’]	[39549, 23843, 16540, 10068]

Lastly… Let’s create a unified 3D plot for all the pixels in all of our paintings.

Step 8

ar = np.empty(shape=[1, 3])
for image_id in [i for i in pd.read_sql_query("SELECT image_id FROM paintings", engine)['image_id'].to_list() if i]:
im = get_art_image(image_id)
im = im.resize((300, 300))
_ar = np.asarray(im)
shape = _ar.shape
_ar = _ar.reshape(np.prod(shape[:2]), shape[2]).astype(float)
ar = np.append(ar, _ar , axis=0)
plot_rgb(ar)

Take a close look at this plot

What strikes you about it? It’s crammed to the gills with data points.

Remember that our ultimate goal is to analyze dominant colors across multiple artworks. So it seems we have two options:

Add all pixels from all paintings into one 3D space, then find dominant color clusters.
For each painting , extract dominant pixel cluster center and count. Add these to 3D space, weigh by count, and find dominant color clusters within that space.

Let’s compare the results of these two methods and talk through the pros and cons of each.

Step 9

In [9]:

option1_df, option1_ar = get_color_stats(ar=ar)

option2_raw = pd.read_sql_query("SELECT * FROM paintings", engine)
option2_raw['colors'] = option2_raw['colors'].apply(lambda x: ast.literal_eval(x))
option2_raw['colors'] = option2_raw['colors'].apply(lambda x: [list(int(h.replace('#', '')[i:i+2], 16) for i in (0, 2, 4)) for h in x])
option2_raw['counts'] = option2_raw['counts'].apply(lambda x: ast.literal_eval(x))

my_array = []
for c, n in zip(option2_raw['colors'].to_list(), option2_raw['counts'].to_list()):
for ci, ni in zip(c, n):
my_array.extend(repeat(ci, ni))
my_array = np.array([[float(number) for number in group] for group in my_array])

option2_df, option2_ar = get_color_stats(ar=my_array)

print(option1_df)
print('*'*60)
print(option2_df)

figure, axis = plt.subplots(1, 2, figsize=(10,5))

axis[0].bar(
x=option1_df['hex'],
height=option1_df['count'],
color=option1_df['hex'].tolist())
axis[0].set_title("Option 1: All Pixels in one Space")
axis[1].bar(
x=option1_df['hex'],
height=option2_df['count'],
color=option1_df['hex'].tolist())
axis[1].set_title("Option 2: Weighted Dominant Colors from Each Painting")
plt.show()

Rank	RGB	Count	Hex	R	G	B
1	[44, 33, 15]	212322	#2c210f	44	33	15
2	[83, 65, 25]	183604	#534119	83	65	25
3	[175, 177, 152]	163510	#afb198	175	177	152
4	[116, 114, 83]	160565	#747253	116	114	83

Our results

Option 1 considers all pixels from multiple artworks in the color clustering algorithm.

Option 2 considers only the dominant colors from each painting, and weighs them by the number of original pixels in each cluster.

Interesting…

The cluster centers barely changed, but the associated counts did, resulting in different dominant clusters. The explanation for these results is that in option 1, pixels near the borders of clusters made their way into different clusters. At first glance, this seems to be the result of randomness, suggesting that option 2 is superior. However, option 2 is ultimately a summary of a summary – in statistical terms, more degrees of freedom – and therefore, option 1 is superior.

Tags:

Art and AI Art API art institute of chicago Art Intelligence artwork analysis Color Clustering color science data science georgia tech K-Means Clustering machine learning SQLite

Art Scene Disrupting the Market Through Inclusion

An Exploration of Color Recognition through Clustering

An Exploration of Color Recognition through Clustering

Analyzing Artworks

Take a close look at this plot

Our results

Interesting…

Related

Tags:

Subscribe