logo

Art Scene Disrupting the Market Through Inclusion


Art Scene (formerly AI Art Advisor), the next-generation art discovery and evaluation app, is disrupting the art market by combining cutting-edge technology with deep insights into art and aesthetics. Our proprietary "artistic quotient" machine learning system helps users discover their unique taste in art and navigate the art market with confidence. With Art Scene, collecting art is no longer limited to the elite few - our app is democratizing the market and making it accessible to everyone.



4852
post-template-default,single,single-post,postid-4852,single-format-standard,wp-custom-logo,qi-blocks-1.3.3,qodef-gutenberg--no-touch,stockholm-core-2.4,qodef-qi--no-touch,qi-addons-for-elementor-1.8.1,select-theme-ver-9.12,ajax_fade,page_not_loaded,side_area_over_content,,qode_menu_,qode-mobile-logo-set,elementor-default,elementor-kit-550,elementor-page elementor-page-4852

An Exploration of Color Recognition through Clustering

The New AI

An Exploration of Color Recognition through Clustering

Analyzing Artworks

 

Accomplishing this in a statistically meaningful way requires a substantial volume of art. Fortunately, in today's day and age, several museums maintain APIs. These allow us to programmatically access not just the artwork itself, but metadata about the piece. One such museum is the Art Institute of Chicago, whose collection contains over 120,000 pieces. The Institute maintains a feature-rich API, allowing for both metadata and image acquisition. Let's take a tour of these APIs, and see if we can fuse them into our overarching color science process.

In [1]:
import ast
import binascii
from itertools import repeat
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
import requests
import scipy.cluster
import scipy.stats as ss
import sqlite3
from sqlalchemy import create_engine

Let’s write a small function to query the Art Institute of Chicago’s Artworks API, and test with a random integer.

In [2]:

def get_art_attributes(_id):
url = (
f"https://api.artic.edu/api/v1/artworks/{_id}?"
f"fields=id,image_id,date_end,place_of_origin,artwork_type_title"
)
r = requests.get(url)
data = json.loads(r.text)['data']
return data
get_art_attributes(897)
Out[2]:
{'id': 897,
'date_end': 1852,
'place_of_origin': 'France',
'artwork_type_title': 'Painting',
'image_id': '5ae91cbf-66c5-cf9b-f355-629e458cb063'}
Success! We now have several useful attributes. Notice the image_id , which we can use to retrieve the artwork itself. Let’s write a small function to do just that, then test it with the image_id value we got previously.

In [3]:

def get_art_image(image_id):
im = None
url = (
f"https://www.artic.edu/iiif/2/"
f"{image_id}"
f"/full/843,/0/default.jpg"
)
try:
im = Image.open(requests.get(url, stream=True).raw)
except:
pass
return im.resize((300, 300))
get_art_image(get_art_attributes(893)['image_id'])

Out[3]: Excellent! Now we have three very important data points: time, place, and the painting itself. Let’s run this through our color clustering algorithm that we developed a couple weeks ago.

In [4]:
def get_color_stats(im=None, ar=None):
if im and not ar:
ar = np.asarray(im)
shape = ar.shape
ar = ar.reshape(np.prod(shape[:2]), shape[2]).astype(float)
codes, dist = scipy.cluster.vq.kmeans(ar, 4) vectors, distance = scipy.cluster.vq.vq(ar, codes) counts, bins = np.histogram(vectors, len(codes)) colors = dict(zip(ss.rankdata(-counts), codes.tolist())) colors = {int(k): {'rgb': v} for k, v in colors.items()}

for i, v in enumerate(colors):
colors[v]['count'] = counts[i]
for v in colors.values():
v['rgb'] = [round(n) for n in v['rgb']]
v['hex'] = f"#{binascii.hexlify(bytearray(int(c) for c in v['rgb'])).decode('ascii')}"
v['r'] = v['rgb'][0]
v['g'] = v['rgb'][1]
v['b'] = v['rgb'][2]
df = pd.DataFrame.from_dict
data=colors, orient='index').reset_index().rename(
columns={'index': 'rank'}).sort_values(by='rank')
return df, ar
df, ar = get_color_stats(get_art_image(get_art_attributes(893)['image_id'])) df
Out[4]:
Rank RGB Count Hex r g b
1 [32, 23, 16] 36405 #201710 32 23 16
2 [200, 192, 137] 24989 #c8c089 200 192 137
3 [140, 133, 79] 17134 #8c854f 140 133 79
4 [75, 66, 27] 11472 #4b421b 75 66 27

We now have…the four most dominant colors and their counts. Let’s visualize the RGB pixels from the painting in 3D.
In [5]:
def plot_rgb(ar,s=0.1):
X = np.hsplit(ar, np.array([1, 2]))[0].flatten().tolist()
Y = np.hsplit(ar, np.array([1, 2]))[1].flatten().tolist()
Z = np.hsplit(ar, np.array([1, 2]))[2].flatten().tolist()
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection = '3d')
ax.scatter(X, Y, Z, s=s, c=ar / 255.0)
plt.show()
plot_rgb(ar)
color plotHuzzah! We’ve now combined the API-based data retrieval process with our color extracting algorithm. But how do we execute this at scale? With the never-ending usefulness of SQLite. Let’s write a few small functions that will: 1. Create an in-memory sqlite dB 2. Load a pandas dataframe into the sqlite dB 3. Unify our previous functions into a workflow
In [6]:
def create_mem_sqlite():
engine = create_engine('sqlite://', echo=False)
return engine
def load_to_sqlite(df, name, engine):
df.to_sql(name, con=engine)
def workflow(_id):
data = {}
try:
data = get_art_attributes(_id)
if 'image_id' in data:
im = get_art_image(data['image_id'])
if im:
df, ar = get_color_stats(im)
data['colors'] = str(df['hex'].to_list())
data['counts'] = str(df['count'].to_list())
except KeyError:
pass
return data
Now let’s use our new functions… to drive the process , and to view the results in our SQLite dB
In [7]:
engine = create_mem_sqlite()
paintings = {}
for _id in range(890, 900):
data = workflow(_id)
if data:
paintings[_id] = data
paintings = {k: v for k, v in paintings.items() if v}
df = pd.DataFrame.from_dict(paintings, orient='index')
load_to_sqlite(df, 'paintings', engine)
pd.read_sql_query("SELECT * FROM paintings", engine).head(5)
Out[7]:
Index ID Date End Place of Origin Artwork Type Title Image ID Colors Counts
0 890 1856 France Painting e0d8a305-15b0-bdcd-1e83-06d8594a2f7e [‘#30240f’, ‘#b2baa2’, ‘#82896d’, ‘#5f4b1d’] [34565, 21714, 17089, 16632]
1 891 1865 France Painting f4d85da1-5c80-3c7b-38cc-bf324d6ce670 [‘#2b2316’, ‘#7a7451’, ‘#5b5b42’, ‘#3f3623’] [30047, 24408, 17908, 17637]
2 893 1855 France Painting 3527f037-a9b2-9253-1a92-dcd281b54340 [‘#201710’, ‘#c8bf88’, ‘#8c854f’, ‘#4b421b’] [36420, 25082, 17065, 11433]
3 894 1865 France Painting f5731565-80bd-6d4c-8790-d0c252d92bd4 [‘#493d19’, ‘#b7c2bd’, ‘#6d612d’, ‘#909f9a’] [35077, 23334, 20413, 11176]
4 895 1885 Germany Painting fa96ef54-c3b1-8f4d-390a-219f7bc64c4a [‘#5e6652’, ‘#878566’, ‘#aeab8d’, ‘#3b332b’] [39549, 23843, 16540, 10068]
Lastly… Let’s create a unified 3D plot for all the pixels in all of our paintings.
ar = np.empty(shape=[1, 3])
for image_id in [i for i in pd.read_sql_query("SELECT image_id FROM paintings", engine)['image_id'].to_list() if i]:
im = get_art_image(image_id)
im = im.resize((300, 300))
_ar = np.asarray(im)
shape = _ar.shape
_ar = _ar.reshape(np.prod(shape[:2]), shape[2]).astype(float)
ar = np.append(ar, _ar , axis=0)
plot_rgb(ar)
color plot

Take a close look at this plot

What strikes you about it? It’s crammed to the gills with data points.

Remember that our ultimate goal is to analyze dominant colors across multiple artworks. So it seems we have two options:

  1. Add all pixels from all paintings into one 3D space, then find dominant color clusters.
  2. For each painting , extract dominant pixel cluster center and count. Add these to 3D space, weigh by count, and find dominant color clusters within that space.

Let’s compare the results of these two methods and talk through the pros and cons of each.

In [9]:
option1_df, option1_ar = get_color_stats(ar=ar)
option2_raw = pd.read_sql_query("SELECT * FROM paintings", engine) option2_raw['colors'] = option2_raw['colors'].apply(lambda x: ast.literal_eval(x)) option2_raw['colors'] = option2_raw['colors'].apply(lambda x: [list(int(h.replace('#', '')[i:i+2], 16) for i in (0, 2, 4)) for h in x]) option2_raw['counts'] = option2_raw['counts'].apply(lambda x: ast.literal_eval(x))
my_array = [] for c, n in zip(option2_raw['colors'].to_list(), option2_raw['counts'].to_list()):
for ci, ni in zip(c, n):
my_array.extend(repeat(ci, ni))
my_array = np.array([[float(number) for number in group] for group in my_array])
option2_df, option2_ar = get_color_stats(ar=my_array)
print(option1_df) print('*'*60) print(option2_df)
figure, axis = plt.subplots(1, 2, figsize=(10,5))
axis[0].bar(
x=option1_df['hex'],
height=option1_df['count'],
color=option1_df['hex'].tolist())
axis[0].set_title("Option 1: All Pixels in one Space")
axis[1].bar(
x=option1_df['hex'],
height=option2_df['count'],
color=option1_df['hex'].tolist())
axis[1].set_title("Option 2: Weighted Dominant Colors from Each Painting")
plt.show()
Rank RGB Count Hex R G B
1 [44, 33, 15] 212322 #2c210f 44 33 15
2 [83, 65, 25] 183604 #534119 83 65 25
3 [175, 177, 152] 163510 #afb198 175 177 152
4 [116, 114, 83] 160565 #747253 116 114 83

Our results

Option 1 considers all pixels from multiple artworks in the color clustering algorithm.

Option 2 considers only the dominant colors from each painting, and weighs them by the number of original pixels in each cluster.

Interesting…

The cluster centers barely changed, but the associated counts did, resulting in different dominant clusters. The explanation for these results is that in option 1, pixels near the borders of clusters made their way into different clusters. At first glance, this seems to be the result of randomness, suggesting that option 2 is superior. However, option 2 is ultimately a summary of a summary – in statistical terms, more degrees of freedom – and therefore, option 1 is superior.