April 6, 2023 In Data Science, News By Analytics Avengers

The New AI

Art & Intelligence¶

The New AI: An Exploration of Color Recognition through Clustering¶

`Tylor Mondloch, Nigel Joseph, Stephen Zhu`¶

`9 February 2023`¶

Analyzing Artworks¶

Accomplishing this in a statistically meaningful way requires a substantial volume of art.
Fortunately, in today’s day and age, several museums maintain APIs.
These allow us to programmatically access not just the artwork itself, but metadata about the piece.
One such museum is the Art Institute of Chicago, whose collection contains over 120,000 pieces.
The Institute maintains a feature-rich API, allowing for both metadata and image acquisition.
Let’s take a tour of these APIs, and see if we can fuse them into our overarching color science process.

In [1]:

import ast
import binascii
from itertools import repeat
import json
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
import requests
import scipy.cluster
import scipy.stats as ss
import sqlite3
from sqlalchemy import create_engine

First¶

Let’s write a small function to query the Art Institute of Chicago’s Artworks API, and test with a random integer.

In [2]:

def get_art_attributes(_id):
    url = (
        f"https://api.artic.edu/api/v1/artworks/{_id}?"
        f"fields=id,image_id,date_end,place_of_origin,artwork_type_title"
    )
    r = requests.get(url)
    data = json.loads(r.text)['data']
    return data

get_art_attributes(897)

Out[2]:

{'id': 897,
 'date_end': 1852,
 'place_of_origin': 'France',
 'artwork_type_title': 'Painting',
 'image_id': '5ae91cbf-66c5-cf9b-f355-629e458cb063'}

Success!¶

We now have several useful attributes.
Notice the image_id, which we can use to retrieve the artwork itself.
Let’s write a small function to do just that, then test it with the image_id value we got previously.

In [3]:

def get_art_image(image_id):
    im = None
    url = (
        f"https://www.artic.edu/iiif/2/"
        f"{image_id}"
        f"/full/843,/0/default.jpg"
    )
    try:
        im = Image.open(requests.get(url, stream=True).raw)
    except:
        pass
    return im.resize((300, 300))

get_art_image(get_art_attributes(893)['image_id'])

Out[3]:

Excellent!¶

Now we have three very important data points: time, place, and the painting itself.
Let’s run this through our color clustering algorithm that we developed a couple weeks ago.

In [4]:

def get_color_stats(im=None, ar=None):

    if im and not ar:
        ar = np.asarray(im)
        shape = ar.shape
        ar = ar.reshape(np.prod(shape[:2]), shape[2]).astype(float)

    codes, dist = scipy.cluster.vq.kmeans(ar, 4)
    vectors, distance = scipy.cluster.vq.vq(ar, codes)
    counts, bins = np.histogram(vectors, len(codes))
    colors = dict(zip(ss.rankdata(-counts), codes.tolist()))
    colors = {int(k): {'rgb': v} for k, v in colors.items()}

    for i, v in enumerate(colors):
        colors[v]['count'] = counts[i]

    for v in colors.values():
        v['rgb'] = [round(n) for n in v['rgb']]
        v['hex'] = f"#{binascii.hexlify(bytearray(int(c) for c in v['rgb'])).decode('ascii')}"
        v['r'] = v['rgb'][0]
        v['g'] = v['rgb'][1]
        v['b'] = v['rgb'][2]

    df = pd.DataFrame.from_dict(
        data=colors, orient='index').reset_index().rename(
        columns={'index': 'rank'}).sort_values(by='rank')

    return df, ar

df, ar = get_color_stats(get_art_image(get_art_attributes(893)['image_id']))
df

Out[4]:

	rank	rgb	count	hex	r	g	b
1	1	[32, 23, 16]	36387	#201710	32	23	16
0	2	[200, 191, 136]	25097	#c8bf88	200	191	136
3	3	[140, 133, 79]	17072	#8c854f	140	133	79
2	4	[75, 66, 27]	11444	#4b421b	75	66	27

We now have…¶

…the four most dominant colors and their counts.
Let’s visualize the RGB pixels from the painting in 3D.

In [5]:

def plot_rgb(ar,s=0.1):
    X = np.hsplit(ar, np.array([1, 2]))[0].flatten().tolist()
    Y = np.hsplit(ar, np.array([1, 2]))[1].flatten().tolist()
    Z = np.hsplit(ar, np.array([1, 2]))[2].flatten().tolist()

    fig = plt.figure(figsize=(10, 10))
    ax = fig.add_subplot(111, projection = '3d')
    ax.scatter(X, Y, Z, s=s, c=ar / 255.0)
    plt.show()

plot_rgb(ar)

Huzzah!¶

We’ve now combined the API-based data retrieval process with our color extracting algorithm.
But how do we execute this at scale?
With the never-ending usefulness of SQLite.
Let’s write a few small functions that will:

Create an in-memory sqlite dB

Load a pandas dataframe into the sqlite dB

Unify our previous functions into a workflow

In [6]:

def create_mem_sqlite():
    engine = create_engine('sqlite://', echo=False)
    return engine

def load_to_sqlite(df, name, engine):
    df.to_sql(name, con=engine)

def workflow(_id):
    data = {}
    try:
        data = get_art_attributes(_id)
        if 'image_id' in data:
            im = get_art_image(data['image_id'])
            if im:
                df, ar = get_color_stats(im)
                data['colors'] = str(df['hex'].to_list())
                data['counts'] = str(df['count'].to_list())
                
    except KeyError:
        pass

    return data

Now let’s use our new functions…¶

to drive the process, and to view the results in our SQLite dB

In [7]:

engine = create_mem_sqlite()

paintings = {}

for _id in range(890, 900):
    data = workflow(_id)
    if data:
        paintings[_id] = data

paintings = {k: v for k, v in paintings.items() if v}

df = pd.DataFrame.from_dict(paintings, orient='index')

load_to_sqlite(df, 'paintings', engine)

pd.read_sql_query("SELECT * FROM paintings", engine).head(5)

Out[7]:

	index	id	date_end	place_of_origin	artwork_type_title	image_id	colors	counts
0	890	890	1856	France	Painting	e0d8a305-15b0-bdcd-1e83-06d8594a2f7e	[‘#31250f’, ‘#b2baa2’, ‘#81896e’, ‘#614d1e’]	[35462, 21664, 16916, 15958]
1	891	891	1865	France	Painting	f4d85da1-5c80-3c7b-38cc-bf324d6ce670	[‘#2b2316’, ‘#7b7551’, ‘#5c5b43’, ‘#3f3622’]	[29658, 23832, 18438, 18072]
2	893	893	1855	France	Painting	3527f037-a9b2-9253-1a92-dcd281b54340	[‘#211710’, ‘#c8c089’, ‘#8d854f’, ‘#4c431c’]	[36579, 24979, 17068, 11374]
3	894	894	1865	France	Painting	f5731565-80bd-6d4c-8790-d0c252d92bd4	[‘#493d19’, ‘#b6c2bd’, ‘#6c602c’, ‘#8f9e98’]	[34497, 23738, 20884, 10881]
4	895	895	1885	Germany	Painting	fa96ef54-c3b1-8f4d-390a-219f7bc64c4a	[‘#5e6652’, ‘#878566’, ‘#aeab8d’, ‘#3b332b’]	[39559, 23829, 16540, 10072]

Lastly…¶

Let’s create a unified 3D plot for all the pixels in all of our paintings.

In [8]:

ar = np.empty(shape=[1, 3])

for image_id in [i for i in pd.read_sql_query("SELECT image_id FROM paintings", engine)['image_id'].to_list() if i]:
    im = get_art_image(image_id)
    im = im.resize((300, 300))
    _ar = np.asarray(im)
    shape = _ar.shape
    _ar = _ar.reshape(np.prod(shape[:2]), shape[2]).astype(float)
    ar = np.append(ar, _ar , axis=0)

plot_rgb(ar)

Take a close look at this plot¶

What strikes you about it?

It’s crammed to the gills with data points.

Remember that our ultimate goal is to analyze dominant colors across multiple artworks.
So it seems we have two options:

Add all pixels from all paintings into one 3D space, then find dominant color clusters.

For each painting, extract dominant pixel cluster center and count. Add these to 3D space, weigh by count, and find dominant color clusters within that space.

Let’s compare the results of these two methods and talk through the pros and cons of each.¶

In [ ]:

option1_df, option1_ar = get_color_stats(ar=ar)

option2_raw = pd.read_sql_query("SELECT * FROM paintings", engine)
option2_raw['colors'] = option2_raw['colors'].apply(lambda x: ast.literal_eval(x))
option2_raw['colors'] = option2_raw['colors'].apply(lambda x: [list(int(h.replace('#', '')[i:i+2], 16) for i in (0, 2, 4)) for h in x])
option2_raw['counts'] = option2_raw['counts'].apply(lambda x: ast.literal_eval(x))

my_array = []
for c, n in zip(option2_raw['colors'].to_list(), option2_raw['counts'].to_list()):
    for ci, ni in zip(c, n):
        my_array.extend(repeat(ci, ni))

my_array = np.array([[float(number) for number in group] for group in my_array])

option2_df, option2_ar = get_color_stats(ar=my_array)

print(option1_df)
print('*'*60)
print(option2_df)

In [10]:

figure, axis = plt.subplots(1, 2, figsize=(10,5))

axis[0].bar(
    x=option1_df['hex'],
    height=option1_df['count'],
    color=option1_df['hex'].tolist())
axis[0].set_title("Option 1: All Pixels in one Space")

axis[1].bar(
    x=option1_df['hex'],
    height=option2_df['count'],
    color=option2_df['hex'].tolist())
axis[1].set_title("Option 2: Weighted Dominant Colors from Each Painting")

plt.show()

   rank              rgb   count      hex    r    g    b
3     1     [45, 33, 15]  212886  #2d210f   45   33   15
1     2     [83, 65, 25]  183114  #534119   83   65   25
0     3  [175, 177, 152]  163586  #afb198  175  177  152
2     4   [116, 114, 83]  160415  #747253  116  114   83
************************************************************
   rank              rgb   count      hex    r    g    b
0     1   [116, 115, 84]  199462  #747354  116  115   84
2     2     [45, 33, 14]  197276  #2d210e   45   33   14
3     3     [79, 62, 25]  180488  #4f3e19   79   62   25
1     4  [179, 181, 155]  142774  #b3b59b  179  181  155

Our results¶

Option 1 considers all pixels from multiple artworks in the color clustering algorithm.
Option 2 considers only the dominant colors from each painting, and weighs them by the number of original pixels in each cluster.

Interesting…

The cluster centers barely changed, but the associated counts did, resulting in different dominant clusters.
The explanation for these results is that in option 1, pixels near the borders of clusters made their way into different clusters.
At first glance, this seems to be the result of randomness, suggesting that option 2 is superior.
However, option 2 is ultimately a summary of a summary – in statistical terms, more degrees of freedom – and therefore, option 1 is superior.

Tylor Mondloch¶

Tylor is a data scientist at a Big 4 Consulting Firm.
His day-to-day includes building statistical models for cybersecurity contracts.
He was born in South Dakota and now resides in Billings, Montana.

Stephen Zhu¶
Stephen is a data scientist working with a hydro electricity company.
In his spare time, he loves rock climbing and modeling the financial market.
He was born in Hangzhou, China, and now resides in Vancouver, Canada.

Nigel Joseph¶
Nigel is an analytics manager at a multinational pharmaceutical company.
He currently provides business insights and forecasting expertise for newly launched products.
In his spare time, Nigel likes to play board games and golf (poorly).

tags:¶

georgia tech
programming
computer vision
color science
machine learning
data science
K-Means
Clustering
Art Institute of Chicago

Tags:

art institute of chicago clustering color science computer vision data science georgia tech k-means machine learning programming

Art Scene Disrupting the Market Through Inclusion

The New AI

The New AI

Art & Intelligence¶

The New AI: An Exploration of Color Recognition through Clustering¶

`Tylor Mondloch, Nigel Joseph, Stephen Zhu`¶

`9 February 2023`¶

Analyzing Artworks¶

First¶

Success!¶

Excellent!¶

We now have…¶

Huzzah!¶

Now let’s use our new functions…¶

Lastly…¶

Take a close look at this plot¶

Let’s compare the results of these two methods and talk through the pros and cons of each.¶

Our results¶

Tylor Mondloch¶

Stephen Zhu¶

Nigel Joseph¶

tags:¶

Related

Tags:

Subscribe

Art Scene Disrupting the Market Through Inclusion

The New AI

The New AI

Art & Intelligence¶

The New AI: An Exploration of Color Recognition through Clustering¶

Tylor Mondloch, Nigel Joseph, Stephen Zhu¶

9 February 2023¶

Analyzing Artworks¶

First¶

Success!¶

Excellent!¶

We now have…¶

Huzzah!¶

Now let’s use our new functions…¶

Lastly…¶

Take a close look at this plot¶

Let’s compare the results of these two methods and talk through the pros and cons of each.¶

Our results¶

Tylor Mondloch¶

Stephen Zhu¶

Nigel Joseph¶

tags:¶

Related

Tags:

`Tylor Mondloch, Nigel Joseph, Stephen Zhu`¶

`9 February 2023`¶