An Exploration of Color Recognition through Clustering
Analyzing Artworks
Accomplishing this in a statistically meaningful way requires a substantial volume of art. Fortunately, in today's day and age, several museums maintain APIs. These allow us to programmatically access not just the artwork itself, but metadata about the piece. One such museum is the Art Institute of Chicago, whose collection contains over 120,000 pieces. The Institute maintains a feature-rich API, allowing for both metadata and image acquisition. Let's take a tour of these APIs, and see if we can fuse them into our overarching color science process.
import ast import binascii from itertools import repeat import json import matplotlib.pyplot as plt import numpy as np import pandas as pd from PIL import Image import requests import scipy.cluster import scipy.stats as ss import sqlite3 from sqlalchemy import create_engine
Let’s write a small function to query the Art Institute of Chicago’s Artworks API, and test with a random integer. In [2]:
def get_art_attributes(_id):Out[2]:url = ( f"https://api.artic.edu/api/v1/artworks/{_id}?" f"fields=id,image_id,date_end,place_of_origin,artwork_type_title" ) r = requests.get(url) data = json.loads(r.text)['data'] return dataget_art_attributes(897)
{'id': 897, 'date_end': 1852, 'place_of_origin': 'France', 'artwork_type_title': 'Painting', 'image_id': '5ae91cbf-66c5-cf9b-f355-629e458cb063'}Success! We now have several useful attributes. Notice the image_id , which we can use to retrieve the artwork itself. Let’s write a small function to do just that, then test it with the image_id value we got previously.
In [3]:
def get_art_image(image_id):im = None url = ( f"https://www.artic.edu/iiif/2/" f"{image_id}" f"/full/843,/0/default.jpg" ) try: im = Image.open(requests.get(url, stream=True).raw) except: pass return im.resize((300, 300))get_art_image(get_art_attributes(893)['image_id'])
Out[3]: Excellent! Now we have three very important data points: time, place, and the painting itself. Let’s run this through our color clustering algorithm that we developed a couple weeks ago.
def get_color_stats(im=None, ar=None):Out[4]:if im and not ar:df, ar = get_color_stats(get_art_image(get_art_attributes(893)['image_id'])) dfar = np.asarray(im) shape = ar.shape ar = ar.reshape(np.prod(shape[:2]), shape[2]).astype(float)codes, dist = scipy.cluster.vq.kmeans(ar, 4) vectors, distance = scipy.cluster.vq.vq(ar, codes) counts, bins = np.histogram(vectors, len(codes)) colors = dict(zip(ss.rankdata(-counts), codes.tolist())) colors = {int(k): {'rgb': v} for k, v in colors.items()}for i, v in enumerate(colors):colors[v]['count'] = counts[i]for v in colors.values():v['rgb'] = [round(n) for n in v['rgb']] v['hex'] = f"#{binascii.hexlify(bytearray(int(c) for c in v['rgb'])).decode('ascii')}" v['r'] = v['rgb'][0] v['g'] = v['rgb'][1] v['b'] = v['rgb'][2]df = pd.DataFrame.from_dictdata=colors, orient='index').reset_index().rename( columns={'index': 'rank'}).sort_values(by='rank')return df, ar
Rank | RGB | Count | Hex | r | g | b |
---|---|---|---|---|---|---|
1 | [32, 23, 16] | 36405 | #201710 | 32 | 23 | 16 |
2 | [200, 192, 137] | 24989 | #c8c089 | 200 | 192 | 137 |
3 | [140, 133, 79] | 17134 | #8c854f | 140 | 133 | 79 |
4 | [75, 66, 27] | 11472 | #4b421b | 75 | 66 | 27 |
def plot_rgb(ar,s=0.1):Huzzah! We’ve now combined the API-based data retrieval process with our color extracting algorithm. But how do we execute this at scale? With the never-ending usefulness of SQLite. Let’s write a few small functions that will: 1. Create an in-memory sqlite dB 2. Load a pandas dataframe into the sqlite dB 3. Unify our previous functions into a workflowX = np.hsplit(ar, np.array([1, 2]))[0].flatten().tolist() Y = np.hsplit(ar, np.array([1, 2]))[1].flatten().tolist() Z = np.hsplit(ar, np.array([1, 2]))[2].flatten().tolist() fig = plt.figure(figsize=(10, 10)) ax = fig.add_subplot(111, projection = '3d') ax.scatter(X, Y, Z, s=s, c=ar / 255.0) plt.show()plot_rgb(ar)
def create_mem_sqlite():Now let’s use our new functions… to drive the process , and to view the results in our SQLite dBengine = create_engine('sqlite://', echo=False) return enginedef load_to_sqlite(df, name, engine):df.to_sql(name, con=engine)def workflow(_id):data = {} try:data = get_art_attributes(_id) if 'image_id' in data:im = get_art_image(data['image_id']) if im:except KeyError:df, ar = get_color_stats(im) data['colors'] = str(df['hex'].to_list()) data['counts'] = str(df['count'].to_list())passreturn data
engine = create_mem_sqlite() paintings = {} for _id in range(890, 900):Out[7]:data = workflow(_id) if data:paintings = {k: v for k, v in paintings.items() if v} df = pd.DataFrame.from_dict(paintings, orient='index') load_to_sqlite(df, 'paintings', engine) pd.read_sql_query("SELECT * FROM paintings", engine).head(5)paintings[_id] = data
Index | ID | Date End | Place of Origin | Artwork Type Title | Image ID | Colors | Counts |
---|---|---|---|---|---|---|---|
0 | 890 | 1856 | France | Painting | e0d8a305-15b0-bdcd-1e83-06d8594a2f7e | [‘#30240f’, ‘#b2baa2’, ‘#82896d’, ‘#5f4b1d’] | [34565, 21714, 17089, 16632] |
1 | 891 | 1865 | France | Painting | f4d85da1-5c80-3c7b-38cc-bf324d6ce670 | [‘#2b2316’, ‘#7a7451’, ‘#5b5b42’, ‘#3f3623’] | [30047, 24408, 17908, 17637] |
2 | 893 | 1855 | France | Painting | 3527f037-a9b2-9253-1a92-dcd281b54340 | [‘#201710’, ‘#c8bf88’, ‘#8c854f’, ‘#4b421b’] | [36420, 25082, 17065, 11433] |
3 | 894 | 1865 | France | Painting | f5731565-80bd-6d4c-8790-d0c252d92bd4 | [‘#493d19’, ‘#b7c2bd’, ‘#6d612d’, ‘#909f9a’] | [35077, 23334, 20413, 11176] |
4 | 895 | 1885 | Germany | Painting | fa96ef54-c3b1-8f4d-390a-219f7bc64c4a | [‘#5e6652’, ‘#878566’, ‘#aeab8d’, ‘#3b332b’] | [39549, 23843, 16540, 10068] |
ar = np.empty(shape=[1, 3]) for image_id in [i for i in pd.read_sql_query("SELECT image_id FROM paintings", engine)['image_id'].to_list() if i]: im = get_art_image(image_id) im = im.resize((300, 300)) _ar = np.asarray(im) shape = _ar.shape _ar = _ar.reshape(np.prod(shape[:2]), shape[2]).astype(float) ar = np.append(ar, _ar , axis=0) plot_rgb(ar)
Take a close look at this plot
What strikes you about it? It’s crammed to the gills with data points.
Remember that our ultimate goal is to analyze dominant colors across multiple artworks. So it seems we have two options:
- Add all pixels from all paintings into one 3D space, then find dominant color clusters.
- For each painting , extract dominant pixel cluster center and count. Add these to 3D space, weigh by count, and find dominant color clusters within that space.
Let’s compare the results of these two methods and talk through the pros and cons of each.
option1_df, option1_ar = get_color_stats(ar=ar) option2_raw = pd.read_sql_query("SELECT * FROM paintings", engine) option2_raw['colors'] = option2_raw['colors'].apply(lambda x: ast.literal_eval(x)) option2_raw['colors'] = option2_raw['colors'].apply(lambda x: [list(int(h.replace('#', '')[i:i+2], 16) for i in (0, 2, 4)) for h in x]) option2_raw['counts'] = option2_raw['counts'].apply(lambda x: ast.literal_eval(x)) my_array = [] for c, n in zip(option2_raw['colors'].to_list(), option2_raw['counts'].to_list()):for ci, ni in zip(c, n):my_array = np.array([[float(number) for number in group] for group in my_array]) option2_df, option2_ar = get_color_stats(ar=my_array) print(option1_df) print('*'*60) print(option2_df) figure, axis = plt.subplots(1, 2, figsize=(10,5)) axis[0].bar(my_array.extend(repeat(ci, ni))x=option1_df['hex'], height=option1_df['count'], color=option1_df['hex'].tolist()) axis[0].set_title("Option 1: All Pixels in one Space")axis[1].bar(x=option1_df['hex'], height=option2_df['count'], color=option1_df['hex'].tolist()) axis[1].set_title("Option 2: Weighted Dominant Colors from Each Painting") plt.show()
Rank | RGB | Count | Hex | R | G | B |
---|---|---|---|---|---|---|
1 | [44, 33, 15] | 212322 | #2c210f | 44 | 33 | 15 |
2 | [83, 65, 25] | 183604 | #534119 | 83 | 65 | 25 |
3 | [175, 177, 152] | 163510 | #afb198 | 175 | 177 | 152 |
4 | [116, 114, 83] | 160565 | #747253 | 116 | 114 | 83 |
Our results
Option 1 considers all pixels from multiple artworks in the color clustering algorithm.
Option 2 considers only the dominant colors from each painting, and weighs them by the number of original pixels in each cluster.
Interesting…
The cluster centers barely changed, but the associated counts did, resulting in different dominant clusters. The explanation for these results is that in option 1, pixels near the borders of clusters made their way into different clusters. At first glance, this seems to be the result of randomness, suggesting that option 2 is superior. However, option 2 is ultimately a summary of a summary – in statistical terms, more degrees of freedom – and therefore, option 1 is superior.