Extracting the dominant colors of an image using K-Means Clustering
Havana Image by Greta Chapin–McGill from Budget Collector dataset
Extracting, Transforming, and Loading Airtable Data in Python
Our team decided to use the dataset provided by budget collector. The dataset is in Airtable which is a spreadsheet-database hybrid solution that allows data to be organized in structured manner. To access the data on Airtable using python, we needed to get an API and connect to the dataset. With the API key, base id and the name of the table we were able to import the data into python and convert the data into a pandas Dataframe. With the data, we needed to access the images in order to process them and extract dominant colors. We accessed the images through the image links obtained from the Airtable data using the Python Imaging library together with the HTTP requests library. This ensured that when accessing the image link, we got an active image link. A safety feature was implemented in Airtable in November 2022 whereby public-accessible links expire, which helps increase attachment security.
Using K-Means Clustering Algorithm to get the Dominant Colors
We then used the numpy python Library to convert the image into an array then we reshaped the array obtained into a format that could train a machine learning model based on the features of the image. In our case we want to run a clustering algorithm on the images in order to find clusters of the same colors, which we will then term as dominant colors. The algorithm we chose to use is the KMeans clustering algorithm through the sklearn library. To run the KMeans algorithm we first needed to select the number of clusters we would want, we chose to go with five clusters, however, we will explore methods to get the optimum number of clusters e.g., the elbow method. With the number of clusters agreed on, we ran the clustering algorithm on a subset of the images from which we got the cluster centers for the five clusters. We then converted the clusters centers into integers which represent the colors. This gave us the top 5 dominant colors. We displayed the colors on an axis using the matplotlib library and converted the array back into an image using only the dominant colors i.e., clustered image.
Next Steps & Challenges
The next step in our analysis is running the KMeans algorithm on the entire dataset to get the dominant colors for each image. We will then create columns in our dataset, each of the 5 dominant colors of the images. Every pixel of an image can have a red, green, blue (RGB) value between 0 and 255. This means that there are 256^3, or 16777216 possible colors. After extracting the 5 dominant colors of the images, we anticipate the challenge that the exact RGB value may be different in each of the images. We have researched several ways to calculate the similarity between the colors, starting with the Euclidean distance. What we found is that, although the method is simple, humans see color very differently than how it is represented in the RGB space. Everyone’s devices may also be different, so we need to transform the RGB values into a more uniform color space, like CIELAB. Then we hope to use the device-independent values to measure the color differences, with a metric such as Lab Delta E. Our goal remains to determine whether there is any correlation between the dominant colors to the period the image was painted and the region the image was painted at. Additionally, we will use this information to create an interactive visualization that will allow a user to travel through time and see the changes that have happened in terms of color choices used to create art.