Scrolly clusters

By Chau Nguyen







Hello and welcome!

Thank you for checking out my project!


I will give an overview of 2 density-based clustering algorithms: DBSCAN and HDBSCAN.


Throughout this scrollyboard, you will find demonstrations of how the algorithms work, guided mostly by self-updating visuals.


This is a scrolling page, so please scroll along!

Unlabeled data


Imagine you have a set of objects you would like to study

(or if you prefer to not use your imagination, they'd still appear in the corner of your eyes soon enough)


You want to put these data points into groups before you can effectively study each group.


However, you don't know what the groups are


How many categories you can put your objects in


And you also cannot examine each item individually to characterize them.

Unsupervised machine learning


Clustering is a subset of unsupervised ML algorithms developed to solve your very problem!


Clustering analysis can be thought of as classification techniques.


Because the labels are derived from the data itself, it is unsupervised, meaning there are no outcomes (labels) to compare the clusters to.


Popular clustering algorithms include:

1) K-means: Prototype-based clustering techniques create a one-level partitioning of the data objects.

2) Hierarchical clustering.

3) Density-based clustering, which is the focus of this tutorial.

Density-based clustering



By now, the unlabeled points on the left side of the page should have finished rendering


Let's take a look at them!


What do you know about them?
Close to nothing, right?


If I gave you a box of crayons and ask you to color the clusters of points , would you be able to?
Probably!

Before discussing these synthetic clusters, let's briefly talk about one of the pioneers of density-based clustering algorithms: DBSCAN.

Introducing DBSCAN


Density-based spatial clustering of applications with noise

Identifies clusters by locating points in regions of high density that can be separated from another point in regions of lower density.


How:

1) Finds core points - objects with more than minimum number of neighbors within a certain radius.

2) Finds areas around the core point that satisfy minimum density calculated from radius and number of neighbors earlier.

Two main tuning parameters :

1) epsilon (radius) of a point in relation to another point to be considered as points in the same neighborhood.

2) min_samples: number of neighbors in a neighborhood for a point to be considered as a core point. This includes the point itself.




Synthetic data

You have had a while to look at the synthetic data points being generated, it's time for a


Pop quiz!!!


How many clusters do you think there are?


How many clusters did sklearn create?


To find out and get access to the rest of the article, click here to subscribe

*daily cat facts are included with the subscription. This is a feature, not a bug, so please stop emailing me about this.

.
.
.
.
.
.
anyhoo





Synthetic data

In this first example, we will examine 2123 synthetic data points I generated using sklearn's built-in make_blobs function.


Unlike real world datasets where the outcome clusters are truly unlabeled (one of which you will interact with as you scroll down)


Because I created these datapoints, I know exactly how many cluster labels there are.

20


You don't have to take my words for it though. Let's put some colors on them!

20 clusters huh 🤔

There are 20 distinct clusters on screen. What was your guess before? Not quite 20, was it?


On the left of the canvas, there is a cluster of blue and red point that sklearn assigned 2 different labels.

We can identify a similar green-yellow blob on the right as well.

When creating the data, I only wanted 2 features for each datapoint so they would map out nicely on an XY plane (more on that later).

One last thing to note is that I based the X & Y scale from GPS coordinates.


DBSCAN

Next, let's see how DBSCAN deals with this synthetic dataset.

For this first simulation, I did an eye test for each distinct blob of points I could identify.

I settled on the radius to be 0.5(km) - because the points supposedly reflect coordinates that fit within the bounding box of a city I know, I thought 0.5km was a reasonable number .

And core points must have at least 10 neighbors within the 0.5km radius.

The updated colors show that DBSCAN picked up 17 clusters using these parameters.

It found 9 points that couldn't be assigned a cluster number and marke them as noise.

The main 17 clusters DBSCAN identified were inline with what I would have color the blobs had I had a box of crayons .

DBSCAN... eye test?

I hope you didn't immediately close this tab when I said "eye test".

Why would I want to do an eye test on fake unlabeled data (not really unlabeled) to give the model the parameters that would probably allow it to return the results I wanted?

Wouldn't that defeat the whole point of applying unsupervised learning algorithms on unlabeled data to let the model to tell us correlations in the data that can't see with our own eyes?

Yes, that was the whole point of the exercise.

I wanted to show that even though the dataset was unlabeled, somebody still had domain expertise over it. In this case, although the data was practically unlabeled, I did spend a shameful amount of time adjusting random_seed and cluster_std in sklearn to make them look just right.

After a domain expert on this synthethic dataset (me!) checked DBSCAN's clustering results and confirmed that it at least performs as well as an eye test, let's throw more things at it!

DBSCAN part 2

In the second simulation, I set more a liberal set of parameters: for a point to be a core point to start defining its neighborhood (cluster), it only needs to have 5 neighbors within a 0.2km radius.

Using the updated requirements, DBSCAN found 29 clusters but also classify a lot more points as noise.

🤔

So fewer neighbors and smaller radius requirements simultaneous helped identifying small clusters in more densed areas as legitimate clusters while disqualifying more outter points on the edge of each blob.

🤔

DBSCAN can disadvantage outlier points and count them as noise if they're not located close enough to so-called "core points?"

🤔

🤔

🤔

But



What if



Your data



Looks like this



🤔

(I really hope that transformation worked on your browser or it would've been really awkward)

Clustering data with varying levels of sparsity.

Perhaps you are more familiar with maps where items are spread out in a fashion similar to this, as opposed to the defined blobs in the synthetic example above.

Maps like this one are what you'd expect to see in reality, whether it's in a neighborhood, city, county, state, or country-level, even globally.

A simple explanation is density varies from one area to another at any given levels of aggregation.

We saw how well DBSCAN does when given a reasonable set of parameters to work with, but struggle when it has to choose between points outside of the detetcted edge or smaller neighborhood clusters.

Introducing Hierarchical DBSCAN

HDBSCAN was developed as an extension to DBSCAN which incorporates an additional hierarchical component into the original algorithm.

Instead using a radius cut-off, HDBSCAN prunes trees.


HDBSCAN starts off similar to DBSCAN: Requires a minimum number of points or min_cluster_size to form a cluster.


Let's jump right into an example using this new set of points.

I will set min_cluster_size to 15. Let's see what happens.

HDBSCAN delivers!

By allowing for varying levels of densities, HDBSCAN splits the 2123 points (yes, it's the same number of points as before) in 47 clusters.


Do you notice anything familiar about these clusters after transformation?


That looks like Washington, DC!


Okay, let's draw this out.

A map of liquor license holders in DC

A liquor license holder is an entity such as restaurant, liquor store, grocery store, brewery or catering business that are allowed to sell alcoholic beverages.

The underlying map divides the city into 46 neighborhood clusters, which is defined as "cluster boundaries were established in the early 2000s based on the professional judgment of the staff of the Office of Planning as reasonably descriptive units of the City for planning purposes"

Both datasets came from OpenDC

But why?

GPS coordinates of liquor lincense holders sure seem like a random thing to cluster, yet there are many factors that can qualify or disqualify a business or individual from obtaining a liquor license.

1) GPS coordinates are 2-dimensional and latitude & longitude are well defined and easily understood and visualized

2) The distance between 2 points on a sphere can be measured precisely using the Harvesince formula;

3) We can visualize the output of clustering algorithms on a familiar map;

4) Using neighborhood clusters defined by the Office of Planning as a stand-in for human judgment/ domain expertise, we visualize and compare the outputs of these clustering algorithms.

5) The algorithms can only "see" as far as the longitude and latitude of each eastablishment. They are blind to uderlying factors like property value, criminal history of the applicant, or covid violations.

Did you notice a neighborhood borders changing color on your map? Please click on it!

Congratulations, silent protagonist! You have unlocked the full map!

Feel free to drag your mouse pointer around to see other neighborhoods, double-click to reset the view, and click on other neighborhoods to zoom into their bounding box!

You can also hover your pointer on any point to see the name of the establishment and the class of their license.

I don't want to drag this post out for too long, so there will only be 3 more simulations after this.

You're almost at the end. It'll be worth it.

DBSCAN revisited

Using the first set of parameters for DBSCAN earlier (the "eyetest" parameters), the clustering performance is much worth on the 2123 real datapoints compared to the 2123 synthetic datapoints.

Because DBSCAN does not perform density estimation in-between points - all neighbors within the 0.5km radius of a core point are considered to be in the same cluster.

Essentially, this means that DBSCAN considers liquor license holders in Georgetown and Navy Yard to be in the same cluster.

DBSCAN revisited, part 2

Setting a less strict definition of a cluster by dialing epsilon down to a radius of 0.2km and minimum samples of 5 somewhat improves DBSCAN's clustering performance. It now picks up 55 clusters, some of which do seem to be placed in different neighborhoods.

However, the West End - Dupont Circle - Shaw - Chinatown areas, where restaurants are highly concentrated, are still jumbled together into one very large cluster by DBSCAN.

HDBSCAN - Do we really need this?

In this last, I set a very loose definition of a cluster for HDBSCAN, requiring min_cluster_size to be 5.

To make the algorithm even more aggressive, I set min_samples to 2 - the default is set to be equal to min_cluster_size.

The lower min_samples, the more aggressive the algorithm is in trying to put points into clusters instead of disregarding them as noise.

In this case, HDBSCAN found 198 clusters of liquor license holders in Washington DC, which seems a little too high.

Conclusion

I hope this scrollyboard was helpful in demonstrating how density-based clustering works.

I also hope this provided you some new ideas on ways to test out unsupervised learning and unlabeled data

Inspirations & References

1) Vince Egalla whose confidence convinced me to think that I can pull off learning D3 and JavaScript in 2 weeks and offered me so much help along the way.

2) The best tutorial on HDBSCAN on the internet by the authors of the package themselves.

3) Baron Watt's ScrollyTelling Example using D3 and Waypoints

4) D3 Zoom to Bounding Box

5) Moving Circles in D3

6) Github issue on converting geopandas with lists into GeoJSON

And many more. This is a running list and will keep being updated.

Thanks and shoutouts and whatnot

I'd like to thank all the wonderful folks in my cohort, specifically:

Vince Egalla, Sahithi Adari, Merykokeb Belay for all their help and encouragement and hype and tech support and "career-support-Wednesdays."

Proud founder of Tombs Day™ Madeline Kinnaird and the rest of Tombs Gang.

The incredibly brilliant, bright, selfless, idealistic, and passionate future leaders: Alex Adams, Gloria Li, and Matt Ring.

And everyone else. We've got really good folks in our cohort - thank you for making this semester great.

Final words

You've scrolled to the very end! You like me! You really like me!

This was my first time setting up a Github page, using JavaScript, CSS, HTML, and D3, so I'm really thankful that you stuck with me to this very last div.

Thank you for sitting through whatever that was.


If you'd like to know more about me or my projects, here are some links:

Go back to my Github

Follow me on Twitter where I mostly complain about Cal football (Go Bears!)

Connect with me on LinkedIn and hire me