We started out by pulling data from various sources where certain tags were applied to startups by the databases or the startups themselves. By doing this, we generated tags for most of the 160k startups and a total of 300,000 different tags. Some tags were far too general, like “software”, while others were too specific and didn’t appear often independently. For example, similar tags like music video, music streaming, music chart, music entertainment, music, music label, music venues, and independent music (there are 100’s more like this), are related, so we cluster them together as music and audio. On the other hand, there were some broad tags that were too trivial, i.e. advertising, apps, shopping and so on. In addition, some of the tags were often not descriptive of a company’s actual business and some just had the singular vs. plural problem like designer vs. designers.
To overcome these limitations, we used a process called ‘hierarchical clustering’ to group tags with similar characteristics into clusters. What helped us with computing the similarity between tags was using information about the keywords in Wikipedia. Specifically, we compared the content similarity of Wikipedia articles between the different tags. If our algorithm spotted frequently occurring keywords (terms that are specific and representative of a given industry) between two Wikipedia articles, the tags were considered similar, and thus the dendrogram below was born:
The resulting clusters depicted on the massive dendrogram above is a bit much so below we have zoomed in to how the aforementioned music & audio cluster got merged and what are its nearest ‘companions’.
The outcome of the whole clustering process and data cleansing was an organized set of 45 different tag clusters. Considering we currently have thousands of different tags in our database, this is quite some narrowing and it has significantly eased the experience of searching for startups. For example, you can search from automotive to wellness & fitness industries. And you can filter your searches by valuation, last funding size, headquarters, founding date, and many more features, all on Funderbeam Data.
We have been using this method for classification since we started clustering, and it has brought us a long way. However, we have also seen some limitations to the method, which we will discuss in the next post in this series (Along with how we are improving it using neural networks).
We wanted to provide some insightful knowledge for our more technically savvy followers, so we decided to produce a set of articles where we concentrate on explaining how we gather and sort information, specifically tagging the startups and grouping them in a logical way.
The other articles in this series:
– Neural networks and how they help our data structure
Please leave a comment if you have any input, and make sure to follow so you get notified on our next post!
Have data needs?
Funderbeam has updated data on more than 160k startups and 20k investors. We are working together with a range of startups, accelerators, VCs and more to provide data services. If you would like to learn more, get in touch with Nick, our head of data via: Nicholas.Vandrey@Funderbeam.com or connect on Twitter: @nsvandrey.