Training and tuning your algorithm for optimal performance

In the final post about our clustering system and neural networks, we’ll explain how we’re training our algorithm, how we measure its effectiveness, and how we plan to make it even better.

Training the model involves two steps: forward propagation and backpropagation. Firstly, we run the forward propagation, which is essentially just testing the algorithm. We examine the outputs it produces while knowing what the true results are ourselves. If we see that the predictions differ from the true outputs, we continue with a process called backpropagation. What we do during backpropagation is assign new weights to the links between neurons throughout different layers in the neural network and then run the algorithm again. If the discrepancies between the algorithm’s output and the true output remain, we (or actually another algorithm) adjusts the weights again and this process goes on until the difference between the real and predicted output is sufficiently accurate.

In addition, we constantly evaluate our model using precision and recall graphs, an example of which can be seen below for the tag big data & artificial intelligence.

0-WvnQklX-w0p3Eau-

The red line (precision) corresponds to the values on the x-axis and it indicates what percentage of the data the algorithm predicted correctly. The green line (recall) corresponds to the values on the y-axis and it indicates the fraction of the relevant data that is chosen by the algorithm to be correct, but ignores the incorrect picks. Let’s use some basic math as an example for simplicity. Say we have 100 big data & artificial intelligence startups in a database consisting of 1,000 total startups. We run the data through our neural network, and it predicts that 56 startups belong to the big data & artificial intelligence cluster. Of these 56 startups, 50 actually belong to the big data & artificial intelligence cluster. As 50 of the 56 companies that were assigned to the cluster were assigned correctly, our precision is at nearly 90%. However, the total number of startups that actually belong to the big data & artificial intelligence cluster is 100. As the network only picked up on 50 of those, the recall is 50%. We could broadly assign every company in database to the cluster, thereby achieving a recall of 100%, but our precision would drop to 10% (100 out of 1,000 startups being assigned correctly.)

Alternatively, we could narrow down the parameters of the model so that only 10 companies are assigned to the cluster, out of which all 10 actually belong. This way we would achieve a precision rate of 100%, but a recall of 10% (10 out of 100 relevant startups being selected.) This is the trade-off between high precision and high recall. A model that is optimal will generalize the relationship between inputs and outputs well and obtain high precision and recall by default. In a perfect model, the precision and recall would both be 100%, but this rarely happens in practice. What we’re currently striving for is around 90% precision with a recall level of 50% — exactly what the graph for big data & AI gives us.
Regardless of how useful and sophisticated our neural network may be, a machine learning algorithm that can replace human intuition has yet to be developed. With new data feeding the network every day, keeping it running efficiently and effectively will require a human eye to make sure new companies are assigned to the correct industries. However, as more information is made available to the algorithm over time, it will learn how to assign new companies and new tags to industries with less and less human oversight.

Despite the initial results of our neural network being both promising and exciting, it remains under heavy development. While this post concludes our series on the development of the neural network, we’re definitely going to show off the results of the final implementation in the near future. On top of that, we’re excited to share more development stories as we strive to improve our data platform.

The other articles in this series:
How do you organize +160,000 startups into meaningful clusters?
Neural networks and how they help our data structure

Have data needs?

Funderbeam has updated data on more than 160k startups and 20k investors. We are working together with a range of startups, accelerators, VCs and more to provide data services. If you would like to learn more, get in touch with Nick, our head of data via: Nicholas.Vandrey@Funderbeam.com or connect on Twitter: @nsvandrey.