Graphics by Dan Horne for Golem

Ok, so here’s the thing: machine learning is taking over the world.

Like, not in the literal, skynet’ish, space-oddysey’ish, matrix’ish way. At least for now.

How then? Well, in the ordinary, economic’ish way. Data is pouring, algorithms are trained, money is being made.

And since Golem is the future of computing, we would like to be able to run ML tasks on it.

There is, however, a problem. Ordinary machine learning algorithms are, at least at this moment, not trivially parallelizable. They can be run in parallel, but for a large part of them, it is only possible with high-bandwidth, low-latency connections.¹

But, there are algorithms, which can be easily parallelized, and which are crucial in the modern machine learning workflow, even if hidden behind the scenes. I’m talking about hyperparameter optimization stuff.

“What’s that?”, you may ask. Short answer: they are algorithms for optimizing algorithms. Let’s say, we have a toy neural network, with only one layer, for which we want to set a number of hidden neurons. If we set it too high, the network will overfit, if we set it too low — it will not learn as much as it could². Since the network’s training algorithm — gradient descent — doesn’t change the structure of the network, it just optimizes parameters (weights) of the model, we have to find it in some other way. How then?

Well, we can guess. If you are an experienced machine learning engineer or researcher, the guess will probably we quite good. But, that’s obviously not the best way to do it.

Another way is to just check all possible configurations (in some range, since there is an infinite amount of them). This will produce the best possible answer, but will take too long to be practical.

We can combine these two approaches by guessing a number of configurations, based on our intuition, then try them all and select the best.

So, how do we guess them? We can base it on our intuition, we can select them randomly… or we can use one of many hyperparameter optimization algorithms that exist to do this job for us.³

Since “trying out” every configuration in practice means training the whole network with this parameter, and then checking the result, the whole procedure can be easily parallelized.

So, this is our first, proof-of-concept machine learning algorithm working on Golem. We implemented it over the summer, so you can see how it works here. And here⁴ is the full description of the ML task, with experiments justifying various implementation choices.

If you have any comments or doubts, send them to and we would be happy to address them.

[1] Better algorithms are developed all the time, but we have yet to see a real distributed gradient descent. However, the prospects are looking good, for example see Lian et al.

[2] Yes, we currently control the overfitting by using other techniques, but this is only a toy example.

[3] In fact, these algorithms usually perform better than random search or even human experts. See Bergstra et al. or Snoek et al.

[4] It is one of our many research resources we made publicly available, to ensure the company and our process of developing Golem are more transparent than ever. We described it here.