Jekyll2020-07-04T10:15:15+00:00https://albertcthomas.github.io/feed.xmlAlbert ThomasPersonal page of Albert Thomas{"name"=>nil, "avatar"=>"/assets/images/bio-photo.jpg", "bio"=>"Research Engineer at Huawei Technologies", "location"=>nil, "links"=>[{"label"=>"Email", "icon"=>"fas fa-fw fa-envelope-square", "url"=>"mailto:albert.thomas@centraliens.net"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/albertcthomas"}, {"label"=>"Github", "icon"=>"fab fa-fw fa-github", "url"=>"https://github.com/albertcthomas"}]}Good practices with numpy random number generators2020-04-07T00:00:00+00:002020-04-07T00:00:00+00:00https://albertcthomas.github.io/good-practices-random-number-generators<p>Unless you are working on a problem where you can afford a true Random Number Generator (RNG), which is basically never for most of us, implementing something random means relying on a pseudo Random Number Generator. I want to share here what I have learnt about good practices with pseudo RNGs and especially the ones available in <a href="https://numpy.org/">numpy</a>. <!--more--> I assume a certain knowledge of numpy and that numpy 1.17 or greater is used. The reason for this is that great new features were introduced in the <a href="https://numpy.org/doc/1.18/reference/random/index.html">random</a> module of version 1.17. As <code class="language-plaintext highlighter-rouge">numpy</code> is usually imported as <code class="language-plaintext highlighter-rouge">np</code>, I will sometimes use <code class="language-plaintext highlighter-rouge">np</code> instead of <code class="language-plaintext highlighter-rouge">numpy</code>. Finally, as I will not talk about true RNGs, RNG will always mean pseudo RNG in the rest of this blog post.</p>
<h3 id="the-main-messages">The main messages</h3>
<ol>
<li>Avoid using the global numpy RNG. This means that you should avoid using <a href="https://numpy.org/doc/1.18/reference/random/generated/numpy.random.seed.html"><code class="language-plaintext highlighter-rouge">np.random.seed</code></a> and <code class="language-plaintext highlighter-rouge">np.random.*</code> functions, such as <code class="language-plaintext highlighter-rouge">np.random.random</code>, to generate random values.</li>
<li>Create a new RNG and pass it around using the <a href="https://numpy.org/doc/1.18/reference/random/generator.html#numpy.random.default_rng"><code class="language-plaintext highlighter-rouge">np.random.default_rng</code></a> function.</li>
<li>Be careful with parallel computations and rely on <a href="https://numpy.org/doc/1.18/reference/random/parallel.html">numpy strategies for reproducible parallel number generation</a>.</li>
</ol>
<p>Note that with numpy <1.17 the way to create a new RNG is to use <a href="https://numpy.org/doc/1.18/reference/random/legacy.html#numpy.random.RandomState"><code class="language-plaintext highlighter-rouge">np.random.RandomState</code></a> which is based on the popular Mersenne Twister 19937 algorithm. This is also how the global numpy RNG is created. It is still possible to use this function in versions higher than 1.17 but it is now recommended to use <code class="language-plaintext highlighter-rouge">default_rng</code> which returns an instance of the statistically better <a href="https://www.pcg-random.org">PCG64</a> RNG.</p>
<h2 id="random-number-generation-with-numpy">Random number generation with numpy</h2>
<p>When you import <code class="language-plaintext highlighter-rouge">numpy</code> in your python script a RNG is created behind the scenes. This RNG is the one used when you generate a new random value using a function such as <code class="language-plaintext highlighter-rouge">np.random.random</code>. I will here refer to this RNG as the global numpy RNG.</p>
<p>Although not recommended, it is a common practice to reset the seed of this global RNG at the beginning of a script using the <code class="language-plaintext highlighter-rouge">np.random.seed</code> function. Fixing the seed at the beginning ensures that the script is reproducible: the same values and results will be produced each time you run it. However, although sometimes convenient, using the global numpy RNG is considered a bad practice. A simple reason is that using global variables can lead to undesired side effects. For instance one might use <code class="language-plaintext highlighter-rouge">np.random.random</code> without knowing that the seed of the global RNG was set somewhere else in the codebase. Quoting the <a href="https://numpy.org/neps/nep-0019-rng-policy.html">Numpy Enhancement Proposal (NEP) 19</a> by Robert Kern about the numpy RNG policy:</p>
<blockquote>
<p>The implicit global RandomState behind the <code class="language-plaintext highlighter-rouge">np.random.*</code> convenience functions can cause problems, especially when threads or other forms of concurrency are involved. Global state is always problematic. We categorically recommend avoiding using the convenience functions when reproducibility is involved. […] The preferred best practice for getting reproducible pseudorandom numbers is to instantiate a generator object with a seed and pass it around.</p>
</blockquote>
<p>In short:</p>
<ul>
<li>Instead of using <code class="language-plaintext highlighter-rouge">np.random.seed</code>, which reseeds the already created global numpy RNG and then using <code class="language-plaintext highlighter-rouge">np.random.*</code> functions you should create a new RNG.</li>
<li>You should create one RNG at the beginning of your script (with a seed if you want reproducibility) and use this RNG in the rest of your script.</li>
</ul>
<p>To create a new RNG you can use the <code class="language-plaintext highlighter-rouge">default_rng</code> function as illustrated in the <a href="https://numpy.org/doc/1.18/reference/random/index.html#introduction">introduction of the random module documentation</a>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">seed</span> <span class="o">=</span> <span class="mi">12345</span>
<span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">default_rng</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span> <span class="c1"># can be called without a seed
</span><span class="n">rng</span><span class="p">.</span><span class="n">random</span><span class="p">()</span>
</code></pre></div></div>
<p>The reason for seeding your RNG only once is that you can loose on the randomness and the independence of the generated random numbers by reseeding the RNG multiple times. Furthermore obtaining a good seed can be time consuming. Once you have a good seed to instantiate your generator you might as well use it. With a good RNG such as the one returned by <code class="language-plaintext highlighter-rouge">default_rng</code> you will be ensured good randomness (and independence) of the generated numbers. It might be more dangerous to use different seeds: how do you know that the streams of random numbers obtained with two different seeds are not correlated, or I should say less independent than the ones created from the same seed? That being said, <a href="https://github.com/numpy/numpy/issues/15322#issuecomment-573890207">as explained by Robert Kern</a>, with the RNGs and seeding strategies introduced in numpy 1.17, it could be considered safe enough to recreate new RNGs from the system entropy, e.g. using <code class="language-plaintext highlighter-rouge">default_rng(None)</code> multiple times. However as explained later be careful when running jobs in parallel and relying on <code class="language-plaintext highlighter-rouge">default_rng(None)</code>.</p>
<h2 id="passing-a-numpy-rng-around">Passing a numpy RNG around</h2>
<p>As you write functions that you will use on their own as well as in a more complex script it is convenient to be able to pass a seed or your already created RNG. The function <code class="language-plaintext highlighter-rouge">default_rng</code> allows you to do this very easily. As written above, this function can be used to create a new RNG from your chosen seed, if you pass a seed to it, or from system entropy when passing <code class="language-plaintext highlighter-rouge">None</code> but you can also pass an already created RNG. In this case the returned RNG is the one that you passed.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">def</span> <span class="nf">stochastic_function</span><span class="p">(</span><span class="n">seed</span><span class="p">,</span> <span class="n">high</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
<span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">default_rng</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="k">return</span> <span class="n">rng</span><span class="p">.</span><span class="n">integers</span><span class="p">(</span><span class="n">high</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
</code></pre></div></div>
<p>You can either pass an <code class="language-plaintext highlighter-rouge">int</code> seed or your already created RNG to <code class="language-plaintext highlighter-rouge">stochastic_function</code>. To be perfectly exact, the <code class="language-plaintext highlighter-rouge">default_rng</code> function returns the exact same RNG passed to it for certain kind of RNGs such at the ones created with <code class="language-plaintext highlighter-rouge">default_rng</code> itself. You can refer to the <a href="https://numpy.org/doc/1.18/reference/random/generator.html#numpy.random.default_rng"><code class="language-plaintext highlighter-rouge">default_rng</code> documentation</a> for more details on the arguments that you can pass to this function.</p>
<p>Before knowing about <code class="language-plaintext highlighter-rouge">default_rng</code>, and before numpy 1.17, I was using the scikit-learn function <a href="https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_random_state.html"><code class="language-plaintext highlighter-rouge">check_random_state</code></a> which is of course heavily used in the scikit-learn codebase. While writing this post I discovered that this function is now available in <a href="https://github.com/scipy/scipy/blob/master/scipy/_lib/_util.py#L171">scipy</a>. A look at the docstring and/or the source code of this function will give you a good idea about what it does. The differences with <code class="language-plaintext highlighter-rouge">default_rng</code> are that <code class="language-plaintext highlighter-rouge">check_random_state</code> currently relies on <code class="language-plaintext highlighter-rouge">np.random.RandomState</code> and that when <code class="language-plaintext highlighter-rouge">None</code> is passed to <code class="language-plaintext highlighter-rouge">check_random_state</code> then the function returns the already existing global numpy RNG. The latter can be convenient because if you fix the seed of the global RNG before in your script using <code class="language-plaintext highlighter-rouge">np.random.seed</code>, <code class="language-plaintext highlighter-rouge">check_random_state</code> returns the generator that you seeded. However, as explained above, this is not the recommended practice and you should be aware of the risks and the side effects.</p>
<h2 id="parallel-processing">Parallel processing</h2>
<p>You must be careful when using RNGs in conjunction with parallel processing. I usually use the <a href="https://joblib.readthedocs.io/en/latest/">joblib</a> library to parallelize my code and I will therefore mainly talk about it. However most of the discussion is not specific to joblib.</p>
<p>Let’s consider the context of Monte Carlo simulation: you have a random function returning random outputs and you want to generate these random outputs a lot of times, for instance to compute an empirical mean. If the function is expensive to compute an easy solution to speed up the computation time is to resort to parallel processing. Depending on the parallel processing library or backend that you use different behaviors can be observed. For instance if you do not set the seed yourself it can be the case that forked Python processes use the same random seed, generated for instance from system entropy, and thus produce the exact same outputs which is a waste of computational resources. I learnt this the hard way during my PhD. A very nice example illustrating this and other behaviors that one can obtain with joblib is available <a href="https://joblib.readthedocs.io/en/latest/auto_examples/parallel_random_state.html">here</a>.</p>
<p>If you fix the seed at the beginning of your main script for reproducibility and then pass the same RNG to each process to be run in parallel, most of the time this will not give you what you want as this RNG will be deep copied. The same results will thus be produced by each process. One of the solutions is to create as many RNGs as parallel processes with a different seed for each of these RNGs. The issue now is that you cannot choose the seeds as easily as you would think. When you choose two different seeds to instantiate two different RNGs how do you know that the numbers produced by these RNGs will appear as statistically independent? The design of independent RNGs for parallel processes has been an important research question. You can for instance refer to the paper <a href="https://www.sciencedirect.com/science/article/pii/S0378475416300829">Random numbers for parallel computers: Requirements and methods, with emphasis on GPUs</a> by L’Ecuyer et al. (2017) for a good summary of the different methods on this topic.</p>
<p>From numpy 1.17, it is now very easy to instantiate independent RNGs. Depending on the RNG that you use, different strategies are easily available as documented in the <a href="https://numpy.org/doc/1.18/reference/random/parallel.html">Parallel random number generation section</a> of the numpy documentation. One of the strategies is to use <code class="language-plaintext highlighter-rouge">SeedSequence</code> which is an algorithm that makes sure that a not so good user-provided seed results in a good initial state for the RNG. Additionally, it ensures that two close seeds will result in two very different initial states for the RNG that are, with very high probability, independent of each other. You can refer to the documentation of <a href="https://numpy.org/doc/1.18/reference/random/parallel.html#seedsequence-spawning">SeedSequence Spawning</a> for an example on how to generate independent RNGs from a user-provided seed. I here show how you can do this from an already existing RNG and apply it to the <a href="https://joblib.readthedocs.io/en/latest/auto_examples/parallel_random_state.html#fixing-the-random-state-to-obtain-deterministic-results">joblib example</a> mentioned above.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">joblib</span> <span class="kn">import</span> <span class="n">Parallel</span><span class="p">,</span> <span class="n">delayed</span>
<span class="k">def</span> <span class="nf">stochastic_function</span><span class="p">(</span><span class="n">seed</span><span class="p">,</span> <span class="n">high</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
<span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">default_rng</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="k">return</span> <span class="n">rng</span><span class="p">.</span><span class="n">integers</span><span class="p">(</span><span class="n">high</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">seed</span> <span class="o">=</span> <span class="mi">98765</span>
<span class="c1"># create the RNG that you want to pass around
</span><span class="n">rng</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">default_rng</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="c1"># get the SeedSequence of the passed RNG
</span><span class="n">ss</span> <span class="o">=</span> <span class="n">rng</span><span class="p">.</span><span class="n">bit_generator</span><span class="p">.</span><span class="n">_seed_seq</span>
<span class="c1"># create 5 initial independent states
</span><span class="n">child_states</span> <span class="o">=</span> <span class="n">ss</span><span class="p">.</span><span class="n">spawn</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="c1"># use 2 processes to run the stochastic_function 5 times with joblib
</span><span class="n">random_vector</span> <span class="o">=</span> <span class="n">Parallel</span><span class="p">(</span><span class="n">n_jobs</span><span class="o">=</span><span class="mi">2</span><span class="p">)(</span><span class="n">delayed</span><span class="p">(</span>
<span class="n">stochastic_function</span><span class="p">)(</span><span class="n">random_state</span><span class="p">)</span> <span class="k">for</span> <span class="n">random_state</span> <span class="ow">in</span> <span class="n">child_states</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">random_vector</span><span class="p">)</span>
<span class="c1"># rerun to check that we obtain the same outputs
</span><span class="n">random_vector</span> <span class="o">=</span> <span class="n">Parallel</span><span class="p">(</span><span class="n">n_jobs</span><span class="o">=</span><span class="mi">2</span><span class="p">)(</span><span class="n">delayed</span><span class="p">(</span>
<span class="n">stochastic_function</span><span class="p">)(</span><span class="n">random_state</span><span class="p">)</span> <span class="k">for</span> <span class="n">random_state</span> <span class="ow">in</span> <span class="n">child_states</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">random_vector</span><span class="p">)</span>
</code></pre></div></div>
<p>By using a fixed seed you always get the same results and by using <code class="language-plaintext highlighter-rouge">SeedSequence.spawn</code> you have an independent RNG for each of the iterations. Note that I used the convenient <code class="language-plaintext highlighter-rouge">default_rng</code> function in <code class="language-plaintext highlighter-rouge">stochastic_function</code>. You can also see that the <code class="language-plaintext highlighter-rouge">SeedSequence</code> of the existing RNG is a private attribute. Accessing the <code class="language-plaintext highlighter-rouge">SeedSequence</code> might become easier in future versions of numpy (more information <a href="https://github.com/numpy/numpy/issues/15322#issuecomment-626400433">here</a>).</p>
<h2 id="resources">Resources</h2>
<h3 id="numpy-rngs">Numpy RNGs</h3>
<ul>
<li><a href="https://numpy.org/doc/1.18/reference/random/index.html">The documentation of the numpy random module</a> is the best place to find information and where I found most of the information that I share here.</li>
<li><a href="https://numpy.org/neps/nep-0019-rng-policy.html">The Numpy Enhancement Proposal (NEP) 19 on the Random Number Generator Policy</a> which lead to the changes introduced in numpy 1.17</li>
<li>A <a href="https://github.com/numpy/numpy/issues/15322">recent numpy issue</a> about the <code class="language-plaintext highlighter-rouge">check_random_state</code> function and RNG good practices, especially <a href="https://github.com/numpy/numpy/issues/15322#issuecomment-573890207">this comment</a> by Robert Kern.</li>
<li><a href="https://scikit-learn.org/stable/faq.html#how-do-i-set-a-random-state-for-an-entire-execution">How do I set a random_state for an entire execution?</a> from the scikit-learn FAQ.</li>
</ul>
<h3 id="rngs-in-general">RNGs in general</h3>
<ul>
<li><a href="https://www.sciencedirect.com/science/article/pii/S0378475416300829">Random numbers for parallel computers: Requirements and methods, with emphasis on GPUs</a> by L’Ecuyer et al. (2017)</li>
<li>To know more about the default RNG used in numpy, named PCG, I recommend the <a href="https://www.pcg-random.org/paper.html">PCG paper</a> which also contains lots of useful information about RNGs in general. The <a href="https://www.pcg-random.org">pcg-random.org website</a> is also full of interesting information about RNGs.</li>
</ul>{"name"=>nil, "avatar"=>"/assets/images/bio-photo.jpg", "bio"=>"Research Engineer at Huawei Technologies", "location"=>nil, "links"=>[{"label"=>"Email", "icon"=>"fas fa-fw fa-envelope-square", "url"=>"mailto:albert.thomas@centraliens.net"}, {"label"=>"Twitter", "icon"=>"fab fa-fw fa-twitter-square", "url"=>"https://twitter.com/albertcthomas"}, {"label"=>"Github", "icon"=>"fab fa-fw fa-github", "url"=>"https://github.com/albertcthomas"}]}Unless you are working on a problem where you can afford a true Random Number Generator (RNG), which is basically never for most of us, implementing something random means relying on a pseudo Random Number Generator. I want to share here what I have learnt about good practices with pseudo RNGs and especially the ones available in numpy.