Blog blog("Baptiste Wicht"); (Posts about Algorithm)https://baptiste-wicht.com/enWed, 20 Sep 2023 17:44:01 GMTNikola (getnikola.com)http://blogs.law.harvard.edu/tech/rssRelated posts on a Nikola websitehttps://baptiste-wicht.com/posts/2014/04/related-posts-nikola-website.htmlBaptiste Wicht<p>The one thing I missed in Nikola was the lack of <strong>Related Posts generation</strong>. I solved this during <a href="http://baptiste-wicht.com/posts/2014/03/migrated-from-wordpress-to-nikola.html">the migration from WordPress to Nikola</a>, by using simple algorithms to generate related posts for each blog post and then display them in the form of a simple widget. </p>
<p>For example, you can see the related posts of this post on the left, just under my Google+ badge. </p>
<p>Here is the workflow that is used:
* A simple C++ tool generate a list of related posts in HTML for each posts
* The generated HTML code is included in the MAKO template using Python</p>
<p>In this article, I'll show how the related posts are generated and how to include them in your template.</p>
<h2>Related Post Generation</h2>
<p>It is important to note that it is necessary to cleanup the content of the files before using it:
* First, it is necessary to remove all HTML that may be present in the Markdown files. I remove only the HTML tags, not their content. For instance, in <em><strong>test</strong></em>, test would be counted, but not strong. The only exception to that, is that the content of preformatted parts (typically some or console output) is completely removed.
* It is also necessary to cleanup Markdown, for instance, parentheses and square brackets are removed, but not their content. Same goes for Markdown syntax for bold, italics, ...
* Finally, I also remove punctuation. </p>
<p>My related posts algorithm is very simple. </p>
<p>First, I compute the Term Frequency (TF) of each word in each post. The number of times a word is present in a document is represented by <em>tf(w,d)</em>. I decided to give a bigger importance to words in the title and the tags, but that is just a matter of choice. </p>
<p>After that, I compute the Inverse Document Frequency (IDF) of each word. This measure allows to filter words like: a, the, and, has, is, ... These words are not really representative of the content of a blog post. The formula for idf is very simple: <em>idf(w) = log(N / (1+ n(w)))</em>. <em>n(w)</em> is the number of posts where the word is present. It is a measure of rarity of a word on the complete posts set. </p>
<p>Once we have the two values, we can easily compute the TF-IDF vectors of each blog post. The TF-IDF for a word is simply: <em>tf_idf(w,d) = tf(w, d) * idf(w)</em>. </p>
<p>Finally, we can derive the matrix of Cosine similarities between the TF-IDF vectors. The idea of the algorithm is simple: each document is represented by a vector and then the distance between two vectors indicates how related two posts are. The formula for the Cosine similarity is also simple: <em>cs(d1, d2) = dot(d1, d2) / ||d1|| * || d2||</em>. <em>d1</em> and <em>d2</em> are two TF-IDF vectors. Once the cosine similarities between each document is computed, we can just take the N most related documents as the "Related Posts" for each blog post. </p>
<p>With this list, the C++ program simply generates an HTML file that will be included in each post by Nikola template. This process is <strong>very fast</strong>. I have around 200 posts on this blog and the generation takes about 1 second. </p>
<h2>Include in template</h2>
<p>Once the HTML files are generate, they are included into the website by altering the template and adding their content directly into the web page. Here is the code I use in <em>base.tmpl</em>.</p>
<div class="code"><pre class="code literal-block"><span class="cp">%</span><span class="k">if</span> <span class="n">post</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">post</span><span class="o">.</span><span class="n">source_link</span><span class="p">()</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s1">'/stories/'</span><span class="p">):</span>
<span class="x"> <div class="left-sidebar-widget"></span>
<span class="x"> <h3>Related posts</h3></span>
<span class="x"> <div class="left-sidebar-widget-content"></span>
<span class="x"> </span><span class="cp"><%</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="n">related_dir</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">getcwd</span><span class="p">()</span>
<span class="n">related_path</span> <span class="o">=</span> <span class="n">related_dir</span> <span class="o">+</span> <span class="n">post</span><span class="o">.</span><span class="n">source_link</span><span class="p">()</span> <span class="o">+</span> <span class="s2">".related.html"</span>
<span class="k">try</span><span class="p">:</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">related_path</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="n">related_text</span> <span class="o">=</span> <span class="n">f</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">f</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="k">except</span> <span class="ne">IOError</span> <span class="k">as</span> <span class="n">e</span><span class="p">:</span>
<span class="n">related_text</span> <span class="o">=</span> <span class="s2">"Not generated"</span>
<span class="cp">%></span>
<span class="x"> </span><span class="cp">${</span><span class="n">related_text</span><span class="cp">}</span>
<span class="x"> </div></span>
<span class="x"> </div></span>
<span class="cp">%</span><span class="k">endif</span>
</pre></div>
<p>You could also display it in <em>post.tmpl</em> as a simple list. </p>
<p>There is a limitation with this code: it only works if the source file has the same name than the slug, otherwise the file is not found. If someone has a solution to get the path to the source file and not the slug version, I'd be glad to have it ;)</p>
<h2>Conclusion</h2>
<p>The code for the generator is available on the <a href="https://github.com/wichtounet/wichtounet.github.io/tree/master/src/related">Github repository of my website</a>. </p>
<p>I wrote it in C++ because I don't like Python a lot and because I'm not good at it and it would have taken me a lot more time to include it in Nikola. If I have time and I'm motivated enough, I'll try to integrate that in Nikola. </p>
<p>I hope that could be useful for some people. </p>AlgorithmC++NikolaPythonThe sitehttps://baptiste-wicht.com/posts/2014/04/related-posts-nikola-website.htmlSat, 05 Apr 2014 14:16:45 GMTInteger Linear Time Sorting Algorithmshttps://baptiste-wicht.com/posts/2012/11/integer-linear-time-sorting-algorithms.htmlBaptiste Wicht<p><strong>Update</strong>: The code is now more C++</p>
<p>Most of the sorting algorithms that are used are generally comparison sort. It means that each element of the collection being sorted will be compared to see which one is the first one. A comparison must have a lower bound of Ω(n log n) comparisons. That is why there are no comparison-based sorting algorithm better than O(n log n).</p>
<p>On the other hand, there are also sorting algorithms that are performing better. This is the family of the integer sorting algorithms. These algorithms are using properties of integer to sort them without comparing them. They can be only be used to sort integers. Nevertheless, a hash function can be used to assign a unique integer to any value and so sort any value. All these algorithms are using extra space. There are several of these algorithms. In this article, we will see three of them and I will present an implementation in C++. At the end of the article, I will compare them to <em>std::sort</em>.</p>
<p>In the article, I will use <em>n</em> as the size of the array to sort and <em>m</em> as the max number that is permitted in the array.</p>
<h3>Bin Sort</h3>
<p>Bin Sort, or Bucket Sort, is a very simple algorithm that partition all the input numbers into a number of buckets. Then, all the buckets are outputted in order in the array, resulting in a sorting array. I decided to implement the simplest case of Bin Sort where each number goes in its own bucket, so there are <em>m</em> buckets.</p>
<p>The implementation is pretty straightforward:</p>
<div class="code"><pre class="code literal-block"><span class="kt">void</span><span class="w"> </span><span class="nf">binsort</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="o">>&&</span><span class="w"> </span><span class="n">A</span><span class="p">){</span>
<span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="o">>></span><span class="w"> </span><span class="n">B</span><span class="p">(</span><span class="n">MAX</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">);</span>
<span class="w"> </span><span class="k">for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">SIZE</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">){</span>
<span class="w"> </span><span class="n">B</span><span class="p">[</span><span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]].</span><span class="n">push_back</span><span class="p">(</span><span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">current</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="w"> </span><span class="k">for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">MAX</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">){</span>
<span class="w"> </span><span class="k">for</span><span class="p">(</span><span class="k">auto</span><span class="w"> </span><span class="n">item</span><span class="w"> </span><span class="o">:</span><span class="w"> </span><span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">]){</span>
<span class="w"> </span><span class="n">A</span><span class="p">[</span><span class="n">current</span><span class="o">++</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">item</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</pre></div>
<p>B is the array of buckets. Each bucket is implemented as a std::vector. The algorithm starts by filling each buckets with the numbers from the input array. Then, it outputs them in order in the array.</p>
<p>This algorithm works in <em>O(n + m)</em> and requires <em>O(m)</em> extra memory. With these properties, it makes a very limited algorithm, because if you don't know the maximum number and you have to use the maximum number of the array type, you will have to allocate for instance 2^32 buckets. That won't be possible.</p>
<h3>Couting Sort</h3>
<p>An interesting fact about binsort is that each bucket contains only the same numbers. The size of the bucket would be enough. That is exactly what Counting Sort. It counts the number of times an element is present instead of the elements themselves. I will present two versions. The first one is a version using a secondary array and then copying again into the input array and the second one is an in-place sort.</p>
<div class="code"><pre class="code literal-block"><span class="kt">void</span><span class="w"> </span><span class="nf">counting_sort</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="o">>&&</span><span class="w"> </span><span class="n">A</span><span class="p">){</span>
<span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="o">></span><span class="w"> </span><span class="n">B</span><span class="p">(</span><span class="n">SIZE</span><span class="p">);</span>
<span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="o">></span><span class="w"> </span><span class="n">C</span><span class="p">(</span><span class="n">MAX</span><span class="p">);</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">SIZE</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">){</span>
<span class="w"> </span><span class="o">++</span><span class="n">C</span><span class="p">[</span><span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]];</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">MAX</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">){</span>
<span class="w"> </span><span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="p">];</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="kt">long</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SIZE</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="o">--</span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="p">{</span>
<span class="w"> </span><span class="n">B</span><span class="p">[</span><span class="n">C</span><span class="p">[</span><span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="w"> </span><span class="o">--</span><span class="n">C</span><span class="p">[</span><span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]];</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">SIZE</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">){</span>
<span class="w"> </span><span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">B</span><span class="p">[</span><span class="n">i</span><span class="p">];</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</pre></div>
<p>The algorithm is also simple. It starts by counting the number of elements in each bucket. Then, it aggregates the number by summing them to obtain the position of the element in the final sorted array. Then, all the elements are copied in the temporary array. Finally, the temporary array is copied in the final array. This algorithms works in <em>O(m + n)</em> and requires <em>O(m + n)</em>. This version is presented only because it is present in the literature. We can do much better by avoiding the temporary array and optimizing it a bit:</p>
<div class="code"><pre class="code literal-block"><span class="kt">void</span><span class="w"> </span><span class="nf">in_place_counting_sort</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="o">>&&</span><span class="w"> </span><span class="n">A</span><span class="p">){</span>
<span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="o">></span><span class="w"> </span><span class="n">C</span><span class="p">(</span><span class="n">MAX</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="mi">1</span><span class="p">);</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">SIZE</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">){</span>
<span class="w"> </span><span class="o">++</span><span class="n">C</span><span class="p">[</span><span class="n">A</span><span class="p">[</span><span class="n">i</span><span class="p">]];</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="kt">int</span><span class="w"> </span><span class="n">current</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="w"> </span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">MAX</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">i</span><span class="p">){</span>
<span class="w"> </span><span class="k">for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">=</span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">C</span><span class="p">[</span><span class="n">i</span><span class="p">];</span><span class="w"> </span><span class="o">++</span><span class="n">j</span><span class="p">){</span>
<span class="w"> </span><span class="n">A</span><span class="p">[</span><span class="n">current</span><span class="o">++</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</pre></div>
<p>The temporary array is removed and the elements are directly written in the sorted array. The counts are not used directly as position, so there is no need to sum them. This version still works in <em>O(m + n)</em> but requires only <em>O(m)</em> extra memory. It is much faster than the previous version.</p>
<h3>Radix Sort</h3>
<p>The last version that I will discuss here is a Radix Sort. This algorithm sorts the number digit after digit in a specific radix. It is a form of bucket sort, where there is a bucket by digit. Like Counting Sort, only the counts are necessary. For example, if you use radix sort in base 10. It will first sort all the numbers by their first digit, then the second, .... It can work in any base and that is its force. With a well chosen base, it can be very powerful. Here, we will focus on radix that are in the form 2^r. These radix have good properties, we can use shifts and mask to perform division and modulo, making the algorithm much faster.</p>
<p>The implementation is a bit more complex than the other implementations:</p>
<div class="code"><pre class="code literal-block"><span class="k">static</span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">digits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">2</span><span class="p">;</span><span class="w"> </span><span class="c1">//Digits</span>
<span class="k">static</span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">16</span><span class="p">;</span><span class="w"> </span><span class="c1">//Bits</span>
<span class="k">static</span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">radix</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="w"> </span><span class="o"><<</span><span class="w"> </span><span class="n">r</span><span class="p">;</span><span class="w"> </span><span class="c1">//Bins</span>
<span class="k">static</span><span class="w"> </span><span class="k">const</span><span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">mask</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">radix</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span>
<span class="kt">void</span><span class="w"> </span><span class="nf">radix_sort</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="o">>&&</span><span class="w"> </span><span class="n">A</span><span class="p">){</span>
<span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="o">></span><span class="w"> </span><span class="n">B</span><span class="p">(</span><span class="n">SIZE</span><span class="p">);</span>
<span class="w"> </span><span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="o">></span><span class="w"> </span><span class="n">cnt</span><span class="p">(</span><span class="n">radix</span><span class="p">);</span>
<span class="w"> </span><span class="k">for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">,</span><span class="w"> </span><span class="n">shift</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">digits</span><span class="p">;</span><span class="w"> </span><span class="n">i</span><span class="o">++</span><span class="p">,</span><span class="w"> </span><span class="n">shift</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">r</span><span class="p">){</span>
<span class="w"> </span><span class="k">for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">radix</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">j</span><span class="p">){</span>
<span class="w"> </span><span class="n">cnt</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="k">for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">SIZE</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">j</span><span class="p">){</span>
<span class="w"> </span><span class="o">++</span><span class="n">cnt</span><span class="p">[(</span><span class="n">A</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="n">shift</span><span class="p">)</span><span class="w"> </span><span class="o">&&</span><span class="w"> </span><span class="n">mask</span><span class="p">];</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="k">for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">radix</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">j</span><span class="p">){</span>
<span class="w"> </span><span class="n">cnt</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">+=</span><span class="w"> </span><span class="n">cnt</span><span class="p">[</span><span class="n">j</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="p">];</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="k">for</span><span class="p">(</span><span class="kt">long</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">SIZE</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="mi">1</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="o">--</span><span class="n">j</span><span class="p">){</span>
<span class="w"> </span><span class="n">B</span><span class="p">[</span><span class="o">--</span><span class="n">cnt</span><span class="p">[(</span><span class="n">A</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">>></span><span class="w"> </span><span class="n">shift</span><span class="p">)</span><span class="w"> </span><span class="o">&&</span><span class="w"> </span><span class="n">mask</span><span class="p">]]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">A</span><span class="p">[</span><span class="n">j</span><span class="p">];</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="k">for</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="kt">size_t</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="mi">0</span><span class="p">;</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">SIZE</span><span class="p">;</span><span class="w"> </span><span class="o">++</span><span class="n">j</span><span class="p">){</span>
<span class="w"> </span><span class="n">A</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">B</span><span class="p">[</span><span class="n">j</span><span class="p">];</span>
<span class="w"> </span><span class="p">}</span>
<span class="w"> </span><span class="p">}</span>
<span class="p">}</span>
</pre></div>
<p><em>r</em> indicates the power of two used as the radix (2^r). The mask is used to compute modulo faster. The algorithm repeats the steps for each digit. Here <em>digits</em> equals 2. It means that we support 2^32 values. A 32 bits value is sorted in two pass. The steps are very similar to counting sort. Each value of the digit is counted and then the counts are summed to give the position of the number. Finally, the numbers are put in order in the temporary array and copied into A.</p>
<p>This algorithm works in <em>O(digits (m + radix))</em> and requires <em>O(n + radix)</em> extra memory. A very good thing is that the algorithm does not require space based on the maximum value, only based on the radix.</p>
<h3>Results</h3>
<p>It's time to compare the different implementations in terms of runtime. For each size, each version is tested 25 times on different random arrays. The arrays are the same for each algorithm. The number is the time necessary to sort the 25 arrays. The benchmark has been compiler with GCC 4.7.</p>
<p>The first test is made with very few duplicates (m = 10n).</p>
<div id="graph_0" style="width: 600px; height: 400px;"></div>
<p><input id="button_graph_0" type="button" value="Logarithmic scale">
<script type="text/javascript">function draw_graph_0(){var graph=new google.visualization.ColumnChart(document.getElementById('graph_0'));var data=google.visualization.arrayToDataTable([['x','std::sort','counting_sort','in_place_counting_sort','bin_sort','radix_sort'],['100000',171,182,105,945,89],['500000',993,2229,970,6435,461],['1000000',2175,4812,2046,14096,1068],['5000000',11791,27050,10202,81255,6148],]);var options={title:"m = 10n",animation:{duration:1200,easing:"in"},width:'600px',height:'400px',hAxis:{title:"n"},vAxis:{title:"ms",viewWindow:{min:0}}};graph.draw(data,options);var button=document.getElementById('button_graph_0');button.onclick=function(){if(options.vAxis.logScale){button.value="Logarithmic Scale";}else{button.value="Normal scale";}options.vAxis.logScale=!options.vAxis.logScale;graph.draw(data,options);};}</script>
</p>
<p>Radix Sort comes to be the fastest in this case, <strong>twice faster as <em>std::sort</em></strong>. In place counting sort has almost the same performance as <em>std::sort</em>. The other are performing worse.</p>
<p>The second test is made with few duplicates (m ~= n).</p>
<div id="graph_1" style="width: 600px; height: 400px;"></div>
<p><input id="button_graph_1" type="button" value="Logarithmic scale">
<script type="text/javascript">function draw_graph_1(){var graph=new google.visualization.ColumnChart(document.getElementById('graph_1'));var data=google.visualization.arrayToDataTable([['x','std::sort','counting_sort','in_place_counting_sort','bin_sort','radix_sort'],['100000',186,73,37,309,90],['500000',991,611,189,3126,455],['1000000',2235,2171,547,7978,1038],['5000000',12184,18470,4516,49056,5791],]);var options={title:"m ~= n",animation:{duration:1200,easing:"in"},width:'600px',height:'400px',hAxis:{title:"n"},vAxis:{title:"ms",viewWindow:{min:0}}};graph.draw(data,options);var button=document.getElementById('button_graph_1');button.onclick=function(){if(options.vAxis.logScale){button.value="Logarithmic Scale";}else{button.value="Normal scale";}options.vAxis.logScale=!options.vAxis.logScale;graph.draw(data,options);};}</script>
</p>
<p>The numbers are impressive. In place <strong>counting sort is between 3-4 times faster than <em>std::sort</em></strong> and <strong>radix sort is twice faster than <em>std::sort</em></strong> ! Bin Sort does not performs very well and counting sort even if generally faster than <em>std::sort</em> does not scale very well.</p>
<p>Let's test with more duplicates (m = n / 2).</p>
<div id="graph_2" style="width: 600px; height: 400px;"></div>
<p><input id="button_graph_2" type="button" value="Logarithmic scale">
<script type="text/javascript">function draw_graph_2(){var graph=new google.visualization.ColumnChart(document.getElementById('graph_2'));var data=google.visualization.arrayToDataTable([['x','std::sort','counting_sort','in_place_counting_sort','bin_sort','radix_sort'],['100000',178,65,25,262,90],['500000',979,450,143,2332,461],['1000000',2171,1480,321,6240,1041],['5000000',11978,16205,3453,41709,5890],]);var options={title:"m = n / 2",animation:{duration:1200,easing:"in"},width:'600px',height:'400px',hAxis:{title:"n"},vAxis:{title:"ms",viewWindow:{min:0}}};graph.draw(data,options);var button=document.getElementById('button_graph_2');button.onclick=function(){if(options.vAxis.logScale){button.value="Logarithmic Scale";}else{button.value="Normal scale";}options.vAxis.logScale=!options.vAxis.logScale;graph.draw(data,options);};}</script>
</p>
<p><em>std::sort</em> and radix sort performance does not change a lot but the other sort are performing better. In-place counting sort is still the leader with a higher margin.</p>
<p>Finally, with a lot of duplicates (m = n / 10).</p>
<div id="graph_3" style="width: 600px; height: 400px;"></div>
<p><input id="button_graph_3" type="button" value="Logarithmic scale">
<script type="text/javascript">function draw_graph_3(){var graph=new google.visualization.ColumnChart(document.getElementById('graph_3'));var data=google.visualization.arrayToDataTable([['x','std::sort','counting_sort','in_place_counting_sort','bin_sort','radix_sort'],['100000',161,46,12,144,74],['500000',918,322,76,1023,449],['1000000',2062,824,167,2721,1041],['5000000',10789,8534,1030,24026,5686],]);var options={title:"m = n / 10n",animation:{duration:1200,easing:"in"},width:'600px',height:'400px',hAxis:{title:"n"},vAxis:{title:"ms",viewWindow:{min:0}}};graph.draw(data,options);var button=document.getElementById('button_graph_3');button.onclick=function(){if(options.vAxis.logScale){button.value="Logarithmic Scale";}else{button.value="Normal scale";}options.vAxis.logScale=!options.vAxis.logScale;graph.draw(data,options);};}</script>
</p>
<p>Again, <em>std::sort</em> and radix sort performance are stable, but in-place counting is now <strong>ten times faster than <em>std::sort</em></strong> !</p>
<h3>Conclusion</h3>
<p>To conclude, we have seen that these algorithms can outperforms <em>std::sort</em> by a high factor (10 times for In place Counting Sort when there m << n). If you have to sort integers, you should consider these two cases:</p>
<ul>
<li>m > n or m is unknown : Use radix sort that is about twice faster than <em>std::sort</em>.</li>
<li>m << n : Use in place counting sort that can be much faster than <em>std::sort</em>.</li>
</ul>
<p>I hope you found this article interesting. The implementation can be found on Github: https://github.com/wichtounet/articles/tree/master/src/linear_sorting</p>
<script type="text/javascript">function draw_visualization(){draw_graph_0();draw_graph_1();draw_graph_2();draw_graph_3();}google.setOnLoadCallback(draw_visualization);</script>AlgorithmBenchmarksC++Performanceshttps://baptiste-wicht.com/posts/2012/11/integer-linear-time-sorting-algorithms.htmlWed, 07 Nov 2012 08:02:46 GMTAlgorithms books Reviewshttps://baptiste-wicht.com/posts/2012/08/algorithms-books-reviews.htmlBaptiste Wicht<p>To be sure to be well prepared for an interview, I decided to read several <strong>Algorithms book</strong>. I also chosen books in order to have information about data structures. I chose these books to read:</p>
<ol>
<li>Data Structures & Algorithm Analysis in C++, Third Edition, by Clifford A. Shaffer</li>
<li>Algorithms in a Nutshell, by George T. Heineman, Gary Pollice and Stanley Selkow</li>
<li>Algorithms, Fourth Edition, by Robert Sedgewick and Kevin Wayne</li>
<li>Introduction to Algorithms, by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein. I have to say that I have only read most of it, not completely, because some chapters were not interesting for me at the current time, but I will certainly read them later.</li>
</ol>
<p>As some of my comments are about the presentation of the books, it has to be noted that I have read the three first books on my Kindle.</p>
<p>In this post, you will find my point of view about all these books.</p>
<h4>Data Structures & Algorithm Analysis in C++</h4>
<p>This book is really great. It contains a lot of data structures and algorithms. Each of them is very clearly presented. It is not hard to understand the data structures and the algorithms.</p>
<p>Each data structure is first presented as an ADT (Abstract Data Structure) and then several possible implementations are presented. Each implementation is precisely defined and analyzed to find its sweet pots and worst cases. Other implementations are also presented with enough references to know where to start with them.</p>
<p>I have found that some other books about algorithms are writing too much stuff for a single thing. This is not the case with this book. Indeed, each interesting thing is clearly and succinctly explained.</p>
<p>About the presentation, the code is well presented and the content of the book is very well written. A good think would have been to add a summary of the most important facts about each algorithm and data structure. If you want to know these facts, you have to read several pages (but the facts are always here).</p>
<p>The book contains very good explanation about the complexity analysis of algorihtms. It also contains a very interesting chapter about limits to computation where it treats P, NP, NP-Complete and NP-Hard complexity classes.</p>
<p>This book contains a large number of exercises and projects that can be used to improve even more your algorithmic skills. Moreover, there are very good references at the end of each chapters if you want more documentation about a specific subject.</p>
<p>I had some difficulty reading it on my Kindle. Indeed, it's impossible to switch chapters directly with the Kindle button. If you want quick access to the next chapter, you have to use the table of contents.</p>
<h4>Algorithms in a Nutshell</h4>
<p>This book is much shorter than the previous one. Even if it could be a good book for beginners, I didn't liked this book a lot. The explanations are a bit messy sometimes and it could contain more data structures (even if I know that this is not the subject of the book). The analysis of the different algorithms are a bit short too. Even if it looks normal for a book that short, it has to be known that this book has no exercise.</p>
<p>However, this book has also several good points. Each algorithm is very well presented in a single panel. The complexity of each algorithm is directly given alongside its code. It helps finding quickly an algorithm and its main properties.</p>
<p>Another thing that I found good is that the author included empiric benchmarks as well as complexity analysis. The chapters about Path Finding in AI and computational geometry were very interesting, especially because it is not widely dealt with in other books.</p>
<p>It also has very good references for each chapter.</p>
<p>This book was perfect to read with Kindle, the navigation was very easy.</p>
<h4>Algorithms</h4>
<p>This book is a good book, but suffers from several drawbacks regarding to other books. First, the book covers a lot of data structures and algorithms. Then, it also has very good explanations about complexity classes. It also has a lot of exercises. I also liked a lot the chapter about string algorithms that was lacking in previous books.</p>
<p>Most of the time, the explanations are good, but sometimes, I found them quite hard to understand. Moreover, some parts of code are also hard to follow. The author included Java runs of some of programs. In my opinion, this is quite useless, empiric benchmarks could have been useful, but not single runs of the program. Some of the diagrams were also hard to read, but that's perhaps a consequence of the Kindle.</p>
<p>A think that disappointed me a bit is that the author doesn't use big Oh notation. Even, if we have enough information to easily get the Big Oh equivalent, I don't understand why a book about algorithms doesn't use this notation.</p>
<p>Just like the first book, there is no simple view of a given algorithm that contains all the information about an algorithm. Another think that disturbed me is that the author takes time to describe an API around the algorithms and data structures and about the Java API. Again, in my opinion only, it takes a too large portion of the book.</p>
<p>Again, this book was perfect to read with Kindle, the navigation was very easy.</p>
<h4>Introduction to Algorithms</h4>
<p>This book is the most complete I read about algorithms and data structures by a large factor. It has very complete explanations about complexity analysis: big Oh, Big Theta, Small O. For each data structure and algorithm, the complexity analysis is very detailed and very well explained. The pieces of code are written in a very good pseudo code manner.</p>
<p>As I said before, the complexity analysis are very complete and sometimes very complex. This can be either an advantage or a disadvantage, depending of what you awaits from the book. For example, the analysis is made using several notations Big Oh, Big Theta or even small Oh. Sometimes, it is a bit hard to follow, but it provides very good basis for complexity analysis in general.</p>
<p>The book was also the one with the best explanations about linear time sorting algorithms. In the other books, I found difficult to understand sorts like counting sort or bucket sort, but in this book, the explanations are very clear. It also includes multithreaded algorithm analysis, number theoretic algorithms, polynomials and a very complete chapter about linear programming.</p>
<p>The book contains a huge number of exercises for each chapters and sub chapters.</p>
<p>This book will not only help you find the best suited algorithm for a given problem, it will also help you understand how to write your own algorithm for a problem or how to analyze deeply an existing solution.</p>
<h4>Algorithms Book Wrap-up</h4>
<p>As I read all these Algorithms books in order, it's possible that my review is a bit subjective regarding to comparisons to other books.</p>
<p>If you plan to work in C++ and need more knowledge in algorithms and C++, I advice you to read <strong>Data Structures & Algorithm Analysis in C++</strong>, that is really awesome. If you want a very deep knowledge about algorithm analysis and algorithms in general and have good mathematical basis, you should really take a deep look at <strong>Introduction to Algorithms</strong>. If you want short introduction about algorithms and don't care about the implementation language, you can read <strong>Algorithms in a Nutshell</strong>. <strong>Algorithms</strong> is like a master key, it will gives you good starting knowledge about algorithm analysis and a broad range of algorithms and data structures.</p>AlgorithmBooksC++ConceptionJavaPerformancesProgramminghttps://baptiste-wicht.com/posts/2012/08/algorithms-books-reviews.htmlFri, 24 Aug 2012 06:52:04 GMTFind closest pair of point with Plane Sweep Algorithm in O(n ln n)https://baptiste-wicht.com/posts/2010/04/closest-pair-of-point-plane-sweep-algorithm.htmlBaptiste Wicht<div><p>Finding the closest pair of Point in a given collection of points is a standard problem in computational geometry. In this article I'll explain an efficient algorithm using plane sweep, compare it to the naive implementation and discuss its complexity.</p>
<p class="more"><a href="https://baptiste-wicht.com/posts/2010/04/closest-pair-of-point-plane-sweep-algorithm.html">Read more…</a></p></div>AlgorithmBenchmarksConceptionJavaPerformanceshttps://baptiste-wicht.com/posts/2010/04/closest-pair-of-point-plane-sweep-algorithm.htmlTue, 27 Apr 2010 14:08:10 GMT