![]() ![]() Dictionaries should be open, free, and easily accessible to everyone, everywhere. The stepping stone to language, the underpinning of civilisation. ![]() I consider dictionaries to be on top of that list. The gist is that certain knowledge should be freely available to everyone with no restrictions and with one goal-collective advancement of humanity. The open data movement shares strong parallels with the Free and Open Source Software (FOSS) movement. I shudder to think of a world without Wikipedia. Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. I consider it a privilege to have been able to speak to you just that one time. The data you created will proliferate and continue to be useful to humanity in ways we never imagined. Your work’s utility will span generations. Life is absurd like that, shaped in infinite ways by tiny, random events.ĭatuk passed away in January 2019. He found it amusing that a random stranger had somehow unearthed a relic he had lost to the annals of internet history. He had seen the news clip of the dictionary’s release, and was thrilled to know that his work was now accessible as he had originally intended. Shortly thereafter, I was connected to Datuk by an old friend of his I had met at the conference, and we spoke briefly on the phone. Datuk’s story was covered by the press, and his work was now open and available to everyone. I wrote to the Swathanthra Malayalam Computing (SMC) mailing list announcing it, and we launched it with some fanfare at the SMC conference held in Thrissur, Kerala, that year. The dataset was named Datuk Corpus, and was published on Olam in 2013. It took more than two years of on and off work to convert the text from the original ASCII input to Unicode, and to clean, structure, and correct close to 200,000 entries. Needless to say, I was stumped by the scope of this project, and immediately started working on integrating it into Olam. I discovered the RTF file Datuk had posted a decade prior on an inactive Yahoo groups page around the time I was working on Olam. I do not know of the origin of the dictionary Datuk digtised, but it is poignant to think that the original author’s work lives on after a century. The Malaysian government conferred the title “Datuk” upon him in recognition of his exemplary services in the country, which then ended up being his nickname too. He was a Malayali settled in Malaysia, a prominent active social worker and educator. ![]() Joseph undertook in the late 90s, when he single-handledly digitised an out-of-copyright Malayalam-Malayalam dictionary along with many other books and posted them online at the expense of copious amounts of time out of his retirement. While the English-Malayalam corpus is crowdsourced, the Malayalam-Malayalam corpus (now known as the Datuk Corpus) was created out of the mammoth digitisation project the late “Datuk” K. The entire Olam corpus is open source (licensed under OdBL), or open data, rather. Since then, the English-Malayalam dictionary has been expanding slowly with crowdsourced entries. The first version of the Olam corpus was seeded with unattributed word lists I scraped together from random parts of the web, and several thousand entries I entered myself. It is actively used by millions of Malayalam speakers. It has an input box that responds to dictionary lookups in under ~50ms, exactly as it did in 2010. Olam’s website has stayed exactly the same for 10 years. It was built out of the frustration of not having an easily accessible online Malayalam dictionary, of the frustration at dictionary websites that insulted the reader’s intelligence with poor usability, terrible ad-ridden spamminess, and no respect for language. I have been running Olam, an English-Malayalam and Malayalam-Malayalam dictionary, since 2010. This post is also a personal note, something I have not attempted in a long time. Krishna, Alar, his Kannada-English dictionary, and its accidental discovery and open sourcing at an unlikely place, a stock brokerage, Zerodha. This is the story of a massive dictionary that will become the window to a language spoken by tens of millions of people for generations to come, a resource its author has donated to posterity. This is the story of a product of tenacity, selflessness, and passion a product that will transcend and outlive most technology we know of. ನಮಸ್ಕಾರ (Namaskāra)! This is not a post on fintech, or even technology for that matter. ![]()
0 Comments
Leave a Reply. |