Open Tamil Texts for Machine Processing January 18th, 2020

Submitted by tamiladmin on

Date:  Saturday, January 18th, 2020
Time: 9:30 am - 11:30 am (EST - local time) ; 8.00 pm - 10:00 pm (IST - India/Sri Lankan Time)

Location: The BRIDGE Boardroom: IC 111

University of Toronto Scarborough Campus (UTSC)
Instructional Centre (IC Building), ground floor
1095 Military Trail
Toronto, Ontario M1C 1A4

Hosted by:  Digital Tamil Studies project at UTSC Library.

High-quality open Tamil language texts are difficult to source, often requiring substantial cleaning/processing in order to be used for digital scholarship and application development purposes. The Digital Tamil Studies project, based at the UTSC library has been developing partnerships to create better quality Tamil text data for machine processing. This subject is intertwined with many other activities such as text analysis, natural language processing, development of multilingual digital repositories and the Digital Tamil Studies community writ large. Please join us for a roundtable with Tamil computing practitioners and users discussing projects and developments in this area.

Presentations:

  • Current State of Open Tamil Datasets - Ravi Annaswamy (Information Architect)
  • Python Libraries for Tamil Computing - Muthu Annamalai (Software Engineer)
  • Linked and Structured Tamil Data for Machine Learning - Saatviga Sudhahar (Machine Learning Scientist)
  • Tamil Computing Needs for Libraries - Natkeeran L. Kanthan (Software Developer)

Presentations will be followed by discussions. 

Contact:
Kirsta Stapelfeldt - kirsta.stapelfeldt@utoronto.ca
Natkeeran L. Kanthan - nat.ledchumykanthan@utoronto.ca

Event Notes

The in-person and virtual “Open Tamil Texts for Machine Processing” event was convened by the UTSC Library’s Digital Scholarship Unit as part of the Digital Tamil Studies on Saturday, January, 18th, 2020.  The event was attended by twenty-five Tamil computing developers and language experts. The event explored open Tamil data as the foundation for Tamil computing development, and how developers, scholars, public and institutions can play respective roles in building open Tamil datasets.

There was a general consensus that lack of open access Tamil datasets is the main roadblock faced by Tamil developers.  The first presentation “Current State of Open Tamil Datasets” given by Vallipuram Suganthan explored this issue at length.  The presentation listed 32 Tamil datasets. It highlighted that Indian federal and state institutions often prohibit the sharing and republishing of publicly funded data, thus making them unusable for most applications.  Further, through discussion, it was pointed out that the lack of a copyright statement, even for possibly public domain data, prohibits some developers from using that data. Making data available in a format such as plain text or csv through platforms such as Kaggle and GitHub was also identified as an important step to facilitate use. 

The second presentation “Tamil Natural Language Processing via Open-Tamil Python Library” was given by Muthu Annamalai.  This session explored how the Open-Tamil python library and the tamilpesu.us web UI provides a tool chain for many common textual processing functions.  The library incorporates many Tamil Computing open source works, and can be extended by developers.  The library currently supports font conversion, stemming, spell checking, santhi checking, Tamil Text to Speech among various other functions.  Some of these functions need to be developed further to improve the quality; however, the library provides an excellent open source toolset for Tamil computing.  The discussions briefly explored how open texts can be loaded via this library to facilitate exploration and development, especially textual analysis. Muthu followed up with his notes for the event, which can be accessed here:  https://ezhillang.blog/2020/01/25/open-தமிழ்-திட்டம்-ஒரு-பார்வை/

The third presentation “Linked and Structured Tamil Data for Machine Learning” by  Saatviga Sudhahar provided an overview of Linked Data and how it can be used for machine learning.  She described the technical setup needed for large scale textual data analysis. Further, she shared her experience in helping coordinate Tamil Natural Language Processing (NLP) related projects, specially the challenges with respect to data and in evaluating/benchmarking software.

The fourth presentation “Tamils Computing Needs for Digital Libraries” by Natkeeran Ledchumykanthan outlined major technical needs for Tamil Digital Libraries with focus on accessibility.  Foremost, digital collections and digital archive software need to have multilingual capability, including support for multilingual metadata and UI. To reach visually impaired users, Tamil screen readers and production quality Tamil speech synthesis software are required.  To reach hearing imapired users, we need production quality speech to text software. 

In discussion, attendees pointed out that Tamil computing and digital scholarship community needs to find ways to engage the public to contribute towards building open datasets.  Developing mobile applications to crowdsource data, communication campaigns, and training are needed, and some initiatives are already underway. Competitions/gamification may encourage wider participation.  Tamil language experts noted that they would like to receive more training to use and contribute towards these initiatives. Some concrete next steps follow:

Next Steps

Tamil Datasets Catalogue
Create a richly described catalogue for Tamil computing and scholarship applications based on this spreadsheet.

Create Machine Accessible Formats of Existing Data
Projects such as Free Tamil Books, Project Madurai have a significant amount of text data.  However, they are not in plain text format. Encourage those projects to release the resources in plain text formats in platforms such as Kaggle and GitHub. 

Open Tamil Dictionary API
A structured dictionary with public domain license (CCO) and API is critical for many Tamil computing applications.  Wikidata Lexicographical Data project is a promising platform. We need to populate wikidata with public domain data, including possibly Tamil Wiktionary.

The University of Chicago has digitized many public domain dictionaries; the Digital Tamil Studies project should advocate for the application of open licences for these key works.

UTSC and the Tamil Computing community along with UoC can possibly collaborate in structuring that data into and contributing to a common platform such as Wikidata.

Mobile Platforms for Data Collection
There are plans to develop Apps to collect open access datasets.  Digital Tamil Studies can possibly collaborate in that initiative (i.e help spread the word, cataloguing/hosting the datasets). 

Volunteer Training/Mobilization/Communication
Tamil language experts noted that many aspects of this domain are new to them.  They would like to receive training and more information along many of these areas to help them with their study and research.  Text analysis was noted as a possible event theme.

Event Recording

Audio Recording