Listen to this story
The Association for Computational Linguistics (ACL) Test-of-Time Paper Award recognises up to four papers (two papers from 25 years earlier and two papers from a decade earlier) for their long-lasting impact in the area of Natural Language Processing and Computational Linguistics. Recently, ACL announced the winners of the 2022 Test-of-Time Paper Awards.
Machine Transliteration (1997)
Authors: Kevin Knight, Jonathan Graehl
It was presented at the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics. This paper from 1997 addresses the challenge that exists in translating names and technical terms in different languages with different alphabets and sounds. Often, these names are transliterated (replaced with approximate phonetic equivalents).
Sign up for your weekly dose of what's up in emerging technology.
Pioneering work on back-transliteration
Essentially, transliteration is the process of transferring a word from the alphabet of one language to another. What makes it different from translation is that it only gives an idea of how the word is pronounced by introducing a familiar alphabet.
This paper took the example of English-Japanese translations. The Japanese language has a special phonetic alphabet called “katakana”, which is used primarily to write down foreign names and loanwords, added the paper.
Download our Mobile App
The paper brings out the issue that transliteration is not easy to automate but what is harder is back-transliteration (here, from katakana to English). Katakana phrases are the biggest source of text phrases that don’t appear in bilingual dictionaries, the paper further added. At that time, the work in this area was quite limited. (Yamron et al., 1994, (Arbabi et al., 1994)).
After initial experiments along these lines, the researchers built a generative model of the transliteration process. It followed the following mechanism, the researchers informed:
- One English phrase is written.
- A translator pronounces it in English.
- The pronunciation is modified to fit the Japanese sound inventory.
- Then, the sounds are converted into katakana, and then katakana is written.
- The problem to be solved is divided into five sub-problems.
Techniques already existed for coordinating solutions to such sub-problems and for using generative models in the reverse direction (relying on probabilities and Bayes’ Rule).
Read the full paper here.
Three Generative, Lexicalised Models for Statistical Parsing (1997)
Author: Michael Collins
In this paper, the author proposed a new statistical parsing model that is a generative model of lexicalised context-free grammar. It was extended to include a probabilistic treatment of both sub-categorisation and wh-movement. The results obtained on Wall Street Journal text show that the parser performs at 88.1/87.5% constituent precision/recall (which was an average improvement of 2.3% over Collins 96.
This paper came out with three new parsing models. As per the author, “Model 1 is essentially a generative version of the model described in (Collins 96). In Model 2, the author extends the parser to make the complement/adjunct distinction by adding probabilities over subcategorisation frames for head-words. Derived from the analysis given in Generalized Phrase Structure Grammar (Gazdar et al. 95), Model 3 gave a probabilistic treatment of wh-movement.
Read the full paper here.
Open Language Learning for Information Extraction (2012)
OLLIE – a major improvement over existing Open IE systems
As per the paper, Open Information Extraction systems extract relational tuples from text and do not require a pre-specified vocabulary by identifying relation phrases and associated arguments in arbitrary sentences. But state-of-the-art Open IE systems, such as REVERB and WOE, suffer from two major issues, adds the paper:
- Extract relations only that are mediated by verbs
- They ignore context
The authors said that OLLIE (Open Language Learning for Information Extraction) is a much more improved Open IE system working on these two limitations. OLLIE achieved high yield by extracting relations mediated by nouns, adjectives, etc., and not just verbs. A context-analysis step increased precision by including contextual information from the sentence in the extractions. OLLIE obtains 2.7 times the area under the precision-yield curve (AUC) compared to REVERB and 1.9 times the AUC of WOEparse.
Image: Google Scholar
Dedicated to Late Dr Stephen Soderland
While accepting the award, Mausam dedicated it to Late Dr Stephen Soderland. He added, “It was work that was being done at the University of Washington at that time – Open Information Extraction. It started in 2007 with the first system called Textrunner (Open IE 1.0). OLLIE is kind of the third generation of Open IE. It used dependency past based extraction, extracting information from nouns, etc. It could additionally extract attribution information.”
Midge: Generating Image Descriptions From Computer Vision Detections (2012)
Authors – Margaret Mitchell, Jesse Dodge, Amit Goyal, Kota Yamaguchi, Karl Stratos, Xufeng Han, Alyssa Mensch, Alex Berg, Tamara Berg, Hal Daumé III
“We did this work just as ‘language and vision’ research was blossoming” – Hal Daumé III
This paper was a leap forward in building a generation system that composes humanlike descriptions of images from computer vision detections. “By leveraging syntactically informed word co-occurrence statistics, the generator filters and constrains the noisy detections output from a vision system to generate syntactic trees that detail what the computer vision system sees.” The results indicated that this generation system outperforms state-of-the-art systems generating some of the most natural image descriptions to date at that time.
Image: Google Scholar
After the declaration from ACL, Daume, in a University of Maryland news update, said that this work was done at a time when ‘language and vision’ research was blossoming. He added that with the advances that have happened in vision and language technology since this paper came out, it was humbling to see how far the field has evolved in the past decade.