NEWS & EVENTS
Members of our staff have a deep level of expertise on issues related to multilingual information processing. We've collected the published articles and conference presentations we have given on a wide variety of topics and made them available below.
Basis Products | Chinese Language Issues | Data Quality | Digital Forensics | E-Discovery | Entity Extraction | Middle Eastern Language Issues | Name Resolution | Search | Unicode
Basis Technology Products
Guided Tour of the Rosette Linguistics Platform
A presentation outlining the full suite of the Rosette Linguistics Platform (RLP) and it components. Learn its capabilities as well as how it's being used, how it interacts , and how it can be tuned and customized to fit your needs.
Ken Glidden's presentation at Basis Technology’s Government Users Conference on June 8, 2009.
Tutorial: Arabic Editor and GeoScope®
This tutorial offers hands-on training with Basis Technology’s Arabic Desktop Suite, an integrated collection of productivity-boosting applications designed for analysts, linguists, and translators.
Tina Lieu and Youssef Fayed: Hands-on Tutorial at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Tutorial: Transliteration Assistant & Knowledge Center
This tutorial teaches you to prepare reports using standardized transliterations, to automatically translate lists of names, and to exploit online reference materials.
Tina Lieu and Youssef Fayed: Hands-on Tutorial at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Basis Technology’s Just Right Solutions:
Bigger than a Component, Smaller than a Stovepipe
CTO Benson Margulies speaks about how Basis Technology is moving to solve larger pieces of the problems facing government users by delivering modules of functionality — entity extraction, entity translation, name matching, and geospatial fusion — either pre-assembled into desktop applications or as enterprise software.
Benson Margulies’ presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Multilingual Deep Web Search
This presentation introduces BrightPlanet’s product, technology, and unique placement in the ‘deep web’ space, and presents the close relationship with Basis Technology and its Rosette® Linguistics Platform.
Duncan Witte, and Dirk Koechner’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Building Applications with the Rosette® Linguistics Platform
The presentation reviews the capabilities of RLP, the applications for which it can be used, and the techniques it employs. It also focuses on how RLP can be integrated and used in existing systems, and how it can be tuned for each system’s requirements.
Steve Cohen’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
What Language is That? Using the Rosette® Language Identifier
This presentation gives an overview of the Rosette Language Identifier (RLI) and the techniques RLI uses to automatically identify the language and encoding of a block of text. It also explains how language and encoding identification is an essential stage in the process of working with unstructured multilingual text.
Nobuo Otsuka’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Introduction to Basis Technology Transliteration Assistant
This presentation showcases Basis Technology’ Transliteration Assistant (XA), a Microsoft Word and Excel plug-in which enables translators to quickly and consistently produce accurate transliterations of Arabic names.
Melissa Lucius’ presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Introduction to Basis Technology’s Arabic Editor
This presentation provides a broad overview of Basis Technology’s Arabic Editor and Linguist’s Workbench, a powerful and flexible text editing and analysis system. Arabic Editor is best known for providing a simple method for entering and editing Arabic text using a standard “QWERTY” keyboard.
Mary Galvin’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Chinese Language Issues
Processing the Mosaic of Chinese Dialects
This presentation explores the taxonomy of modern Chinese and illustrates the aforementioned difficulties through case studies of a dialect, Wu Chinese (spoken in the Shanghai area) and a Mandarin variant, Sichuanese (as spoken in Chengdu, the capital of Sichuan province).
Benjamin Swanson's presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
The Web as a Corpus for Chinese Natural Language Processing
This presentation discusses how Basis Technology created process work, and the problems Basis overcame (or avoided), and how it all turned out, both as a problem of Chinese linguistics and as a challenge of downloading, filtering, and processing terabytes of raw web pages from the Internet.
John O’Neil, Ph.D. presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Chinese Language Analysis: Solving the Chinese Puzzle
This presentation survey’s the problems associated with automatic processing of Chinese. It reviews the various Chinese character sets and encoding systems; input methods and transliteration; and the solutions offered by Basis Technology’s Chinese Language Analyzer and Named Entity Extractor.
Joe Ho’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Large Corpus Construction for Chinese Lexicon Development
The World Wide Web provides an important source of natural language data in many languages. However, it doesn't include annotation about linguistic structure, so it's necessary to use very large corpora to infer it. We developed a system for continuous, automatic acquisition of a Chinese lexicon. An up-to-date lexicon is needed for many applications, but Chinese is written without spaces between words, so determining word boundaries is the primary problem. We discuss our experience with using the Chinese Web for lexicon construction, focusing on both low-level details and problems we experienced during our initial proof-of-concept experiments, and on algorithmic issues.
Thomas Emersons presentation from the 29th Internationalization & Unicode Conference, San Francisco CA, March 7 - 9, 2006.
Data Quality
Exploiting GeoNames in Practical Applications
This presentation explains how NGA’s data is presently exploited by the Arabic Desktop Suite and future directions.
Tina Lieu’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
N-Gram vs. Morphological Analysis - Whitepaper
There are two common ways to segment words: N-Gram and Morphological Analysis. Learn the differences between the two by reading this short whitepaper.
Written by Steven Cohen, VP of Products at Basis Technology - June 29, 2006.
Designing Large-Scale Multilingual Systems
Foreign language documents pose challenges for the entire document-management pipeline: identifying the format, extracting text, indexing, search, retrieval, and display. While commonly used technologies work much better than they did a few years ago, there are still many ways to build systems that fail to handle foreign text. This presentation provides an overview of the problem and points out some of the more important issues and traps.
Benson Margulies’ presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Digital Forensics
Extracting Text from Arabic PDF
This presentation paints a solution to the problem of extracting Arabic text from PDFs through modifications of the open source software PDFBox (www.pdfbox.org). It starts by looking at the basics of PDF structure, then looks at how Arabic is stored in PDF and how to get it out using a custom-modified PDFBox.
Brian Carrier's presentation at Basis Technology’s Government Users Conference on June 9, 2009.
Drive Analysis in a Flash
Presents a new media analysis and exploitation technique based on the statistical sampling of drive sectors. Using this approach it is possible to make highly accurate statements about the contents of a 1TB disk with less than 10 seconds of analysis, and with a margin of error of less than 1%. Making these statements requires a number of new advances in recognition technology and fast database lookups which will is reviewed.
Simson Garfinkel's presentation at Basis Technology’s Government Users Conference on June 9, 2009.
Digital Forensics
R&D Initiatives at Basis Technology
As criminal and counter-terror investigations cross national and language boundaries, the challenges include not only finding the right documents and evidence among terabytes of data spread across thousands of hard drives, but also searching for keywords or names in different languages, and then interpreting search results in languages unfamiliar to the investigator.
This presentation reviews Basis Technology's digital forensics initiatives as it connects to the broader text analytic and name matching solutions.
Brian Carrier's presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Multilingual Keyword Search Comes to Digital Forensics
Searching hard drives containing text in foreign language presents technical complexities which most investigators are unaware of: multiple encoding schemes, orthographic variations, spelling variations, and online “chat” dialects. This presentation introduces the Odyssey Digital Forensics system, which has been specifically designed to address these linguistic issues.
Brian Carrier’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Cross Drive Analysis: A New Approach to Media Exploitation
This presentation describes correlation techniques for the analysis of large volumes of digital data, and presents results from ten years of research on real-world drives.
Simson Garfinkel’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Crash Course in Digital Forensics
This presentations provides an overview of key topics in digital forensics, including the investigation process; analysis techniques and tools; and some examples. It also provides information on new forensics products being developed at Basis Technology and how linguistic analysis techniques will be incorporated into these products.
Brian Carrier’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
E-Discovery
Multi-language E-Discovery:
Three Critical Steps for Litigating in a Global Economy - Whitepaper
Managing the legal risk associated with the ever-expanding global economy is a challenge facing many companies. When litigation spans national boundaries, lawyers can be flooded with thousands of documents in languages other than English – all of it potential evidence that needs to be evaluated. Companies need the ability to identify, process and review multilingual documents for use in a courtroom – discovery documents. This whitepaper examines how global companies are using multi-language text processing and entity extraction as part of their next generation of e-discovery solutions.
May 2009
Entity Extraction
Demystifying Entity Extraction Quality
This presentation surveys the types of measurements used for entity extraction quality, and discusses techniques to better extract the data you're looking for when general language models don't fit your needs.
Charlotte Shabarekh’s presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Middle Eastern Language Issues
The World of Arabic Nicknames
In the Arab culture, the number of nicknames for a person may seem endless. You often see them in chat, emails, or in oral communication. Dealing with multiple nicknames is a tricky problem for fields such as compliance, intelligence gathering and name resolution, since they could be used as aliases. This presentation desribes different types of Arabic nicknames and how they are used.
Bushra Zawaydeh’s presentation at Basis Technology’s Government Users Conference on June 9, 2009.
One Language, Many Dialects: An Analysis of Arabic Dialects
This presentation discusses the similarities of many linguistic structures that define an Arabic dialect as well as the differences that draw non-geographical boundaries, and then show how this affects Arabic search.
Zina Saadi’s presentation at Basis Technology’s Government Users Conference on June 9, 2009.
The Names of Afghanistan – Understanding Pushto and Dari Names
This presentation introduces naming practices in Afghanistan, following a primer on Pushto and Dari, the two major languages spoken in Afghanistan. It explores the linguistic attributes of Pushto and Dari names such as their influence by Arabic names, spelling variations, and morphology.
Bushra Zawaydeh's presentation at Basis Technology’s Government Users Conference on June 9, 2009.
You say “Jamāl”; he writes “Djamel”: Influences on Western Transliteration of Arabic Names
This presentation reviews examples of names influenced by formal languages and spoken in the region as well as how these languages influence the orthography of the names in Latin alphabet.
Zina Saadi’s presentation at Basis Technology’s Government Users Conference on June 8, 2009.
Next Generation of Arabic Search: Linguistically Intelligent Retrieval
This presentation demonstrates how a search engine with knowledge of the linguistic components of Arabic – the roots, lemmas and stems – can greatly boost the relevancy of search results.
Zina Saadi’s presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
الأجيــال القادمة لتقنيات البــحث العربي 
لقد أدى النمو السريع للمحتوى العربي على شبكة الإنترنت إلى الحاجة إلى جيل جديد من البحث النصي ذو تقنيات متقدمة لمعالجة تعقيدات اللغة العربية. هذا العرض يظهر كيف يمكن لمحرك البحث إستخدام المكونات اللغوية للغة العربية -- الجذور، الجذوع، والكلمات المعجمية -- ليعزز بشكل كبير ملاءمة لنتائج البحث .
A Linguistic Profile of the Persian Language and Dialects
This presentation is a brief history of the Persian language, its speakers, and its dialects. It compares Persian to other Arabic script languages such as Arabic, Pashto, and Urdu. It also delves into linguistic aspects of the language, which are important to natural language processing and analysis applications such as, orthography, typography rules, phonology, and spelling variants.
Bushra Zawadeh’s presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
A Profile of Arabic Script Languages
This presentation explores the history of the script in various Arabic script languages, the structure and characteristics of the Arabic alphabet, the alphabet used, the phonological structure, the borrowings, and the differences between Arabic and these languages.
Bushra Zawadeh’s presentation at Basis Technology’s Government Users Conference in Washington D.C. on June 7, 2007.
Arabic, Farsi and Urdu Text Normalization for Natural Language Processing
This presentation suggests a multi-level normalization for handling various Arabic script orthographic variations that appear in current news corpora.
Zina Saadi’s presentation at Basis Technology’s Government Users Conference in Washington D.C. on June 7, 2007.
Decoding Arabic Chat
This presentation decodes the representation of Arabic sounds in the Romanized shorthand commonly used in chatrooms and blogs by presenting findings from field analyses of Egyptian, Gulf, Iraqi, and Levantine online dialects.
Bushra Zawadeh’s presentation at Basis Technology’s Government Users Conference in Washington D.C. on June 7, 2007.
What’s in a Persian Name?
This presentation begins with the basics of Persian phonology and name morphology, and delves into the rich influences of other languages; cultural naming preferences (such as the decline of Arabic-based names after the fall of the Shah in Iran); historical roots; and regional customs.
Zina Saadi’s presentation at Basis Technology’s Government Users Conference in Washington D.C. on June 7, 2007.
Orthographic Variations in Arabic Corpora
This presentation discusses the different kinds of Arabic orthographic issues that Basis Technology’s Arabic linguists have encountered and handled while building various software solutions for Arabic text analysis
Bushra Zawaydeh’s presentation at Basis Technology's Government Users Conference in Washington, D.C. on June 14, 2006.
Behind the Name: Etymology of Arabic Names
This presentation gives some samples of various linguistic rules that contributed to the evolution of certain famous Arabic names. It samples different types of names as well as the influence of various foreign languages; regional and social impacts; and language evolution.
Zina Saadi’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Tailoring UAX #29 Word Breaking for Arabic Text
Thomas Emerson's presentation at the 28th Internationalization & Unicode Conference in Orlando, FL on Sept. 8, 2005.
Name Resolution
New Developments in Name Analysis
When searching documents or analyzing text, often the most critical pieces of
information are the names of people, places, and organizations. But how can you
be sure that one name is the same as another, especially if it's written in a
different language or appears as some variant form? What if you want to
translate that name into a language you can recognize and process? This presentation reviews the overall view of name relationships and demonstrates the use of Basis Technology's products to navigate those relationships.
David Murgatroyd’s presentation at Basis Technology’s Government Users Conference on June 8, 2009.
Making Your Name Search CrossLingual
Finding the few emails among thousands that mention a specific person or concept may provide a needed missing link, but what if the emails are in a language you don’t speak? This language barrier can be bridged by making a search system cross-lingual. Doing so involves trading off properties like implementation ease, accuracy, and speed. This presentation will explore some specific options to tackle these trade-offs as well as other challenges of enabling cross-lingual search.
David Murgatroyd’s presentation at Basis Technology’s Government Users Conference on June 8, 2009.
Global Anti-Money Laundering Compliance: Challenges and Solutions
This presentation will overview how a financial organization can comply with international sanctions programs when their customer names are in one language and the sanctions list is in another. These challenges are highlighted along with practical solutions for handling customer names in multiple languages, including Basis Technology's Rosette Name Indexer and Rosette Name Translator.
Steve Kearns presentation at Basis Technology’s Government Users Conference on June 9, 2009.
Everything You've Ever Wanted to do With Names
This presentation will explore challenges of multilingual name resolution, retrieval, and translation. It will also demonstrate Basis Technology’s products which enable rapid identification of names in multiple languages and automatic, high-accuracy translation of those names into English.
David Margatroyd’s presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Linguistic Considerations of Identity Resolution
This presentation will consider metrics and data for evaluating identity resolution and retrieval systems. Itl also explores the linguistic challenges these systems face.
David Margatroyd’s presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Building Application with Rosette Name Indexer
and Rosette Name Translator
This presentation will demonstrate how to rapidly construct an application which extracts names from foreign language documents, indexes those names, and automatically generates a high-quality translation into English according to the applicable agency transliteration standard. Real–world examples are presented in Arabic, Chinese, Korean, Pashto, Farsi (Persian), and Russian, for a total of six scripts and nine languages.
Benson Margulies' presentation at Basis Technology’s Government Users Conference in College Park, MD on May 20, 2008.
Different Script, Same Name: Tools for Matching and Translation
This presentation explains how to build multilingual name search and translation capabilities into your application by leveraging innovative products.
David Murgatroyd’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Search
Beyond Keyword Search – by Susan Feldman, IDC
Day Two Keynote Address at Basis Technology’s Government Users Conference on June 9, 2009.
Building a Multlingual Search Engine with Apache Lucene
Read how you can build a global-ready search server using Apache Lucene or Solr using the Rosette Linguistics Platform.
February 2009
Lucene and Solr for the Rest of the World
Lucene is a popular open-source search engine library, used by a variety of commercial and non–commercial web sites. However, its built–in support for non-English languages is very limited, creating a significant barrier to sophisticated processing of data in certain languages. The Rosette Linguistics Platform (RLP) helps overcome this barrier for a number of linguistically challenging languages such as Japanese and Arabic. This presentation explores how RLP integrates with and what benefits it brings to Lucene.
Teruhiko Kurosaka's presentation at Basis Technology’s Government Users Conference on June 8, 2009.
Adding Linguistics to a Lucene-based Application
This presentation survey’s the challenges and solutions to integrating complex linguistics into this popular open-source application.
Chris Milner, Ph.D., and Steve Cohen’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Unicode
Unicode 5.0 Essentials
This presentation begins with a look at how Unicode, established in 1991, has changed the way computers process text, with particular emphasis on Arabic, Chinese, Japanese, and Korean. For the non-programmer, this presentation briefly presents foundational concepts of encodings, characters, glyphs, code points, and the design principles behind Unicode.
Tina Lieu’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 7, 2007.
Hewlett Packard Breaks the Printer Barrier of Global Operations:
Basis Technology reviewed HP’s International Print Solution. Hewlett-Packard introduced technology to help companies overcome a key barrier to global operations — how to print documents correctly everywhere despite differences in language and script. Read our review.
Written by: Benson Margulies, CTO, Basis Technology
Understanding Unicode 5.0
This presentation provides a gentle introduction to the basic concepts of the Unicode 5.0 standard, including characters, encodings, transcoding, byte ordering, and the common UTF 8 and UTF 16 transformation formats. Also covered is practical information about support for Unicode in popular operating systems, computer languages, and protocols.
Ken Glidden’s presentation at Basis Technology’s Government Users Conference in Washington, D.C. on June 14, 2006.
Big Dots, Little Dots, and Circled Dots: How Unicode can help (and hurt) the process of converting documents to information.
Basis Technology CTO Benson Margulies keynote address from the 25th International Unicode Conference, Washington D.C. March 2004.







