A Guide to the Filipino Language Corpus – UP Monolinggwal na Diksyunaryong Filipino

A
Guide to the Filipino Language Corpus

There are two primary objectives in the design of the UP Filipino Language Corpus (UP-FLC). First, this serves as a guide to the collection of data in the form of texts which will constitute the proposed corpus of the Filipino language. At present, there is no existing system in the collection and compiling of texts of the National Language for the purpose of language research. Researchers usually use their own methods and preferences in collecting data according to his/her needs. The collected data is often unusable for other research due to the inflexibility of the criteria or the very nature of its purpose in the original research renders the data inapplicable for other objectives.

Second, the UP-FLC will serve as a model to represent a Filipino vocabulary resource which will complement the building of a dictionary for the national language. One of the latest methods in lexicography is the use of large corpora for the purposes of making general-descriptive dictionaries, monolingual and bilingual alike. Unlike traditional methods, the process of describing the meanings and providing examples in dictionary entry articles becomes easier and is based on actual usage with the use of data corpora. Therefore, the making of a dictionary for the National Language becomes realistic and practical by the very reason that Filipino serves as the national lingua franca and thus, its usage and development is based on its continued and repeated use in communication among Filipinos.

Categories in the corpus:

The codes on the right are used to indicate the type of text collected with their respective description on the left. This design was largely based on the International Corpus of English (ICE) which was started by Sidney Greenbaum (Nelson 1996). This was modified to be made applicable for the collection of data in Filipino:

Description	Code
Written Texts (40%)	W
Unpublished	W1
Academic writings Professional writings Student essays Examination Scripts (Essays) Blogs	W1A
Correspondence Letters, Memo	W1B
Published	W2
Academic works Humanities Social Sciences Science Technology	W2A
Non-academic Writing Featured articles	W2B
News Reporting News (e.g. showbiz, sports)	W2C
Instructional Writing Manual Instructions Regulations Pamphlets Tech/Voc	W2D
Persuasive Writing Press Editorials	W2E
Creative Writing Novels & Stories Creative Essays	W2F
Spoken Texts (60%)	S
Dialogue	S1
Private Direct conversations Video Call, Skype	S1A
Public Class Discussions Broadcast discussions Broadcast interviews Political speeches Conversations in public arenas	S1B
Monologue	S2
Unscripted Spontaneous Commentaries Unscripted speeches Speeches in public demonstrations	S2A
Scripted Broadcast News Broadcast Talks Non-broadcast Talks	S2B

General explanation of the categories

The UP-FLC is divided into two major categories: the written and spoken texts. The “W” indicates text from written sources while “S” is used for the spoken texts. Further divisions within the categories are marked by Hindu-Arabic numerals 1, 2, 3, etc. followed by the use of capitalized letters A, B, C, etc. as needed.

The definition of “text” here is based on Atkins, Clear and Ostler (1991) which is used for corpus-building purposes. Aside from the usual understanding of the word “text” to mean a written work, transcriptions of spoken language usage is also included here. Unlike other earlier language corpora which is based solely on written texts, the UP-FLC puts importance on spoken language use (see Nelson 2006 for a good discussion of this side) and therefore, has put a larger percentage in eliciting spoken texts. For the first stage of the UP-FLC, 60% of the collected texts were from spoken sources while 40% were from written works. It is however expected that additions and modifications to the percentages of collections will be continuously made in keeping with the changes and development of the language.