A Guide to the Filipino Language Corpus

Filipino Language Corpus

A
Guide to the Filipino Language Corpus

There are two primary objectives in the design of the UP Filipino Language Corpus (UP-FLC). First, this serves as a guide to the collection of data in the form of texts which will constitute the proposed corpus of the Filipino language. At present, there is no existing system in the collection and compiling of texts of the National Language for the purpose of language research. Researchers usually use their own methods and preferences in collecting data according to his/her needs. The collected data is often unusable for other research due to the inflexibility of the criteria or the very nature of its purpose in the original research renders the data inapplicable for other objectives.

Second, the UP-FLC will serve as a model to represent a Filipino vocabulary resource which will complement the building of a dictionary for the national language. One of the latest methods in lexicography is the use of large corpora for the purposes of making general-descriptive dictionaries, monolingual and bilingual alike. Unlike traditional methods, the process of describing the meanings and providing examples in dictionary entry articles becomes easier and is based on actual usage with the use of data corpora. Therefore, the making of a dictionary for the National Language becomes realistic and practical by the very reason that Filipino serves as the national lingua franca and thus, its usage and development is based on its continued and repeated use in communication among Filipinos.

Categories in the corpus:

The codes on the right are used to indicate the type of text collected with their respective description on the left. This design was largely based on the International Corpus of English (ICE) which was started by Sidney Greenbaum (Nelson 1996). This was modified to be made applicable for the collection of data in Filipino:

Description Code
Written Texts (40%) W
Unpublished W1
Academic writings
Professional writings
Student essays
Examination Scripts (Essays)
Blogs
W1A
Correspondence
Letters, Memo
W1B
Published W2
Academic works
Humanities
Social Sciences
Science
Technology
W2A
Non-academic Writing
Featured articles
W2B
News Reporting
News (e.g. showbiz, sports)
W2C
Instructional Writing
Manual Instructions
Regulations
Pamphlets
Tech/Voc
W2D
Persuasive Writing
Press Editorials
W2E
Creative Writing
Novels & Stories
Creative Essays
W2F
Spoken Texts (60%) S
Dialogue S1
Private
Direct conversations
Video Call, Skype
S1A
Public
Class Discussions
Broadcast discussions
Broadcast interviews
Political speeches
Conversations in public arenas
S1B
Monologue S2
Unscripted
Spontaneous Commentaries
Unscripted speeches
Speeches in public demonstrations
S2A
Scripted
Broadcast News
Broadcast Talks
Non-broadcast Talks
S2B

General explanation of the categories

The UP-FLC is divided into two major categories: the written and spoken texts. The “W” indicates text from written sources while “S” is used for the spoken texts. Further divisions within the categories are marked by Hindu-Arabic numerals 1, 2, 3, etc. followed by the use of capitalized letters A, B, C, etc. as needed.

The definition of “text” here is based on Atkins, Clear and Ostler (1991) which is used for corpus-building purposes. Aside from the usual understanding of the word “text” to mean a written work, transcriptions of spoken language usage is also included here. Unlike other earlier language corpora which is based solely on written texts, the UP-FLC puts importance on spoken language use (see Nelson 2006 for a good discussion of this side) and therefore, has put a larger percentage in eliciting spoken texts. For the first stage of the UP-FLC, 60% of the collected texts were from spoken sources while 40% were from written works. It is however expected that additions and modifications to the percentages of collections will be continuously made in keeping with the changes and development of the language.