A
Guide to the Filipino Language Corpus
There are two primary objectives in the design of the UP Filipino Language Corpus (UP-FLC). First, this serves as a guide to the collection of data in the form of texts which will constitute the proposed corpus of the Filipino language. At present, there is no existing system in the collection and compiling of texts of the National Language for the purpose of language research. Researchers usually use their own methods and preferences in collecting data according to his/her needs. The collected data is often unusable for other research due to the inflexibility of the criteria or the very nature of its purpose in the original research renders the data inapplicable for other objectives.
Second, the UP-FLC will serve as a model to represent a Filipino vocabulary resource which will complement the building of a dictionary for the national language. One of the latest methods in lexicography is the use of large corpora for the purposes of making general-descriptive dictionaries, monolingual and bilingual alike. Unlike traditional methods, the process of describing the meanings and providing examples in dictionary entry articles becomes easier and is based on actual usage with the use of data corpora. Therefore, the making of a dictionary for the National Language becomes realistic and practical by the very reason that Filipino serves as the national lingua franca and thus, its usage and development is based on its continued and repeated use in communication among Filipinos.
Categories in the corpus:
The codes on the right are used to indicate the type of text collected with their respective description on the left. This design was largely based on the International Corpus of English (ICE) which was started by Sidney Greenbaum (Nelson 1996). This was modified to be made applicable for the collection of data in Filipino:
Description | Code |
---|---|
Written Texts (40%) | W |
Unpublished | W1 |
Academic writings Professional writings Student essays Examination Scripts (Essays) Blogs |
W1A |
Correspondence Letters, Memo |
W1B |
Published | W2 |
Academic works Humanities Social Sciences Science Technology |
W2A |
Non-academic Writing Featured articles |
W2B |
News Reporting News (e.g. showbiz, sports) |
W2C |
Instructional Writing Manual Instructions Regulations Pamphlets Tech/Voc |
W2D |
Persuasive Writing Press Editorials |
W2E |
Creative Writing Novels & Stories Creative Essays |
W2F |
Spoken Texts (60%) | S |
Dialogue | S1 |
Private Direct conversations Video Call, Skype |
S1A |
Public Class Discussions Broadcast discussions Broadcast interviews Political speeches Conversations in public arenas |
S1B |
Monologue | S2 |
Unscripted Spontaneous Commentaries Unscripted speeches Speeches in public demonstrations |
S2A |
Scripted Broadcast News Broadcast Talks Non-broadcast Talks |
S2B |
General explanation of the categories
The UP-FLC is divided into two major categories: the written and spoken texts. The “W” indicates text from written sources while “S” is used for the spoken texts. Further divisions within the categories are marked by Hindu-Arabic numerals 1, 2, 3, etc. followed by the use of capitalized letters A, B, C, etc. as needed.
The definition of “text” here is based on Atkins, Clear and Ostler (1991) which is used for corpus-building purposes. Aside from the usual understanding of the word “text” to mean a written work, transcriptions of spoken language usage is also included here. Unlike other earlier language corpora which is based solely on written texts, the UP-FLC puts importance on spoken language use (see Nelson 2006 for a good discussion of this side) and therefore, has put a larger percentage in eliciting spoken texts. For the first stage of the UP-FLC, 60% of the collected texts were from spoken sources while 40% were from written works. It is however expected that additions and modifications to the percentages of collections will be continuously made in keeping with the changes and development of the language.