IIUM Repository

Exploration of Sindhi corpus through statistical analysis on the basis of reality

Sodhar, Irum Naz and Sulaiman, Suriani and Buller, Abdul Hafeez (2023) Exploration of Sindhi corpus through statistical analysis on the basis of reality. INDIAN JOURNAL OF SCIENCE AND TECHNOLOGY, 16 (12). pp. 924-931. ISSN 0974-6846 E-ISSN 0974-5645

[img] PDF - Published Version
Restricted to Repository staff only

Download (872kB) | Request a copy

Abstract

The Sindhi language is given more importance in Sindh’s educational institutions than other regional languages, and the majority of the population uses it in today’s mobile programs, letters, text messages and other text conversations. Research is needed to analyze the Sindhi corpus, as communication over computer systems and mobile phones is growing significantly. This research study focuses on the Sindhi alphabet and performs different tasks on the corpus. Methods: Data collection was conducted from available resources, and a corpus was created in Sindhi and English. Twenty patterns of letters are used, three dot alignments are used in the letters, and six symbols are used for making letters. After the collection, data was explored and analyzed with different tasks. Findings: The corpus of Sindhi text is being built due to its importance for language, linguistics and other developments in NLP. This research study focuses on statically analyzing the Sindhi-English corpus through reality basis, finding that there are two small words (۽ and ۾) and three biggest words (انگلینڊ ,پاڪستان and ڳالھیون). The letter 'آ 'is used as a single letter in Sindhi alphabets, with the minimum frequently occurring letter being consonant and the maximum frequently being .ئ vowel. Novelty: Text analysis is an important area in data mining and in other research, and this research study focuses on statically analyzing the Sindhi-English corpus through statically on reality basis. The author explores orthography and Sindhi composition of copra, and recommends that the Romanized languages data be used in Sindhi as well. Preprocessing is not easy due to lack of resources, and the character conversion model has generated two languages.

Item Type: Article (Journal)
Uncontrolled Keywords: Sindhi; Language exploration; Corpus; Statistical Analysis; pattern of letters; Text conversation.
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Kulliyyahs/Centres/Divisions/Institutes (Can select more than one option. Press CONTROL button): Kulliyyah of Information and Communication Technology > Department of Computer Science
Kulliyyah of Information and Communication Technology > Department of Computer Science

Kulliyyah of Information and Communication Technology
Kulliyyah of Information and Communication Technology
Depositing User: Dr. Suriani Sulaiman
Date Deposited: 29 Dec 2023 15:13
Last Modified: 29 Dec 2023 15:13
URI: http://irep.iium.edu.my/id/eprint/109433

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year