Menu

LiHiSTO: a comprehensive list of Hindi stopwords

Abstract:

AbstractA preliminary preprocessing step in text analytics is the removal of words with no semantic meaning, otherwise known as stopwords. English stopwords are very easily accessible and created due to the broad usability of the English language. However, a standard list of Hindi stopwords is still missing. This paper proposes an exhaustive list of generic Hindi stopwords and a Python package for easy distribution and usage. The methodology uses a dual mechanism for creating a list of Hindi stopwords. First, the famous English stopwords are collected and translated into meaningful Hindi words (group 1). Second, unique Hindi stopwords from multiple sources are fetched (group 2). Finally, the respective Hindi stopwords from groups 1 and 2 are combined, which resulted in a significantly large set of 820 Hindi stopwords. Additionally, the list of Hindi stopwords is made openly available for use at the Python Package Index (PyPI) repository as a Python package, which is named LiHiSTO. With the help of illustrative implementations, it is shown that LiHiSTO provides abstract and easy access to the list of stopwords for users to perform Hindi text analytics.