Skip to content

Commit ea7e3d2

Browse files
committed
Update stopwords for Arabic, Viet and most of Indic languages
1 parent 99784ca commit ea7e3d2

3 files changed

Lines changed: 894 additions & 1165 deletions

File tree

ac_dc/languages_id.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
{
3333
"lang": "Assamese",
3434
"dataset_id": "as",
35-
"stopwords_id": None,
35+
"stopwords_id": "as",
3636
"flagged_words_id": None,
3737
"fasttext_id": "as",
3838
"sentencepiece_id": "as",
@@ -95,7 +95,7 @@
9595
{
9696
"lang": "Gujarati",
9797
"dataset_id": "gu",
98-
"stopwords_id": None,
98+
"stopwords_id": "gu",
9999
"flagged_words_id": None,
100100
"fasttext_id": "gu",
101101
"sentencepiece_id": "gu",

ac_dc/parameters_filtering.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -251,7 +251,7 @@
251251
"cond_check_lang_id": True,
252252
"lang_id_min_cutoff": 0.8,
253253
"cond_check_perplexity": True,
254-
"perplexity_max_cutoff": 150000,
254+
"perplexity_max_cutoff": 2500,
255255
}
256256

257257
parameters_filtering_en = {

0 commit comments

Comments
 (0)