Skip to content

quadrismegistus/prosodic

Repository files navigation

Prosodic

codecov

Prosodic is a metrical-phonological parser written in Python. Currently, it can parse English and Finnish text, but adding additional languages is easy with a pronunciation dictionary or a custom python function. Prosodic was built by Ryan Heuser, Josh Falk, and Arto Anttila. Josh also maintains another repository, in which he has rewritten the part of this project that does phonetic transcription for English and Finnish. Sam Bowman has contributed to the codebase as well, adding several new metrical constraints.

Prosodic 3.x features a DataFrame-first architecture with vectorized numpy constraint evaluation, GPU-accelerated harmonic bounding, and a Maximum Entropy weight learner for training constraint weights from annotated data. See CLAUDE.md for full architecture docs.

Supports Python>=3.9.

Try the web app at prosodic.app. Features:

  • Parse poetry and prose with constraint-satisfaction metrical analysis
  • Five tabs: Parse (results table with sortable columns), Line View (all scansions for a single line), Meter (constraint config), MaxEnt (weight learning), Settings (syntax/language)
  • Data export (CSV/TSV/JSON) with per-constraint violation counts and unbounded averages
  • Prose handling: auto-fallback to linepart parsing for long lines, with optional syntax-based sub-splitting via spaCy

Performance

Shakespeare sonnets (2155 lines, Apple M1). Run python -m prosodic.profiling to regenerate.

Step v2 v3 Speedup
Init (tokenize + pronunciations + entities) 5.29s 1.80s 3x
Parse (CPU) 72.97s 5.0s 15x
Parse (GPU) 72.97s 1.3s 57x
End-to-end (CPU) 78.3s 6.8s 12x
End-to-end (GPU) 78.3s 3.1s 26x
DF-only (no entities, GPU) 78.3s 1.8s 42x
Syntax (dep parse) 160.2s 2.7s 58x

Install

1. Install python package

Install from pypi:

pip install prosodic

2. Install espeak

Install espeak, free text-to-speak (TTS) software, to ‘sound out’ unknown words.

  • Mac: brew install espeak. (First install homebrew if not already installed.)

  • Linux: apt-get install espeak libespeak1 libespeak-dev

  • Windows: Download and install from github

Usage

Web app

Prosodic has a web app GUI. After installing, run:

prosodic web                    # production mode
prosodic web --dev              # auto-reload on Python/Svelte changes
prosodic web --host 0.0.0.0     # expose to network

Then navigate to http://127.0.0.1:8181/, or visit the live demo at prosodic.app.

Python

Read texts

# import prosodic
import prosodic

# load a text
sonnet = prosodic.Text("""
Those hours, that with gentle work did frame
The lovely gaze where every eye doth dwell,
Will play the tyrants to the very same
And that unfair which fairly doth excel;
For never-resting time leads summer on
To hideous winter, and confounds him there;
Sap checked with frost, and lusty leaves quite gone,
Beauty o’er-snowed and bareness every where:
Then were not summer’s distillation left,
A liquid prisoner pent in walls of glass,
Beauty’s effect with beauty were bereft,
Nor it, nor no remembrance what it was:
But flowers distill’d, though they with winter meet,
Leese but their show; their substance still lives sweet.
""")

# can also load by filename
shaksonnets = prosodic.Text(fn='corpora/corppoetry_en/en.shakespeare.txt')

Stanzas, lines, words, syllables, phonemes

Texts in prosodic are organized into a tree structure. The .children of a Text object is a list of Stanza's, whose .parent objects point back to the Text. In turn, in each stanza's .children is a list of Line's, whose .parent's point back to the stanza; so on down the tree.

# Take a peek at this tree structure 
# and the features particular entities have
sonnet.show(maxlines=30, incl_phons=True)
Text()
|   Stanza(num=1)
|       Line(num=1, txt='Those hours, that with gentle work did frame')
|           WordToken(num=1, txt='Those', sent_num=1, sentpart_num=1)
|               WordType(num=1, txt='Those', lang='en', num_forms=1)
|                   WordForm(num=1, txt='Those')
|                       Syllable(ipa='ðoʊz', num=1, txt='Those', is_stressed=False, is_heavy=True)
|                           Phoneme(num=1, txt='ð', syl=-1, son=-1, cons=1, cont=1, delrel=-1, lat=-1, nas=-1, strid=0, voi=1, sg=-1, cg=-1, ant=1, cor=1, distr=1, lab=-1, hi=-1, lo=-1, back=-1, round=-1, velaric=-1, tense=0, long=-1, hitone=0, hireg=0)
|                           Phoneme(num=3, txt='o', syl=1, son=1, cons=-1, cont=1, delrel=-1, lat=-1, nas=-1, strid=0, voi=1, sg=-1, cg=-1, ant=0, cor=-1, distr=0, lab=-1, hi=-1, lo=-1, back=1, round=1, velaric=-1, tense=1, long=-1, hitone=0, hireg=0)
|                           Phoneme(num=3, txt='ʊ', syl=1, son=1, cons=-1, cont=1, delrel=-1, lat=-1, nas=-1, strid=0, voi=1, sg=-1, cg=-1, ant=0, cor=-1, distr=0, lab=-1, hi=1, lo=-1, back=1, round=1, velaric=-1, tense=-1, long=-1, hitone=0, hireg=0)
|                           Phoneme(num=4, txt='z', syl=-1, son=-1, cons=1, cont=1, delrel=-1, lat=-1, nas=-1, strid=0, voi=1, sg=-1, cg=-1, ant=1, cor=1, distr=-1, lab=-1, hi=-1, lo=-1, back=-1, round=-1, velaric=-1, tense=0, long=-1, hitone=0, hireg=0)
|           WordToken(num=2, txt=' hours', sent_num=1, sentpart_num=1)
|               WordType(num=1, txt='hours', lang='en', num_forms=2)
|                   WordForm(num=1, txt='hours')
|                       Syllable(ipa="'aʊ", num=1, txt='ho', is_stressed=True, is_heavy=True, is_strong=True, is_weak=False)
|                           Phoneme(num=2, txt='a', syl=1, son=1, cons=-1, cont=1, delrel=-1, lat=-1, nas=-1, strid=0, voi=1, sg=-1, cg=-1, ant=0, cor=-1, distr=0, lab=-1, hi=-1, lo=1, back=-1, round=-1, velaric=-1, tense=1, long=-1, hitone=0, hireg=0)
|                           Phoneme(num=3, txt='ʊ', syl=1, son=1, cons=-1, cont=1, delrel=-1, lat=-1, nas=-1, strid=0, voi=1, sg=-1, cg=-1, ant=0, cor=-1, distr=0, lab=-1, hi=1, lo=-1, back=1, round=1, velaric=-1, tense=-1, long=-1, hitone=0, hireg=0)
|                       Syllable(ipa='ɛːz', num=2, txt='urs', is_stressed=False, is_heavy=True, is_strong=False, is_weak=True)
|                           Phoneme(num=2, txt='ɛː', syl=1, son=1, cons=-1, cont=1, delrel=-1, lat=-1, nas=-1, strid=0, voi=1, sg=-1, cg=-1, ant=0, cor=-1, distr=0, lab=-1, hi=-1, lo=-1, back=-1, round=-1, velaric=-1, tense=-1, long=1, hitone=0, hireg=0)
|                           Phoneme(num=4, txt='z', syl=-1, son=-1, cons=1, cont=1, delrel=-1, lat=-1, nas=-1, strid=0, voi=1, sg=-1, cg=-1, ant=1, cor=1, distr=-1, lab=-1, hi=-1, lo=-1, back=-1, round=-1, velaric=-1, tense=0, long=-1, hitone=0, hireg=0)
|                   WordForm(num=2, txt='hours')
|                       Syllable(ipa="'aʊrz", num=1, txt='hours', is_stressed=True, is_heavy=True)
|                           Phoneme(num=2, txt='a', syl=1, son=1, cons=-1, cont=1, delrel=-1, lat=-1, nas=-1, strid=0, voi=1, sg=-1, cg=-1, ant=0, cor=-1, distr=0, lab=-1, hi=-1, lo=1, back=-1, round=-1, velaric=-1, tense=1, long=-1, hitone=0, hireg=0)
|                           Phoneme(num=3, txt='ʊ', syl=1, son=1, cons=-1, cont=1, delrel=-1, lat=-1, nas=-1, strid=0, voi=1, sg=-1, cg=-1, ant=0, cor=-1, distr=0, lab=-1, hi=1, lo=-1, back=1, round=1, velaric=-1, tense=-1, long=-1, hitone=0, hireg=0)
|                           Phoneme(num=4, txt='r', syl=-1, son=1, cons=1, cont=1, delrel=0, lat=-1, nas=-1, strid=0, voi=1, sg=-1, cg=-1, ant=1, cor=1, distr=-1, lab=-1, hi=0, lo=0, back=0, round=-1, velaric=-1, tense=0, long=-1, hitone=0, hireg=0)
|                           Phoneme(num=4, txt='z', syl=-1, son=-1, cons=1, cont=1, delrel=-1, lat=-1, nas=-1, strid=0, voi=1, sg=-1, cg=-1, ant=1, cor=1, distr=-1, lab=-1, hi=-1, lo=-1, back=-1, round=-1, velaric=-1, tense=0, long=-1, hitone=0, hireg=0)
|           WordToken(num=3, txt=',', sent_num=1, sentpart_num=1)
|               WordType(num=1, txt=',', lang='en', num_forms=0, is_punc=True)
|           WordToken(num=4, txt=' that', sent_num=1, sentpart_num=1)
|               WordType(num=1, txt='that', lang='en', num_forms=3)
# take a peek at it in dataframe form
sonnet.df   # by-syllable dataframe representation
sonnet      # ...which will also be shown when text object displayed (in a notebook)
word_num_forms syll_is_stressed syll_is_heavy syll_is_strong syll_is_weak word_is_punc
stanza_num line_num line_txt sent_num sentpart_num wordtoken_num wordtoken_txt word_lang wordform_num syll_num syll_txt syll_ipa
1 1 Those hours, that with gentle work did frame 1 1 1 Those en 1 1 Those ðoʊz 1 0 1
2 hours en 1 1 ho 'aʊ 2 1 1 1 0
2 urs ɛːz 2 0 1 0 1
2 1 hours 'aʊrz 2 1 1
3 , en 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14 Leese but their show; their substance still lives sweet. 1 1 7 substance en 1 2 tance stəns 1 0 1 0 1
8 still en 1 1 still 'stɪl 1 1 1
9 lives en 1 1 lives 'lɪvz 1 1 1
10 sweet en 1 1 sweet 'swiːt 1 1 1
11 . en 0 0 0 1

195 rows × 6 columns

# you can loop over this directly if you want
for stanza in shaksonnets.stanzas:
    for line in sonnet:
        for wordtoken in line:
            for wordtype in wordtoken:
                for wordform in wordtype:
                    for syllable in wordform:
                        for phoneme in syllable:
                            # ...
                            pass
# or directly access components
print(f'''
Shakespeare's sonnets have:
  * {len(shaksonnets.stanzas):,} "stanzas"        (in this text, each one a sonnet)
  * {len(shaksonnets.lines):,} lines
  * {len(shaksonnets.wordtokens):,} wordtokens    (including punctuation)
  * {len(shaksonnets.wordtypes):,} wordtypes     (each token has one wordtype object)
  * {len(shaksonnets.wordforms):,} wordforms     (a word + IPA pronunciation; no punctuation)
  * {len(shaksonnets.syllables):,} syllables
  * {len(shaksonnets.phonemes):,} phonemes
''')
Shakespeare's sonnets have:
  * 154 "stanzas"        (in this text, each one a sonnet)
  * 2,155 lines
  * 20,317 wordtokens    (including punctuation)
  * 20,317 wordtypes     (each token has one wordtype object)
  * 17,601 wordforms     (a word + IPA pronunciation; no punctuation)
  * 21,915 syllables
  * 63,614 phonemes
# access lines

# text.line{num} will return text.lines[num-1]
assert sonnet.line1 is sonnet.lines[0]
assert sonnet.line10 is sonnet.lines[9]

# show the line
sonnet.line1
word_num_forms syll_is_stressed syll_is_heavy syll_is_strong syll_is_weak word_is_punc
line_num line_txt sent_num sentpart_num wordtoken_num wordtoken_txt word_lang wordform_num syll_num syll_txt syll_ipa
1 Those hours, that with gentle work did frame 1 1 1 Those en 1 1 Those ðoʊz 1 0 1
2 hours en 1 1 ho 'aʊ 2 1 1 1 0
2 urs ɛːz 2 0 1 0 1
2 1 hours 'aʊrz 2 1 1
3 , en 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ...
6 gentle en 1 2 tle təl 1 0 1 0 1
7 work en 1 1 work 'wɛːk 1 1 1
8 did en 1 1 did dɪd 2 0 1
2 1 did 'dɪd 2 1 1
9 frame en 1 1 frame 'freɪm 1 1 1

15 rows × 6 columns

# build lines directly
line_from_richardIII = prosodic.Line('A horse, a horse, my kingdom for a horse!')
line_from_richardIII
�[34m�[1mtokenizing�[0m�[36m @ 2023-12-15 14:14:17,991�[0m
�[34m�[1m⎿ 0 seconds�[0m�[36m @ 2023-12-15 14:14:17,992�[0m
word_num_forms syll_is_stressed syll_is_heavy word_is_punc syll_is_strong syll_is_weak
line_txt sent_num sentpart_num wordtoken_num wordtoken_txt word_lang wordform_num syll_num syll_txt syll_ipa
A horse, a horse, my kingdom for a horse! 1 1 1 A en 1 1 A 1 0 1
2 horse en 1 1 horse 'hɔːrs 1 1 1
3 , en 0 0 0 1
4 a en 1 1 a 1 0 1
5 horse en 1 1 horse 'hɔːrs 1 1 1
... ... ... ... ... ... ... ... ... ... ... ... ...
8 kingdom en 1 2 dom dəm 1 0 1 0 1
9 for en 1 1 for fɔːr 1 0 1
10 a en 1 1 a 1 0 1
11 horse en 1 1 horse 'hɔːrs 1 1 1
12 ! en 0 0 0 1

13 rows × 6 columns

Phrasal stress (syntax)

Prosodic can optionally compute phrasal stress from dependency parsing (Liberman & Prince 1977), using spaCy. This adds a phrasal_stress column to the syllable DataFrame, indicating each word's syntactic prominence (0 = sentence root, more negative = more deeply embedded).

# Install spaCy (optional dependency)
pip install prosodic[syntax]
python -m spacy download en_core_web_sm
# Enable with syntax=True
t = prosodic.Text("Shall I compare thee to a summers day", syntax=True)

# Phrasal stress values per word
df = t._syll_df
df[['word_txt', 'phrasal_stress']].drop_duplicates('word_num')
#   word_txt  phrasal_stress
#      Shall              -1
#          I              -1
#    compare               0   # ROOT (most prominent)
#       thee              -1
#         to              -1
#          a              -3
#    summers              -2
#        day              -1

Two metrical constraints use phrasal stress (both inert when syntax=False):

  • w_prom: penalizes phrasally prominent words (root/direct dependents) on weak metrical positions
  • s_demoted: penalizes deeply embedded words on strong metrical positions
from prosodic.parsing.meter import Meter
m = Meter(constraints=['w_stress', 's_unstress', 'w_peak', 'w_prom', 's_demoted'])
t.parse(meter=m)

Metrical parsing

Parsing lines
# parse with default options by just reaching for best parse
plausible_parses = line_from_richardIII.parse()
plausible_parses
parse_score parse_is_bounded meterpos_num_slots *w_peak *w_stress *s_unstress *unres_across *unres_within
line_txt parse_rank parse_txt parse_meter parse_stress
A horse, a horse, my kingdom for a horse! 1 a HORSE a HORSE my KING dom FOR a HORSE -+-+-+-+-+ -+-+-+---+ 1.0 0.0 10 0 0 1 0 0
# see best parse
line_from_richardIII.best_parse
A horse a horse my kingdom for a horse
⎿ Parse(rank=1, meter='-+-+-+-+-+', stress='-+-+-+---+', score=1, is_bounded=0)
# parse with different options
diff_parses = line_from_richardIII.parse(constraints=('w_peak','s_unstress'))
diff_parses
parse_score parse_is_bounded meterpos_num_slots *w_peak *s_unstress
line_txt parse_rank parse_txt parse_meter parse_stress
A horse, a horse, my kingdom for a horse! 1 a HORSE a HORSE my KING dom FOR a HORSE -+-+-+-+-+ -+-+-+---+ 1.0 0.0 10 0 1
2 a HORSE a HORSE my KING dom FOR a.horse -+-+-+-+-- -+-+-+---+ 1.0 0.0 12 0 1
3 a HORSE a HORSE my KING dom.for A horse -+-+-+--+- -+-+-+---+ 1.0 0.0 12 0 1
4 a HORSE a HORSE my KING dom.for A.HORSE -+-+-+--++ -+-+-+---+ 1.0 0.0 14 0 1
5 a HORSE a HORSE my KING.DOM for.a HORSE -+-+-++--+ -+-+-+---+ 1.0 0.0 14 0 1
6 a HORSE a HORSE my KING dom FOR.A horse -+-+-+-++- -+-+-+---+ 2.0 0.0 12 0 2
Parsing texts
# small texts
sonnet.parse()
�[34m�[1mparsing 14 lines [5x]�[0m�[36m @ 2023-12-15 14:17:43,563�[0m
�[1;34m│ stanza 01, line 14: LEESE but.their SHOW their SUBS tance STILL lives SWEET: 100%|�[0;36m██████████�[0;36m| 14/14 [00:00<00:00, 45.78it/s]
�[34m�[1m⎿ 0.3 seconds�[0m�[36m @ 2023-12-15 14:17:43,873�[0m
parse_score parse_is_bounded meterpos_num_slots *w_peak *w_stress *s_unstress *unres_across *unres_within
stanza_num line_num line_txt parse_rank parse_txt parse_meter parse_stress
1 1 Those hours, that with gentle work did frame 1 those HO urs THAT with GEN tle WORK did FRAME -+-+-+-+-+ -+-+-+-+-+ 0.0 0.0 10 0 0 0 0 0
2 those HOURS that.with GEN tle WORK did FRAME -+--+-+-+ -+--+-+-+ 0.0 0.0 11 0 0 0 0 0
3 those HOURS that.with GEN tle WORK did FRAME -+--+-+-+ -+--+-+-+ 0.0 0.0 11 0 0 0 0 0
2 The lovely gaze where every eye doth dwell, 1 the LO vely GAZE where E very EYE doth DWELL -+-+-+-+-+ -+-+-+-+-+ 0.0 0.0 10 0 0 0 0 0
2 the LO vely GAZE where E ve.ry EYE doth DWELL -+-+-+--+-+ -+-+-+--+-+ 1.0 0.0 13 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
13 But flowers distill'd, though they with winter meet, 1 but FLO wers DIS.TILL'D though THEY with WIN ter MEET -+-++-+-+-+ -+--+-+-+-+ 2.0 0.0 13 0 0 1 0 1
2 but FLO wers.dis TILL'D though THEY with WIN ter MEET -+--+-+-+-+ -+--+-+-+-+ 2.0 0.0 13 0 0 0 2 0
3 but FLO.WERS dis TILL'D though THEY with WIN ter MEET -++-+-+-+-+ -+--+-+-+-+ 2.0 0.0 13 0 0 1 0 1
4 but FLO wers DIS till'd THOUGH they.with WIN ter MEET -+-+-+--+-+ -+--+---+-+ 4.0 0.0 13 1 1 2 0 0
14 Leese but their show; their substance still lives sweet. 1 LEESE but.their SHOW their SUBS tance STILL lives SWEET +--+-+-+-+ +--+-+-+++ 1.0 0.0 12 0 1 0 0 0

37 rows × 8 columns

# and big texts
shaksonnets.parse()
�[34m�[1mparsing 2155 lines [5x]�[0m�[36m @ 2023-12-15 14:17:52,124�[0m
�[1;34m│ stanza 154, line 14: love's FI re HEATS.WA ter WA ter COOLS not LOVE       : 100%|�[0;36m██████████�[0;36m| 2155/2155 [00:56<00:00, 38.03it/s]
�[34m�[1m⎿ 57.4 seconds�[0m�[36m @ 2023-12-15 14:18:49,496�[0m
parse_score parse_is_bounded meterpos_num_slots *w_peak *w_stress *s_unstress *unres_across *unres_within
stanza_num line_num line_txt parse_rank parse_txt parse_meter parse_stress
1 1 FROM fairest creatures we desire increase, 1 from FAI rest CREA tures WE de SIRE in CREASE -+-+-+-+-+ -+-+-+-+-+ 0.0 0.0 10 0 0 0 0 0
2 from FAI rest CREA tures WE de SI re IN crease -+-+-+-+-+- -+-+-+-+-++ 1.0 0.0 11 0 1 0 0 0
3 from FAI rest CREA tures WE de SI re IN.CREASE -+-+-+-+-++ -+-+-+-+-++ 1.0 0.0 13 0 0 0 0 1
4 from FAI rest CREA tures WE de SI re.in CREASE -+-+-+-+--+ -+-+-+-+--+ 2.0 0.0 13 0 0 0 2 0
2 That thereby beauty's rose might never die, 1 that THE reby BEA uty's ROSE might NE ver DIE -+-+-+-+-+ -+++-+-+-+ 1.0 0.0 10 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
154 14 Love's fire heats water, water cools not love. 2 love's FI re HEATS wa.ter WA ter COOLS not LOVE -+-+--+-+-+ ++-++-+-+-+ 4.0 0.0 13 1 2 0 0 1
3 love's FI.RE heats WA ter WA ter COOLS not LOVE -++-+-+-+-+ ++-++-+-+-+ 4.0 0.0 13 0 2 1 0 1
4 LOVE'S fire HEATS wa.ter WA ter COOLS not LOVE +-+--+-+-+ ++++-+-+-+ 4.0 0.0 12 1 2 0 0 1
5 LOVE'S.FI re HEATS.WA ter WA ter COOLS not LOVE ++-++-+-+-+ ++-++-+-+-+ 4.0 0.0 15 0 0 0 4 0
6 love's FI re HEATS wa TER wa TER cools NOT love -+-+-+-+-+- ++-++-+-+++ 9.0 0.0 11 2 5 2 0 0

7277 rows × 8 columns

About

Prosodic: a metrical-phonological parser, written in Python. For English and Finnish, with flexible language support.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors