.. _Jupyter notebook: https://github.com/MontrealCorpusTools/PolyglotDB/tree/master/examples/tutorial/tutorial_3_query.ipynb .. _full version of the script: https://github.com/MontrealCorpusTools/PolyglotDB/tree/master/examples/tutorial/tutorial_3.py .. _related ISCAN tutorial: https://iscan.readthedocs.io/en/latest/tutorials_iscan.html#examining-analysing-the-data .. _expected output: https://github.com/MontrealCorpusTools/PolyglotDB/tree/master/examples/tutorial/results/tutorial_3_subset_output.csv .. _tutorial scripts: https://github.com/MontrealCorpusTools/PolyglotDB/tree/main/examples/tutorial .. _tutorial_query: *********************************** Tutorial 3: Getting information out *********************************** The main objective of this tutorial is to export a CSV file using a query on an imported (:ref:`tutorial_first_steps`) and enriched (:ref:`tutorial_enrichment`) corpus. .. note:: The following Python scripts are presented in step-by-step blocks to guide you through the process. However, it is expected that you run the entire Python script as a single unit when using PolyglotDB. The complete Python script is available here `tutorial scripts`_. If you prefer running the steps in blocks, this tutorial is also available as a `Jupyter notebook`_. As in the other tutorials, import statements and the corpus name (as it is stored in pgdb) must be set for the code in this tutorial to be runnable: .. code-block:: python from polyglotdb import CorpusContext corpus_name = 'tutorial-subset' export_path = './results/tutorial_3_subset_output.csv' Creating an initial query ========================= The first steps for generating a CSV file is to create a query that selects just the linguistic objects ("annotations") of a particular type (e.g. words, syllables) that are of interest to our study. For this example, we will query for all *syllables*, which are: - `stressed` (defined here as having a ``stress`` value equal to ``'1'``) - At the beginning of the word, - In words that are at the end of utterances. .. code-block:: python with CorpusContext(corpus_name) as c: q = c.query_graph(c.syllable) q = q.filter(c.syllable.stress == '1') # Stressed syllables... q = q.filter(c.syllable.begin == c.syllable.word.begin) # That are at the beginning of words... q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) # that are at the end of utterances. Next, we want to specify the particular information to extract for each syllable found. .. code-block:: python # duplicated from above with CorpusContext(corpus_name) as c: q = c.query_graph(c.syllable) q = q.filter(c.syllable.stress == '1') # Stressed syllables... q = q.filter(c.syllable.begin == c.syllable.word.begin) # That are at the beginning of words... q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) # that are at the end of utterances. q = q.columns(c.syllable.label.column_name('syllable'), c.syllable.duration.column_name('syllable_duration'), c.syllable.word.label.column_name('word'), c.syllable.word.begin.column_name('word_begin'), c.syllable.word.end.column_name('word_end'), c.syllable.word.num_syllables.column_name('word_num_syllables'), c.syllable.word.stress_pattern.column_name('word_stress_pattern'), c.syllable.word.utterance.speech_rate.column_name('utterance_speech_rate'), c.syllable.speaker.name.column_name('speaker'), c.syllable.discourse.name.column_name('file'), ) With the above, we extract information of interest about the syllable, the word it is in, the utterance it is in, the speaker and the sound file (``discourse`` in PolyglotDB's API). To test out the query, we can ``limit`` the results (for readability) and print them: .. code-block:: python # duplicated from above with CorpusContext(corpus_name) as c: q = c.query_graph(c.syllable) q = q.filter(c.syllable.stress == '1') # Stressed syllables... q = q.filter(c.syllable.begin == c.syllable.word.begin) # That are at the beginning of words... q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) # that are at the end of utterances. q = q.columns(c.syllable.label.column_name('syllable'), c.syllable.duration.column_name('syllable_duration'), c.syllable.word.label.column_name('word'), c.syllable.word.begin.column_name('word_begin'), c.syllable.word.end.column_name('word_end'), c.syllable.word.num_syllables.column_name('word_num_syllables'), c.syllable.word.stress_pattern.column_name('word_stress_pattern'), c.syllable.word.utterance.speech_rate.column_name('utterance_speech_rate'), c.syllable.speaker.name.column_name('speaker'), c.syllable.discourse.name.column_name('file'), ) q = q.limit(10) # Optional: Use order_by to enforce ordering on the output for easier comparison with the sample output. q = q.order_by(c.syllable.label) results = q.all() print(results) Which will show the first ten rows that would be exported to a csv. .. _tutorial_export: Exporting a CSV file ==================== Once the query is constructed with filters and columns, exporting to a CSV is a simple method call on the query object. For completeness, the full code for the query and export is given below. .. code-block:: python with CorpusContext(corpus_name) as c: q = c.query_graph(c.syllable) q = q.filter(c.syllable.stress == '1') q = q.filter(c.syllable.begin == c.syllable.word.begin) q = q.filter(c.syllable.word.end == c.syllable.word.utterance.end) q = q.columns(c.syllable.label.column_name('syllable'), c.syllable.duration.column_name('syllable_duration'), c.syllable.word.label.column_name('word'), c.syllable.word.begin.column_name('word_begin'), c.syllable.word.end.column_name('word_end'), c.syllable.word.num_syllables.column_name('word_num_syllables'), c.syllable.word.stress_pattern.column_name('word_stress_pattern'), c.syllable.word.utterance.speech_rate.column_name('utterance_speech_rate'), c.syllable.speaker.name.column_name('speaker'), c.syllable.discourse.name.column_name('file'), ) q = q.order_by(c.syllable.label) q.to_csv(export_path) The CSV file generated will then be ready to open in other programs or in R for data analysis. You can see a `full version of the script`_, as well as `expected output`_ when run on the 'LibriSpeech-subset' corpora. Next steps ========== See :ref:`tutorial_formants` and :ref:`tutorial_pitch` for practical examples of interesting linguistic analysis that can be peformed on enriched corpora using python and R. You can also see the `related ISCAN tutorial`_ for R code on visualizing and analyzing the exported results.