HARUNIWA2

Welcome to

HARUNIWA2 —
pipeline for parsing Japanese

What is HARUNIWA2?

HARUNIWA2 provides a full pipeline for parsing Japanese. Central components are:

an interface to the tabled output of either:
- mecab (Kudo, Yamamoto and Matsumoto, 2004) with the UniDic model (Den, Nakamura, Ogiso, and Ogura, 2008), or
- the COrpus based Morphological Analyzer with INtegrated User dictionary Comainu (小澤俊介, 内元清貴, 伝康晴, 2014)
a model for the Berkeley Parser (Petrov and Klein, 2007) trained on data of The Kainoki Treebank.

The parser pipeline is described in the following paper:

Horn, Stephen Wright, Butler, Alastair, and Yoshimoto, Kei. 2017. Keyaki Treebank segmentation and part-of-speech labelling. 『言語処理学会第23 回年次大会発表論文集』 Proceedings of the Twenty Third Annual Meeting of the Association of Natural Language Processing, pages 414-417. The Association for Natural Language Processing. [PDF].

All components are distributed freely.

Acknowledgements

This software is developed as part of the project Development of and Linguistic Research with a Parsed Corpus of Japanese of the National Institute for Japanese Language and Linguistics.

Feedback

Feedback is extremely welcome. Please email: ajb129 __AT__ hotmail __DOT__ com.

NAME

BCCWJ_to_tnt - convert M-XML to NPCMJ tags

Filter that takes files in the M-XML (morphology-based XML) format of the BCCWJ (http://www.ninjal.ac.jp/corpus_center/bccwj) that contains both the document structure information and the results of dual POS analysis and returns the POS analysis with NPCMJ tags.

OPTIONS

--tree)	show intermediate tree from tnt_collapse
--raw)	without tnt_collapse
--example)	show an example
*)	show this help message

EXAMPLE

$ cat << EOF | BCCWJ_to_tnt
> <paragraph>
> <sentence>
> <LUW l_lemma="両耳受聴" l_lForm="リョウミミジュチョウ" l_pos="名詞-普通名詞-一般">
> <SUW lemma="両耳" lForm="リョウミミ" pos="名詞-普通名詞-一般" pron="リョーミミ">両耳</SUW>
> <SUW lemma="受聴" lForm="ジュチョウ" pos="名詞-普通名詞-一般" pron="ジュチョー">受聴</SUW>
> </LUW>
> <LUW l_lemma="によって" l_lForm="ニヨッテ" l_pos="助詞-格助詞">
> <SUW lemma="に" lForm="ニ" pos="助詞-格助詞" pron="ニ">に</SUW>
> <SUW lemma="因る" lForm="ヨル" pos="動詞-一般" pron="ヨッ" cType="五段-ラ行" cForm="連用形-促音便">よっ</SUW>
> <SUW lemma="て" lForm="テ" pos="助詞-接続助詞" pron="テ">て</SUW>
> </LUW>
> <LUW l_lemma="得る" l_lForm="エル" l_pos="動詞-一般" l_cType="下一段-ア行" l_cForm="終止形-一般">
> <SUW lemma="得る" lForm="エル" pos="動詞-非自立可能" pron="ウル" cType="下一段-ア行" cForm="終止形-一般">得る</SUW>
> </LUW>
> <LUW l_lemma="情報" l_lForm="ジョウホウ" l_pos="名詞-普通名詞-一般">
> <SUW lemma="情報" lForm="ジョウホウ" pos="名詞-普通名詞-一般" pron="ジョーホー">情報</SUW>
> </LUW>
> <LUW l_lemma="に" l_lForm="ニ" l_pos="助詞-格助詞">
> <SUW lemma="に" lForm="ニ" pos="助詞-格助詞" pron="ニ">に</SUW>
> </LUW>
> <LUW l_lemma="は" l_lForm="ハ" l_pos="助詞-係助詞">
> <SUW lemma="は" lForm="ハ" pos="助詞-係助詞" pron="ワ">は</SUW>
> </LUW>
> <LUW l_lemma="パワースペクトル情報" l_lForm="パワースペクトルジョウホウ" l_pos="名詞-普通名詞-一般">
> <SUW lemma="パワー-power" lForm="パワー" pos="名詞-普通名詞-一般" pron="パワー">パワー</SUW>
> <SUW lemma="スペクトル-spectre" lForm="スペクトル" pos="名詞-普通名詞-一般" pron="スペクトル">スペクトル</SUW>
> <SUW lemma="情報" lForm="ジョウホウ" pos="名詞-普通名詞-一般" pron="ジョーホー">情報</SUW>
> </LUW>
> <LUW l_lemma="と" l_lForm="ト" l_pos="助詞-格助詞">
> <SUW lemma="と" lForm="ト" pos="助詞-格助詞" pron="ト">と</SUW>
> </LUW>
> <LUW l_lemma="両耳間位相差" l_lForm="リョウジカンイソウサ" l_pos="名詞-普通名詞-一般">
> <SUW lemma="両耳" lForm="リョウジ" pos="名詞-普通名詞-一般" pron="リョージ">両耳</SUW>
> <SUW lemma="間" lForm="カン" pos="接尾辞-名詞的-副詞可能" pron="カン">間</SUW>
> <SUW lemma="位相" lForm="イソウ" pos="名詞-普通名詞-一般" pron="イソー">位相</SUW>
> <SUW lemma="差" lForm="サ" pos="名詞-普通名詞-一般" pron="サ">差</SUW>
> </LUW>
> <LUW l_lemma="が" l_lForm="ガ" l_pos="助詞-格助詞">
> <SUW lemma="が" lForm="ガ" pos="助詞-格助詞" pron="ガ">が</SUW>
> </LUW>
> <LUW l_lemma="有る" l_lForm="アル" l_pos="動詞-一般" l_cType="五段-ラ行" l_cForm="連用形-一般">
> <SUW lemma="有る" lForm="アル" pos="動詞-非自立可能" pron="アリ" cType="五段-ラ行" cForm="連用形-一般">あり</SUW>
> </LUW>
> <LUW l_lemma="ます" l_lForm="マス" l_pos="助動詞" l_cType="助動詞-マス" l_cForm="終止形-一般">
> <SUW lemma="ます" lForm="マス" pos="助動詞" pron="マス" cType="助動詞-マス" cForm="終止形-一般">ます</SUW>
> </LUW>
> </sentence>
> </paragraph>
> EOF
-| 両耳受聴	N
-| によって	P-ROLE
-| 得る	VB0
-| 情報	N
-| に	P-ROLE
-| は	P-OPTR
-| パワースペクトル情報	N
-| と	P-ROLE
-| 両耳間位相差	N
-| が	P-ROLE
-| あり	VB;{有る}
-| ます	AX
-| EOS

NAME

BCCWJ_to_unidic - convert M-XML to UNIDIC mecab analysis

SYNOPSIS

BCCWJ_to_unidic [OPTIONS]

DESCRIPTION

Filter that takes files in the M-XML (morphology-based XML) format of the BCCWJ that contains both the document structure information and the results of dual POS analysis and returns the POS analysis in UNIDIC mecab format.

OPTIONS

--example)	show an example
*)	show this help message

EXAMPLE

cat << EOF | BCCWJ_to_unidic
<paragraph>
<sentence>
<LUW l_lemma="両耳受聴" l_lForm="リョウミミジュチョウ" l_pos="名詞-普通名詞-一般">
<SUW lemma="両耳" lForm="リョウミミ" pos="名詞-普通名詞-一般" pron="リョーミミ">両耳</SUW>
<SUW lemma="受聴" lForm="ジュチョウ" pos="名詞-普通名詞-一般" pron="ジュチョー">受聴</SUW>
</LUW>
<LUW l_lemma="によって" l_lForm="ニヨッテ" l_pos="助詞-格助詞">
<SUW lemma="に" lForm="ニ" pos="助詞-格助詞" pron="ニ">に</SUW>
<SUW lemma="因る" lForm="ヨル" pos="動詞-一般" pron="ヨッ" cType="五段-ラ行" cForm="連用形-促音便">よっ</SUW>
<SUW lemma="て" lForm="テ" pos="助詞-接続助詞" pron="テ">て</SUW>
</LUW>
<LUW l_lemma="得る" l_lForm="エル" l_pos="動詞-一般" l_cType="下一段-ア行" l_cForm="終止形-一般">
<SUW lemma="得る" lForm="エル" pos="動詞-非自立可能" pron="ウル" cType="下一段-ア行" cForm="終止形-一般">得る</SUW>
</LUW>
<LUW l_lemma="情報" l_lForm="ジョウホウ" l_pos="名詞-普通名詞-一般">
<SUW lemma="情報" lForm="ジョウホウ" pos="名詞-普通名詞-一般" pron="ジョーホー">情報</SUW>
</LUW>
<LUW l_lemma="に" l_lForm="ニ" l_pos="助詞-格助詞">
<SUW lemma="に" lForm="ニ" pos="助詞-格助詞" pron="ニ">に</SUW>
</LUW>
<LUW l_lemma="は" l_lForm="ハ" l_pos="助詞-係助詞">
<SUW lemma="は" lForm="ハ" pos="助詞-係助詞" pron="ワ">は</SUW>
</LUW>
<LUW l_lemma="パワースペクトル情報" l_lForm="パワースペクトルジョウホウ" l_pos="名詞-普通名詞-一般">
<SUW lemma="パワー-power" lForm="パワー" pos="名詞-普通名詞-一般" pron="パワー">パワー</SUW>
<SUW lemma="スペクトル-spectre" lForm="スペクトル" pos="名詞-普通名詞-一般" pron="スペクトル">スペクトル</SUW>
<SUW lemma="情報" lForm="ジョウホウ" pos="名詞-普通名詞-一般" pron="ジョーホー">情報</SUW>
</LUW>
<LUW l_lemma="と" l_lForm="ト" l_pos="助詞-格助詞">
<SUW lemma="と" lForm="ト" pos="助詞-格助詞" pron="ト">と</SUW>
</LUW>
<LUW l_lemma="両耳間位相差" l_lForm="リョウジカンイソウサ" l_pos="名詞-普通名詞-一般">
<SUW lemma="両耳" lForm="リョウジ" pos="名詞-普通名詞-一般" pron="リョージ">両耳</SUW>
<SUW lemma="間" lForm="カン" pos="接尾辞-名詞的-副詞可能" pron="カン">間</SUW>
<SUW lemma="位相" lForm="イソウ" pos="名詞-普通名詞-一般" pron="イソー">位相</SUW>
<SUW lemma="差" lForm="サ" pos="名詞-普通名詞-一般" pron="サ">差</SUW>
</LUW>
<LUW l_lemma="が" l_lForm="ガ" l_pos="助詞-格助詞">
<SUW lemma="が" lForm="ガ" pos="助詞-格助詞" pron="ガ">が</SUW>
</LUW>
<LUW l_lemma="有る" l_lForm="アル" l_pos="動詞-一般" l_cType="五段-ラ行" l_cForm="連用形-一般">
<SUW lemma="有る" lForm="アル" pos="動詞-非自立可能" pron="アリ" cType="五段-ラ行" cForm="連用形-一般">あり</SUW>
</LUW>
<LUW l_lemma="ます" l_lForm="マス" l_pos="助動詞" l_cType="助動詞-マス" l_cForm="終止形-一般">
<SUW lemma="ます" lForm="マス" pos="助動詞" pron="マス" cType="助動詞-マス" cForm="終止形-一般">ます</SUW>
</LUW>
</sentence>
</paragraph>
EOF
両耳	リョーミミ	リョウミミ	両耳	名詞-普通名詞-一般		
受聴	ジュチョー	ジュチョウ	受聴	名詞-普通名詞-一般		
に	ニ	ニ	に	助詞-格助詞		
よっ	ヨッ	ヨル	因る	動詞-一般	五段-ラ行	連用形-促音便
て	テ	テ	て	助詞-接続助詞		
得る	ウル	エル	得る	動詞-非自立可能	下一段-ア行	終止形-一般
情報	ジョーホー	ジョウホウ	情報	名詞-普通名詞-一般		
に	ニ	ニ	に	助詞-格助詞		
は	ワ	ハ	は	助詞-係助詞		
パワー	パワー	パワー	パワー-power	名詞-普通名詞-一般		
スペクトル	スペクトル	スペクトル	スペクトル-spectre	名詞-普通名詞-一般		
情報	ジョーホー	ジョウホウ	情報	名詞-普通名詞-一般		
と	ト	ト	と	助詞-格助詞		
両耳	リョージ	リョウジ	両耳	名詞-普通名詞-一般		
間	カン	カン	間	接尾辞-名詞的-副詞可能		
位相	イソー	イソウ	位相	名詞-普通名詞-一般		
差	サ	サ	差	名詞-普通名詞-一般		
が	ガ	ガ	が	助詞-格助詞		
あり	アリ	アル	有る	動詞-非自立可能	五段-ラ行	連用形-一般
ます	マス	マス	ます	助動詞	助動詞-マス	終止形-一般
EOS

Comainu_to_BCCWJ(1)

NAME

Comainu_to_BCCWJ - transform Comainu tabled analysis

SYNOPSIS

Comainu_to_BCCWJ [OPTIONS]

DESCRIPTION

Filter that takes Comainu tabled analysis (http://comainu.org) as input and returns the data in the M-XML (morphology-based XML) format of the BCCWJ.

OPTIONS

--example)	show an example
*)	show this help message

EXAMPLE

$ cat << EOF | Comainu_to_BCCWJ
> B	両耳	リョーミミ	リョウミミ	両耳	名詞-普通名詞-一般			名詞-普通名詞-一般	*	*	リョウミミジュチョウ	両耳受聴	両耳受聴
> 	受聴	ジュチョー	ジュチョウ	受聴	名詞-普通名詞-一般			*	*	*	*	*	*
> 	に	ニ	ニ	に	助詞-格助詞			助詞-格助詞	*	*	ニヨッテ	によって	によって
> 	よっ	ヨッ	ヨル	因る	動詞-一般	五段-ラ行	連用形-促音便	*	*	*	*	*	*
> 	て	テ	テ	て	助詞-接続助詞			*	*	*	*	*	*
> 	得る	ウル	エル	得る	動詞-非自立可能	下一段-ア行	終止形-一般	動詞-一般	下一段-ア行	終止形-一般	エル	得る	得る
> 	情報	ジョーホー	ジョウホウ	情報	名詞-普通名詞-一般			名詞-普通名詞-一般	*	*	ジョウホウ	情報	情報
> 	に	ニ	ニ	に	助詞-格助詞			助詞-格助詞	*	*	ニ	に	に
> 	は	ワ	ハ	は	助詞-係助詞			助詞-係助詞	*	*	ハ	は	は
> 	パワー	パワー	パワー	パワー-power	名詞-普通名詞-一般			名詞-普通名詞-一般	*	*	パワースペクトルジョウホウ	パワースペクトル情報	パワースペクトル情報
> 	スペクトル	スペクトル	スペクトル	スペクトル-spectre	名詞-普通名詞-一般			*	*	*	*	*	*
> 	情報	ジョーホー	ジョウホウ	情報	名詞-普通名詞-一般			*	*	*	*	*	*
> 	と	ト	ト	と	助詞-格助詞			助詞-格助詞	*	*	ト	と	と
> 	両耳	リョージ	リョウジ	両耳	名詞-普通名詞-一般			名詞-普通名詞-一般	*	*	リョウジカンイソウサ	両耳間位相差	両耳間位相差
> 	間	カン	カン	間	接尾辞-名詞的-副詞可能			*	*	*	*	*	*
> 	位相	イソー	イソウ	位相	名詞-普通名詞-一般			*	*	*	*	*	*
> 	差	サ	サ	差	名詞-普通名詞-一般			*	*	*	*	*	*
> 	が	ガ	ガ	が	助詞-格助詞			助詞-格助詞	*	*	ガ	が	が
> 	あり	アリ	アル	有る	動詞-非自立可能	五段-ラ行	連用形-一般	動詞-一般	五段-ラ行	連用形-一般	アル	有る	あり
> 	ます	マス	マス	ます	助動詞	助動詞-マス	終止形-一般	助動詞	助動詞-マス	終止形-一般	マス	ます	ます
> EOS
> EOF
-| <paragraph>
-| <sentence>
-| <LUW l_lemma="両耳受聴" l_lForm="リョウミミジュチョウ" l_pos="名詞-普通名詞-一般">
-| <SUW lemma="両耳" lForm="リョウミミ" pos="名詞-普通名詞-一般" pron="リョーミミ">両耳</SUW>
-| <SUW lemma="受聴" lForm="ジュチョウ" pos="名詞-普通名詞-一般" pron="ジュチョー">受聴</SUW>
-| </LUW>
-| <LUW l_lemma="によって" l_lForm="ニヨッテ" l_pos="助詞-格助詞">
-| <SUW lemma="に" lForm="ニ" pos="助詞-格助詞" pron="ニ">に</SUW>
-| <SUW lemma="因る" lForm="ヨル" pos="動詞-一般" pron="ヨッ" cType="五段-ラ行" cForm="連用形-促音便">よっ</SUW>
-| <SUW lemma="て" lForm="テ" pos="助詞-接続助詞" pron="テ">て</SUW>
-| </LUW>
-| <LUW l_lemma="得る" l_lForm="エル" l_pos="動詞-一般" l_cType="下一段-ア行" l_cForm="終止形-一般">
-| <SUW lemma="得る" lForm="エル" pos="動詞-非自立可能" pron="ウル" cType="下一段-ア行" cForm="終止形-一般">得る</SUW>
-| </LUW>
-| <LUW l_lemma="情報" l_lForm="ジョウホウ" l_pos="名詞-普通名詞-一般">
-| <SUW lemma="情報" lForm="ジョウホウ" pos="名詞-普通名詞-一般" pron="ジョーホー">情報</SUW>
-| </LUW>
-| <LUW l_lemma="に" l_lForm="ニ" l_pos="助詞-格助詞">
-| <SUW lemma="に" lForm="ニ" pos="助詞-格助詞" pron="ニ">に</SUW>
-| </LUW>
-| <LUW l_lemma="は" l_lForm="ハ" l_pos="助詞-係助詞">
-| <SUW lemma="は" lForm="ハ" pos="助詞-係助詞" pron="ワ">は</SUW>
-| </LUW>
-| <LUW l_lemma="パワースペクトル情報" l_lForm="パワースペクトルジョウホウ" l_pos="名詞-普通名詞-一般">
-| <SUW lemma="パワー-power" lForm="パワー" pos="名詞-普通名詞-一般" pron="パワー">パワー</SUW>
-| <SUW lemma="スペクトル-spectre" lForm="スペクトル" pos="名詞-普通名詞-一般" pron="スペクトル">スペクトル</SUW>
-| <SUW lemma="情報" lForm="ジョウホウ" pos="名詞-普通名詞-一般" pron="ジョーホー">情報</SUW>
-| </LUW>
-| <LUW l_lemma="と" l_lForm="ト" l_pos="助詞-格助詞">
-| <SUW lemma="と" lForm="ト" pos="助詞-格助詞" pron="ト">と</SUW>
-| </LUW>
-| <LUW l_lemma="両耳間位相差" l_lForm="リョウジカンイソウサ" l_pos="名詞-普通名詞-一般">
-| <SUW lemma="両耳" lForm="リョウジ" pos="名詞-普通名詞-一般" pron="リョージ">両耳</SUW>
-| <SUW lemma="間" lForm="カン" pos="接尾辞-名詞的-副詞可能" pron="カン">間</SUW>
-| <SUW lemma="位相" lForm="イソウ" pos="名詞-普通名詞-一般" pron="イソー">位相</SUW>
-| <SUW lemma="差" lForm="サ" pos="名詞-普通名詞-一般" pron="サ">差</SUW>
-| </LUW>
-| <LUW l_lemma="が" l_lForm="ガ" l_pos="助詞-格助詞">
-| <SUW lemma="が" lForm="ガ" pos="助詞-格助詞" pron="ガ">が</SUW>
-| </LUW>
-| <LUW l_lemma="有る" l_lForm="アル" l_pos="動詞-一般" l_cType="五段-ラ行" l_cForm="連用形-一般">
-| <SUW lemma="有る" lForm="アル" pos="動詞-非自立可能" pron="アリ" cType="五段-ラ行" cForm="連用形-一般">あり</SUW>
-| </LUW>
-| <LUW l_lemma="ます" l_lForm="マス" l_pos="助動詞" l_cType="助動詞-マス" l_cForm="終止形-一般">
-| <SUW lemma="ます" lForm="マス" pos="助動詞" pron="マス" cType="助動詞-マス" cForm="終止形-一般">ます</SUW>
-| </LUW>
-| </sentence>
-| </paragraph>

NAME

csearch_to_top - change treebank format

SYNOPSIS

csearch_to_top

DESCRIPTION

Filter to transform CorpusSearch treebank data into Penn Treebank format.

OPTIONS

--keep)	keep ID
--example)	show an example
*)	show this help message

EXAMPLE

$ cat << EOF | csearch_to_top | munge-trees -p
> ( (IP-MAT (PP (NP (N 授業))
>               (P が))
>           (NP-SBJ *が*)
>           (VB 終わる)
>           (PU 。))
>   (ID 7_textbook_kisonihongo;page_13;AT1-7;JP))
> EOF
-| (TOP (IP-MAT (PP (NP (N 授業))
-|                  (P が))
-|              (NP-SBJ *が*)
-|              (VB 終わる)
-|              (PU 。)))

CSJ_to_BCCWJ(1)

NAME

CSJ_to_BCCWJ - convert CSJ XML into BCCWJ M-XML

SYNOPSIS

CSJ_to_BCCWJ

DESCRIPTION

Filter that takes CSJ XML content and returns BCCWJ M-XML (morphology-based XML) format.

haruniwa2(1)

NAME

haruniwa2 - parse input

SYNOPSIS

haruniwa2 [OPTIONS]

DESCRIPTION

Parse input with the HARUNIWA grammar model for Japanese using the Berkley parser.

Input should be in TnT format where each line contains one word token and one word class tag separated by a single tab character. EOS indicates end-of-sentence.

OPTIONS

-i\|--id)	must receive an argument giving ID basename
--raw)	output parse without any modification (default is to post-process with parse_decorate and parse_finish)
--top\|-top)	output with TOP as root node
--basic\|-basic)	output without root node wrapping
--rank\|-rank)	output prepared for reranking
[0-9]*)	number of most probable parse trees output (default is 1)
--example)	show examples
-*)	show this help message
*)	location of grammar model

EXAMPLES

$ cat << EOF | haruniwa2 | munge-trees -p
> すもも	N
> も	P
> もも	N
> も	P
> もも	N
> の	P
> うち	N
> 。	PU
> EOS
> にわ	N
> に	P
> は	P
> に	NUM
> わ	CL
> の	P
> にわとり	N
> が	P
> いる	VB
> 。	PU
> EOS
> EOF
-| ( (IP-MAT (PP-SBJ (NP (PP (NP (N すもも))
-|                           (P-OPTR も))
-|                       (N もも))
-|                   (P-OPTR も))
-|           (NP-OB1 (PP (NP (N もも))
-|                       (P-ROLE の))
-|                   (N うち))
-|           (PU 。))
-|   (ID 1_ex1576158326;JP))
-| ( (IP-MAT (PP (NP (N にわ))
-|               (P-ROLE に)
-|               (P-OPTR は))
-|           (PP-SBJ (NP (PP (NP (NUMCLP (NUM に)
-|                                       (CL わ)))
-|                           (P-ROLE の))
-|                       (N にわとり))
-|                   (P-ROLE が))
-|           (VB いる)
-|           (PU 。))
-|   (ID 2_ex1576158326;JP))

$ cat << EOF | haruniwa2 3
> ゴスタック	N
> は	P
> ドッシュ	N
> を	P
> ディスティム	VB
> し	VB0
> ます	AX
> 。	PU
> EOS
> EOF
-| ( (IP-MAT (PP-SBJ (NP (N ゴスタック)) (P-OPTR は)) (PP-OB1 (NP (N ドッシュ)) (P-ROLE を)) (VB ディスティム) (VB0 し) (AX ます) (PU 。)) (ID 1_ex1576158330;JP))
-| ( (IP-MAT (PP-SBJ (NP (N ゴスタック)) (P-OPTR は)) (PP-CZZ (NP (N ドッシュ)) (P-ROLE を)) (VB ディスティム) (VB0 し) (AX ます) (PU 。)) (ID 2_ex1576158330;JP))
-| ( (IP-MAT (NP-SBJ *pro*) (PP-OB1 (NP (PP (NP (N ゴスタック)) (P-OPTR は)) (N ドッシュ)) (P-ROLE を)) (VB ディスティム) (VB0 し) (AX ます) (PU 。)) (ID 3_ex1576158330;JP))

haruniwa2_scaffold(1)

NAME

haruniwa2_scaffold - modify NPCMJ word class information

SYNOPSIS

haruniwa2_scaffold

DESCRIPTION

Filter to build parse structure, merge and demerge word segments, and disambiguate word class information.

OPTIONS

--example)	show an example
*)	show this help message

EXAMPLE

$ cat << EOF | haruniwa2_scaffold
> ( (IP-MAT (N 大学) (P-ROLE まで) (WADV どう) (VB2;遣る やっ) (P-CONN て) (VB2;行く 行き) (AX ます) (P-FINAL か) (PU 。)) (ID example;JP))
> EOF
-| ( (IP-MAT (NP (N 大学)) (P-ROLE まで) (ADVP (WADV どうやって)) (VB;行く 行き) (AX ます) (P-FINAL か) (PU 。)) (ID example;JP))

NAME

inline_to_tnt - convert inline to TnT format

SYNOPSIS

inline_to_tnt

DESCRIPTION

Filter to convert tagged information in inline format to TnT format. With TnT format each line contains one word token and one word class tag separated by a single tab character. EOS indicates end-of-sentence.

OPTIONS

--divider\|--div)	set divider
--example)	show an example
*)	show this help message

EXAMPLE

$ cat << EOF | inline_to_tnt
> 花子_NPR は_P 赤い_ADJI コート_N を_P 着_VB た_AXD 。_PU
> EOF
-| 花子	NPR
-| は	P
-| 赤い	ADJI
-| コート	N
-| を	P
-| 着	VB
-| た	AXD
-| 。	PU
-| EOS

NAME

parse_binarize - modify treebank data

SYNOPSIS

parse_binarize [OPTIONS]

DESCRIPTION

Binarize treebank data.

OPTIONS

--left)	make tree output binary branching for left headed language (e.g., English)
--keepconj)	do not add coordination at the IP level
--example)	show examples
*)	show this help message

EXAMPLES

$ cat << EOF | parse_binarize | munge-trees -p
> ( (IP-MAT (PP-SBJ (NP (N ゴスタック))
>                   (P-OPTR は))
>           (PP-OB1 (NP (N ドッシュ))
>                   (P-ROLE を))
>           (VB ディスティム)
>           (VB0 し)
>           (AX ます)
>           (PU 。))
>   (ID 20_BUFFALO;JP))
> EOF
-| ( (IP-MAT (IML (PP-SBJ (NP (N ゴスタック))
-|                        (P-OPTR は))
-|                (IML (PP-OB1 (NP (N ドッシュ))
-|                             (P-ROLE を))
-|                     (IML (IML (VB ディスティム)
-|                               (VB0 し))
-|                          (AX ます))))
-|           (PU 。))
-|   (ID 20_BUFFALO;JP))

$ cat << EOF | parse_binarize --left | munge-trees -p
> ( (IP-MAT (NP-SBJ (PRO I))
>           (VBD went)
>           (PP (P on)
>               (NP (D a)
>                   (N trip)))
>           (PP (P to)
>               (NP (NPR Kyoto)))
>           (NP-TMP (ADJ last)
>                   (N week))
>           (. .))
>   (ID 41_textbook_djg_basic;page_116;AT2-11;EN))
> EOF
-| ( (IP-MAT (NP-SBJ (PRO I))
-|           (IML (IML (IML (IML (VBD went)
-|                               (PP (P on)
-|                                   (NP (D a)
-|                                       (N trip))))
-|                          (PP (P to)
-|                              (NP (NPR Kyoto))))
-|                     (NP-TMP (ADJ last)
-|                             (N week)))
-|                (. .)))
-|   (ID 41_textbook_djg_basic;page_116;AT2-11;EN))

parse_decorate(1)

NAME

parse_decorate - filter to change tree

--script)	send tsurgeon script to stdout
--keep)	neither decorate nodes nor reposition functional information to remove stars (other consequences depend on presence of other flags)
--frame2)	keep frame information
--frame*\|--sense)	keep frame information
--pruneframe)	remove frame information
--comment)	keep comments
--essence\|-e)	retain only essential aspects of parse
--luw)	make long unit words
--removeluw)	remove long unit words
--example)	show an example
*)	show this help message

EXAMPLE

$ cat << EOF | parse_decorate | munge-trees -p
> ( (IP-MAT (NP-SBJ;{SPEAKER_28} *pro*)
>           (PP (NP (NPR O王))
>               (P を))
>           (VB 追お)
>           (MD う))
>   (ID 28_misc_BUFFALO;JP))
> EOF
-| ( (IP-MAT (NP-SBJ;{SPEAKER_28} *pro*)
-|           (PP (NP (NPR O王))
-|               (P-ROLE を))
-|           (VB 追お)
-|           (MD う))
-|   (ID 28_misc_BUFFALO;JP))

NAME

parse_undecorate - remove node decorations

--script)	send tsurgeon script to stdout
--extra)	make extra changes, notably to remove SORT information
--essence\|-e)	retain only essential aspects of parse
--example)	show an example
*)	show this help message

EXAMPLE

$ cat << EOF | parse_undecorate | munge-trees -p
> ( (IP-MAT (PP (NP (IP-REL (NP-SBJ *T*)
>                           (PP (NP (N 両耳受聴))
>                               (P-ROLE によって))
>                           (VB 得る))
>                   (N 情報))
>               (P-ROLE に)
>               (P-OPTR は))
>           (PP-SBJ (NP (CONJP (NP (N パワースペクトル情報))
>                              (P-CONN と))
>                       (NP (N 両耳間位相差)))
>                   (P-ROLE が))
>           (VB あり)
>           (AX ます))
>   (ID example;JP))
> EOF
-| ( (IP-MAT (PP (NP (IP-REL (NP-SBJ *T*)
-|                           (PP (NP (N 両耳受聴))
-|                               (P によって))
-|                           (VB 得る))
-|                   (N 情報))
-|               (P に)
-|               (P は))
-|           (PP (NP (CONJP (NP (N パワースペクトル情報))
-|                          (P と))
-|                   (NP (N 両耳間位相差)))
-|               (P が))
-|           (NP-SBJ *が*)
-|           (VB あり)
-|           (AX ます))
-|   (ID example;JP))

NAME

tnt_clean - clean TnT input

--full)	full parts-of-speech information
--pron)	pronounce information
--sense)	sense information
--example)	show an example
*)	show this help message

EXAMPLE

$ cat << EOF | tnt_clean
> 花子;ハナコ	NPR
> は;ワ	P
> 赤い;アカイ	ADJI
> コート;コート	N
> を;オ	P
> 着;キ	VB;着る.0
> た;タ	AXD
> 。	PU
> EOS
> EOF
-| 花子	NPR
-| は	P
-| 赤い	ADJI
-| コート	N
-| を	P
-| 着	VB
-| た	AXD
-| 。	PU
-| EOS

NAME

tnt_collapse - modify NPCMJ word class and segmentation information

--tree)	keep tree structure
--example)	show an example
*)	show this help message

EXAMPLE

$ cat << EOF | tnt_collapse
> 大学	N
> まで	P-ROLE
> どう	WADV
> やっ	VB2;遣る
> て	P-CONN
> 行き	VB2;行く
> ます	AX
> か	P-FINAL
> 。	PU
> EOS
> EOF
-| 大学	N
-| まで	P-ROLE
-| どうやって	WADV
-| 行き	VB;行く
-| ます	AX
-| か	P-FINAL
-| 。	PU
-| EOS

NAME

tnt_to_flat_parse - make basic parse from tnt analysis

--example)	show an example
*)	show this help message

EXAMPLE

$ cat << EOF | tnt_to_flat_parse
> 授業	N
> が	P
> 終わる	VB
> 。	PU
> EOS
> EOF
-| ( (IP (N 授業) (P が) (VB 終わる) (PU 。)) (ID example;JP))

NAME

tnt_to_inline - convert TnT to inline format

--sep\|-s)	specify separator, e.g., --sep "/"
--example)	show an example
*)	show this help message

EXAMPLE

$ cat << EOF | tnt_to_inline
> 花子	NPR
> は	P
> 赤い	ADJI
> コート	N
> を	P
> 着	VB
> た	AXD
> 。	PU
> EOS
> EOF
-| 花子_NPR は_P 赤い_ADJI コート_N を_P 着_VB た_AXD 。_PU

BCCWJ_to_tnt(1)	convert M-XML to NPCMJ tags
BCCWJ_to_unidic(1)	convert M-XML to UNIDIC mecab analysis
Comainu_to_BCCWJ(1)	transform Comainu tabled analysis
csearch_to_top(1)	change treebank format
CSJ_to_BCCWJ(1)	convert CSJ XML into BCCWJ M-XML
haruniwa2(1)	parse input
haruniwa2_scaffold(1)	modify NPCMJ word class information
inline_to_tnt(1)	convert inline to TnT format
parse_binarize(1)	modify treebank data
parse_decorate(1)	filter to change tree
parse_undecorate(1)	remove node decorations
tnt_clean(1)	clean TnT input
tnt_collapse(1)	modify NPCMJ word class and segmentation information
tnt_to_flat_parse(1)	make basic parse from tnt analysis
tnt_to_inline(1)	convert TnT to inline format

Welcome to

HARUNIWA2 — pipeline for parsing Japanese

What is HARUNIWA2?

Acknowledgements

Feedback

NAME

NAME

NAME

NAME

NAME

NAME

NAME

NAME

NAME

NAME

NAME

NAME

NAME

NAME

NAME

HARUNIWA2 —
pipeline for parsing Japanese