| Title: | Convert Chinese Characters into Hanyu Pinyin |
|---|---|
| Description: | Convert Chinese characters into Hanyu Pinyin (the official romanization system for Standard Chinese) with support for tones, toneless output, initials, URL slugs, and valid R variable names. The package was inspired by the now-orphaned CRAN package 'pinyin' (archived in April 2026 after the maintainer became unreachable). 'hanyupinyin' is a ground-up rewrite using the authoritative Unicode Unihan database, a vectorized engine, and modern R practices. Dictionary data are derived from the Unicode Unihan Database (Unicode Consortium, 2025) <https://www.unicode.org/reports/tr38/>. |
| Authors: | Haoran Cui [aut, cre] |
| Maintainer: | Haoran Cui <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.3 |
| Built: | 2026-05-22 02:12:52 UTC |
| Source: | https://github.com/cuihr17/hanyupinyin |
Allows users to extend the built-in phrase table with their own
multi-character phrases and readings. The function automatically detects the
input format and stores both a numeric-tone version and a tone-mark version
internally, so the phrase works correctly with all settings of the tone
argument in to_pinyin().
add_phrase(phrase, reading)add_phrase(phrase, reading)
phrase |
A Chinese character string of at least two characters
(e.g. |
reading |
The corresponding Pinyin reading. Syllables should be
separated by spaces (e.g. |
The separator used in reading is independent of the sep
argument to to_pinyin(). The latter controls only the output format.
Invisibly returns NULL.
# Numeric input -- marks are derived automatically add_phrase("\u884c\u957f", "hang2 zhang3") to_pinyin("\u94f6\u884c\u884c\u957f", polyphone = TRUE) to_pinyin("\u94f6\u884c\u884c\u957f", polyphone = TRUE, tone = "marks") # Tone-mark input -- numeric tones are derived automatically add_phrase("\u548c\u5e73", "h\u00e9 p\u00edng") to_pinyin("\u548c\u5e73", polyphone = TRUE, tone = "marks") # Underscore separators are also accepted add_phrase("\u6d4b\u8bd5", "ce4_shi4") to_pinyin("\u6d4b\u8bd5", polyphone = TRUE)# Numeric input -- marks are derived automatically add_phrase("\u884c\u957f", "hang2 zhang3") to_pinyin("\u94f6\u884c\u884c\u957f", polyphone = TRUE) to_pinyin("\u94f6\u884c\u884c\u957f", polyphone = TRUE, tone = "marks") # Tone-mark input -- numeric tones are derived automatically add_phrase("\u548c\u5e73", "h\u00e9 p\u00edng") to_pinyin("\u548c\u5e73", polyphone = TRUE, tone = "marks") # Underscore separators are also accepted add_phrase("\u6d4b\u8bd5", "ce4_shi4") to_pinyin("\u6d4b\u8bd5", polyphone = TRUE)
Returns all user-defined phrases added via add_phrase() in the current
R session, together with their internally-stored numeric-tone and tone-mark
readings.
list_phrases()list_phrases()
A data frame with three columns:
The Chinese character phrase.
The reading with numeric tones (e.g. "hang2 zhang3").
The reading with diacritic tone marks (e.g. "háng zhǎng").
list_phrases()list_phrases()
Converts a character vector of Chinese strings into Pinyin romanization.
The function is fully vectorized and uses the Unicode Unihan database
(kMandarin) as its authoritative source.
to_pinyin(x, sep = "_", tone = TRUE, polyphone = FALSE, other_replace = NULL)to_pinyin(x, sep = "_", tone = TRUE, polyphone = FALSE, other_replace = NULL)
x |
A character vector. |
sep |
Separator inserted between syllables in the output.
Default is |
tone |
If |
polyphone |
If |
other_replace |
How to handle non-Chinese characters. |
A character vector of the same length as x.
to_pinyin("\u6625\u7720\u4e0d\u89c9\u6653") to_pinyin("Hello \u4e16\u754c", sep = " ", other_replace = "?") to_pinyin("\u94f6\u884c\u884c\u957f", polyphone = TRUE) to_pinyin("\u6625\u7720\u4e0d\u89c9\u6653", tone = "marks")to_pinyin("\u6625\u7720\u4e0d\u89c9\u6653") to_pinyin("Hello \u4e16\u754c", sep = " ", other_replace = "?") to_pinyin("\u94f6\u884c\u884c\u957f", polyphone = TRUE) to_pinyin("\u6625\u7720\u4e0d\u89c9\u6653", tone = "marks")
Returns only the first letter of each syllable.
to_pinyin_initials(x, polyphone = FALSE, other_replace = NULL)to_pinyin_initials(x, polyphone = FALSE, other_replace = NULL)
x |
A character vector. |
polyphone |
If |
other_replace |
How to handle non-Chinese characters. |
A character vector of the same length as x.
to_pinyin_initials("\u4e2d\u534e\u4eba\u6c11\u5171\u548c\u56fd")to_pinyin_initials("\u4e2d\u534e\u4eba\u6c11\u5171\u548c\u56fd")
A convenience wrapper around to_pinyin() with tone = "marks".
to_pinyin_marks(x, sep = "_", polyphone = FALSE, other_replace = NULL)to_pinyin_marks(x, sep = "_", polyphone = FALSE, other_replace = NULL)
x |
A character vector. |
sep |
Separator between syllables. Default is |
polyphone |
If |
other_replace |
How to handle non-Chinese characters. |
A character vector of the same length as x.
to_pinyin_marks("\u6625\u7720\u4e0d\u89c9\u6653") to_pinyin_marks("Hello \u4e16\u754c", sep = " ")to_pinyin_marks("\u6625\u7720\u4e0d\u89c9\u6653") to_pinyin_marks("Hello \u4e16\u754c", sep = " ")
A convenience wrapper around to_pinyin() with tone = FALSE.
to_pinyin_toneless(x, sep = "_", polyphone = FALSE, other_replace = NULL)to_pinyin_toneless(x, sep = "_", polyphone = FALSE, other_replace = NULL)
x |
A character vector. |
sep |
Separator between syllables. Default is |
polyphone |
If |
other_replace |
How to handle non-Chinese characters. |
A character vector of the same length as x.
to_pinyin_toneless("\u6625\u7720\u4e0d\u89c9\u6653")to_pinyin_toneless("\u6625\u7720\u4e0d\u89c9\u6653")
Create URL-Friendly Slug from Chinese Text
to_slug(x, polyphone = FALSE, other_replace = NULL)to_slug(x, polyphone = FALSE, other_replace = NULL)
x |
A character vector. |
polyphone |
If |
other_replace |
How to handle non-Chinese characters. |
A character vector of URL-friendly slug strings.
to_slug("2026\u5e74\u62a5\u544a")to_slug("2026\u5e74\u62a5\u544a")
Useful when cleaning imported data (e.g. from SAS or Excel) where column labels are in Chinese.
to_varname( x, unique = TRUE, abbrev = NULL, polyphone = FALSE, other_replace = NULL )to_varname( x, unique = TRUE, abbrev = NULL, polyphone = FALSE, other_replace = NULL )
x |
A character vector. |
unique |
If |
abbrev |
If not |
polyphone |
If |
other_replace |
How to handle non-Chinese characters. |
A character vector of valid R variable names.
to_varname(c("\u59d3\u540d", "\u5e74\u9f84", "\u6027\u522b")) to_varname("\u4e2d\u534e\u4eba\u6c11\u5171\u548c\u56fd", abbrev = 4)to_varname(c("\u59d3\u540d", "\u5e74\u9f84", "\u6027\u522b")) to_varname("\u4e2d\u534e\u4eba\u6c11\u5171\u548c\u56fd", abbrev = 4)
A data frame containing Chinese characters and their Hanyu Pinyin readings
extracted from the Unicode Unihan Database (kMandarin field, Version 17.0).
unihan_pinyinunihan_pinyin
A data frame with 44348 rows and 4 variables:
The Chinese character.
Pinyin with tone marks (e.g. qiū). Multiple readings are
space-separated.
Pinyin with numeric tones (e.g. qiu1). Multiple
readings are space-separated.
Toneless Pinyin (e.g. qiu). Multiple readings are
space-separated.
Unicode Consortium, Unihan Database, https://www.unicode.org/reports/tr38/