utf8-validator
UTF-8 is the dominant text encoding on the modern web β backwards-compatible with ASCII, supports all Unicode characters, variable-width (1-4 bytes per code point). Invalid UTF-8 sequences cause "mojibake" (garbled text) when interpreted with a different encoding, or outright errors in strict parsers (Python, Rust, JSON parsers). The ZTools UTF-8 Validator checks input bytes (paste hex or upload binary) for valid UTF-8 sequencing, flags invalid bytes, detects byte-order marks, and identifies likely alternative encodings (Latin-1, Windows-1252, GB2312).
Use casesβ
Debug "mojibake" in scraped textβ
Text shows "ΓΒ©" instead of "Γ©". Validator reveals the bytes are valid UTF-8 but were decoded as Latin-1.
Validate user uploadsβ
CSV upload with mixed encodings. Validator detects which rows have invalid UTF-8.
Check for BOM (Byte Order Mark)β
Some tools add a UTF-8 BOM (0xEF 0xBB 0xBF). Most parsers handle it; some choke. Validator flags it.
Verify a binary file is textβ
Random binary won't parse as valid UTF-8. Validator confirms or refutes.
How it worksβ
- Paste text or hex bytes β Auto-detect: if hex chars only, treat as bytes; otherwise treat as already-decoded text.
- Validate β Walk bytes. UTF-8 rules: ASCII (0xxxxxxx), 2-byte (110xxxxx 10xxxxxx), 3-byte, 4-byte. Each continuation byte starts with 10.
- Flag issues β Invalid lead bytes, missing continuation bytes, overlong encodings, invalid surrogates.
- Display β Valid: β + decoded text. Invalid: byte position + suggested encoding (often Latin-1 / Windows-1252).
Examplesβ
Input: Hex "C3 A9" (Γ© in UTF-8)
Output: Valid. Decodes to "Γ©".
Input: Hex "E9" (Γ© in Latin-1)
Output: Invalid UTF-8 (orphan continuation byte). Likely Latin-1 / Windows-1252.
Input: Hex "EF BB BF E2 9C 93"
Output: Valid UTF-8 with BOM (EF BB BF) prefix. Decodes to "β".
Frequently asked questionsβ
What's mojibake?
Garbled text caused by encoding/decoding mismatches. "ΓΒ©" is "Γ©" decoded as Latin-1 then re-encoded as UTF-8. Validator helps trace the chain.
BOM β should I keep it?
For UTF-8, BOM is optional and most modern parsers handle it. Some (Python's csv module) don't. Strip if unsure.
Overlong encoding?
Older UTF-8 spec allowed encoding ASCII as multi-byte (e.g. 0x2F as C0 AF). Modern strict UTF-8 forbids β security risk (path traversal). Validator flags.
Privacy?
All in browser.
Tipsβ
- For "mojibake" debugging, the chain is usually: source encoding β mis-decoded as X β re-encoded as Y. Validator helps identify each step.
- Always strip BOM when concatenating UTF-8 files β multiple BOMs in one file confuse parsers.
- For user uploads, validate before processing. Reject invalid UTF-8 at the boundary, not deep in the pipeline.
- For round-trip safety, use modern strict UTF-8 (no overlong, no surrogates) β the modern spec.
Try it nowβ
The full utf8-validator runs in your browser at https://ztools.zaions.com/utf8-validator β no signup, no upload, no data leaves your device.
Last updated: 2026-05-06 Β· Author: Ahsan Mahmood Β· Edit this page on GitHub