Skip to content

shrimple πŸ‡΅πŸ‡± πŸ³οΈβ€βš§οΈ

shrimple mind. shrimple problems. complex solutions. she/her

Replace `chardet` Python library immediately

Posted on March[²⁰26], Saturday 14.March[²⁰26], Saturday 14. By Shrimple No Comments on Replace `chardet` Python library immediately

Generative AI.

Replace with charset_normalizer.

There is news about the chardet Python library. It’s maintainer relicensed the LGPL project in release 7.0.0 claiming the full AI rewrite is cleanroom sufficient. https://github.com/chardet/chardet/issues/327

Here you can read about the backwards compatibility of charset_normalizer: https://charset-normalizer.readthedocs.io/en/latest/user/getstarted.html#backward-compatibility

You can fallback to chardet when replacing the import, if you dare face the implications (that’s a full AI rewrite done in a two-week pass), by importing it in an except ModuleNotFoundError block.

python-charset-normalizer is already present in Ubuntu repositories.

Of other alternatives, there’s faust-cchardet.

You can stop reading here.


How the guy describes it (entire PR description seems partially written by a language model): https://github.com/chardet/chardet/pull/322

This PR is for a ground-up, MIT-licensed rewrite of chardet. It maintains API compatibility with chardet 5.x and 6.x, but with 27x improvements to detection speed, and highly accurate support for even more encodings. It fixes numerous longstanding issues (like poor accuracy on short strings and poor performance), and is just all-around better than previous versions of chardet in every possible respect. It’s even faster, more memory efficient, and more accurate than charset-normalizer, which is something I’m particularly proud of. The test data has also been moved to a separate repo to help prevent any licensing issues having it here might have lead to.

(above emphasis mine, so that you can notice he already compares to the alternative, proud not of being supposedly more accurate than chardet 6.x but than charset-normalizer β€” which he has included in the benchmark suite residing in the repo. However, the reason might be that the alternative was superior long before already, I have not researched that.)

Below you can read entire commit subject history, verbatim (it might disappear soon if cloning an LGPL project will be an issue somehow for whoever pays him):

2025-12-18 Remove Python 3.9 support, add Python 3.14 support (#311) 
2026-01-05 Ignore .claude 
2026-01-06 Add github codespace support (#312) 
2026-02-20 Pass max_bytes parameter to UniversalDetector in detect() and detect_all() (#314) 
2026-02-21 Create CLAUDE.md and update AGENTS.md 
2026-02-21 Fix SJISDistributionAnalysis discarding valid second-byte range >= 0x80 (#315) 
2026-02-21 Add test for distribution analysis get_order() coverage of valid characters 
2026-02-21 Pre-release fixes: bump to 6.0.0, fix get_charset crash, cleanup 
2026-02-21 Add --encoding-era CLI flag and improve heuristic selection 
2026-02-21 Add LEGACY_REGIONAL encoding era and reclassify misplaced encodings 
2026-02-21 Update documentation for 6.0.0 release 
2026-02-21 Fix pyright type errors in chardetect.py and test.py 
2026-02-21 Add .readthedocs.yaml to fix RTD builds 
2026-02-21 docs: update copyright to 2015-2026 chardet contributors 
2026-02-21 docs: fix copyright start year and remove first-person reference 
2026-02-21 docs: modernize usage examples and reorganize table of contents 
2026-02-22 Update version to 6.0.0.post1 
2026-02-22 Bump version to 6.0.1.dev0 for future dev 
2026-02-23 Potential fix for code scanning alert no. 2: Workflow does not contain permissions (#320) 
2026-02-23 Refactor create_language_model to cache unfiltered bigrams 
2026-02-23 Add Urdu and Cornish probers that were missing 
2026-02-23 Fix remove_xml_tags treating bare > as tag closer 
2026-02-24 Fix CP932 misclassified as single byte 
2026-02-24 Retrain with latest version of create_language_model 
2026-02-24 Update AGENTS.md now that training is cached 
2026-02-24 Add accuracy investigation design doc 
2026-02-24 Add accuracy investigation implementation plan 
2026-02-25 Fix 20 regressions in XML/markup file detection 
2026-02-25 Improve detection accuracy with non-ASCII bigram analysis 
2026-02-25 Add fallback encoding when no prober meets threshold 
2026-02-26 Remove binary file from ASCII test data 
2026-03-01 Wipe it clean 
2026-02-25 Add chardet rewrite design document 
2026-02-25 Add chardet rewrite implementation plan 
2026-02-25 chore: scaffold project with uv, ruff, pytest, pre-commit 
2026-02-25 feat: add EncodingEra IntFlag enum 
2026-02-25 feat: add encoding registry with era classifications 
2026-02-25 feat: add DetectionResult type for pipeline stages 
2026-02-25 feat: add Stage 0 binary content detection 
2026-02-25 feat: add Stage 1a BOM detection 
2026-02-25 feat: add Stage 1b ASCII detection 
2026-02-25 feat: add Stage 1c UTF-8 structural validation 
2026-02-25 feat: add Stage 1d markup charset extraction 
2026-02-25 feat: add Stage 2a byte validity filtering 
2026-02-25 feat: add Stage 2b multi-byte structural probing 
2026-02-25 feat: add bigram model format, loading, and training pipeline 
2026-02-25 feat: add Stage 3 statistical scoring with parallel support 
2026-02-25 feat: add pipeline orchestrator connecting all stages 
2026-02-25 feat: add detect() and detect_all() public API 
2026-02-25 feat: add UniversalDetector streaming API 
2026-02-25 feat: add chardetect CLI 
2026-02-25 feat: add accuracy test suite against chardet test data 
2026-02-25 feat: add benchmark suite and performance regression tests 
2026-02-25 fix: move BOM before binary detection, add UTF-16/32 null-byte detection 
2026-02-25 feat: add encoding equivalence classes to accuracy test 
2026-02-25 feat: add escape-sequence encoding detection for ISO-2022 and HZ-GB-2312 
2026-02-25 feat: improve accuracy with high-byte bigram weighting and equivalence classes 
2026-02-25 feat: retrain models with CulturaX data and 5000 samples per language 
2026-02-25 fix: restore missing encoding models and clean up training script 
2026-02-25 Retrain with max-samples set to 15000 
2026-02-25 Bump up minimum overall accuracy to 79% 
2026-02-25 Go nuts with ruff rules 
2026-02-25 docs: add accuracy improvements design and diagnostic scripts 
2026-02-25 docs: add accuracy improvements implementation plan 
2026-02-25 test: replace bidirectional equivalences with directional superset classes 
2026-02-25 feat: add windows-1252 fallback and binary confidence to pipeline 
2026-02-25 feat: gate CJK multi-byte candidates on structural evidence 
2026-02-25 perf: cache structural scores to avoid duplicate computation 
2026-02-25 feat: add era-based tiebreaking for close statistical scores 
2026-02-25 Revert "feat: add era-based tiebreaking for close statistical scores" 
2026-02-25 refactor: update diagnostic scripts to use directional equivalences 
2026-02-25 feat: add --encodings filter to train.py for targeted retraining 
2026-02-25 feat: retrain koi8-r/cp866 on Russian-only to improve Cyrillic discrimination 
2026-02-26 feat: retrain Western European models with expanded language coverage 
2026-02-26 refactor: extract shared equivalence logic into chardet.equivalences 
2026-02-26 refactor: improve train.py and add deserialize_models tests 
2026-02-26 docs: add design for test parallelism, perf, and accuracy improvements 
2026-02-26 docs: add implementation plan for test parallelism, perf, and accuracy 
2026-02-26 build: add pytest-xdist for parallel test execution 
2026-02-26 refactor: extract collect_test_files into shared scripts/utils.py 
2026-02-26 test: parametrize accuracy tests per-file for xdist parallelism 
2026-02-26 refactor: consolidate compare scripts into single compare_detectors.py 
2026-02-26 feat: add detection profiling script 
2026-02-26 perf: remove ProcessPoolExecutor from statistical scoring 
2026-02-26 perf: use flat bytearray lookup tables for bigram scoring 
2026-02-26 feat: add encoding equivalences for iso-8859-1, cp037/cp500, gb2312, and Central European encodings 
2026-02-26 feat: add Phase 2 encoding equivalences for Hebrew, Greek, Baltic, Cyrillic, and Western European 
2026-02-26 feat: unify Western European encodings into a single bidirectional equivalence group 
2026-02-26 style: fix ruff FURB110 lint warnings in scripts 
2026-02-26 fix: remove invalid encoding equivalences and add memory/startup benchmarking 
2026-02-26 Add initial performance report for rewrite 
2026-02-26 refactor: unify all detectors to subprocesses and extract benchmark script 
2026-02-26 docs: update performance report with all 7 detectors 
2026-02-26 feat: download test data from chardet/test-data and exclude from builds 
2026-02-26 docs: add design for accuracy improvements to 95% 
2026-02-26 fix: clone test data at collection time when missing 
2026-02-26 docs: add implementation plan for accuracy improvements to 95% 
2026-02-26 feat: add is_equivalent_detection() for base-letter equivalence checking 
2026-02-26 fix: use normalized encoding names in decode and add symbol diff test 
2026-02-26 feat: integrate is_equivalent_detection into test harness and diagnosis script 
2026-02-26 fix: tighten CJK multi-byte gating to reduce false positives 
2026-02-26 feat: add currency/euro symbol equivalence to decoded-output comparison 
2026-02-26 fix: demote iso-8859-10 when no distinguishing bytes present 
2026-02-26 fix: preserve confidence ordering in iso-8859-10 demotion 
2026-02-26 feat: add post-decode mess detection to penalize wrong encodings 
2026-02-26 fix: address mess detection review issues 
2026-02-26 feat: train per-language-encoding bigram models with language detection 
2026-02-26 fix: restore score_bigrams backward compat and add encoding index 
2026-02-26 perf: tune detection pipeline for 95% accuracy target 
2026-02-26 docs: update performance report with 95.4% accuracy 
2026-02-26 fix: address code review issues and remove mess detection 
2026-02-26 perf: optimize scoring loop, binary detection, and registry caching 
2026-02-26 fix: address code review feedback 
2026-02-27 docs: update performance report with 95.0% accuracy 
2026-02-27 docs: add mypyc optimization design plan 
2026-02-27 docs: add mypyc optimization implementation plan 
2026-02-27 refactor: prepare models module for mypyc compilation 
2026-02-27 refactor: prepare structural module for mypyc compilation 
2026-02-27 refactor: prepare validity module for mypyc compilation 
2026-02-27 refactor: prepare statistical module for mypyc compilation 
2026-02-27 build: add hatch-mypyc configuration for optional compilation 
2026-02-27 fix: resolve mypyc compilation type errors and add .so to gitignore 
2026-02-27 docs: update performance report with mypyc benchmark results 
2026-02-27 refactor: switch mypyc config from exclude to include list 
2026-02-27 refactor: split benchmark.py into benchmark_time.py and benchmark_memory.py 
2026-02-27 fix: apply decoded-output equivalence in compare_detectors.py and update performance report 
2026-02-27 docs: add accuracy improvements v2 design plan 
2026-02-27 docs: add accuracy improvements v2 implementation plan 
2026-02-27 feat: add stdout flushing to train.py for tee compatibility 
2026-02-27 feat: write training metadata YAML alongside models.bin 
2026-02-27 feat: bump max_samples default to 25000 for better model quality 
2026-02-27 feat: add KOI8-T promotion heuristic for Tajik text detection 
2026-02-27 feat: add lead byte diversity check to CJK false-positive gating 
2026-02-27 feat: retrain all 233 bigram models at 25K samples 
2026-02-27 Revert "feat: retrain all 233 bigram models at 25K samples" 
2026-02-27 revert: restore max_samples default to 15000 
2026-02-27 docs: add parallel training design and implementation plan 
2026-02-27 feat(train): add --build-workers CLI argument 
2026-02-27 refactor(train): extract _build_one_model function 
2026-02-27 feat(train): wire ProcessPoolExecutor into model build loop 
2026-02-27 fix(train): use _load_cached_articles in workers to prevent hanging processes 
2026-02-27 feat: retrain models with parallel training pipeline 
2026-02-27 docs: add CLAUDE.md with project overview and conventions 
2026-02-27 fix: suppress FA100 for mypyc-compiled modules and fix lint warnings 
2026-02-27 refactor: address code review findings 
2026-02-27 fix: match chardet 6.0.0 public API signatures for backward compatibility 
2026-02-27 feat: implement should_rename_legacy, ignore_threshold, and lang_filter DeprecationWarning 
2026-02-27 docs: update performance comparison with expanded ISO-to-Windows superset equivalences 
2026-02-27 fix: add DeprecationWarning for unused chunk_size param and suppress FBT lint 
2026-02-27 refactor: address code review findings across codebase 
2026-02-27 fix: harden model loading and validate max_bytes parameter 
2026-02-27 fix: address staff-level code review findings across codebase 
2026-02-28 docs: update performance benchmarks with verified measurements 
2026-02-28 feat: add --pure flag to benchmark scripts to detect stale mypyc .so files 
2026-02-28 docs: add confusion group resolution design for accuracy improvement 
2026-02-28 docs: add confusion group resolution implementation plan 
2026-02-28 feat: add confusion group computation for similar encodings 
2026-02-28 feat: add distinguishing byte maps with Unicode categories 
2026-02-28 feat: add binary serialization for confusion group data 
2026-02-28 feat: generate confusion.bin during model training 
2026-02-28 feat: add lazy loading of confusion.bin at runtime 
2026-02-28 feat: add Unicode category voting resolution strategy 
2026-02-28 feat: add distinguishing-bigram re-scoring resolution strategy 
2026-02-28 feat: integrate confusion group resolution into detection pipeline 
2026-02-28 feat: add env var override for confusion strategy experimentation 
2026-02-28 feat: regenerate models with full training data (15k samples) 
2026-02-28 fix: resolve lint issues in confusion module and tests 
2026-02-28 perf: add utf1632, utf8, and escape modules to mypyc compilation 
2026-02-28 style: apply pyupgrade --py310-plus modernizations 
2026-02-28 fix: replace dot product with cosine similarity in bigram scoring 
2026-02-28 fix: score all candidates in single pool to fix normalization bug 
2026-02-28 fix: use raw cosine scores as confidence instead of max-normalization 
2026-02-28 fix: use weighted category voting instead of binary votes 
2026-02-28 fix: run chardet-rewrite in isolated venv for fair comparison 
2026-02-28 refactor: replace --use-encoding-era bool flag with --encoding-era choice 
2026-02-28 docs: add language detection accuracy tracking design 
2026-02-28 docs: add language detection accuracy implementation plan 
2026-02-28 feat: add detected_language to benchmark_time.py JSON output 
2026-02-28 feat: track language detection accuracy in compare_detectors.py 
2026-02-28 feat: add language accuracy reporting to compare_detectors.py output 
2026-02-28 fix: use shared denominator in per-encoding language accuracy table 
2026-02-28 feat: add language mismatch warnings to test_accuracy.py 
2026-02-28 feat: infer language from single-language encodings 
2026-02-28 feat: normalize ISO 639-1 language codes in scripts and tests 
2026-02-28 docs: update performance comparison with latest benchmark results 
2026-02-28 docs: add language detection accuracy section to performance doc 
2026-02-28 docs: add design for statistical language inference in _fill_language 
2026-02-28 feat: use statistical bigram scoring for language on multi-language encodings 
2026-02-28 docs: add design for UTF-8 language models with universal fallback 
2026-02-28 docs: add implementation plan for UTF-8 language models 
2026-02-28 feat: train UTF-8 byte bigram models for all 48 languages 
2026-02-28 feat: add Tier 3 decode-to-UTF-8 language fallback in _fill_language 
2026-02-28 perf: cap language scoring data to 2KB and fix mypyc BigramProfile bug 
2026-02-28 docs: note previous mypyc numbers were from stale build 
2026-02-28 fix: add BigramProfile.from_weighted_freq() factory classmethod 
2026-02-28 fix: add len(data) to structural analysis cache key for safety 
2026-02-28 refactor: remove score_bigrams wrapper and dead models parameter 
2026-02-28 refactor: deduplicate _resolve_rename and _validate_max_bytes 
2026-02-28 refactor: move build-time confusion functions and script tests to scripts/ 
2026-02-28 refactor: minor cleanups from code review 
2026-02-28 build: switch from pre-commit to prek for git hooks 
2026-02-28 docs: add thread-safety design for PipelineContext + cache locking 
2026-02-28 feat: add PipelineContext dataclass for per-run pipeline state 
2026-02-28 refactor: thread PipelineContext through structural.py, remove module-level cache 
2026-02-28 refactor: create PipelineContext in orchestrator, thread through call chain 
2026-02-28 fix: add double-checked locking to model caches for thread safety 
2026-02-28 fix: add double-checked locking to confusion and registry caches 
2026-02-28 test: add thread-safety integration test for concurrent detect() 
2026-02-28 fix: remove inner re-check in mypyc-compiled cache locks 
2026-02-28 test: add cold-cache race, high-concurrency, and GIL diagnostic tests 
2026-02-28 ci: add GitHub Actions workflow with free-threaded Python testing 
2026-02-28 ci: add Python 3.14 and 3.14t to CI matrix 
2026-02-28 ci: add coverage configuration 
2026-02-28 ci: run tests under coverage and upload to Codecov 
2026-02-28 docs: add design for performance doc update with thread safety 
2026-02-28 docs: add implementation plan for performance doc update 
2026-02-28 docs: update performance doc with thread safety results 
2026-02-28 ci: add ty type checking to prek and xfail known accuracy gaps 
2026-02-28 fix: UTF-32 BOM alignment, UTF-8 binary false positives, and CJK gating 
2026-02-28 docs: add design for docstrings and type annotations 
2026-02-28 docs: add implementation plan for docstrings and type annotations 
2026-02-28 chore: enforce docstring and return-type ruff rules for src/chardet 
2026-02-28 docs: add Sphinx reST docstrings to all src/chardet/ modules 
2026-03-01 refactor: code review fixes β€” DRY constants, cleaner types, test hygiene 
2026-03-01 Update performance report 
2026-03-01 Remove unnecessary TYPE_CHECKING guards 
2026-03-01 Remove unused strategy param from resolve_confusion_groups 
2026-03-01 analyser -> analyzer 
2026-03-01 Commit thread safety implementation plan for posterity 
2026-03-01 Add documentation setup design doc 
2026-03-01 Add documentation setup implementation plan 
2026-03-01 Add docs dependency group (Sphinx, Furo, sphinx-copybutton) 
2026-03-01 Add Sphinx configuration and ReadTheDocs config 
2026-03-01 Add docs landing page (index.rst) 
2026-03-01 Add usage documentation page 
2026-03-01 Add supported encodings page (generated from registry) 
2026-03-01 Add 'How It Works' documentation page 
2026-03-01 Add performance benchmarks documentation page 
2026-03-01 Add FAQ documentation page 
2026-03-01 Add API reference page (autodoc from source docstrings) 
2026-03-01 Fix copyright year and clarify FAQ version narrative 
2026-03-01 Add documentation build commands to CLAUDE.md 
2026-03-01 Update copyright date to reflect rewrite 
2026-03-01 Unify UniversalDetector with detect()/detect_all() pipeline 
2026-03-01 Add release workflow for PyPI publishing on tag push 
2026-03-01 feat: add UTF-7 detection via escape sequence matching 
2026-03-01 fix: tighten UTF-7 validation to reject false positives 
2026-03-01 style: fix ruff lint violations in escape.py 
2026-03-01 feat: add cp273 (EBCDIC German) to registry and training config 
2026-03-01 feat: add hp-roman8 to registry and training config 
2026-03-01 fix: accept RFC 2152 implicit termination in UTF-7 detection 
2026-03-01 feat: train bigram models for cp273 and hp-roman8 
2026-03-01 test: add end-to-end detection tests for UTF-7, cp273, hp-roman8 
2026-03-01 docs: update supported encoding count to 84 
2026-03-01 docs: add design and implementation plan for cp273, hp-roman8, UTF-7 
2026-03-01 perf: skip irrelevant escape checks when only '+' is present 
2026-03-01 fix: reject UTF-7 detection when data contains bytes > 0x7F 
2026-03-01 docs: update README examples with accurate output comments 
2026-03-01 docs: add English ASCII and UTF-8 examples to README 
2026-03-01 feat: train English bigram models for improved language detection 
2026-03-01 docs: add design for language metadata in encoding registry 
2026-03-01 docs: add implementation plan for registry language metadata 
2026-03-01 feat: add languages field to EncodingInfo and utf-7 to registry 
2026-03-01 test: add tests for registry languages field and utf-7 entry 
2026-03-01 refactor: derive _SINGLE_LANG_MAP from registry instead of hardcoding 
2026-03-01 refactor: derive ENCODING_LANG_MAP from registry instead of hardcoding 
2026-03-01 refactor: standardize escape.py language strings to ISO 639-1 codes 
2026-03-01 test: use ISO 639-1 code in pipeline types test for consistency 
2026-03-01 docs: add design for full Python character encoding coverage 
2026-03-01 docs: add implementation plan for full Python encoding coverage 
2026-03-01 refactor: flip encoding supersets and add missing aliases in registry 
2026-03-01 refactor: update structural analyzer dispatch keys for renamed encodings 
2026-03-01 feat: resolve registry aliases in model index, remove gb2312 hack 
2026-03-01 feat: differentiate ISO-2022-JP branches in escape detector 
2026-03-01 feat: update equivalences for flipped supersets and ISO-2022-JP branches 
2026-03-01 fix: regenerate confusion.bin and update tests for renamed encodings 
2026-03-01 revert: remove cp037 from cp500 equivalences (out of scope) 
2026-03-01 fix: correct EBCDIC superset: cp1140 replaces cp037, not cp500 
2026-03-01 docs: correct EBCDIC section in design doc 
2026-03-01 refactor: convert REGISTRY from tuple to MappingProxyType dict 
2026-03-01 refactor: address code review findings across the codebase 
2026-03-01 docs: update benchmark numbers and revamp README comparison table 
2026-03-01 docs: update supported encoding count to 99 across all docs 
2026-03-01 fix: score_best_language early-exit bug and move training-only dict 
2026-03-01 docs: report speeds as files/second instead of total time 
2026-03-01 Update README some more 
2026-03-01 refactor: default encoding_era to ALL and simplify should_rename_legacy 
2026-03-01 Remove plans to keep PR less noisy 
2026-03-01 build: derive version from git tags via hatch-vcs 
2026-03-02 docs: update CLAUDE.md for hatch-vcs versioning and current architecture 
2026-03-02 docs: update docs for encoding_era=ALL default, fix stale numbers, add missing pipeline stages 
2026-03-02 docs: fix stale encoding_era example in README 
2026-03-02 ci: restrict GITHUB_TOKEN to read-only permissions 
2026-03-02 fix: address PR review findings across error handling, types, docs, and tests 
2026-03-02 docs: fix stale confidence, language count, and stage count in docs 
2026-03-02 fix: resolve all ruff and ty lint failures 
2026-03-02 chore: update devcontainer to install prek hooks and add Ruff extension 
2026-03-02 Add authors and maintainers back to pyproject.toml 
2026-03-02 fix: test harness None-None support, UTF-7 false positives, and WHATWG replacement era reclassification 
2026-03-02 fix: handle binary files systematically via collect_test_files() and equivalences 
2026-03-02 docs: update performance benchmarks for 2179 test files (incl. binary) 
2026-03-02 fix: CLI defaults to ALL encoding era, remove --legacy flag 
2026-03-02 docs: comprehensive documentation update 
2026-03-02 Slight tweak to index.rst 
2026-03-02 docs: hide inline toctree on landing page, keep sidebar nav 
2026-03-02 Tweak FAQs 
2026-03-02 Tweak README 
2026-03-02 Update tag trigger for PyPI releases 
2026-03-02 Make readthedocs get tags 
2026-03-02 Make index.rst example match one from README 
2026-03-02 docs: fix code examples and benchmark numbers to match actual output 
2026-03-02 tests: add streaming parity test for detect vs UniversalDetector (GH-296) 
2026-03-02 fix: use post_checkout to unshallow clone for ReadTheDocs builds 
2026-03-02 Add slug for codecov 
2026-03-02 Add codecov badge 
2026-03-02 docs: add keywords and changelog URL to project metadata 
2026-03-02 fix: update cibuildwheel action from v2 to v3 
2026-03-02 fix: use cibuildwheel@v3.3 (no rolling major version tag exists) 
2026-03-02 fix: install uv and pass mypyc env var into cibuildwheel builds 
2026-03-02 Add 7.0.0 release date to changelog 
2026-03-02 docs: clarify that mypyc wheels are prebuilt on PyPI 
2026-03-02 docs: add mypyc vs pure Python speed stats across all docs 
2026-03-02 docs: add cross-Python-version timing benchmarks (7.0.0rc4) 
2026-03-02 docs: add import timing to cross-version benchmark tables 
2026-03-02 docs: extend changelog to cover all PyPI releases (1.0-2.1.1) 
2026-03-02 docs: add test data coverage design plan 
2026-03-02 docs: add test data coverage implementation plan 
2026-03-03 Make tests/data get ignored if it is a symlink too 
2026-03-03 Allow symlink tests/data to simplify updating test-data repo 
2026-03-03 Remove known failure entries for deleted duplicate test files 
2026-03-03 Add known failure entries for new DOS codepage test files 
2026-03-03 docs: add test coverage gap analysis and improvement design 
2026-03-03 docs: add test coverage gap implementation plan 
2026-03-03 refactor: remove dead code branches and add pragma for invariant assertion 
2026-03-03 Add tests for uncovered branches across pipeline modules 
2026-03-03 Add CLI, confusion, models, and script utility tests 
2026-03-03 Reach 100% test coverage across all chardet modules 
2026-03-03 Update test infrastructure for ISO 639-1 language codes 
2026-03-03 Update language equivalences for mutual intelligibility 
2026-03-04 Update CLAUDE.md to stop a couple stupid things 
2026-03-04 Fix _SINGLE_LANG_MAP missing aliases 
2026-03-04 Retrain models and remove 24 resolved xfails 
2026-03-04 Fix false UTF-7 detection of SHA-1 git hashes (#324) 
2026-03-04 Add separate lint job back 
2026-03-04 Bump coverage requirements up to 95% since we have 100% 
2026-03-04 Fix precommit hook failures 
2026-03-04 Use package name in cache filenames and enrich display labels 
2026-03-04 Remove plans 
2026-03-04 feat: add helpers for venv-less version/tag resolution and cache checking 
2026-03-04 fix: use project_root parameter instead of pip_args[0] in _resolve_version_without_venv 
2026-03-04 feat: skip venv creation when full cache exists for detector 
2026-03-04 fix: remove unused cached_specs and add version mismatch diagnostic 
2026-03-04 docs: update benchmark numbers for expanded test suite (2,510 files) 
2026-03-06 Update changelog with missing 7.0.1 info 
2026-03-06 fix: prevent false UTF-7 detection of ASCII with ++ or +word (#332) (#335) 
2026-03-06 fix: eliminate 0.5s startup cost by computing model norms during loading (#333) 
2026-03-06 feat: add 1st detect and max columns to compare_detectors output 
2026-03-06 fix: correct import time measurement when --pure is used 
2026-03-06 refactor: replace unwieldy tuple returns with _TimingResult dataclass 
2026-03-06 fix: display startup times in milliseconds to avoid rounding to 0.000s 
2026-03-06 Fix issue where language was not returned by charset_normalizer 
2026-03-06 fix: generalize warmup comment to apply to all detector types 
2026-03-06 feat: add --no-memory flag and "time to 1st result" column to compare_detectors 
2026-03-06 perf: use struct.iter_unpack for bulk model parsing 
2026-03-06 fix: detect truncated model data with iter_unpack and extract parser 
2026-03-06 fix: normalize language names to ISO 639-1 for cross-detector comparison 
2026-03-06 docs: update benchmark numbers for chardet 7.0.2 
2026-03-06 test: add coverage for struct.error and UnicodeDecodeError paths 
2026-03-06 Update incorrect date in LICENSE 
2026-03-06 fix: restore _MODEL_NORMS in mock_models_bin fixture to prevent test slowdown 
2026-03-07 feat: pin test-data cloning to chardet release version tags 
2026-03-07 test: add unit tests for get_data_dir cache logic 
2026-03-07 docs: add test-data tagging step to versioning notes in CLAUDE.md 
2026-03-07 fix: remove v prefix from test-data tags to match release tag format 
2026-03-07 Fix xpasses 
2026-03-08 Fix undocumented encoding name changes (#338) 
2026-03-09 docs: add 7.0.2 changelog entry 
2026-03-09 Update benchmark numbers and fix subprocess import error 
2026-03-10 Improve ISO-2022-JP family escape sequence detection 
2026-03-10 docs: add design spec for switching to Python codec canonical names 
2026-03-10 docs: add implementation plan for switching to Python codec canonical names 
2026-03-10 fix: patch chardet.__version__ directly in test_utils 
2026-03-10 fix: build mypyc wheel explicitly in compare_detectors --mypyc 
2026-03-10 refactor: replace _LEGACY_NAMES with _COMPAT_NAMES (codec name keys) 
2026-03-10 feat: add compat_names and prefer_superset parameters to detect() 
2026-03-10 feat: add compat_names and prefer_superset to UniversalDetector 
2026-03-10 refactor: rename apply_legacy_rename to apply_preferred_superset 
2026-03-10 refactor: switch pipeline and equivalences to Python codec names 
2026-03-10 test: update all assertions for Python codec name internals 
2026-03-10 refactor: update registry EncodingName and train.py to use Python codec names 
2026-03-10 refactor: remove redundant EncodingInfo.python_codec field 
2026-03-11 refactor: clean up post-codec-name-switch and document new API parameters 
2026-03-11 refactor: replace manual caches with functools.cache, consolidate normalization 
2026-03-11 docs: update accuracy to 98.2%, use prefer_superset for accuracy eval 
2026-03-11 test: replace old benchmarks with three-layer performance regression suite 
2026-03-11 test: tune benchmark thresholds based on measured baselines 
2026-03-11 docs: add -n auto to pytest commands, fix markdown spacing 
2026-03-11 refactor: deduplicate code and derive inverse dicts from single sources 
2026-03-11 fix: add chardet.universaldetector stub for 6.x backward compatibility 
2026-03-11 feat: detect PEP 263 encoding declarations (# -*- coding: ... -*-) 
2026-03-11 chore: remove docs/plans/ design and plan files 
2026-03-11 docs: add threaded benchmark design spec 
2026-03-11 docs: address spec review feedback for threaded benchmark 
2026-03-11 docs: add threaded benchmark implementation plan 
2026-03-11 feat: add --threads option to benchmark_time.py for concurrent detection 
2026-03-11 feat: add --threads passthrough to compare_detectors.py 
2026-03-11 fix: only include threads in timing cache keys, not memory cache keys 
2026-03-11 fix: add --threads validation and docstring updates in compare_detectors.py 
2026-03-11 Remove plans that got thrown in other directory 
2026-03-11 docs: update thread scaling table with GIL vs free-threaded benchmarks 
2026-03-11 fix: adjust benchmark speedup threshold for pure Python vs mypyc 
2026-03-11 test: achieve 100% test coverage 
2026-03-11 refactor: use pathlib.Path instead of str for filesystem paths in scripts 
2026-03-11 perf: add early-exit check in PEP 263 detection for non-Python data 
2026-03-11 docs: expand changelog with contributor attribution, PR links, and charade history
0 Give it a Click if you enjoyed (it does not federate)
Influencing Society, Programming Technologies Tags:language-models-ai

Post navigation

Previous Post: Create Block Theme with Block Editor in WordPress Playground β€” a first
Next Post: How I ran a dev localhost Akkoma instance (commands listing!) on openSUSE, fixed a bug, investigated terminal emulator

Related Posts

  • Check up on RSS/Atom dates in dozen+ lines of Bash Programming Technologies
  • Getting TLS1.3 Key Log from Go application with requests by a library, and using it in Wireshark Programming Technologies
  • Create Block Theme with Block Editor in WordPress Playground β€” a first Programming Technologies
  • What if we organized a different kind of hackathon Influencing Society
  • My spur of the moment opinion on LLMs Influencing Society
  • Don’t let AI sparkle: idea for self-shame hub access Influencing Society

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Atom feed for this page

Atom feed for this blog

against-messy-software akkoma Atom|RSS_feeds bash big.ugly.git.patch. chromium-and-derivatives community fragment golang kde language-models-ai links2 linux me microsoft-edge network offpunk offpunk:lists offpunk:redirections oss-contributing perl programming-tips scripting smolweb subscribe superuser window-decorations wordpress-diving Wordpress_ActivityPub_plugin

Categories

  • Guides to Free Open Source

    (1)
  • Influencing Society

    (4)
  • Meta

    (4)
  • Oddities of alternate reality

    (1)
  • Programming Technologies

    (6)
  • Rookie Repairs

    (1)
  • Smol Web Habits

    (5)
  • Software Imposed On Us

    (1)
  • Wild Software Writing

    (8)
  • March 2026 (13)
  • February 2026 (5)
  • January 2026 (10)
Fediverse reactions

Post URL

Paste the post URL into the search field of your favorite open social app or platform.

Your Profile

Or, if you know your own profile, we can start things that way!
Why do I need to enter my profile?

This site is part of the ⁂ open social web, a network of interconnected social platforms (like Mastodon, Pixelfed, Friendica, and others). Unlike centralized social media, your account lives on a platform of your choice, and you can interact with people across different platforms.

By entering your profile, we can send you to your account where you can complete this action.

shrimple πŸ‡΅πŸ‡±  πŸ³οΈβ€βš§οΈ
shrimple πŸ‡΅πŸ‡± πŸ³οΈβ€βš§οΈ
@shrimple@www.shrimple.pl
Follow

shrimple mind. shrimple problems. complex solutions. she/her

28 posts
7 followers

Follow shrimple πŸ‡΅πŸ‡± πŸ³οΈβ€βš§οΈ

My Profile

Paste my profile into the search field of your favorite open social app or platform.

Your Profile

Or, if you know your own profile, we can start things that way!
Why do I need to enter my profile?

This site is part of the ⁂ open social web, a network of interconnected social platforms (like Mastodon, Pixelfed, Friendica, and others). Unlike centralized social media, your account lives on a platform of your choice, and you can interact with people across different platforms.

By entering your profile, we can send you to your account where you can complete this action.

  • Forcing KWin decorations and MS Edge’s 1cm shadow gradient Software Imposed On Us
  • Simplistic reconciliation of mostly-append text files like Offpunk lists: draft involving Kahn’s algorithm Wild Software Writing
  • Distributed file version management in 15 minutes of Bash Wild Software Writing
  • Why follow requests here and can I even be followed Meta
  • Bugfix for list URI for my Offpunk redirections implementation draft Wild Software Writing
  • How I ran a dev localhost Akkoma instance (commands listing!) on openSUSE, fixed a bug, investigated terminal emulator Guides to Free Open Source
  • Implementing proper natural language grep β€” approach Programming Technologies
  • On OVH hosting, you can’t opt out from protection Meta

shrimple@shrimple.pl

Copyright © 2026 shrimple πŸ‡΅πŸ‡± πŸ³οΈβ€βš§οΈ.

Powered by PressBook News WordPress theme