UnicodeDecodeError - Python decoding of json string created in PHP - Hack The Tech - Latest News related to Computer and Technology

Hack The Tech - Latest News related to Computer and Technology

Get Daily Latest News related to Computer and Technology and hack the world.

Wednesday, March 9, 2022

UnicodeDecodeError - Python decoding of json string created in PHP

This is a weird one. I've shot trouble just about as far as I can, so I'm calling on the cavalry. (That's you.)

I think it's an encoding issue.

Here are the facts:

I have a PHP script, triggered by CRON, which captures a "records state" from my client's research. It pulls data from the database and saves it to the server as a massive json string. Later, I pull it down and run various processes on that data string. I do this using Python.

The PHP script writes the data to the server like this:

$completeString = '{"test1":[{"id":"80","data":"The Philosophical Transactions and Collections to the End of the Year 1700 Vol. III. In Two Parts Anatomical Medical and Chymical Philological and Miscellaneous Papers The Philosophical Transactions of the Royal Society"},{"id":"81","data":"The Philosophical Transactions From the Year 1700 to the Year 1720 Vol. IV Containing The Mathematical Papers The Physiological Papers. The Philosophical Transactions of the Royal Society"}],"test2":[{"id":"122","data":"BMV","page":"","notes":""},{"id":"122","data":" Book of Hours","page":"","notes":""},{"id":"122","data":" Christianity","page":"","notes":""},{"id":"122","data":"Medieval","page":"","notes":""},{"id":"122","data":" Renaissance","page":"","notes":""},{"id":"122","data":" Vellum","page":"","notes":""}],"test3":[{"etcetcetc":"etc"}]}';

$indexFile = fopen(__DIR__ . "/records_2022-03-08-03-45-AM.json", "wb") or die("Unable to open file!");
fwrite($indexFile,$completeString);
fclose($indexFile);

There are no problems with this so far, and the PHP script happily writes a 9MB string whilst the server is on downtime.

The PHP script removes any non-UTF characters from the data prior to adding them to the string.

In the morning, the user pulls the file (records_2022-03-08-03-45-AM.json) from the server via FTP.

They then use the Python script (from a MacOS console "Terminal") to read the line. The germane code from the Python script is this:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys
import io
import json
with io.open(thefilePath, 'rb') as f:
    for line in f:
        sourceData = json.loads(line.decode("utf-8"))

The error result is this:

Traceback (most recent call last):
  File "pythonScript.py", line 75, in <module>
    sourceData = json.loads(line.decode("utf-8"))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 2547886: invalid continuation byte

The character at position 2547886 is a space. Not a "0xc3". Not the "Ã…". It is just a space.

The kicker is this.

If the file is unchanged, but is resaved to the Mac, when the script is then re-run it works as expected. Perfectly.

I researched this one, but I think it's just far enough outside of my ken that I need a push in the right direction.

What are your thoughts? I've tried to be concise, but please let me know if I can add any further information.

Python version is 3.9.10 PHP is rather old: 5.4.45 (I have no control over this factor.) Server is CentOS 6



source https://stackoverflow.com/questions/71399770/unicodedecodeerror-python-decoding-of-json-string-created-in-php

No comments:

Post a Comment