UnicodeDecodeError in zeppelin 0.7.1

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

UnicodeDecodeError in zeppelin 0.7.1

Meethu Mathew
Hi,

I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing this error while creating an RDD(in pyspark).

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I was able to create the RDD without any error after adding use_unicode=False as follows
sc.textFile("file.csv",use_unicode=False)

​But it fails when I try to stem the text. I am getting similar error when trying to apply stemming to the text using python interpreter. 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

All these code is working in 0.7.0 version. There is no change in the dataset and code. ​Is there any change in the encoding type in the new version of zeppelin?

Regards, 
Meethu Mathew

Reply | Threaded
Open this post in threaded view
|

Re: UnicodeDecodeError in zeppelin 0.7.1

moon
Administrator
Hi,

0.7.1 didn't changed any encoding type as far as i know.
One difference is 0.7.1 official artifact has been built with JDK8 while 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But i'm not sure that can make pyspark and spark encoding type changes.

Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?

Thanks,
moon

On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <[hidden email]> wrote:
Hi,

I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing this error while creating an RDD(in pyspark).

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I was able to create the RDD without any error after adding use_unicode=False as follows
sc.textFile("file.csv",use_unicode=False)

​But it fails when I try to stem the text. I am getting similar error when trying to apply stemming to the text using python interpreter. 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

All these code is working in 0.7.0 version. There is no change in the dataset and code. ​Is there any change in the encoding type in the new version of zeppelin?

Regards, 


Meethu Mathew

Reply | Threaded
Open this post in threaded view
|

Re: UnicodeDecodeError in zeppelin 0.7.1

Felix Cheung
And are they running with the same Python version? What is the Python version?

_____________________________
From: moon soo Lee <[hidden email]>
Sent: Thursday, April 20, 2017 11:53 AM
Subject: Re: UnicodeDecodeError in zeppelin 0.7.1
To: <[hidden email]>


Hi,

0.7.1 didn't changed any encoding type as far as i know.
One difference is 0.7.1 official artifact has been built with JDK8 while 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But i'm not sure that can make pyspark and spark encoding type changes.

Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?

Thanks,
moon

On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <[hidden email]> wrote:
Hi,

I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing this error while creating an RDD(in pyspark).

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I was able to create the RDD without any error after adding use_unicode=False as follows
sc.textFile("file.csv",use_unicode=False)

​But it fails when I try to stem the text. I am getting similar error when trying to apply stemming to the text using python interpreter. 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

All these code is working in 0.7.0 version. There is no change in the dataset and code. ​Is there any change in the encoding type in the new version of zeppelin?

Regards, 


Meethu Mathew



Reply | Threaded
Open this post in threaded view
|

Re: UnicodeDecodeError in zeppelin 0.7.1

Meethu Mathew
​​
Hi,

Thanks for the repsonse.

@ moon soo Lee: The interpreter setting is same in 0.7.0 and 0.7.1

@ Felix Cheng : The Python version is same. 

The code is as follows:

PYSPARK

def textPreProcessor(text):   
     for w in text.split():
        
​     ​
regex = re.compile('[%s]' % re.escape(string.punctuation))
       
​    
no_punctuation = unicode(regex.sub(' ', w),'utf8')

       
​     ​
tokens = word_tokenize(no_punctuation)
       
​     ​
lowercased = [t.lower() for t in tokens]
       
​     ​
no_stopwords = [w for w in lowercased if not w in stopwordsX]
       
​     ​
stemmed = [stemmerX.stem(w) for w in no_stopwords]
       
​     ​
return [w for w in stemmed if w]

  • docs =sc.textFile(hdfs_path+training_data,use_unicode=False).repartition(96)
  • docs.map(lambda features: sentimentObject.textPreProcessor(features.split(delimiter)[text_colum])).count()

Error:
  • UnicodeDecodeError: 'utf8' codec can't decode byte 0x9b in position 17: invalid start byte
  • Same error  use_unicode=False is not used
  • Error change to 'ascii' codec can't decode byte 0x97 in position 3: ordinal not in range(128) when no_punctuation = regex.sub(' ', w) is used instead of no_punctuation = unicode(regex.sub(' ', w),'utf8')
Note :: In version 0.7.0 the code was running fine without using use_unicode and unicode(regex.sub(' ', w),'utf8')

PYTHON

def textPreProcessor(text_column):   
    processed_text=[]
for text in text_column: 
       for w in text.split():
          regex = re.compile('[%s]' % re.escape(string.punctuation)) # reg exprn for puntuation
          no_punctuation = unicode(regex.sub(' ', text_),'utf8') 
             tokens = word_tokenize(no_punctuation)
                 lowercased = [t.lower() for t in tokens]
           no_stopwords = [w for w in lowercased if not w in stopwordsX]
           stemmed = [stemmerX.stem(w) for w in no_stopwords]
           processed_text.append([w for w in stemmed if w])     
return processed_text
  • new_training = pd.read_csv(training_data,header=None, delimiter=delimiter, error_bad_lines=False, usecols=[label_column,text_column],names=['label','msg']).dropna()
  • new_training['processed_msg'] = textPreProcessor(new_training['msg'])
This python code is working and I am getting result. In version 0.7.0, I am getting output without using the unicode function.

Hope the problem is clear now.

Regards, 
Meethu Mathew


On Fri, Apr 21, 2017 at 3:07 AM, Felix Cheung <[hidden email]> wrote:
And are they running with the same Python version? What is the Python version?

_____________________________
From: moon soo Lee <[hidden email]>
Sent: Thursday, April 20, 2017 11:53 AM
Subject: Re: UnicodeDecodeError in zeppelin 0.7.1
To: <[hidden email]>



Hi,

0.7.1 didn't changed any encoding type as far as i know.
One difference is 0.7.1 official artifact has been built with JDK8 while 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But i'm not sure that can make pyspark and spark encoding type changes.

Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?

Thanks,
moon

On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <[hidden email]> wrote:
Hi,

I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing this error while creating an RDD(in pyspark).

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I was able to create the RDD without any error after adding use_unicode=False as follows
sc.textFile("file.csv",use_unicode=False)

​But it fails when I try to stem the text. I am getting similar error when trying to apply stemming to the text using python interpreter. 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

All these code is working in 0.7.0 version. There is no change in the dataset and code. ​Is there any change in the encoding type in the new version of zeppelin?

Regards, 


Meethu Mathew




Reply | Threaded
Open this post in threaded view
|

Re: UnicodeDecodeError in zeppelin 0.7.1

Meethu Mathew
Hi All,

I am getting in zeppelin 0.7.2 also with the following code. I had reported the same error in 0.7.1 as well (PFB the mail).

def textPreProcessor(text):   
     for w in text.split():
        
​     ​
regex = re.compile('[%s]' % re.escape(string.punctuation))
        
​     ​
no_punctuation = unicode(regex.sub(' ', w),'utf8')

        
​     ​
tokens = word_tokenize(no_punctuation)
        
​     ​
lowercased = [t.lower() for t in tokens]
        
​     ​
no_stopwords = [w for w in lowercased if not w in stopwordsX]
        
​     ​
stemmed = [stemmerX.stem(w) for w in no_stopwords]
        
​     ​
return [w for w in stemmed if w]

  • docs =sc.textFile(hdfs_path+training_data,use_unicode=False).repartition(96)
  • docs.map(lambda features: sentimentObject.textPreProcessor(features.split(delimiter)[text_colum])).count()
Note :: In version 0.7.0 the code was running fine without using use_unicode and unicode(regex.sub(' ', w),'utf8')


Please help to fix this issue.

Regards, 
Meethu Mathew


On Fri, Apr 21, 2017 at 11:26 AM, Meethu Mathew <[hidden email]> wrote:
​​
Hi,

Thanks for the repsonse.

@ moon soo Lee: The interpreter setting is same in 0.7.0 and 0.7.1

@ Felix Cheng : The Python version is same. 

The code is as follows:

PYSPARK

def textPreProcessor(text):   
     for w in text.split():
        
​     ​
regex = re.compile('[%s]' % re.escape(string.punctuation))
       
​    
no_punctuation = unicode(regex.sub(' ', w),'utf8')

       
​     ​
tokens = word_tokenize(no_punctuation)
       
​     ​
lowercased = [t.lower() for t in tokens]
       
​     ​
no_stopwords = [w for w in lowercased if not w in stopwordsX]
       
​     ​
stemmed = [stemmerX.stem(w) for w in no_stopwords]
       
​     ​
return [w for w in stemmed if w]

  • docs =sc.textFile(hdfs_path+training_data,use_unicode=False).repartition(96)
  • docs.map(lambda features: sentimentObject.textPreProcessor(features.split(delimiter)[text_colum])).count()

Error:
  • UnicodeDecodeError: 'utf8' codec can't decode byte 0x9b in position 17: invalid start byte
  • Same error  use_unicode=False is not used
  • Error change to 'ascii' codec can't decode byte 0x97 in position 3: ordinal not in range(128) when no_punctuation = regex.sub(' ', w) is used instead of no_punctuation = unicode(regex.sub(' ', w),'utf8')
​​
Note :: In version 0.7.0 the code was running fine without using use_unicode and unicode(regex.sub(' ', w),'utf8')


PYTHON

def textPreProcessor(text_column):   
    processed_text=[]
for text in text_column: 
       for w in text.split():
          regex = re.compile('[%s]' % re.escape(string.punctuation)) # reg exprn for puntuation
          no_punctuation = unicode(regex.sub(' ', text_),'utf8') 
             tokens = word_tokenize(no_punctuation)
                 lowercased = [t.lower() for t in tokens]
           no_stopwords = [w for w in lowercased if not w in stopwordsX]
           stemmed = [stemmerX.stem(w) for w in no_stopwords]
           processed_text.append([w for w in stemmed if w])     
return processed_text
  • new_training = pd.read_csv(training_data,header=None, delimiter=delimiter, error_bad_lines=False, usecols=[label_column,text_column],names=['label','msg']).dropna()
  • new_training['processed_msg'] = textPreProcessor(new_training['msg'])
This python code is working and I am getting result. In version 0.7.0, I am getting output without using the unicode function.

Hope the problem is clear now.

Regards, 
Meethu Mathew


On Fri, Apr 21, 2017 at 3:07 AM, Felix Cheung <[hidden email]> wrote:
And are they running with the same Python version? What is the Python version?

_____________________________
From: moon soo Lee <[hidden email]>
Sent: Thursday, April 20, 2017 11:53 AM
Subject: Re: UnicodeDecodeError in zeppelin 0.7.1
To: <[hidden email]>



Hi,

0.7.1 didn't changed any encoding type as far as i know.
One difference is 0.7.1 official artifact has been built with JDK8 while 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But i'm not sure that can make pyspark and spark encoding type changes.

Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?

Thanks,
moon

On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <[hidden email]> wrote:
Hi,

I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing this error while creating an RDD(in pyspark).

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I was able to create the RDD without any error after adding use_unicode=False as follows
sc.textFile("file.csv",use_unicode=False)

​But it fails when I try to stem the text. I am getting similar error when trying to apply stemming to the text using python interpreter. 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

All these code is working in 0.7.0 version. There is no change in the dataset and code. ​Is there any change in the encoding type in the new version of zeppelin?

Regards, 


Meethu Mathew