πŸ’» Activity 3.2: Profanity & Word Count Detector#

Task 1: Detect & Replace Profanity#

The growth of public forums has required automated filters to remove profanity and other inappropriate content from the web. We have provided you with two emails from a newsgroup dataset. We would like you to find and remove the profanity using string tools.

Since the articles selected do not have profane content we will assume the word β€œphilosopher” is profane.

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train', categories = ['sci.med'])
Article_1 = newsgroups_train.data[0]
Article_2 = newsgroups_train.data[1]
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[1], line 2
      1 from sklearn.datasets import fetch_20newsgroups
----> 2 newsgroups_train = fetch_20newsgroups(subset='train', categories = ['sci.med'])
      3 Article_1 = newsgroups_train.data[0]
      4 Article_2 = newsgroups_train.data[1]

File ~/opt/miniconda3/envs/engr131/lib/python3.10/site-packages/sklearn/datasets/_twenty_newsgroups.py:269, in fetch_20newsgroups(data_home, subset, categories, shuffle, random_state, remove, download_if_missing, return_X_y)
    267 if download_if_missing:
    268     logger.info("Downloading 20news dataset. This may take a few minutes.")
--> 269     cache = _download_20newsgroups(
    270         target_dir=twenty_home, cache_path=cache_path
    271     )
    272 else:
    273     raise IOError("20Newsgroups dataset not found")

File ~/opt/miniconda3/envs/engr131/lib/python3.10/site-packages/sklearn/datasets/_twenty_newsgroups.py:77, in _download_20newsgroups(target_dir, cache_path)
     74 archive_path = _fetch_remote(ARCHIVE, dirname=target_dir)
     76 logger.debug("Decompressing %s", archive_path)
---> 77 tarfile.open(archive_path, "r:gz").extractall(path=target_dir)
     78 os.remove(archive_path)
     80 # Store a zipped pickle

File ~/opt/miniconda3/envs/engr131/lib/python3.10/tarfile.py:2059, in TarFile.extractall(self, path, members, numeric_owner)
   2057         tarinfo.mode = 0o700
   2058     # Do not set_attrs directories, as we will do that further down
-> 2059     self.extract(tarinfo, path, set_attrs=not tarinfo.isdir(),
   2060                  numeric_owner=numeric_owner)
   2062 # Reverse sort directories.
   2063 directories.sort(key=lambda a: a.name)

File ~/opt/miniconda3/envs/engr131/lib/python3.10/tarfile.py:2100, in TarFile.extract(self, member, path, set_attrs, numeric_owner)
   2097     tarinfo._link_target = os.path.join(path, tarinfo.linkname)
   2099 try:
-> 2100     self._extract_member(tarinfo, os.path.join(path, tarinfo.name),
   2101                          set_attrs=set_attrs,
   2102                          numeric_owner=numeric_owner)
   2103 except OSError as e:
   2104     if self.errorlevel > 0:

File ~/opt/miniconda3/envs/engr131/lib/python3.10/tarfile.py:2173, in TarFile._extract_member(self, tarinfo, targetpath, set_attrs, numeric_owner)
   2170     self._dbg(1, tarinfo.name)
   2172 if tarinfo.isreg():
-> 2173     self.makefile(tarinfo, targetpath)
   2174 elif tarinfo.isdir():
   2175     self.makedir(tarinfo, targetpath)

File ~/opt/miniconda3/envs/engr131/lib/python3.10/tarfile.py:2215, in TarFile.makefile(self, tarinfo, targetpath)
   2213 bufsize = self.copybufsize
   2214 with bltn_open(targetpath, "wb") as target:
-> 2215     if tarinfo.sparse is not None:
   2216         for offset, size in tarinfo.sparse:
   2217             target.seek(offset)

KeyboardInterrupt: 
print(Article_1)
From: nyeda@cnsvax.uwec.edu (David Nye)
Subject: Re: Post Polio Syndrome Information Needed Please !!!
Organization: University of Wisconsin Eau Claire
Lines: 21

[reply to keith@actrix.gen.nz (Keith Stewart)]
 
>My wife has become interested through an acquaintance in Post-Polio
>Syndrome This apparently is not recognised in New Zealand and different
>symptons ( eg chest complaints) are treated separately. Does anone have
>any information on it
 
It would help if you (and anyone else asking for medical information on
some subject) could ask specific questions, as no one is likely to type
in a textbook chapter covering all aspects of the subject.  If you are
looking for a comprehensive review, ask your local hospital librarian.
Most are happy to help with a request of this sort.
 
Briefly, this is a condition in which patients who have significant
residual weakness from childhood polio notice progression of the
weakness as they get older.  One theory is that the remaining motor
neurons have to work harder and so die sooner.
 
David Nye (nyeda@cnsvax.uwec.edu).  Midelfort Clinic, Eau Claire WI
This is patently absurd; but whoever wishes to become a philosopher
must learn not to be frightened by absurdities. -- Bertrand Russell
print(Article_2)
From: koreth@spud.Hyperion.COM (Steven Grimm)
Subject: Re: Opinions on Allergy (Hay Fever) shots?
Organization: Hyperion, Mountain View, CA, USA
Lines: 7
NNTP-Posting-Host: spud.hyperion.com

I had allergy shots for about four years starting as a sophomore in high
school.  Before that, I used to get bloody noses, nighttime asthma attacks,
and eyes so itchy I couldn't get to sleep.  After about 6 months on the
shots, most of those symptoms were gone, and they haven't come back.  I
stopped getting the shots (due more to laziness than planning) in college.
My allergies got a little worse after that, but are still nowhere near as
bad as they used to be.  So yes, the shots do work.
  1. Determine if there is a profane word in the article?

# Article 1
...
"philosopher" in Article_1
True
# Article 2
...
"philosopher" in Article_2
False
  1. Replace the profane word with ****

# Replace 
...
Article_1 = Article_1.replace("philosopher", '****')
# check both articles visually
print(Article_1)
print(Article_2)
From: nyeda@cnsvax.uwec.edu (David Nye)
Subject: Re: Post Polio Syndrome Information Needed Please !!!
Organization: University of Wisconsin Eau Claire
Lines: 21

[reply to keith@actrix.gen.nz (Keith Stewart)]
 
>My wife has become interested through an acquaintance in Post-Polio
>Syndrome This apparently is not recognised in New Zealand and different
>symptons ( eg chest complaints) are treated separately. Does anone have
>any information on it
 
It would help if you (and anyone else asking for medical information on
some subject) could ask specific questions, as no one is likely to type
in a textbook chapter covering all aspects of the subject.  If you are
looking for a comprehensive review, ask your local hospital librarian.
Most are happy to help with a request of this sort.
 
Briefly, this is a condition in which patients who have significant
residual weakness from childhood polio notice progression of the
weakness as they get older.  One theory is that the remaining motor
neurons have to work harder and so die sooner.
 
David Nye (nyeda@cnsvax.uwec.edu).  Midelfort Clinic, Eau Claire WI
This is patently absurd; but whoever wishes to become a ****
must learn not to be frightened by absurdities. -- Bertrand Russell

From: koreth@spud.Hyperion.COM (Steven Grimm)
Subject: Re: Opinions on Allergy (Hay Fever) shots?
Organization: Hyperion, Mountain View, CA, USA
Lines: 7
NNTP-Posting-Host: spud.hyperion.com

I had allergy shots for about four years starting as a sophomore in high
school.  Before that, I used to get bloody noses, nighttime asthma attacks,
and eyes so itchy I couldn't get to sleep.  After about 6 months on the
shots, most of those symptoms were gone, and they haven't come back.  I
stopped getting the shots (due more to laziness than planning) in college.
My allergies got a little worse after that, but are still nowhere near as
bad as they used to be.  So yes, the shots do work.

Task 2: Evaluate Word Limit#

Some forums may like to impose a word limit on posts.

Use what you have learned about methods that operate on strings to

  1. count the number of words, and

  2. determine if the number of words in each article is greater than the word limit of 200.

...
Ellipsis
print(f"Article 1 has {len(Article_1.split(' '))} words")
print(f"Article 2 has {len(Article_2.split(' '))} words")
Article 1 has 180 words
Article 2 has 107 words