Machine Learning Models for Fake News

4 minute read

Finding Fake news

  • Using the Kaggle Dataset ->

Link to Dataset

  • Preprocess Data
  • Cleaning Data
  • Reviewing Data

Steps to Clean Tweets

  • Remove external inks

  • remove punctuations, numbers, non-alphabetic characters

  • Remove indicators for names of new sources like New York Times, Rueters, Fox News, etc.

  • Use Stemer to identify the same words under different tense of plural forms

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string

from nltk.stem.porter import PorterStemmer

from nltk.corpus import stopwords
from wordcloud import WordCloud
title text subject date
0 Donald Trump Sends Out Embarrassing New Year‚... Donald Trump just couldn t wish all Americans ... News 31-Dec-17
1 Drunk Bragging Trump Staffer Started Russian ... House Intelligence Committee Chairman Devin Nu... News 31-Dec-17
2 Sheriff David Clarke Becomes An Internet Joke... On Friday, it was revealed that former Milwauk... News 30-Dec-17
3 Trump Is So Obsessed He Even Has Obama’s Na... On Christmas day, Donald Trump announced that ... News 29-Dec-17
4 Pope Francis Just Called Out Donald Trump Dur... Pope Francis used his annual Christmas Day mes... News 25-Dec-17
title text subject date
0 As U.S. budget fight looms, Republicans flip t... WASHINGTON (Reuters) - The head of a conservat... politicsNews 31-Dec-17
1 U.S. military to accept transgender recruits o... WASHINGTON (Reuters) - Transgender people will... politicsNews 29-Dec-17
2 Senior U.S. Republican senator: 'Let Mr. Muell... WASHINGTON (Reuters) - The special counsel inv... politicsNews 31-Dec-17
3 FBI Russia probe helped by Australian diplomat... WASHINGTON (Reuters) - Trump campaign adviser ... politicsNews 30-Dec-17
4 Trump wants Postal Service to charge 'much mor... SEATTLE/WASHINGTON (Reuters) - President Donal... politicsNews 29-Dec-17
stemmer = PorterStemmer()

def clean(text):
    if "(Reuters)" in text: # real news contains this identifier sometimes
        text = text.split("(Reuters)")[1]
    text = re.sub(r'@[^s]*', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = " ".join([wd for wd in text.split() if "\\" not in wd and "/" not in wd and wd not in stopwords.words('english')])
    text = "".join([c for c in text if c not in string.punctuation])
    text = "".join([c for c in text if not c.isdigit()])
    text = re.sub('[^a-zA-z\s]', '', text)
    text = text.lower()
    text = " ".join([stemmer.stem(wd) for wd in text.split()])
    return text

alabama offici thursday certifi democrat doug jone winner state us senat race state judg deni challeng republican roy moor whose campaign derail accus sexual misconduct teenag girl jone vacant seat vote percentag point elect offici said that made first democrat quarter centuri win senat seat alabama the seat previous held republican jeff session tap us presid donald trump attorney gener a state canvass board compos alabama secretari state john merril governor kay ivey attorney gener steve marshal certifi elect result seat jone narrow republican major senat seat in statement jone call victori a new chapter pledg work parti moor declin conced defeat even trump urg so he stood claim fraudul elect statement releas certif said regret media outlet report an alabama judg deni moor request block certif result dec elect decis shortli canvass board met moor challeng alleg potenti voter fraud deni chanc victori hi file wednesday montgomeri circuit court sought halt meet schedul ratifi jone win thursday moor could ask recount addit possibl court challeng merril said interview fox news channel he would complet paperwork within time period show money challeng merril said weve notifi yet intent that merril said regard claim voter fraud merril told cnn case report weve adjud those we continu that said republican lawmak washington distanc moor call drop race sever women accus sexual assault misconduct date back teenag earli s moor deni wrongdo reuter abl independ verifi alleg

Lets Clean our Text

  • Pass in “text’ into our clean() function and watch the magic happen!
  • This is just for practice!!
## Add NEw column to both datasets ("isFake") and add either 0 (real) or 1 (fake)
real["isfake"] = 0

fake["isfake"] = 1

# Combine the two Datasets using concat in Pandas!
allnews = pd.concat([real, fake]) 

# Cleans dataset Line by Line using Lambda function for every line (text) Column : perform clean Function on the text.
# This can take some time to Clean Depending on how "Large" the dataset is

allnews['text'] = allnews['text'].apply(lambda text: clean(text))

# Save our new combined array, which includes fake and real news including (0 or 1) if its real or fake column!
pd.DataFrame.to_csv(allnews, "output.csv", index=False)
# Read our newly created csv file into the dataframe

title text subject date isfake
0 As U.S. budget fight looms, Republicans flip t... the head conserv republican faction us congres... politicsNews 31-Dec-17 0
1 U.S. military to accept transgender recruits o... transgend peopl allow first time enlist us mil... politicsNews 29-Dec-17 0
2 Senior U.S. Republican senator: 'Let Mr. Muell... the special counsel investig link russia presi... politicsNews 31-Dec-17 0
3 FBI Russia probe helped by Australian diplomat... trump campaign advis georg papadopoulo told au... politicsNews 30-Dec-17 0
4 Trump wants Postal Service to charge 'much mor... presid donald trump call us postal servic frid... politicsNews 29-Dec-17 0
... ... ... ... ... ...
70 ELECTION FRAUD: If It Happened in Michigan, Wi... st centuri wire say on recent episod the sunda... Middle-east 15-Mar-16 1
71 Patrick Henningsen LIVE with guest Ray McGover... join patrick everi week wiretv news view analy... US_News 1-Dec-16 1
72 Boiler Room EP #85.5 – Who’s Watching The ... tune altern current radio network acr anoth li... US_News 30-Nov-16 1
73 Washington Post attempts to smear Ron Paul Ins... st centuri wire say as wire report saturday th... US_News 28-Nov-16 1
74 Episode #162 – SUNDAY WIRE: ‘The Revolutio... episod sunday wire show resum novemb host patr... US_News 27-Nov-16 1

75 rows × 5 columns

Work in progess……