- 6th Jul 2024
- 19:29 pm
To demonstrate the ability to apply different types of data visualisations to the selected data and be able to adapt these visualisations according to selected data. The visualisations and techniques used are expected to show an understanding of the nature of the data being analysed. Variety is also important as it demonstrates an ability to keep and focus the viewer’s attention.
- In respect of temporal data, simple time/date/Time Zone (TZ) information is sufficient. Fort comparing trends, it may be necessary to convert this data to a Coordinated Universal Time (UTC) value. Information on this conversion can be found on multiple sites via Google or other search engines.
- In respect of the spatial data, this can be kept at country/state level. The temporal TZ value can be used to approximate the longitude of the data source and give some indication of location. This latter method should only be used if country/state information is missing.
- Reliability scores is an evolving subject but there are “tools” in play that can assign the credibility of a feed. If this is available and usable, then a colour progression with a traffic light (red/yellow/green) visualisation could help bolster any claims made based on the data content.
Sentiment and frequency of opinions should also be considered in the visualisations. Word clouds can be used to represent the frequency of sentiment/opinion/phrases while bar charts, radar plots, spider graphs, sunbursts/treemaps (if a hierarchy can be applied to the data) can be used.
Outputs: The CMA is intended to integrate the learning from the AMLNN and DV modules.
The output from this CMA from the DV perspective is a dashboard built in D3.js, Highcharts, or (if so desired) Python. The dashboard should be laid out so the viewer is not overwhelmed but drawn into the analysis present. As with UTC, there are many websites that show various dashboard designs and it is advisable to visit these sites to see what is considered best practise currently.
"cells": [
"cell_type": "code",
"execution_count": 1,
"outputs": [
"output_type": "stream",
"name": "stdout",
"text": [
"[nltk_data] Downloading package stopwords to /root/nltk_data...\n",
"[nltk_data] Package stopwords is already up-to-date!\n",
"[nltk_data] Downloading package punkt to /root/nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n"
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import seaborn as sns;sns.set(style=\"white\")\n",
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"import re\n",
"from nltk.corpus import stopwords\n",
"import string\n",
"import nltk\n",
"import warnings\n",
"cell_type": "code",
"execution_count": 2,
"outputs": [
"output_type": "execute_result",
"data": {
"text/plain": [
" DateTime \\\n",
"0 2022-04-22 06:09:13+00:00 \n",
"1 2022-04-22 04:44:32+00:00 \n",
"2 2022-04-22 04:37:54+00:00 \n",
"3 2022-04-22 03:25:55+00:00 \n",
"4 2022-04-22 03:21:10+00:00 \n",
" Text Followers \\\n",
"0 Djokovic was not my N1 favorite, but after his... 52 \n",
"1 @Surtilala24 Djokovic's issue can still be und... 7353 \n",
"2 @siddtalks Already they blew up on Djokovic va... 270 \n",
"3 Vaccine mandates & indefensible war are ch... 45 \n",
"4 @OmarssAlejandro @anna12345marko @StuYork13 @M... 7268 \n",
" Retweet Count Likes Location Sentiment \\\n",
"0 1 1 Venezuela 1 \n",
"1 0 0 India -1 \n",
"2 0 1 India -1 \n",
"3 0 0 Canada -1 \n",
"4 0 0 USA -1 \n",
" Cleaned_Text \n",
"0 ['n1', 'favorite', 'position', 'regarding', 'c... \n",
"1 ['issue', 'still', 'understood', 'refused', 't... \n",
"2 ['already', 'blew', 'controversy', 'allowing',... \n",
"3 ['mandate', 'indefensible', 'war', 'chalk', 'c... \n",
"4 ['nobody', 'disputing', 'reason', 'acting', '“... "
"source": [
"data = pd.read_csv(\"/content/CleanedTweets.csv\")\n",
"data = data.drop(\"Unnamed: 0\", axis=1)\n",
"cell_type": "markdown",
"source": [
"> Sentiment Distribution:"
"cell_type": "code",
"source": [
"data[\"Sentiment\"]= data[\"Sentiment\"].replace({-1:\"Negative\", 1:\"Positive\", 0:\"Neutral\"})\n",
"sns.countplot(y = 'Sentiment' , data = data, palette=\"Set1\")\n",
"plt.title('Sentiment Ratio')"
"execution_count": 3,
"outputs": [
"output_type": "execute_result",
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Sentiment Ratio')"
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1080x288 with 1 Axes>"
"cell_type": "code",
"source": [
"len_ls= []\n",
"data[\"Text\"] = data[\"Text\"].astype(\"str\")\n",
"for leng in data[\"Text\"].tolist():\n",
" len_ls.append(len(leng))\n",
"data[\"Length\"] = len_ls\n",
"f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={\"height_ratios\": (.2, .80)})\n",
"f.set_size_inches(15, 5)\n",
"sns.boxplot(data[\"Length\"], ax=ax_box)\n",
"sns.distplot(data[\"Length\"], ax=ax_hist)\n",
"ticks = ax_box.set_xticklabels(ax_box.get_xticklabels())\n",
"plt.title(\"Length Distribution of Tweets\")"
"execution_count": 6,
"outputs": [
"output_type": "execute_result",
"data": {
"text/plain": [
"Text(0.5, 1.0, 'Length Distribution of Tweets')"
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 1080x360 with 2 Axes>"
"source": [
"stopwords = set(STOPWORDS)\n",
" \n",
"vectorizer = CountVectorizer(ngram_range=(2, 2))\n",
"bag_of_words = vectorizer.fit_transform(data[\"Cleaned_Text\"])\n",
"sum_words = bag_of_words.sum(axis=0) \n",
"words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]\n",
"words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)\n",
"bigram_words_dict = dict(words_freq)\n",
"wordCloud = WordCloud(stopwords = stopwords,\n",
" background_color = 'white',\n",
" width = 800,\n",
" height = 800).generate_from_frequencies(bigram_words_dict)\n",
"plt.figure(figsize = (8, 8))\n",
"plt.imshow(wordCloud, interpolation='bilinear')\n",
"plt.tight_layout(pad = 0)\n",
"cell_type": "markdown",
"source": [
"> Trigrams Wordcloud:"
"cell_type": "code",
"source": [
"stopwords = set(STOPWORDS)\n",
" \n",
"vectorizer = CountVectorizer(ngram_range=(3, 3))\n",
"bag_of_words = vectorizer.fit_transform(data[\"Cleaned_Text\"])\n",
"sum_words = bag_of_words.sum(axis=0) \n",
"words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]\n",
"words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)\n",
"trigram_words_dict = dict(words_freq)\n",
"wordCloud = WordCloud(stopwords = stopwords,\n",
" background_color = 'white',\n",
" width = 800,\n",
" height = 800).generate_from_frequencies(trigram_words_dict)\n",
"plt.figure(figsize = (8, 8))\n",
"plt.imshow(wordCloud, interpolation='bilinear')\n",
"plt.tight_layout(pad = 0)\n",
"execution_count": 12,
"outputs": [
"output_type": "display_data",
"data": {
"text/plain": [
"<Figure size 576x576 with 1 Axes>"
"cell_type": "code",
"source": [
