Digital Transformation , Big Data and Research Landscape in Digital Communication

The digitization of communications technology has led to an intense interaction between human and digital-based technology. A large number of digital data traces produced by humans as a result of that activity. Such data is commonly referred to Big Data. The availability of Big Data as a digital data source in turn, opens opportunities for communication scientists to be able to use that data to get the patterns and trends of human activities that have been done through social research. It is necessary to understand the basic concept of the Big Data, using appropriate tools and adequate access to the data, and appropriate research method in order to be able to conduct research by using such digital data. This paper aims to describe the potential of Big Data for the purposes of communication research, the use of appropriate tools, techniques and methods and to identify potential research directions in the digital realm. Some limitations and critical issues related to the research validity, population and sample, as well as ethics in digital media research method were also discussed.


Introduction
The digital revolution has brought a great substantial change in many aspects of technology in human communication process (Kaul, 2012).The change was especially characterized by the proliferation of new inventions in the field of information and communication technology used and applied in various aspects of human life.One of the most phenomenal technological inventions in the changing environment was the invention of internet.The technology allowed the connection between a computer and other enormous computer networks in various parts of the world (Abbate, 2017), which stimulated the occurrence of both textual and virtual interactions among human and in turns facilitated global human communication in a more effective and efficient way (Matusitz, 2007).
The increasingly wide spreading use of the internet and its applications has changed the world we live today in a dramatic way and modified the world into an interconnected cyberworld (Tsou, 2011).Subsequently, such condition also encouraged more people in our society to get involved in various online communication activities.Sending and receiving e-mail, visiting websites, watching video in youtube, chatting in social media, updating status in social media, writing online reviews, and browsing information related products were some examples of popular online activities in our daily life.
According to Manyika, the abovementioned online activities of human could produced a large number of digital traces and data (Lewis, Zamith, & Hermida, 2013), which recorded and stored permanently in the computer servers of particular providers that facilitated and enabled such online activities and services (Power, 2014).A simple example of how digital trace and data were produced by human daily activities was through human interaction in various platforms of social media.Everyday the users of social media across the world communicated each other by updating their status (in response to various current social issues, complaining or commending product performance, expressing particular feelings like happiness or sadness, lament, anger, responding to other person status, arguing on social issues, giving likes, etc.).Amongst other, there were several popular social media that widely used by users across the world such as Facebook, Instagram, Twitter and Google+.The social media produced petabytes of data on a daily basis (Housley, Williams, Williams, & Edwards, 2013).The amount of the data would be bigger when we also took other digital data into account, which were produced by humans in various aspects of life such as health, education, socio-economic and politics.
In the context of technology and information study, the data was referred to as "big data".The term "big data" referred to a number of data sets that were very large in capacity and because of its huge size, the data could not to be processed using personal computers or simple software that was commonly used by most of computer users (Eynon, 2013).
The availability of such large amount of digital data traces has attracted the attention of various parties such as business organizations (Zhao, Fan, & Hu, 2014), governments (Eynon, 2013) and also various researchers of many disciplines (Boyd & Crawford, 2012).They tried to get access to the data and started to analyze the large digital datasets to understand general trends and patterns of human behavior (Tsou, 2015).
In the context of communication research, "the big data" has emerged as a new approach in collecting and analyzing digital data in internet (Boyd & Crawford, 2012).Furthermore, the phenomenon certainly provided communication scientists with many opportunities and benefits to explore and to find patterns of human communication in a more practical and concise manner (Mahrt & Scharkow, 2013).As a consequence, various initial studies related to "the big data" in the field of communication began to appear sporadically (Burgess, Bruns, & Hjorth, 2013).
One of the earliest studies that dealt with the importance of "the big data" in communication research was conducted by Papacharissi & De Fatima Oliveira (2012).In the study, they applied computational discourse analysis to assess the news on Twitter in Egypt related to the resignation of Hosni Mobarrak as the president in the period of January -February 2011.Also, Burgess & Bruns (2012) began to suggest the use of the term "big data" in communication and media studies by explaining the possibility of using the Twitter API (Application Programming Interface) and Twitter archives as two important aspects in gathering digital data traces from Twitter.Other studies related to the topic were conducted by Lewis et. al., (2013) who combined computational and manual methods in making content analysis of massive digital datasets.
Despite the importance and the potential use of the data in communication study, the use of "the big data" as a source of digital data in the communication study was still very rare in Indonesia.The situation was very unfortunate given that the development of internet and digital technology went very fast in Indonesia.Based on APJII (Indonesia Internet Service Provider) data, Indonesia had 132.7 million internet users (around 52% of the total population) with internet penetration rate of 34.9% (2016).Moreover, the data also indicated that social media, entertainment, news, education and commerce were the contents most frequently accessed by internet users in Indonesia.Consistent with the findings of the previous studies, other data also confirmed that there were approximately 79 million active social media users in Indonesia (wearesocial, 2016).Furthermore, the data also showed that Facebook, Twitter and Instagram were the most popular social media platforms widely used by Indonesians.
The aforementioned facts and statistics of the internet usage and the availability of digital data clearly showed that Indonesia was one of the countries whose citizens did massive activities in internet.As a consequence, Indonesia produced huge digital footprints that communication scientists could use as sources of digital data in predicting attitudes and patterns of human behavior related to certain aspects of life.Therefore, it was necessary to be able to handle the digital method in conducting internet-based research (data retrieval) (Hutchinson, 2016).
This paper aimed at generally describing the concept of "big data", the potential use of digital data traces for communication research in Indonesia in the context of popular social media (Facebook, Twitter and Instagram).Additionally, it would also try to briefly explain the tools and the methods to retrieve, to extract and to analyze data, to analyze sentiments, the direction and the tendency of research in communication studies, and to address challenges and weaknesses.

The conceptualization of "big data"
Digitalization has brought radical changes in the landscape of media environment and also offerred interdisciplinary scientists new methods in the process of data collection (Burgess et. al., 2013).The change took place especially because of the availability of a large number of digital data tracks left by humans as a result of human interaction with digital-based technology.In the context of computer science, the data was known as "big data".
Basically, "the big data" could be defined as a large amount of data that could not be stored, managed and processed adequately by using a standard computer only (Kaisler, Armour, Espinosa, & Money, 2013).A more comprehensive definition proposed by Zou (2015) suggesting that "big data" referred to a very large (structured and unstructured) data set resulging from human interaction with digital-based technology in communication, movement and behavior.According to Duan & Xiong (2015), there were three main issues in the discussion of "big data", including volume, speed and variation.The volume usually related to a substantial measure of data provided by certain sources.The speed related to real-time applications and data processing speed.Meanwhile, variation referred to the diversity of data formats or different unstructured data such as text, graphics, series data, and other related things.
In addition to the above-mentioned issues, a general map of where the digital data could be obtained must also be addressed appropriately.In general, "the big data" sources could be found in various places where human interaction with digital technology occurred.Tsou (2015) identified several examples of "the big data" sources in everyday life of human activities, which were conversations that took place on social media platforms, electronic medical records in hospitals, health centers or insurance companies, records of business transactions such as credit card records and online shopping transactions, movements and traffic data using the GPS system, scientific research data such as earthquake records, weather records and census records.Similarly, Edwards, Housley, Williams, Sloan, & Williams (2013) formulated several human activities that had the potential to generate digital data traces, including retail transactions, telephone communications, financial expenditures and insurance claims, data of population censuses, general household surveys, police record crimes, victims of crime surveys and labor market surveys.

"The big data" as a source of data in communication research
Although digital data trace could be found in various forms of human interaction with technology, not all types of digital sources and data could be used in social disciplines such as communication studies.Considering the type of the digital data that was usually in the form of words, sentences, conversations, numbers and emoticons, there were at least two different fields of study in which communication study could be conducted, which were social media-based interactions and web-based interactions.
In the context of social media-based interactions, social study of a variety of popular social media (social networking sites, micro blogs and blogging) could be conducted (Edwards et. al., 2013).For social networking sites, Facebook, Instagram and Youtube were some common examples of favorite social networking sites that could serve as digital data sources of communication study and Twitter was a potential data source for micro-blogging platforms.In case of blogs, communication scholars could use various types of public blogs available on internet.In addition to the categories mentioned above, communication scientists could also use web-based interactions as a source of large digital data for learning purpose.In this category, general news sites (online news portals) and online review sites could be used as main source of digital data traces for of communication studies.

Material and Methodology
Unlike various dominant methods in social study that generally used questionnaires or various qualitative techniques (interviews, focus group discussions, etc.) as the tools in data collection process, the study related to "big data" in communication science required a completely different tool.One of the techniques commonly used in "the big data" study was text mining.According to Feldman & Sanger (2007), text mining might be defined as "a knowledge-intensive process in which a user interacted with a document collected over time by using a series of analysis tools".Also, the text mining was also considered as a method used to uncover knowledge in a computerized process of filtering information from unstructured text documents involving a combination of several techniques, including data mining, machine learning, natural language processing (NLP), information retrieval and knowledge management (de Fortuny, Smedt, Martens, & Daelemans, 2012).In the context of social sciences like communication, the text mining approach could be useful when a researcher wanted to find valuable information from a conversation or a text available in various platforms of social media.
In order to be able to perform the data collection process in a "big data" study, researchers usually needed adequate tools capable of capturing, processing and analyzing the data from a particular computer server.The need for the tools could be fulfilled by using a Data analytics software.
Basically, data analysis software was software that could be used for the process of withdrawal (collection), processing and analysis of digital data contained in various sources of "big data".Just like a search engine (i.e Google), the software was able to index various data in online environment.In addition to the feature, it was also able to pull/to retrieve data and then to process the data and subsequently to display the results of data processing.
The way how the software worked as described above required researchers to customize the definition of retrieving data in the source of "the big data".The customization of search engines was made by using several related key words related (included) or not related (excluded) to a particular research theme.The more accurate the key words (included and excluded) related to the specific research theme defined by the researcher, the higher the precision of the results obtained by the software used.
For example, a study of "Tweets containing SARA (Suku, Agama, Ras, and Antar-golongan) or ethnicity, religion, race, and inter-group relation issues during the Jakarta governorship election campaign 2017" would retrieve the data from Twitter platform by specifying several key words related to themes such as "the names of all candidates participating in the election", "campaign", "indigenous", "non-indigenous", "Chinese", "Islam", "Christians", "Muslims", "non-Muslims", "infidel" etc.In addition to the use of the key words, the periodization of data retrieval was also an important factor that should be carefully addressed by researchers.In the case of the abovementioned example, the period of data retrieval was set up during the 2017 Jakarta Election campaign.
In addition to the text mining technique described above, the study of communication science in the context of "the big data" could also be conducted using sentiment analysis.A sentiment analysis techniques could be useful if a study had to reveal the tendencies and the attitudes (the emotions) in a conversation or a post in the online environment (including social media) (W.Duan, Cao, Yu, & Levy, 2013).
Sentiment analysis also referred to specific computing applications based on the emotional polarity of communicators (Li & Wu, 2010).Also, they suggested that the main purpose of the analysis was to determine the emotions contained in a text posted or written by someone related to a particular topic in online environment.Liu (2010) stated that sentiment analysis was a computerized calculation that tested the opinions, the sentiments and the emotions expressed in a text expressed in the online world.
As the social media and internet users continued to grow in Indonesia, the availability of "the big data" was also increasingly significant.Such situation provided communication scholars in Indonesia with an opportunity to elaborate the new field of study.Consequently, the text mining techniques and the sentiment analysis were very important tools for communication scholars in analyzing various data available online, especially the data related to the theme of communication study.

Political Communication
In the context of political communication, various actual research themes might be raised in response to the abundance of human-generated digital data.Some examples of such issues were as follows: predicting the winner of general elections (local, national, legislative and executive) through the level of popularity and sentiment in the realm of social media and online news portals, making comparisons of official social media accounts of political actors and political institutions (political parties, presidents, house of representatives, etc.), reviewing and comparing the thoughts and statements of political actors as expressed in the online environment, assessing political issues that were popular, interesting and became the main issues of public in general.

Marketing Communication
The study theme in the domain of marketing communication was given a good attention because of the availability of data produced by human interaction with digital technology.In general, the major themes of the study that could be conducted in the context included measuring consumer attitudes and sentiments towards particular products, detecting the weaknesses of a product by elaborating consumers' complain of the product, measuring and comparing the popularity of several products and services in market, comparing and assessing the interactivity level of social media accounts for certain products, assessing the popularity and level of public trust in freelancer services and testing the link between the level of popularity in cyberspace (social media) and public acceptance and sales levels.

Other (Intercultural, Organizational and Interpersonal) Fields in Communication Study
In addition to the two fields discussed above, other fields of communication could also give new directions in various studies.Some examples included comparing the differences in the use of cross-cultural adjectives (emotions) on various different accounts (intercultural communication), measuring the use of social media as an official communication medium internally and externally (organizational communication), assessing the depth of interpersonal communication that occurred in social media platforms and testing the link between online closeness and in real world closeness (intercultural communication).

Content analysis
In the context of study technique, one of the study techniques (methods) in the discipline of communication science that posed a new challenge and development was content analysis technique (Lewis et. al., 2013).The technique was widely used in communication study.Classical content analysis research technique was generally applied to describe the contents, the message characteristics and the content development (trends) in various written texts such as newspaper, advertisements, important documents like rules and regulations (Eriyanto, 2011).
The increasingly intense human interaction in the online environment also stimulated the availability of digital data in the forms of written text.Consequently, the condition provided communication scholars with a great opportunity to use various types of data as materials of study in applying content analysis method.In addition to quantitative content analysis, qualitative techniques could also be used in various directions of study.

Challenges and Limitations
The digital transformation in one hand provided an opportunity to use of the latest study models in communication science, while in other hand the study using the digital data also had vulnerability in its implementation.There were several important issues concerning with the methodology that should be addressed in conducting digital data-based study.

Validity issue
Validity was one of the main issues that arose when data/text mining or sentiment analysis was made in communication study (Boyd & Crawford, 2012).The validity issue could arise during the data/text mining and hence it was necessary to test of the validity of data collected from the sources in internet.Various softwares used to collect and to analyze the data still had considerable errors.Most of software developers that developed the tools claimed that the accuracy level of the application was only about 70%.The percentage indicated that there was still a large amount of data that might not yet be retrieved and further analyzed.It might certainly be serious problem when it came to the generalization of results.On the contrary, a "conventional" survey could generally tolerate errors in the sampling process up to 5%.Consequently, the condition could certainly affect the level of precision of the study.

Population and sample
The issue of sample and population surfaced in digital data-based communication study because the exact number of populations of posts and conversations related to certain issues in the internet domain could in general be very large and hence it was difficult to predict the exact number.Consequently, it was difficult to ascertain the exact number of samples drawn using random sampling technique (Mahrt & Scharkow, 2013).Meanwhile, justifiable generalizations of the results of the study based on the rules of statistics and social study became dubious.

Ethical issue
Another factor that could cause serious issue in conducting a study of digital data in cyberspace was ethical factor.The issue especially became prominent because of the use of public and semi-public data.There was not any standard procedure and any ethical standard that served as the basis of the above types of data usage in the context of scientific study (Mahrt & Scharkow, 2013).

Conclusion
The wide open opportunities resulting from human interaction with the internet or the digital world have resulted in the availability of large amount of digital data that could be processed in a communication study.Consequently, it was possible to start and develop new ways of conducting a study in the field of communication.
In the context of Indonesia as a country with a relatively high level of internet usage, the opportunity must be taken by communication scholars.
Also, the communication scholars in Indonesia must pay a good attention to the issue and began to develop new methods to face the challenges as a result of the study/the use of digital data.Such effort should be made by strictly following research method procedures and adequate research design.Additionally, the study must also pay a good attention to various issues previously described to obtain useful results using the right process (population and sample, validity and ethics).
The limitations of the study resulting from the limitations of the tools used in retrieving data should also be the main concern of all communication scholars and hence additional techniques should be developed to optimize the results of a digital media study.