- Natural Language Processing with Java and LingPipe Cookbook
- Breck Baldwin Krishna Dayanidhi
- 780字
- 2021-08-05 17:12:49
Getting data from the Twitter API
We use the popular twitter4j
package to invoke the Twitter Search API, and search for tweets and save them to disk. The Twitter API requires authentication as of Version 1.1, and we will need to get authentication tokens and save them in the twitter4j.properties
file before we get started.
Getting ready
If you don't have a Twitter account, go to twitter.com/signup and create an account. You will also need to go to dev.twitter.com and sign in to enable yourself for the developer account. Once you have a Twitter login, we'll be on our way to creating the Twitter OAuth credentials. Be prepared for this process to be different from what we are presenting. In any case, we will supply example results in the data
directory. Let's now create the Twitter OAuth credentials:
- Log in to dev.twitter.com.
- Find the little pull-down menu next to your icon on the top bar.
- Choose My Applications.
- Click on Create a new application.
- Fill in the form and click on Create a Twitter application.
- The next page contains the OAuth settings.
- Click on the Create my access token link.
- You will need to copy Consumer key and Consumer secret.
- You will also need to copy Access token and Access token secret.
- These values should go into the
twitter4j.properties
file in the appropriate locations. The properties are as follows:debug=false oauth.consumerKey=ehUOExampleEwQLQpPQ oauth.consumerSecret=aTHUGTBgExampleaW3yLvwdJYlhWY74 oauth.accessToken=1934528880-fiMQBJCBExamplegK6otBG3XXazLv oauth.accessTokenSecret=y0XExampleGEHdhCQGcn46F8Vx2E
How to do it...
Now, we're ready to access Twitter and get some search data using the following steps:
- Go to the directory of this chapter and run the following command:
java -cp lingpipe-cookbook.1.0.jar:lib/twitter4j-core-4.0.1.jar:lib/opencsv-2.4.jar:lib/lingpipe-4.1.0.jar com.lingpipe.cookbook.chapter1.TwitterSearch
- The code displays the output file (in this case, a default value). Supplying a path as an argument will write to this file. Then, type in your query at the prompt:
Writing output to data/twitterSearch.csv Enter Twitter Query:disney
- The code then queries Twitter and reports every 100 tweets found (output truncated):
Tweets Accumulated: 100 Tweets Accumulated: 200 … Tweets Accumulated: 1500 writing to disk 1500 tweets at data/twitterSearch.csv
This program uses the search query, searches Twitter for the term, and writes the output (limited to 1500 tweets) to the .csv
file name that you specified on the command line or uses a default.
How it works...
The code uses the twitter4j
library to instantiate TwitterFactory
and searches Twitter using the user-entered query. The start of main()
at src/com/lingpipe/cookbook/chapter1/TwitterSearch.java
is:
String outFilePath = args.length > 0 ? args[0] : "data/twitterSearch.csv"; File outFile = new File(outFilePath); System.out.println("Writing output to " + outFile); BufferedReader reader = new BufferedReader(new InputStreamReader(System.in)); System.out.print("Enter Twitter Query:"); String queryString = reader.readLine();
The preceding code gets the outfile, supplying a default if none is provided, and takes the query from the command line.
The following code sets up the query according to the vision of the twitter4j developers. For more information on this process, read their Javadoc. However, it should be fairly straightforward. In order to make our result set more unique, you'll notice that when we create the query string, we will filter out retweets using the -filter:retweets
option. This is only somewhat effective; see the Eliminate near duplicates with the Jaccard distance recipe later in this chapter for a more complete solution:
Twitter twitter = new TwitterFactory().getInstance(); Query query = new Query(queryString + " -filter:retweets"); query.setLang("en");//English query.setCount(TWEETS_PER_PAGE); query.setResultType(Query.RECENT);
We will get the following result:
List<String[]> csvRows = new ArrayList<String[]>(); while(csvRows.size() < MAX_TWEETS) { QueryResult result = twitter.search(query); List<Status> resultTweets = result.getTweets(); for (Status tweetStatus : resultTweets) { String row[] = new String[Util.ROW_LENGTH]; row[Util.TEXT_OFFSET] = tweetStatus.getText(); csvRows.add(row); } System.out.println("Tweets Accumulated: " + csvRows.size()); if ((query = result.nextQuery()) == null) { break; } }
The preceding snippet is a pretty standard code slinging, albeit without the usual hardening for external facing code—try/catch, timeouts, and retries. One potentially confusing bit is the use of query
to handle paging through the search results—it returns null
when no more pages are available. The current Twitter API allows a maximum of 100 results per page, so in order to get 1500 results, we need to rerun the search until there are no more results, or until we get 1500 tweets. The next step involves a bit of reporting and writing:
System.out.println("writing to disk " + csvRows.size() + " tweets at " + outFilePath); Util.writeCsvAddHeader(csvRows, outFile);
The list of tweets is then written to a .csv
file using the Util.writeCsvAddHeader
method:
public static void writeCsvAddHeader(List<String[]> data, File file) throws IOException { CSVWriter csvWriter = new CSVWriter(new OutputStreamWriter(new FileOutputStream(file),Strings.UTF8)); csvWriter.writeNext(ANNOTATION_HEADER_ROW); csvWriter.writeAll(data); csvWriter.close(); }
We will be using this .csv
file to run the language ID test in the next section.
See also
For more details on using the Twitter API and twitter4j, please go to their documentation pages:
- 計算機網絡
- C及C++程序設計(第4版)
- Python量化投資指南:基礎、數據與實戰
- INSTANT OpenCV Starter
- 工程軟件開發技術基礎
- 騰訊iOS測試實踐
- iOS 9 Game Development Essentials
- 數據結構(Java語言描述)
- Learn Programming in Python with Cody Jackson
- The Data Visualization Workshop
- Android開發案例教程與項目實戰(在線實驗+在線自測)
- Learning OpenCV 3 Computer Vision with Python(Second Edition)
- Visual FoxPro程序設計習題集及實驗指導(第四版)
- Kivy Cookbook
- Red Hat Enterprise Linux Troubleshooting Guide